Results 1 to 14 of 14

Thread: Missing Raid Set, Xserve Late 2009

  1. #1
    Join Date
    Nov 2012
    Posts
    5

    Default Missing Raid Set, Xserve Late 2009

    Hello, I've had the misfortune of having a raid set fail on me.
    We had a technician in to run diagnostics on our server to see why it was running at 100% fans. He plugged in a usb disk, restarted to diagnostics off the disk and found a problem with drive 2's temperature sensor. He then proceeded to hot-swap drives 2 and 3 while the system was still running and ran the test again. The error moved with the drive.

    He was surprised when, on reboot, there was no boot volume. He booted to a lion installation he had on a different USB disk, and in its raid utility there were no volumes listed, and all 3 disks show up as healthy and 'roaming'.

    Well, the technician and Apple support are all telling me we have to rebuild from backup - unfortunately our backups are not quite up to date... looks like it's been having trouble for a month or so.

    Only drives 2 and 3 were moved - why did drive 1 lose its RAID reference...
    Are we really right out of luck? Some of these data recovery places claim they can rebuild it.. I imagine that would be thousands of dollars and a couple weeks, at least. If they can do it, there must be something we can do locally.

    Any help would be appreciated!
    MH

  2. #2
    Join Date
    Feb 2001
    Location
    on the landline, Mr. Smith
    Posts
    7,791

    Default

    So, maybe I am missing something. 3 HDs, what was the original RAID configuration?


    You said drive 1 was part of RAID......was it a RAID 1 (mirror) with drive 2 or 3? Or were all 3 in a RAID 5 with the RAID card? Or something else?
    "Imagine if every Thursday your shoes exploded if you tied them the usual way. This happens to us all the time with computers, and nobody thinks of complaining." -- Jef Raskin

  3. #3
    Join Date
    Nov 2012
    Posts
    5

    Default

    Oh, my apologies. It was originally using all 3 drives in a raid 5 array. Running Snow Leopard Server - I believe it was 10.6.1 because we had compatibility issues after performing the 10.6.3 upgrade with our main software package.

  4. #4
    Join Date
    Aug 2001
    Location
    Grangeville, ID USA
    Posts
    9,145

    Default

    I have never seen a hardware RAID controller that can handle drives being physically moved around. Failing one drive and replacing it, yeah. Physically moving drives forces them out of the RAID set and there are very few RAID administrative packages that can reconstitute the RAID after it is broken that way. With hardware RAID5 the physical drive location determines the logic of the RAID. Moving drives completely breaks the relationship.

    If you had a three drive RAID5 and two of the drives had their locations changed then the RAID is toast. With some RAID software, if you can put the drives back into their original proper locations, you *may* be able to force it to mount the RAID set. I wouldn't want to bet on it though.

    Some years back, had an advertising company running a hardware RAID5 that failed a drive. The tech, through inattention, pulled the wrong drive to replace it. As far as the RAID controller was concerned that RAID had now failed 2 drives. They hauled that RAID set to the developer and spent a week trying to get it rebuilt, to no avail. That company was no longer in business as they had no backup and all their product was on that one RAID.

    These days administrative utilities are getting better and better. The controller will usually not attempt to mount an incomplete RAID set, but instead ask what to do. Reason for this is that if a RAID set is incomplete, missing one drive, and you force it to mount then even if you don't transfer data to it the missing drive has to be reformatted and rebuilt - an hours long process. If you can get the missing drive to be recognized and only mount the RAID after it is complete then no rebuild is required. Knowing that, you may find that putting the drives back into their original positions you just may be able to get the RAID recognized, but only as long as it wasn't manipulated by the admin utility when the drives were moved around. Worth the try when there is data to get off.

    Rick
    molṑn labe'
    "I am a mortal enemy to arbitrary government and unlimited power. I am naturally very jealous for the rights and liberties of my country, and the least encroachment of those invaluable privileges is apt to make my blood boil."
--Ben Franklin

  5. #5
    Join Date
    Feb 2001
    Location
    on the landline, Mr. Smith
    Posts
    7,791

    Default

    Agreed 100%.

    That "tech" hosed you......may not have realized that it was a RAID 5 array.

    Some newer, higher end RAID hardware will let you swap drives around, but traditionally each volume must be in the correct bay/slot. With a RAID 5, you can only ever lose/swap one HD. Period. I always treat them with great care, knowing that breaking a RAID 5 array is really big deal. And each control & software package can be a different in capability and forgiveness as Rick said, so the safe thing to do is assume the worst.....be conservative on changes. A good time to verify backups is before troubleshooting issues that have any possibility of upsetting HDs or controllers.

    Ways to extend RAID 5 with hot failover, or RAID 6, etc. But for your setup, none of that matters.

    Be sure the HDs are back in their original position, and then look at the RAID software to see if you have any options to diagnose, rebuild, etc.

    And don't hold your breath.
    "Imagine if every Thursday your shoes exploded if you tied them the usual way. This happens to us all the time with computers, and nobody thinks of complaining." -- Jef Raskin

  6. #6
    Join Date
    Nov 2012
    Posts
    5

    Default

    Quote Originally Posted by ricks View Post
    These days administrative utilities are getting better and better. The controller will usually not attempt to mount an incomplete RAID set, but instead ask what to do. Reason for this is that if a RAID set is incomplete, missing one drive, and you force it to mount then even if you don't transfer data to it the missing drive has to be reformatted and rebuilt - an hours long process. If you can get the missing drive to be recognized and only mount the RAID after it is complete then no rebuild is required. Knowing that, you may find that putting the drives back into their original positions you just may be able to get the RAID recognized, but only as long as it wasn't manipulated by the admin utility when the drives were moved around. Worth the try when there is data to get off.

    Rick
    Thanks for the replies. I'm wondering if perhaps you know of another utility that we can use to examine the raid, or are we limited to using apple's raid utility? We didn't use the administration utility to do anything to the array, or confirm any rebuild in AXD 3x104. It appears it just cleared the array when we moved the drives in the EFI diags. The people at Applecare sounded surprised but not shocked when I explained the results to them.

    MH

  7. #7
    Join Date
    Feb 2001
    Location
    on the landline, Mr. Smith
    Posts
    7,791

    Default

    Rick's story gives me the willies. Have seen a couple catastrophic failures too.

    I break out into a cold sweat (literally) and check everything about 10 times before I pull a failed HD on a RAID array. Kinda like combat: People that aren't scared don't understand what is going on.
    "Imagine if every Thursday your shoes exploded if you tied them the usual way. This happens to us all the time with computers, and nobody thinks of complaining." -- Jef Raskin

  8. #8
    Join Date
    Feb 2001
    Location
    on the landline, Mr. Smith
    Posts
    7,791

    Default

    Many cards have a command line tool too. Within Terminal, type:

    man raidutil

    Here's the admin guide, and the more extensive dev guide.

    Apple may have a few more clues, such as this.
    "Imagine if every Thursday your shoes exploded if you tied them the usual way. This happens to us all the time with computers, and nobody thinks of complaining." -- Jef Raskin

  9. #9
    Join Date
    Aug 2001
    Location
    Grangeville, ID USA
    Posts
    9,145

    Default

    The reason Apple HAS a RAID card is no one would write the administrative utility for Mac OS, so Apple contracted with LSI for the card and wrote the utility themselves.

    The administrative utility is at least 75% of the cost of a high end RAID card. The utility is 100% of the ability to recover from problems, of which there are too many possible routes to a failure on a RAID 5 for us to count. If the engineers that developed the utility didn't think of the problem, whatever it is, and didn't build in a repair or remediation in the utility, then that type of failure screws you out of your volume structure.

    In my opinion only ATTO, ARECA and CalDigit, write RAID utilities worth owning. I run hot and cold on CalDigit, but for the most part they do well. Apple is not on that list. They are WAY too slow updating drivers and firmware to handle environment changes. And the utility doesn't have any kind of track record for repairing volume structure corruptions - hence all the threads here and elsewhere with folks losing their data.

    There are no outside developers of RAID utilities that will work with the firmware on an Apple LSI card. No outside utility developer is ever going to write a way for you to use their precious and expensive utility with a card they don't sell to you and make money on. Besides, as stated before, the card is the least costly part of a hardware RAID controller. If a good card + utility costs $1000, then $750 of that is in the utility and support. I wouldn't want to hack in a RAID utility for a card not fully supported by the developer. That is unless the RAID is unimportant in data content. Then you can knock yourself out. But there will be no assistance available anywhere for the inevitable problems.

    In my opinion, if the data is important, there is never an excuse to shortchange the quality of the hardware manufacturer. And even more important, you absolutely MUST have a good backup plan in place. The backup is of much greater importance than the type of RAID used.

    Always keep in mind that RAID never adds data security. All it adds is the convenience of continuing to operate when a drive fails. This gives the administrator time to replace said failed drive and rebuild the RAID without interrupting service to the network. Data security is only available by virtue of a backup.

    Rick
    molṑn labe'
    "I am a mortal enemy to arbitrary government and unlimited power. I am naturally very jealous for the rights and liberties of my country, and the least encroachment of those invaluable privileges is apt to make my blood boil."
--Ben Franklin

  10. #10
    Join Date
    Nov 2012
    Posts
    5

    Default

    Thanks for your help, Gents!

  11. #11
    Join Date
    Mar 2001
    Posts
    540

    Default

    I believe there are RAID experts out there who can rebuild the drives. If I were you I would ditch one of the drives 2/3 and put a new one in. the drives must be in the original positons.


    chances are the raid will rebuild the third drive. one thing I learnt is once a drive has been taken out of a RAID you had better be sure it has been wiped and is working before you put it back into it again.

  12. #12
    Join Date
    Feb 2001
    Location
    on the landline, Mr. Smith
    Posts
    7,791

    Default

    Good tips.

    Wonder what the original issue was......bad HD, or something else? If one is bad—as you say—get the other two in the right bays, and replace the bad HD.

    REALLY NEED TO KNOW which one was bad. If HD 1 was untouched, then it had to be either HD 2 or HD 3 that should be replaced. If the trays were labeled, hopefully you can get the good one back in the right bay.....and replace the HD in the bay that originally had an issue.

    If you are not 100% sure which one was diagnosed (the problem followed it), I am not sure if there is a safe way to test them without risk of fouling the RAID info on the good drive.
    "Imagine if every Thursday your shoes exploded if you tied them the usual way. This happens to us all the time with computers, and nobody thinks of complaining." -- Jef Raskin

  13. #13
    Join Date
    Nov 2012
    Posts
    5

    Default

    Well, we're a bit further on, so I figured I'd update.

    We were able to recover almost all our data from backups. It's not a terrible loss - we also found out a great deal about the quality of our backups. It could have been a bigger deal in the new year, our new backup process implemented in April was only copying the current year data... we could have lost more than a decade if this had happened January.

    All of the drives test good, and we have them running again. We recreated the raid set and the utility said they were already part of a set. Disappointing we couldn't find the reference. We all agree seem to agree that the technician damaged it when he swapped drives 2 and 3, and why the set went missing we have no idea. Perhaps the fact that we were in an EFI diagnostic contributed.

    We now have a full time machine backup and a super duper backup of the raid drive. We still have to replace drive 2 when it arrives from Apple next week. All of this from a failed temperature sensor.

    Unfortunately, we didn't try the wipe and reinsert method. Wish we could go back now and try, but it's done.

    MH

  14. #14
    Join Date
    Feb 2001
    Location
    on the landline, Mr. Smith
    Posts
    7,791

    Default

    Welcome to the IT dept.

    Glad it was not catastrophic. Time Machine has its place, but not really enterprise grade. I like it as a secondary option, as you have it configured.

    Be sure to get multiple copies of key data, hopefully with more than one backup tool. Lots of good options out there. I like CrashPlan.

    Most important, regardless of software or setup: Test your backups regularly! Near misses like these help keep me on the straight and narrow regarding back up schedules, versions, data verification and restore practices.
    "Imagine if every Thursday your shoes exploded if you tied them the usual way. This happens to us all the time with computers, and nobody thinks of complaining." -- Jef Raskin

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •