July 2020 - Steve Oakley

Apple APFS and Removing Drives From Your JBOD RAID

Steve Oakley/DIY Projects and Mods, News /add, APFS, apple, drive, edit, fix, format, help, HFS, JBOD, RAID, remove /

July 23, 2020

JBOD drive case copies data on Apple APFS JBOD RAID

I did a bit of searching for if its possible to remove a failing drive from an Apple JBOD RAID. Its all theoretically possible. I’ll also admit having some ZFS envy. MacZFS turned out to be a bit of a bust in terms of usability. I couldn’t format a dataset volume that without error in Disk Utility and I wasn’t going to get into deep terminal work to make it work. Going back to Disk Utility and just messing around with a few drives formatting them in various way, I finally figured out how to remove a failing volume in a JBOD RAID.

First a few clarifications. You have to format the JBOD using OS X. If you did it via drive case firmware, this isn’t going to work because there is no access to the underlying utility or drive controller.

Another is that you need to have enough free space in the JBOD to hold whatever is on the drive you want to remove. If you want to remove a 6TB drive then you need at least 6TB of free space to do this successfully. Having a 5 to 10% margin of additional space would be a good idea for just in case things don’t fit down to the last few bytes. Workaround if you don’t is to add another drive to the JBOD to give it space to internally shuffle the data around in the next steps.

Cost of failure : you loose all your data on the JBOD if the operation fails. This is a rough one but I think Apple’s diskutil is still a work in progress in this area. It apparently does check the space situation but if you try this and it throw up a warning there isn’t enough space, I’d take that warning seriously.

Odd limitations :

Disk Utility requires that there be at least 3 drives in the JBOD for this to work. If you have just 2 it won’t let you remove a volume. So you may need to add a drive just to remove the other one.
Disk Utility only allows you to remove the last mounted drive in the JBOD set. I. know this seems weird since you really should be able to remove any volume except maybe the first one which holds the directory and other RAID metadata. There is a workaround.

Workaround: Figure out which way your drive case or dock is physically mounting the drives, top to bottom or bottom to top, left to right or right to left. How ? unmount the RAID, power down the case and remove a drive at the very top or bottom, right or left. Power the case up and see which drive is now reported missing in Disk Utility. If its the top most disk in the list, the drive on the other end is the last. Knowing this, unmount the drive although its probably not mounted right now, power it down. If you know the physical volume you want to remove due to say, SMART error or simply making noise, move it to what will be the last physical drive slot to mount. Fire everything up and it should mount.

Now one thing I have seen is that with JBODs if you don’t have the drives in the order they RAID was built, it might not mount. Put the drives back in with appropriate power and unmounting. Once it mounts again, power down and move the drive back. On second tries I’ve had this work with seeming random order of drives. In particular maybe the actual issue is getting the first drive of the set into the first slot to mount and then it doesn’t matter.

Once you have the drive you want to remove in the last physical slot to mount and have the RAID mounted.

go to Disk Utility, Select the RAID set
In the view of the volumes that are in the set, select the last one in the list which is presumably the drive you want to remove. If not, unmount, power down, swap drives and repeat until it is.
If the selection is good, hit the Minus button on the bottom left of the window
A warning will come up about the operation may fail and you’ll loose everything if there isn’t enough free space, Click Remove
Wait.
Disk Utility will now copy any data from the drive to be removed back onto free space across the other drives. This may take quite a few hours depending on your case interface and drive speed plus how much data needs to actually be moved around. If you can use an eSATA or USB 3.1+ interface it’ll be a lot faster than USB2 which could take days. Unfortunately Disk Utility doesn’t provide any sort of progress bar, estimate to completion or any other indication how long it will take. You simply repeat step 5 until its done. Fair warning – don’t write anything to the RAID while this operation is going on. Given that things like large block sizes can inflate how much space small files eat up don’t start this operation unless you have a little margin on free space.

If all went well, you should till have your data and the drive you wanted to remove can now be physically removed from the case and everything works.

What if it didn’t go well ? That happened to me the first time where the operation ended after 8hrs of moving bits around. Disk Utility simply displayed “An error has occurred -69000 something”. DU also was beachballed requiring a force quit. I rebooted since DU was in a hopeless state.

Once I got back up, I opened up the array I was trying to trim the drive from. I decided, as a guess there wasn’t enough free space and that DU had already shuffled most of the data off the drive I wanted to remove. I found a project using 1.15tb of space and deleted it. I had 2 other copies of it and this was the headline restore version. It could go as could a few hundred gigs of other dead projects. I now how another 2tb of free space and decided to try again. Note : NO data loss despite DU’s dire warning there would be. My good fortune may not be yours, proceed with caution.

With more space free, I opened DU yet one more time. I set up the remove operation and went to bed. I have no idea how long it ran for, at least 20 mins. In the morning it was done, no error message. In fact DU said it was safe to remove the physical volume now. The JBOD I had removed the drive from showed the correct volume size. I had to reboot again because, well, I actually did remove the drive from the case and that caused DU to again beachball. Once restored, everything was good. I could now reuse the drive from the original array an move it into the new one. Yes folks, this does work with some tedious time and effort. You might need to decide if the time and risk is worth it versus just buying a new drive or several and copying everything over. That is of course assuming things are in working order. If this is a salvage operation, there isn’t any backup and you are trying to save the data this process might be worth while – add a new drive, mess around until the failing drive mounts last, then use DU to remove it from the set.

When Your Backup Fails

Steve Oakley/General Purpose News /

July 16, 2020

You know it was only a matter of time before data loss happens, or, well at least the drive case or drive fails.

Lightening does strike sometimes.We had a few power hits this week due to storms. Not big ones, just enough to flicker the lights once or twice. Hmmm, backup array isn’t powered up. Weird. Maybe that explains emails I’ve gottten from Chronosync that its had errors. I usually follow up these but haven’t on this particular set. Lets take a look. First I found out the battery in my UPS is not holding charge. It may show charged, but put a load on it and even a 10w led bulb crashes it. Replace battery with larger external one. Test the UPS and it works. How ever array powers down again, but everything else stayed up. Oh look, I plugged the power supply into the filtered power outlet, no the one actually on the UPS ! DUH !

As for the PS for the drive dock in this case its soft failing. The case lghtst up, but the drives weren’t getting enough juice to spin. Ok, no big deal. I order a new drive case with space for another drive as I’d been getting tight on space. I find amazon deal of the day with 10TB WD RED’s on sale for $129 off, buy 2. That will double my space as the current JBOD array is 4tb + 6 TB + 4tb + 6 tb. just how I bought drives and started with 2 I think originally, then added on as I needed and had drives. The 4tb’s are matching HGST’s, one WD blue 6tb and one red NAS vers.

Skip many details but I’m checking Neofinder to see what was on the big backup volume. Its mostly Folder = Volume backups and a couple odd files. All the original drives are online, and have backups onto second tier of drives. However several folders show empty. Thats not good but I guess Neofinder was running when the volumes outright quit. No problem, let me just pull a backup db file from time machine. Hmmm, server only has one from an hr ago, that won’t do since it looks like the backup raid failed a couple days ago. Checking across other machines TM only laptop has TM going back normal length of time. Restore Neofinder db file.

Why isn’t TM backing up correctly ? no good reason, TM has always been flaky for no reason. TM network volume has plenty of space, no errors from TM complaining its corrupted and needs to create new backup for many months. In fact the last time I got this error it was from the machine with deep set of backups. I’ll investigate later.

Things I did right :I indexed everything with Neofinder with automatic updates 3 times a week. I have copies of all neofinder db files on all machines. They get updated by an automatic Chronsync batch backup that runs twice a week that cross syncs all machines. While I could run from central server, I opted to go with local copies for performance and when you travel, you need the local copy. I’m not getting into serving the files via the internet if I don’t have to as some of them are several hundred megs each. No fun on a hotel internet connection.

New idea. Put the original drives from the raid into another case. Another thing done right : don’t use a case’s hardware raid controller, use the OS for just such emergencies because when a case fails, you don’t have to worry about finding another exact match years after you bought it to get things working again.

First try : raid shows up, then unmounts, every 30 secs. Disk utility complains things aren’t right. The driver are in the case as 4tb 4tb 6tb blue 6 tb red. I mess around a but thinking maybe OS X is running background repair, looking at log files. Nothing good after 30 minutes.I hope maybe OS X corrects the problem and mounts the drives. Nope.

Crazy idea : what if the order of the drives in the case matters ? It shouldn’t, but its one thing I’ve never tested in this confg. Randomly, I swap the middle two drives so that bottom to top its 4tb 6tb blue 4 tb 6tb red. I know the bottom drive is the 1st volume of the original set. It works ! who would of thought that in a JBOD the order of the drives makes any difference ? well at least for OS X it does. New thing learned.

I run disk first aid, everything is good. Everything is good. So now I wait for drives and new case to show up where I’ll decide how to set up the new backup array, and then copy back everything. At least I have no work hanging over my head to make managing this a pressure situation.

More learning : apple JBOD with HFS+ isn’t a full implementation. The way its supposed to work is that each drive is complete with dir and files, just as a giant volume. Apple apparently is keeping the directory all on the first volume of the set. Worse is that if a data volume drops off, the entire raid goes down. By spec, you should only loose whats on that one drive and everything is ok. Thats why I picked this setup. With HFS+ it doesn’t work this way. Since I’m starting over, I’m going to take a more detailed look at raid options in APFS since its actually using containers. I’ll do some testing and see if apple has improved their JBOD formatting to act and perform more like the spec, or not. I’m going to do some active volume failure tests before committing to a copy of 15tb of data.

No data lost it looks like after checking everything. Only time and money which is the best case for significant failure. I’d planned for this to happen and even tested a few failures. Mostly things worked because I had enough copies meaning at least 3, and in some cases 5 so that in the case of TM, I still had one machine with good backup I could reach into.

Living on the edge of the digital abyss –