When Your Backup Fails

July 16, 2020

You know it was only a matter of time before data loss happens, or, well at least the drive case or drive fails.

Lightening does strike sometimes.We had a few power hits this week due to storms. Not big ones, just enough to flicker the lights once or twice. Hmmm, backup array isn’t powered up. Weird. Maybe that explains emails I’ve gottten from Chronosync that its had errors. I usually follow up these but haven’t on this particular set. Lets take a look. First I found out the battery in my UPS is not holding charge. It may show charged, but put a load on it and even a 10w led bulb crashes it. Replace battery with larger external one. Test the UPS and it works. How ever array powers down again, but everything else stayed up. Oh look, I plugged the power supply into the filtered power outlet, no the one actually on the UPS ! DUH !

As for the PS for the drive dock in this case its soft failing. The case lghtst up, but the drives weren’t getting enough juice to spin. Ok, no big deal. I order a new drive case with space for another drive as I’d been getting tight on space. I find amazon deal of the day with 10TB WD RED’s on sale for $129 off, buy 2. That will double my space as the current JBOD array is 4tb + 6 TB + 4tb + 6 tb. just how I bought drives and started with 2 I think originally, then added on as I needed and had drives. The 4tb’s are matching HGST’s, one WD blue 6tb and one red NAS vers.

Skip many details but I’m checking Neofinder to see what was on the big backup volume. Its mostly Folder = Volume backups and a couple odd files. All the original drives are online, and have backups onto second tier of drives. However several folders show empty. Thats not good but I guess Neofinder was running when the volumes outright quit. No problem, let me just pull a backup db file from time machine. Hmmm, server only has one from an hr ago, that won’t do since it looks like the backup raid failed a couple days ago. Checking across other machines TM only laptop has TM going back normal length of time. Restore Neofinder db file.

Why isn’t TM backing up correctly ? no good reason, TM has always been flaky for no reason. TM network volume has plenty of space, no errors from TM complaining its corrupted and needs to create new backup for many months. In fact the last time I got this error it was from the machine with deep set of backups. I’ll investigate later.

Things I did right :I indexed everything with Neofinder with automatic updates 3 times a week. I have copies of all neofinder db files on all machines. They get updated by an automatic Chronsync batch backup that runs twice a week that cross syncs all machines. While I could run from central server, I opted to go with local copies for performance and when you travel, you need the local copy. I’m not getting into serving the files via the internet if I don’t have to as some of them are several hundred megs each. No fun on a hotel internet connection.

New idea. Put the original drives from the raid into another case. Another thing done right : don’t use a case’s hardware raid controller, use the OS for just such emergencies because when a case fails, you don’t have to worry about finding another exact match years after you bought it to get things working again.

First try : raid shows up, then unmounts, every 30 secs. Disk utility complains things aren’t right. The driver are in the case as 4tb 4tb 6tb blue 6 tb red. I mess around a but thinking maybe OS X is running background repair, looking at log files. Nothing good after 30 minutes.I hope maybe OS X corrects the problem and mounts the drives. Nope.

Crazy idea : what if the order of the drives in the case matters ? It shouldn’t, but its one thing I’ve never tested in this confg. Randomly, I swap the middle two drives so that bottom to top its 4tb 6tb blue 4 tb 6tb red. I know the bottom drive is the 1st volume of the original set. It works ! who would of thought that in a JBOD the order of the drives makes any difference ? well at least for OS X it does. New thing learned.

I run disk first aid, everything is good. Everything is good. So now I wait for drives and new case to show up where I’ll decide how to set up the new backup array, and then copy back everything. At least I have no work hanging over my head to make managing this a pressure situation.

More learning : apple JBOD with HFS+ isn’t a full implementation. The way its supposed to work is that each drive is complete with dir and files, just as a giant volume. Apple apparently is keeping the directory all on the first volume of the set. Worse is that if a data volume drops off, the entire raid goes down. By spec, you should only loose whats on that one drive and everything is ok. Thats why I picked this setup. With HFS+ it doesn’t work this way. Since I’m starting over, I’m going to take a more detailed look at raid options in APFS since its actually using containers. I’ll do some testing and see if apple has improved their JBOD formatting to act and perform more like the spec, or not. I’m going to do some active volume failure tests before committing to a copy of 15tb of data.

No data lost it looks like after checking everything. Only time and money which is the best case for significant failure. I’d planned for this to happen and even tested a few failures. Mostly things worked because I had enough copies meaning at least 3, and in some cases 5 so that in the case of TM, I still had one machine with good backup I could reach into.

Living on the edge of the digital abyss –