Big (large radius) fans can move a lot of air even at low RPM. And be much more energy efficient.
Oxide Computer, in one of their presentations, talks about using 80mm fans, as they are quiet and (more importantly) don't use much power. They observed, in other servers, as much as 25% of the power went just to powering the fans, versus the ~1% of theirs:
Not sure what to believe, but I like having my ZFS NAS running so it can regularly run scrubs and check the data. FWIW, I’ve run my 4 drive system for 10 years with 2 drive failures in that time, but they were not enterprise grade drives (WD Green).
Is anyone using it around here?
Many years ago I was running an 8x500G array in an old Dell server in my basement. The drives were all factory-new Seagates - 7200RPM and may have been the "enterprise" versions (i.e. not cheap). Over 5 years I ended up averaging a drive failure every 6 months. I ran with 2 parity drives, kept spares around and RMA'd the drives as they broke.
I moved houses and ended up with a room dedicated to lab stuff. With the same setup I ended up going another 5 years without a single failure. It wasn't a surprise that the new environment was better, but it was surprising how much better a cleaner, more stable environment ended up being.
There is another (very rare) failure an ups protects against, and that's imbalance in the electricity.
You can get a spike (up or down, both can be destructive) if there is construction in your area and something happens with the electricity, or lightning hits a pylon close enough to your house.
First job I worked at had multiple servers die like that, roughly 10 yrs ago. it's the only time I've ever heard of such an issue however
To my understanding, an ups protects from such spikes as well, as it will die before letting your servers get damaged
I have 4TB HGST drives running 24/7 for over a decade. ok, not 24 but 8, and also 0 failures. But I'm also lucky, like you. Some of the people I know have several RMAs with the same drives so there's that.
My main question is: What is it that takes 71TB but can be turned off most of the time? Is this the server you store backups?
Probably good, definitely cheaper power costs. Those extra grease on the axle drives were a blip in time.
I wonder if backblaze do a drive on-off lifetime stats model? I think they are in the always on problem space.
https://lackofimagination.org/2022/04/our-experience-with-po...
It's a whitebox RAID6 running NTFS (tried ReFS, didn't like it), and has been around for 12+ years, although I've upgraded the drives a couple times (2TB --> 4TB --> 16TB) - the older Areca RAID controllers make it super simple to do this. Tools like Hard Disk Sentinel are awesome as well, to help catch drives before they fail.
I have an additional, smaller array that runs 24x7, which has been through similar upgrade cycles, plus a handful of clients with whitebox storage arrays that have lasted over a decade. Usually the client ones are more abused (poor temperature control when they delay fixing their serveroom A/C for months but keep cramming in new heat-generating equipment, UPS batteries not replaced diligently after staff turnover, etc...).
Do I notice a difference in drive lifespan between the ones that are mostly-off vs. the ones that are always-on? Hard to say. It's too small a sample size and possibly too much variance in 'abuse' between them. But definitely seen a failure rate differential between the ones that have been maintained and kept cool, vs. allowed to get hotter than is healthy.
I can attest those 4TB HGST drives mentioned in the article were tanks. Anecdotally, they're the most reliable ones I've ever owned. And I have a more reasonable sample size there as I was buying dozens at a time for various clients back in the day.
2, 4, 8TB HGST UltraStar disks are particularly reliable. All of my desktop PCs currently hosts mirrors of 2009 vintage, 2 TB drives that I got when they're put out of service. I have heaps of spare, good 2 TB drives (and a few hundreds still running in production after all these years).
For some reason 14TB drives seem to have a much higher failure rate than Helium drives of all sizes. On a fleet of only about 40 14 TB drives, I had more failures than on a fleet of over 1000 12 and 16 TB.
My NAS is basically constantly in use, between video footage being dumped and then pulled for editing, uploading and editing photos, keeping my devices in sync, media streaming in the evening, and backups from my other devices at night..
The only time I had problems is when I tried to add a 5th disk using a USB hub, which caused drives attached to the hub get disconnected randomly under load. This actually happened with 3 different hubs, so I since stopped trying to expand that monstrosity and just replace drives with larger ones instead. Don't use hubs for storage, majority of them are shitty.
Currently ~64TiB (less with redundancy).
Same as OP. No data loss, no broken drives.
A couple of years ago I also added an off-site 46TiB system with similar software, but a regular ATX with 3 or 4 internal drives because the spiderweb of mini PC + dangling USBs + power supplies for HDDs is too annoying.
I do weekly scrubs.
Some notes: https://lostmsu.github.io/ReFS/
edit: looks like for the netherlands (where he lives) this is more significant -- $0.50/kWh is the average price, so ~$32/year
And that no matter how amazing the industrial revolution has been, we can build reliability at the residential level but not the industrial level.
And certainly at the price points.
The whole “At FAANG scale” is a misnomer - we aren’t supposed to use residential quality (possibly the only quality) at that scale - maybe we are supposed to park our cars in our garages and drive them on a Sunday
Maybe we should keep our servers at home, just like we keep our insurance documents and our notebooks
The article mentions backups near the end saying eg "most of the data is not important" and the "most important" data is backed up. Feeling lucky I guess.
I've given up on striped RAID. Residential use requires easy expandability to keep costs down. Expanding an existing parity stripe RAID setup involves failing every drive and slowly replacing them one by one with bigger capacity drives while the whole array is in a degraded state and incurring heavy I/O load. It's easier and safer to build a new one and move the data over. So you pretty much need to buy the entire thing up front which is expensive.
Btrfs has a flexible allocator which makes expansion easier but btrfs just isn't trustworthy. I spent years waiting for RAID-Z expansion only for it to end up being a suboptimal solution that leaves the array in some kind of split parity state, old data in one format and new data in another format.
It's just so tiresome. Just give up on the "storage efficiency" nonsense. Make a pool of double or triple mirrors instead and call it a day. It's simpler to set up, easier to understand, more performant, allows heterogeneous pools of drives which lowers risk of systemic failure due to bad batches, gradual expansion is not only possible but actually easy and doesn't take literal weeks to do, avoids loading the entire pool during resilvering in case of failures, and it offers so much redundancy the only way you'll lose data is if your house literally burns down.
https://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs...
I think it was used on nVidia Tegra systems, maybe? I'd be interested to find it again, if anyone knows. :)
I run a 8x8tb array zraid2 redundancy, initially it was a 8x2tb array but drives started failing once every 4 months, after 3 drives failed I upgraded the remaining ones.
Only downside to hosting your own is power consumption. OS upgrades have been surprisingly easy.
I know it's not a guarantee of no-corruption, and ZFS without ECC is probably no more dangerous than any other file system without ECC, but if data corruption is a major concern for you, and you're building out a pretty hefty system like this, I can't imagine not using ECC.
Slow on-disk data corruption resulting from gradual and near-silent RAM failures may be like doing regular 3-2-1 backups -- you either mitigate against the problem because you've been stung previously, or you're in that blissful pre-sting phase of your life.
EDIT: I found TFA's link to the original build out - and happily they are in fact running a Xeon with ECC. Surprisingly it's a 16GB box (I thought ZFS was much hungrier on the RAM : disk ratio.) Obviously it hasn't helped for physical disk failures, but the success of the storage array owes a lot to this component.
I've had the (small) SSD in a NAS fail before any of the drives due to TBW.
It's not on 24/7.
No mention of I/O metrics or data stored.
For all we know, OP is storing their photos and videos and never actually need to have 80% of the drives actually on and connected.
A UPS provides more than just that, it delivers constant energy without fluctuations and thus makes your hardware last longer.
So they mostly sit idle? Mine are about ~4 years old with ~35,000 hours.
4 drives: 42k hours (4.7 years), 27k hours (3 years), 15k hours (1.6 years), and the last drive I don't know because apparently it isn't SMART.
0 errors according to scrub process.
... but I guess I can't claim 0 HDD failures. There has been 1 or 2, but not for years now. Knock on wood. No data loss because of mirroring. I just can't lose 2 in a pair. (Never run RAID5 BTW, lost my whole rack doing that)
I also do not backup photos and videos locally. It’s a major headache and they just take up a crap ton of space when Amazon Prime will give you photo storage for free.
Anecdotally, only drives that failed on me were enterprise-grade HDDs. And they all failed within a year and in an always-on system. I also think RAIDs are over-utilized and frankly a big money pit outside of enterprise-level environments.
Polite disagree. Data integrity is the natural expectation humans have from computers, and thus we should stick to filesystems with data checksums such as ZFS, as well as ECC memory.
24 drives. Same model. Likely the same batch. Similar wear. Imagine most of them failing at the same time, and the rest failing as you're rebuilding it due to the increased load, because they're already almost at the same point.
Reliable storage is tricky.