Doesn’t this thing look cool as heck? 102 hard drives and 172tb+ of usable space! This is our storage server project where we wanted something cool but we didn't want to spend a lot of money on something completely insane.
Well, as far as we’re concerned it can look like crap, as long as it works well.
This is our storage server, at least for the moment. Our needs will evolve over time and so we’ll want flexibility, but we’re doing this companion article for two reasons: One, we’ll detail our setup. Two, and more importantly, we want to detail our thought process as professionals.
Our system is rather formidable to have been made from used, borrowed, surplus and specialty components. We ended up with:
- Dual Xeon X5670, up to 3.33ghz
- 12 cores (24 threads,)
- 192 gb ram
- 102 disk drives
- 10 gig Ethernet
- 4x1gb Ethernet
- Intel 750 SLOGs
- 2-3 gigabytes/sec streaming performance (internal)
- "Wiresepeed" on 10 gig ethernet.
…and more. Is there room to upgrade? Plenty! We could do one of those Xeon 2670 or 2690 builds for sure! More PCIe, more Ethernet! More SSDs! But for now, for our needs, we’re pretty well set we think.
Experimentation and real-world testing is a big part of the process here.
There are several things in this project we’d never do “professionally” however as part of a “real” business solution. The first and most important of those things is buying used spinning mechanical hard drives. This almost never works out, and it can be very hard to find hard drives that are truly new (especially “enterprise” class drives as all these are). We used smart monitoring and variables to precisely determine the power-on hours of each drive in the enclosure and determined the average power-on time of these drives was just over a year. The design lifetime of the mechanical drives is about 5 years – anything more, and you’re on borrowed time.
Something else that would be problematic in a “professional” solution here is that it is generally undesirable for this one machine to be doing so much different stuff. We’re going to get away with it in this special case because, at max, there is a very small number of clients that would be using this system (about five or six at peak). It doesn’t (usually) make sense for your storage server, or archive server, to also be a host for virtual machines, run other network services and otherwise mix workload types. Many details from memory allocation to how the physical disks are arranged to ensure good performance depend heavily on the workload.
What's the hardware?
It isn't without a sense of irony that we've repurposed an old Google Search Appliance a.k.a. a Dell R710. We've upgraded the processors and ram. The disk shelves are rebadged LSI SAS3 enclosures with QSF+ SAS connections. The disk controllers are simple LSI SAS Host Bus Adapters (HBAs) -- no "hardware" raid controllers here. Curiously, almost as if Google knows what they are doing, the LSI raid controller that came inside the Dell supported IT/Passthrough mode without having to reflash the controller as would normally have been the case with a Dell R710. It may have been an upgraded controller -- I suspect it is, or was, a PERC H710 originally. This may have been part of Google's OEM specifications.
We've cabled the enclosures and controllers into two groups of two such that each group is connected to each controller and each secondary chassis is daisy-chained off the first chassis. If we can get our hands on some 16-channel LSI SAS HBAs, that would be a great upgrade for us and would probably improve performance. We'll keep our eyes peeled for a good deal on better SAS HBAs.
We had no problem setting up Active/Active connections on our multipath configuration under both FreeBSD and FreeNAS.
What operating system?
We’re running Fedora 25 at the moment; We explored FreeBSD and FreeNAS 10 extensively. We identified a bug and hacked together a patch for FreeNAS 10 that was preventing our disks from being recognized. That fix was was formalized into a real patch by the FreeNAS team in expert fashion. FreeNAS 10 is really something, but we wanted a bit more flexibility. FreeBSD was nearly perfect, but we ran into some trouble trying to utilize 10 gigabit Ethernet inside a Windows 2016 virtual machine. In fact we have only been using Fedora 25 the last couple of weeks after a month of testing on FreeBSD and FreeNAS.
Why ZFS? Why not (btrfs, “hardware” raid, etc)?
There isn’t really another filesystem, other than ZFS, that will work for us here. We need integrity and we’re willing to pay a performance penalty for reliability.
We’re using two ZFS pools with Intel 750 NVMe for the ZFS SLOG device (only about 100gb total for the SLOGs, though). The first ZFS pool is for the video archives and the second ZFS pool is for VM storage and focuses on better IOPS performance than the archive pool. The first pool consists of all our external storage (plus NVMe) and the second pool consists of our internal 4tb storage (plus NVMe). I am hoping we can find someone willing to loan us 1tb+ SSDs, eventually (when we need it) for the internal storage. For now, this more more than fine. And, the other benefit of two storage pools is that, until we’re using a significant portion of our storage capacity, the two pools make it easy to destroy reconfigure the pools themselves in order to experiment – simply copy the data from one pool to the other.
In this setup, the ZFS storage tank is super portable, even portable between systems. In fact, the storage tank was originally created on FreeBSD but we had no problem migrating it between operating systems because ZFS has a very robust design.
Why not just get a nice rack enclosure and add a bunch of disks? Why not use the Equalogic 6510?
The enclosure and storage being separate from the host computer is highly desirable in case of host failure. We’ve already experienced that migrating host operating systems with our existing storage tank. The Equalogic is also extremely limited – it just does storage and it works best with iSCSI and anything else is sketchy, at best.
It’s possible to use an iSCSI device with multiple client computers – this actually works great – but requires that the client computers use (and understand) clustered file systems. This works great on MacOS and Linux, but not so much on Windows (which isn’t really a huge concern for us – more of an annoyance).
Rather than build a crazy rackmount server with a billion drives, we wanted to ensure that we had a redundant interface to each drive in the array. This means you need a port multiplier that supports redundant paths (sometimes called multipath) to each disk. The enclosures we’re using have both redundant paths and redundant controllers. In effect, it is a switched Serial-attach-scsi fabric into which all of the disks plug.
So, for us, a fancy rack enclosure was an expense and a worse position than external disk shelves in terms of flexibility. The Google server has dual port redundant SAS controllers. Depending on how we configure ZFS, we could lose an entire shelf without losing the pool. A PCIe raid controller simply will not work well with a large number of disks like this, and it is folly to expect this of a tiny computer with limited ram. Can you imagine trying to migrate this many disks across to a new raid controller from a faulted one? I’ve done it; it was harrowing to say the least.
Another reason the Equalogic 6510 was not part of our solution is because the software is too much of a black box. The software, while sophisticated, doesn’t offer the same level of data integrity features as ZFS at the nuts-and-bolts level. It’s easy to see that when one thinks about the built-in controllers, which have limited CPU and only 2gb ram. That just can’t match the performance and sophistication of software like ZFS running on Dual Xeon CPUs with 192gb ram.
ZFS, and ZFS datasets, also have a great deal of tweaking options, called tunables, for squeezing out peak performance. When you’re dealing with systems like these as an architect, the strategy is to re-arrange the bottlenecks to minimize those limitations. Nothing is ever bottleneck-free. This can translate into strategies for arranging disks, paths to disk through the controllers, tuning the ZFS record size, strategies for the Ethernet interface, testing various ashift= configurations of the ZFS pool(s), and many important but obscure parameters. Doing this right depends on having a clear understanding of what it is you are trying to do; getting that understanding takes time and careful consideration.
ZFS does have a lot of overhead compared to “normal” file systems like XFS, NTFS, ext4, etc. It does mean there is a bit of a performance penalty vs the performance on raw hardware. The goal, as an architect, was to try to tune the array and redundancy to be able to clear at least 2 gigabytes per second streaming performance in order to keep up with 20 gigabits of Ethernet connectivity.
A ZFS pool consists of one or more VDEVs and each VDEV is responsible for its own redundancy. More VDEVs generally means better performance because reads and writes are striped across VDEVs. VDEVs can be a mirror, or RaidZ1 – RaidZ3 (and an arbitrary number of drives). The loss of a single VDEV will lead to the loss of the entire ZFS pool, which is bad. However, since we have so many identical enclosures, and we don’t have a great deal of storage requirements from the start of our little project, we could do things like adding extra mirrors for more speed/iops; or we could simply turn off the shelves we don’t need. Adding them later works out well because the additional hardware would be identical VDEVs to the existing VDEVs. (Though there would be some headache if we wanted to rebalance the data evenly across all the VDEVs in the enlarged pool, it is manageable though.)
A popular notion I’ve noticed among “ZFS fellows” is that you want to add a power-of-two number of disks in each vdev (plus the redundancy disks) but we found this not to really matter much in our real-world testing and experiments.
There are three types of metrics when speaking of storage performance and they are all interrelated. IOPS (I/O Operations per second), Streaming Performance and Latency. Video storage and retrieval is mostly around streaming performance, which is a number that is how fast data can be transferred from the storage medium. Mechanical hard drives can service one I/O operation at a time, and have a latency typically measured in milliseconds. The lower the latency, the more I/O operations that can be completed in a unit of time. However, SSDs have a latency typically measured in nano or microseconds, and SSDs (at least good ones) can service more than one I/O operation at a time.
For our setup, we experimented with three Intel 750 NVMe SSDs (400gb and 1.2tb) and Samsung 850 Evo SSDs. We ended up using a couple of small slices totaling 100gb on the 750 NVMe as our SLOG, which is perhaps unusually large for the SLOG but it is tuned to our specific use case and will work fine for us. We found it unnecessary to use NVMe or Samsung 850s as L2ARC cache devices, opting instead to tune the ZFS module to allow it to use more ram. Only about 16 gb ram is used for virtual machines and other tasks.
What’s the workflow?
We expect our workflow to be the following: We collect footage from the camera memory cards. Our editing computer has a fast NVMe disk that will hold a few projects. We intend to transcode the 4k footage to something like 1080p and the cineform codec. The video projects, whether in Adobe Premiere, Kdenlive or even DaVinci Resolve, we expect to be able to use a proxy workflow. A proxy workflow, if you aren’t familiar, is the editing is done against a low-compression, lower-resolution codec (which dramatically speeds workflows – ask Dimitri). When the video is rendered, the original 4k high-res footage is substituted transparently (ideally, sometimes there are bugs) and the render works well.
In our tests, rendering from NVMe Footage + Proxies vs rendering with original footage stored on the network was a negligible render speed difference.
In a nutshell, our high-speed proxy footage is on high-speed NVMe -- 2x-3x as fast as 10 gig ethernet with extremely low latency -- and that proxy footage is backed by 4k footage stored on this server. The 10 gigabit connection is fast enough at ~1 gigabyte/sec actual speed to not bottleneck the render process which is (for now at least on our editing ring) CPU bound.
What about power?
With all four enclosures running, we're using about 1800 watts of power. With only two enclosures powered on, we use a bit less than 1,000 watts of power. At the higher wattage figure it costs about $100 per month in electricity to operate. We do have the option of spinning down the drives when the system is not in use/during off hours and we may do that to reduce the power utilization by 75-80%. That is a bit steep, but we think it is worth it.
What other cool features do you have with this setup?
We intend to run a steam cache server. This means that games can be installed on benchmark computers at wire speed (anywhere from 1 gigabit to 10 gigabit). It is like as if we have a 10 gigabit connection to steam; it is great.
We also intend to run a PXE boot server that can be used to deploy and image windows to benchmark systems. That means we could deploy a new fresh install of windows in ~ 10 minutes on new hardware. This is much less Sisyphean for us than constantly reinstalling windows for testing.
Of course local mirrors of our favorite Linux distributions are even easier, so we can also deploy Linux from the network for testing in a matter of minutes as well.
We’re hoping that this platform will be the basis for interesting for future videos as well. Perhaps we can do the video on the steam cache server, windows boot server, and projects like that?
Thanks to our Patrons for making it possible to purchase the few new bits of hardware we needed to “thread the needle” on this project. I feel like our team did really well finding a good deal, and leveraging our expertise to build something innovative, but also not too elaborate or wasteful.
What about backups?
Backups are an important item on the todo list. Until we’ve got more than about 30 terabytes, we’re using ZFS internal capabilities to backup key data. Obviously it isn’t necessary to backup the Steam game cache, windows virtual machines, etc -- just the configuration files and our original videos and project files. ZFS makes it easy to sync differential data to another system running elsewhere.
It may even be possible for us to do a differential sync of the important video files offsite, via the internet, because we only have one to two video shooting days per week.
Time will tell how good of a platform our solution ends up being as the foundation for Level1’s storage and automation needs.