Over the past few months you have worked with SQL, interacted with data via Python and R, and looked at the impacts of indexing. It is now worth discussing the hardware that runs these applications and stores the data as well as how we ensure the integrity of the data stored on the machines and how we reduce risk by working to protect ourselves from hardware failure.
We won't spend to much time on the history of drives, so we will start with what is actually in use today. One of the most common interfaces is SATA (Serial Advanced Technology Attachment) which was introduced in 2003. This is the connection used to connect storage devices such as hard disk drives (HDD), optical drives, and solid state drives (SSD) to the motherboards of servers and desktops.
With SATA, there are three revisions you may encounter, with SATA III as the most common seen today. The largest difference between the revisions is the speed of the interface.
| Standard | Bandwidth* | Data Transfer Speed |
|---|---|---|
| SATA I | 1.5 Gb/sec | 150 MB/sec |
| SATA II | 3 Gb/sec | 300 MB/sec |
| SATA III | 6 Gb/sec | 600 MB/sec |
* Note that a gigabit (Gb) is not the same as a gigabyte (GB). 1 GB = 8 Gb.
Hard Disk Drives (HDDs) and Solid State Drives (SSDs) are the two primary types of storage devices you are likely to encounter. Lets compare:
Since most of our systems will exist in data centers far away from the users, we will probably want to focus most of our thoughts on cost and speed. Reliability is also really important, but we actually have other ways of dealing with that!
Drive pooling allows you to combine multiple physical storage drives into a single logical storage pool. This pool appears as a single, unified storage volume to the operating system, making it easier to manage. Drive pooling is often used in situations where you have multiple hard drives or SSDs and want to create a larger, more flexible storage solution.
One of the best part of pooling drives together is that we can use many smaller, less expensive, and/or slower drives, to create a single and possibly faster volume. Once the pool is created, the data is distributed across the pooled drives. This can be done in various ways, including:
RAID, or Redundant Array of Independent (sometimes called inexpensive) Disks, is a method that combines multiple physical hard drives into a single logical unit to enhance data storage, performance, and redundancy. RAID configurations are commonly used in server environments and high-capacity storage systems. The way RAID works depends on the specific RAID level, each of which offers different features and trade-offs. Here's an overview of how RAID works:
We talked about the SATA interface earlier, specifically because SATA/SAS
is the most common drive type used in storage systems using hardware RAID. If
we were to pool together 24 SATA drives, you could imagine that the throughput
could become as high as 24 * 600MB/s = 14.4GB/s. But in
practice you are most likely not going to achieve these ideal numbers.
Lets do a quick comparison between Hard Drives and Solid State Drives again:
What we see in practice is that we can take a large number of more inexpensive drives,
and by using RAID, we can increase the overall speed of the array and provide for some
redundancy in the event of failure. If we were using an array of hard drives, it
would be common to see numbers in the order of 1-2GBs for a raid with 20
hard drives, and closer to 5GBs for a similar number of SSDs.
Typically we would use hardware raid controllers that have battery-backed-up memory,
these cards would further cache the material that was waiting to be written to disk.
This provide a bit more speed, while also allowing some protection against significant
events like power failures because that data will remain on the card for up to 24 hours
and allow administrators to fix larger problems without loosing data.
We have looked at Solid State Drives that are connected via SATA/SAS interfaces, but what would happen if we were to connect a solid state drive directly to the main bus of a computer?
Modern computers use the PCIe (Peripheral Component Interconnect Express) interface to connect hardware components such as network or video cards via a direct link to the CPU. Each of these PCIe interfaces are connected via a number of lanes to the CPU (or similar controller chip). For terms of reference, the 3 primary PCIe standards in use now, PCIe Version 3, 4, and 5 have speeds of 1GB/s, 2GB/s, and 4GB/s per lane. And most slots are rated at X1, X4, X8, or X16 meaning that they have either 1, 4, 8, or 16 lanes available.
What if we connected the flash storage within a solid state drive directly via
a PCIe interface? This is actually the pupose of NVMe (Non-Volatile Memory Express).
The NVME standard is a X4 connect via either PCI 3, 4, or 5. Lets just do some tabletop
math before we move forward! Thats 4 x 1GBs, 4 x 2GBs, or 4 x 4GBs
per drive! Although in practice the rates are closer to 3500MBs, 7500MBs, and 13500MBs,
but again, this is per drive.
Currently with these drive setups, we can achieve considerably faster input/output operations, but it becomes much harder to use hardware RAID. Typically with NVMe storage, software RAID is used to combine the drives at the expense of CPU usage.
The Computer Science Department's main file server uses hardware RAID6 against nine 4TB SATA SSDs. While our virtualization servers use 24 NVMe Drives.
/home drive
that was constructed from 9 drives in RAID6, this drive
is mounted as /home on every lab workstation as well as ssh.cs.usna.edu
and desktop.cs.usna.edu.
df -BG
Filesystem 1G-blocks Used Available Use% Mounted on
tmpfs 2G 1G 2G 1% /run
/dev/sda2 234G 41G 181G 19% /
tmpfs 8G 1G 8G 1% /dev/shm
tmpfs 1G 1G 1G 1% /run/lock
/dev/sda1 1G 1G 1G 2% /boot/efi
nfs.cs.usna.edu:/home ?????G ???G ?????G ??% /home
9 drives in RAID6, what is the size
of each of the 9 drives?
500MB/s, lets imagine that
our RAID controller adds no latency and can stripe across the data drives perfectly.
What is the maximum possible write speed to this array?
Filesystem 1G-blocks Used Available Use% Mounted on
backups.cs.usna.edu:/snapshot/shared 204081G 13097G 190984G 7% /backups
Has an array of 12 Hard Disk Drives (with about 150MB/s sustained write speeds) in RAID6.
What are the sizes of the individual drives and whats the total possible throughput on the system?