Physical Storage

Classification of Physical Storage Media

  • volatile storage: loses contents when power is switched off
  • non-volatile:
    • contents persist even when power is off
    • includes secondary/tertiary storage, as well as batter-backed up main-memory

Factors affecting choice of storage media include:

  • speed with which data can be accessed
  • cost per unit of data
  • reliability

Storage Hierarchy

  • Primary: fastest media but volatile (cache, main memory)
  • Secondary: next level in hierarchy, non-volatile, moderately fast access time
    • Also called on-line storage
    • E.g., flash memory, magnetic disks
  • Tertiary: lowest level in hierarchy, non-volatile, slow access time
    • also called off-line storage and used for archival storage
    • e.g., magnetic tape, optical storage
    • Magnetic tape
      • Sequential access 1 to 12 TB capacity
      • A few drives with many tapes
      • Juke boxes with petabytes (1000s of TB) of storage.

Storage Interfaces

Disk interface standards families:

  • SATA (Serial ATA)
    • SATA 3 supports data transfer speeds of up to 6 gigabits/sec
  • SAS (Serial Attached SCSI)
    • SAS Version 3 supports 12 gigabits/sec
  • NVMe (Non-Volatile Memory Express) interface
    • works with PCIe connectors to support lower latency and higher transfer rates
    • supports data transfer rates of up to 24 gigabits/sec

Disks usually connected directly to computer system.

In a Storage Area Networks (SAN), a large number of disks are connected by a high-speed network to a number of servers. In a Network Attached Storage (NAS) networked storage provides a file system interface using networked file system protocol, instead of providing a disk system interface.

Magnetic Disks

  • Read-write head
  • Surface of platter divided into circular tracks
    • Over 50K-100K tracts per platter on typical hard disks
  • Each track is divided into sectors
    • A sector is the smallest unit of data that can be read or written
    • Sector size typically 512 bytes
    • Typical sectors per tracks: 500 to 1000 on inner tracks, to 1000 to 2000 on outer tracks
  • To read/write a sector
    • disk arm swings to position head on right track
    • platter spins continually; data is read/written as sector passes under head
  • Head-disk assemblies
    • multiple disk platters on a single spindle (1 to 5 usually)
    • one head per platter, mounted on a common arm
  • Cylinder ii consists of ithi^{th} track of all the platters
  • Disk controller: interfaces between the computer system and the disk drive hardware
    • accepts high-level commands to read or write a sector
    • initiates actions such as moving the disk arm to the right track and actually reading or writing the data
    • computes and attaches checksums to each sector to verify that data is read back correctly
      • if data is corrupted, with very high probability stored checksum won't match recomputed checksum
    • ensures successful writing by reading back sector after writing it
    • performs remapping of bad sectors

Performance Measures of Disks

Access time is the time it takes from when a read or write request is issued to when data transfer begins. Consists of:

  • Seek time, the time it takes to reposition the arm over the correct track
    • Average seek time is 1/2, also is worst case
    • 4-10 ms on typical disks
  • Rotational latency, the time it takes for the sector to be accessed to appear under the head
    • 4-11 ms on typical disks (5400 to 15000 r.p.m)
    • Average latency is 1/2 of the above latency
  • Overall latency is 5 to 20 ms depending on the disk model

Data transfer rate, the rate at which data can be retrieved from or stored to the disk.

  • 25 to 200 MB/s max rate

Disk block is a logical unit for storage allocation and retrieval. Typically 4-16 kB.

  • Smaller blocks have more transfers from disk
  • Larger blocks have more space wasted due to partially filled blocks

Sequential access pattern uses success requests for successive disk blocks. Disk seek required only for first block.

Random access pattern uses successive requests for blocks that can be anywhere on disk. Each access requires a seek. Transfer rates are low since a lot of time is wasted in seeks.

I/O operations per seconds (IOPS) is the number of random block reads that a disk can support per second. 50 to 200 IOPS on current generation magnetic disks.

Mean time to failure (MTTF), the average time the disk is expected to run continuously without any failure.

  • Typically 3 to 5 years
  • Probability of failure of new disks is quite low, corresponding to a “theoretical MTTF” of 500,000 to 1,200,000 hours for a new disk
    • E.g., an MTTF of 1,200,000 hours for a new disk means that given 1000 relatively new disks, on an average one will fail every 1200 hours
  • MTTF decreases as disk ages

RAID

Redundant Arrays of Independent Disks is the disk organization techniques that manage a large number of disks, providing a view of a single disk of

  • high capability and high speed by using multiple disks in parallel
  • high reliability by storing data redundantly, so that data can be recovered even if a disk fails

The chance that some disk out of a set of NN disks will fail is much higher than the chance that a specific single disk will fail.

  • E.g., a system with 100 disks, each with MTTF of 100,000 hours will have a system of MTTF of 1000 hours
  • Techniques for using redundancy to avoid data loss are critical with large numbers of disks

Improvement of Reliability via Redundancy

  • Redundancy: store extra information that can be used to rebuild information lost in a disk failure
  • Mirroring (shadowing)
    • duplicate every disk, logical disk consists of two physical disks
    • every write is carried out on both disks
      • reads can take place from either disk
    • if one disk in a pair fails, data still available in the other
      • data loss would occur only if a disk fails, and its mirror disk also fails before the system is repaired
        • probability of combined event is very small, except for dependent failure modes such as fire or building collapse or electrical power surges
  • Mean time to data loss depends on mean time to failure, and mean time to repair
    • MTTF of 100,000 hours, mean time to repair of 10 hours gives mean time to data loss of 500\*106500\*10^6 hours (or 57,000 years) for a mirrored pair of disks (ignoring dependent failure modes)

Improvement in Performance via Parallelism

  • Two main goals of parallelism in a disk system:
    1. Load balance multiple small accesses to increase throughput
    2. Parallelize large accesses to reduce response time
  • Improve transfer rate by striping data across multiple disks
  • Bit-level striping, split the bits of each byte across multiple disks
    • in an array of eight disks, write bit ii of each byte to disk ii
    • each access can read data at eight times the rate of a single disk
    • but seek/access time worse than for a single disk
      • bit level striping is not used much any more
  • Block-level striping – with nn disks, block ii of a file goes to disk (imodn)+1i mod n) + 1
    • Requests for different blocks can run in parallel if the blocks reside on different disks
    • A request for a long sequence of blocks can utilize all disks in parallel

RAID Levels

Schemes to provide redundancy at lower cost by using disk striping combine with parity bits. Different RAID organizations, or RAID levels, have differing cost, performance and reliability characteristics.

  • RAID level 0:
    • Block striping; non-redundant
    • Used in high-performance applications where data loss is not critical
  • RAID level 1:
    • Offers best write performance
    • Popular for applications such as storing log files in a database system
  • Parity blocks
    • parity block jj stores XOR of bits from block jj if each disk
    • when writing data to a block jj, parity block jj must also be computed and written to disk
      • can be done by using old parity block, old value of current block and new value of current block (2 block reads + 2 block writes)
      • or by recomputing the parity value using the new values using the new values of blocks corresponding to the parity block
      • more efficiently for writing large amounts of data sequentially
    • to recover data for block, compute XOR of bits from all other blocks in the set including the parity block
  • RAID level 5:
    • block-interleaved distributed parity
    • partitions data and parity among all N+1N+1 disks, rather than storing data in NN disks and parity in 1 disk
    • block writes occur in parallel if the blocks and their parity blocks on different disks
  • RAID level 6:
    • P+QP+Q redundancy
    • similar to level 5, but stores two error correction blocks (P,Q) instead a single parity block to guard against multiple disk failures
      • better reliability than Level 5 at a higher cost
        • becoming more important as storage sizes increase
  • Other levels
    • RAID Level 2: Memory-Style Error-Correcting-Codes (ECC) with bit striping.
    • RAID Level 3: Bit-Interleaved Parity
    • RAID Level 4: Block-Interleaved Parity; uses block-level striping, and keeps a parity block on a separate parity disk for corresponding blocks from NN other disks.
      • RAID 5 is better than RAID 4, since with RAID 4 with random writes, parity disk gets much higher write load than other disks and becomes a bottleneck
  • Factors choosing RAID level
    • Monetary cost
    • Performance: Number of I/O operations per second, and bandwidth during normal operations
    • Performance during failure
    • Performance during rebuild of failed disk
      • including time taken to rebuild failed disk
  • RAID 0 is used only when data safety is not important
    • e.g. data can be recovered quickly from other sources

Choice of RAID Level

  • Level 1 provides much better write performance than level 5
    • Level 5 requires at least 2 block reads and 2 block writes to write a single block, whereas Level 1 only requires 2 block writes
  • Level 1 had higher storage cost than level 5
  • Level 5 is preferred for applications where writes are sequential and large (many blocks), and need large amounts of data storage
  • RAID 1 is preferred for applications with many random/small updates
  • Level 6 gives better data protection than RAID 5 since it can tolerate two disk (or disk block) failures
    • Increasing in importance since latent block failures on one disk, coupled with a failure of another disk can result in data loss with RAID 1 and RAID 5

Hardware Issues

  • Software RAID: RAID implementations done entirely in software, with no special hardware support
  • Hardware RAID: RAID implementations with special hardware
    • Use non-volatile RAM to record writes that are being executed
    • Beware: power failure during write can result in corrupted disk
      • E.g., failure after writing one block but before writing the second in a mirrored system
      • Such corrupted data must be detected when power is restored
      • Recovery from corruption is similar to recovery from failed disk
      • NV-RAM helps to efficiently detected potentially corrupted blocks
      • Otherwise all blocks of disk must be read and compared with mirror/parity block
  • Latent failures: data successfully written earlier gets damaged
    • can result in data loss even if only one disk fails
  • Data scrubbing:
    • continually scan for latent failures, and recover from copy/parity
  • Hot swapping: replacement of disk while system is running, without power down
    • Supported by some hardware RAID systems,
    • reduces time to recovery, and improves availability greatly
  • Many systems maintain spare disks which are kept online, and used as replacements for failed disks immediately on detection of failure
    • Reduces time to recovery greatly
  • Many hardware RAID systems ensure that a single point of failure will not stop the functioning of the system by using
    • Redundant power supplies with battery backup
    • Multiple controllers and multiple interconnections to guard against controller/interconnection failures