home *** CD-ROM | disk | FTP | other *** search
-
-
-
-
- Benchmarking Storage Systems
-
- White Paper
- by
- Ohad I. Unna
-
- DynaTek Automation Systems Inc.
-
-
-
-
- September, 1992
-
-
-
-
- Benchmarking Storage Systems
-
- Abstract
-
- Existing disk benchmarking programs fall short in measuring the
- performance of the storage subsystem as a whole, in the sense of
- missing the forest for the trees. In order to obtain a meaningful
- performance index of a storage subsystem that is congruent with the
- user-perceived usefulness of that system, a benchmark has to measure
- the performance in an application-level manner, simulating real-life
- computing environments. A method for constructing such a benchmark is
- presented in this paper.
-
-
- Forward
- or
- Who needs another disk benchmarking program?
-
- It is said that if you have a watch you know what time it is, but if
- you have two -- you are never sure. If you have half a dozen disk
- benchmarking utilities, you probably have six different sets of
- results, and there's a good chance none of these results bear any
- resemblance to the way your system performs under its normal workload.
-
- Back in the old days (1982-3) things were pretty simple - PC hard disk
- drives were invariably spinning at 3600 revolutions per minute, almost
- always had 17 sectors per track, and built-in bad-sector remapping and
- multi-megabyte cache were practically unheard of in the microcomputer
- world. When the IDE and SCSI disk interfaces were introduced to the PC
- arena, things got really complicated; disk drive vendors started
- employing sector-mapping schemes in order to avoid the 1024-cylinder
- restriction, controllers were enhanced with a large onboard cache, and
- on-the-fly disk compression adapters and software become more and more
- popular, rendering most of the old disk benchmarking software useless.
-
- If you are not familiar with terms such as "sector-mapping" and
- "compression adapters," don't worry. As a user, you shouldn't really
- care whether your drive has "skewed-track interleave" or "low BIOS
- overhead latency," as long as you get good performance out of it.
-
-
- Application Level Benchmarks
- or
- Why "your mileage may vary"
-
- Have you ever been tempted to buy one of those new "effective access
- time: 0.5 milliseconds" disk drives? Look before you leap: obtaining
- such a figure for a current technology drive means the drive
- manufacturer is using a less-than-adequate disk benchmarking program.
- Such a program can easily be fooled by the presence of a small onboard
- cache; by continuously reading and writing data to the same area of
- the disk, the benchmark merely updates the RAM on the disk controller,
- while very little disk activity actually takes place. Updating a few
-
-
- Page 2
-
-
-
-
- kilobytes of RAM in one half of a millisecond isn't that impressive,
- and this will become evident as soon as you start running real-life
- applications. If you are using a software disk cache, as is now
- standard on most operating systems, you will hardly notice the
- presence of an additional 256KB controller-based cache, unless you
- have a nonstandard configuration or device.
-
- Actual storage system performance depends on an assortment of
- parameters: bus bandwidth, cache strategy, scatter-gather algorithm,
- cluster size and fragmentation, to name a few. None of these
- parameters are influenced by the physical characteristics of the
- specific hard drive used, and each of them has more impact on overall
- storage system performance than the proverbial "average seek time."
-
- The more advanced disk benchmarking programs try to eliminate the
- effects of operating system level parameters by communicating directly
- to the disk. Unfortunately, results obtained in this way are about as
- useful as knowing the top RPM of a car engine when running in neutral.
-
- Two cases present even vaguer prospects: networks and disk-arrays. In
- both cases the same hardware can easily be configured to work in
- different modes, producing a vast range of performance. With networks,
- network topology may be redefined, and server configuration may be
- altered, resulting in an overall system throughput that has little to
- do with the actual performance of the drives themselves. The second
- special case is disk-arrays configured as a RAID cluster; the
- performance of the array as a unit can be as high as twice or more the
- performance of a single drive or as low as a fraction of the
- performance of a single drive.
-
- The experienced user will try to estimate the importance of each
- performance feature of the storage system, weigh them against each
- other, and make a decision based more on intuition than on hard facts.
-
- Eventually, the user will run the ultimate benchmark: a test drive of
- the system with their favorite application. As unscientific as it may
- seem, the results of this test are more meaningful than any of the
- standard benchmarks.
-
- The application level benchmarking approach is defined as testing a
- system in a manner that resembles actual usage as closely as possible.
- This seems to be the most consistent and useful approach to the
- problem of measuring storage system performance.
-
-
-
-
-
-
-
-
-
-
-
-
- Page 3
-
-
-
-
- Shortcomings of Existing Benchmarks
- or
- Why results are only as good as the measurement tool
-
- * Cache
-
- Since most existing disk benchmarking programs were designed to
- give results within a few seconds of activation, they usually
- perform their tests by moving very small amounts of data. With
- modern high-end storage devices you could have anywhere from a
- few kilobytes to 64 megabytes or more of cache between the CPU
- and the storage system. If a benchmarking program uses standard
- file accesses, most or all of the data wouldn't even reach the
- drive by the time the measurement period is over, yielding
- ludicrous results, such as a transfer rate of 20 megabytes per
- second. If the cache is embedded within the hardware, as is
- typical of high-end SCSI adapters, it would present the same
- problem for benchmarking software which uses direct access to the
- drive. In order to measure the true effect of a cache, the
- benchmark should move vast amounts of data during the test, use
- huge files for disk access, and generally behave like a typical
- application. A good benchmark shouldn't eliminate the cache
- effect, but rather measure the true benefit an actual application
- will gain by the presence of that cache.
-
-
- * Compression
-
- Disk compression hardware and utilities are new players in the
- small-computer arena. Before data is written to the drive it is
- compressed either by a software device driver or by a dedicated
- hardware adapter. During reads, data is uncompressed back to its
- original form. Best performance is achieved when the data is
- highly regular, such as is common in large spreadsheets or
- databases. However, as the ratio between CPU speed and disk
- throughput grows, new applications will have built-in data
- compression, especially those dealing with highly compressible
- data, such as video and music. When this happens, the benefits of
- disk-based compression will diminish. A typical disk benchmarking
- program uses either fixed data blocks or whatever garbage data
- happens to be in memory at the time. In either case, the data
- tends to be highly compressible, resulting in exceptionally high
- test scores for transfer rate. In order to obtain true results,
- the benchmark should use data patterns which resemble the actual
- data as closely as possible. Since this is virtually impossible,
- the next best thing is to select a reference pattern which can be
- easily defined and whose maximum compression ratio is known in
- advance. The only data pattern that meets both of these criteria
- is an incompressible pattern. Information theory assures us that
- a suitable pseudorandom sequence has zero redundancy and thus
- cannot be compressed. As far as the compression logic is
- concerned, a pseudorandom sequence seems identical to a
- pre-compressed sequence.
-
-
- Page 4
-
-
-
-
- * Multi-threading
-
- A drive with a long seek time may perform very well for
- sequential accesses, but it will cease doing so as soon as
- several threads are running concurrently, even if all threads
- access the same file, and all do so sequentially. The necessity
- to switch tasks results in multiple seeks up to a point where,
- with a sufficiently large number of threads, the behavior of the
- drive will resemble totally random access. The extent to which
- this effect will degrade performance cannot be computed a-priori
- using only the knowledge of average random access time and
- transfer rate, and thus has to be tested per se. A small number
- (3-6) of concurrent threads is usually enough to feel the full
- effect of sequential access degradation.
-
-
- * Sequential vs. Random
-
- Existing benchmarks generally divide the type of disk accesses
- into two categories: sequential access and random access.
- However, in real life, only very few of the applications actually
- behave this way. A modern disk controller with a high bandwidth
- and deferred-write caching can almost totally eliminate seek
- latency for applications performing only a small portion of
- random accesses. The same drive, when used with a write-through
- cache, might perform almost as slowly as fully random access. A
- good benchmark should present the system with the correct mixture
- of sequential and random accesses in order to get valid results.
-
-
- * Combining reads and writes
-
- Much like combining sequential and random accesses, again, a
- mixture of read and write requests may yield results which differ
- substantially form the 'expected' weighed average of the two.
- With the proper logic programmed in the drive controller or
- device driver, and the appropriate scatter/gather method
- implemented in the operating system, read and write requests can
- be overlapped from the CPU's point of view, resulting in superior
- throughput. Some of the more rudimentary disk benchmarking
- programs do not perform any writes at all during the test, and
- simply assume the transfer rate is the same as for reads. This
- assumption falls apart when RAID class devices are concerned.
- Depending on RAID level and architecture, the time needed to
- write a block may differ from the time it takes to read the same
- block by a factor of two or more. An inherently sluggish RAID
- device may seem exceptionally fast when tested by only reading
- existing data, or when the mixture of read and write operations
- does not match the actual application mixture.
-
-
-
-
-
-
- Page 5
-
-
-
-
- Defining a Reference Disk Benchmark
- or
- What's a RAIDmark?
-
- The DynaTek RAIDmark utility is intended to measure the integrity,
- performance, and cost-efficiency of any read/write random access
- (blocked I/O) storage subsystem in terms that are meaningful for the
- layman, and that depict the apparent usefulness the user gets from the
- system. RAIDmark is also intended to measure some low-level metrics of
- the storage device, which can be utilized for testing and fine-tuning
- such devices throughout the phases of on-site installation.
-
- The RAIDmark utility defines a set of six standard tests, each
- simulating a typical I/O-bounded computer environment. For each test,
- the tested system receives a grade proportional to its throughput. A
- unit of 'One RAIDmark' is defined as the measured throughput (for each
- test) of an IBM PS/2 model 95 with the standard IBM 150 MB hard disk.
- Integrity is tested on a go/no-go basis using 32-bit checksum
- verification on each record.
-
- * Low-level performance
-
- The utility measures raw indexes of storage device performance:
- average access time, read/write transfer rate, effective cache
- size and cache strategy.
-
- * High-level performance
-
- The utility tests the performance of the storage subsystem on the
- application level, checking the subsystem's ability to respond to
- harsh real-world data traffic of typical heavy load environments.
- The utility simulates data flow of several typical work
- environments and measures system response and throughput in each
- of the test cases, giving a combined figure of merit for the
- system's performance in each scenario.
-
- The following scenarios are simulated and tested:
-
- - Typical workstation applications - multiuser (3-10 users)
- environment, many small- to medium-size file requests, most
- of which are read-only accesses with less frequent large
- file read/write operations (as in CAD/CAM file save or
- memory swap).
-
- - Transaction processing environment - 5-50 users, many small
- record read and update requests, reports and backup
- processes running in the background, transaction logbook as
- a large sequential file.
-
- - Video/audio/multimedia applications - single or few users,
- huge sequential access files, with both read and write
- requests.
-
-
-
- Page 6
-
-
-
-
- - Database query systems - 5-200 users, many small record
- accesses, almost all of which are read-only; updates,
- reports and backup are executed during system off-line time.
-
- - Data logging - up to several hundreds of sites, many very
- small random write accesses (typically 20-100 bytes),
- reports and backup are executed during system off-line time.
-
- - Archiving - few users with large sequential read/write
- accesses.
-
- * Degraded subsystem performance
-
- When the subject of the tests is a RAID-class storage system, the
- user can disable one of the subsystem elements, and then proceed
- to measure the degraded performance in this hindered setup.
-
- * Design constraints and limitations
-
- - Free disk space
-
- In order for the RAIDmark utility to conduct the tests it
- should be able to create files of several megabytes each, or
- even larger, if caching effects are to be measured
- accurately. For best accuracy, at least 150MB of free disk
- space should be available on the tested drive.
-
- - Operating system interface
-
- When using a multi-processing environment, all benchmarks
- should be run as a sole process, otherwise time measurements
- will be of little value.
-
- - Integrity and stability
-
- The ability of RAID-class devices to operate with parts of
- the subsystem inoperable (or altogether missing) might
- create a situation in which the user is not aware that the
- system is operating under highly unfavorable conditions
- which adversely affect the benchmark. Moreover, modern disk
- controllers can overcome some physical mishaps, such as bad
- sectors, unaligned heads, etc., relying on firmware logic to
- correct these problems at the expense of degraded
- performance. The user should identify and isolate such
- conditions, possibly by testing for stability over time and
- over different areas of the disk platters.
-
- - DOS FAT Latency
-
- When using top-level MS-DOS file I/O requests, each random
- file seek requires the operating system to scan the file
- allocation table (FAT) from the location pointing to the
- beginning of the file, to the location pointing to the
-
-
- Page 7
-
-
-
-
- requested cluster. Although the FAT is usually kept in RAM,
- for large files this sequential scanning process may
- actually take longer than the physical movement of the disk
- arm. Some advanced applications (database managers, etc.)
- overcome this delay by keeping several logical entry points
- to the file ("file handles") scattered at different points
- in the file. Access is then achieved using the nearest file
- handle, reducing FAT latency by a factor of 10 to 20. Other
- applications, such as MS-Windows 3.0/3.1 Virtual Memory
- Manager, implement their own file system to achieve zero FAT
- latency. More modern operating systems, such as OS/2 HPFS,
- achieve the same favorable results using advanced built-in
- random seek methods.
-
- For the sake of being as general as possible, RAIDmark
- measures the exact FAT latency on the tested system by
- comparing the average access times for clusters located in a
- limited region near the beginning of the file to those of
- clusters located in a similar size region near the end of
- the file. The average FAT latency is then deducted from all
- random access test times, giving the theoretical limit for
- an 'optimally tuned' application.
-
- * Test files
-
- Test files are homogeneous streams of random values, logically
- accessed as an array of 32-bit unsigned integers. The stream is
- logically subdivided into records of 16 double-words
- (long-words), each containing an embedded checksum.
-
- Each record in a test file contains 16 double-words of 32 bits
- each. Fifteen of these double-words are generated by a
- pseudorandom number generator. The 16th value is calculated in
- such a way that the sum of all words in the record plus the
- offset of the record from the beginning of the file (in bytes)
- equals zero (modulo 2^32). This file structure enables the user
- to create files that are incompressible by nature, but still
- allows the user to verify the validity of the data with an
- error-detection probability of near certainty. However, this
- method requires one multiplication and two addition operations
- per 32-bit integer during file creation, but only one addition
- per double-word during read access.
-
- To prevent meaningless results obtained from slow processor
- machines with concurrent disk access, a 32KB block of data (512
- records of 64 bytes) is prepared in advance, and records from
- this block are written to the disk at file creation, with only
- one word in each 16-word record changed for each write operation
- to account for different file offsets. This will give even a slow
- CPU ample time to complete data preparation before the next write
- operation is due.
-
-
-
-
- Page 8
-
-
-
-
- Theoretically, data created in the method described above can be
- compressed. However, there is currently no compression algorithm
- capable of detecting the 256Kbit-long patterns created using this
- method.
-
- A question arises as to the extent to which this random-data file
- serves as a good test case, since in reality most data files are
- compressible by 30%-60% or, in some homogeneous data acquisition
- applications, up to 80%; however, using compressible data exposes
- us to the vast range of compression algorithms, each having its
- own peculiarities regarding specific data patterns, sometimes
- resulting in a twofold factor in throughput for some specific
- instance. Hence, the approach taken in RAIDmark is the worst-case
- compression scenario, in which all algorithms behave in a similar
- fashion, namely, no compression occurs.
-
- In a future version the user may be given the option of
- generating semi-random ("pink noise") data files, permitting the
- compression mechanism to affect results.
-
-
- - Test File Creation
-
- If any of the disk tests is requested, a 128MB file is
- created. This process is timed as accurately as possible,
- both the initial file open operation, and the writing of
- data on the disk. Measurement is done separately for chunks
- of data of varying sizes. The first two chunks are 16KB
- long, the third is 32KB long, the fourth -- 64KB, and so on,
- with the 8th chunk being 1MB long. The remaining 1MB chunks
- are written to bring the total to the final size.
-
- The 128MB file size was chosen arbitrarily, being larger
- than today's biggest disk cache units, but less than half
- the size of modern high capacity disks. This is a default
- value, intended to become the 'RAIDmark standard,' but
- values ranging from 1MB to 1GB are implemented and selected
- using a command line option. Test files smaller than 16MB
- might produce inaccurate results, and the user will be
- warned in case such a selection was made. Using the default
- file size, each separate test will take between a few
- seconds and two minutes to complete, permitting accurate
- measurement without seizing the system for an intolerable
- duration. There are 14 separate tests in a full test-suite,
- so with all the testing overhead, running the full benchmark
- may take the better part of an hour.
-
-
- - Determine deferred-write cache size
-
- During the file creation phase, effective transfer rate
- throughput is measured for increasing data chunks as
- described above. If the system contains a deferred-write
-
-
- Page 9
-
-
-
-
- cache, the first few chunks will take a very short time to
- be written, usually exhibiting a throughput in excess of 10
- megabytes per second. At some point, throughput will decline
- sharply. This point in the data flow marks the end of the
- posted write cache. With systems employing dynamic caching,
- the value obtained for the cache size may be arbitrary, but
- if the test is run on an otherwise idle system, the size
- detected will be close to the maximum possible for that
- system.
-
- Note that in order for this test to run accurately, the
- system should be given enough time to "rest" before the test
- commences in order to enable previously cached data to be
- flushed to disk by the cache logic. The user should wait
- until all previous disk activity ceases before starting this
- test.
-
-
- - Determine raw transfer rate for write
-
- After obtaining an upper bound for the posted-write cache
- size, the actual non-cached transfer rate for write
- operations is measured by first writing data to a different
- location on the drive. The amount of data written should
- exceed the size of the computed deferred-write cache. The
- benchmark then compares total write times for data chunks of
- different sizes, specifically 8KB and 32KB blocks. By
- subtracting the two, all fixed-overhead delays are canceled
- out, and the net result is the time it takes to write 24KB
- of data.
-
- This figure for transfer rate will usually be lower than the
- nominal figure quoted by the drive manufacturer, since the
- measurement involves drive head movements (seek and search
- operations), operating system and BIOS overhead, and
- cache-miss delays. However, the figure obtained this way is
- a more meaningful measure of subsystem performance, since
- this is the actual throughput as experienced by the
- application and the end user. Delays created by the
- benchmarking process utility itself are taken into account
- and removed from the delay measurement.
-
- During the "post-cache" phase of writing, each of the 1MB
- chunks of data should take approximately the same time to
- complete. If the variability is higher than expected, the
- user will be warned of the system's instability. This
- situation may arise when other processes in the system or
- the network are active, when the cache strategy involves
- some exotic algorithm, when the disk is highly fragmented,
- when many bad sectors are scattered on the disk's test area,
- or when a "flaky" drive causes retried attempts for some of
- the operations. The user has the choice of repeating the
-
-
-
- Page 10
-
-
-
-
- test, or else referring to the user guide for
- troubleshooting advice.
-
-
- - Determine raw transfer rate for read
-
- Transfer rate for sequential read access is measured by
- comparing total read times for data chunks of different
- sizes, specifically 8KB and 32KB blocks. By subtracting the
- two, all fixed-overhead delays are canceled out, and the net
- result is the time it takes to read 24KB of data.
-
-
- - Main test executive
-
- The main test executive procedure is a generic routine whose
- input parameters define the real-life environment each test
- case is supposed to simulate. Parameters include:
-
- - record size
-
- - distribution of access type
-
- - number of users or processing threads
-
- The procedure determines the number of times the tests are
- to be repeated as a function of the duration of a single
- test. The rule is that, by default, each test case should
- take from a few seconds to around two minutes. If the total
- desired testing time of the complete benchmark is given on
- the command line, the duration of each single test is
- adjusted accordingly. This time does not include benchmark
- overhead, such as file creation and cache overflowing, hence
- total benchmark duration may be considerably longer than the
- command line parameter.
-
-
- - Build random block
-
- This routine creates a 32KB block in memory for writing onto
- the disk during the file creation phase. The format of the
- block is described above. The algorithm used for generating
- random numbers is the Linear-Congruential method (described
- in D.E. Knuth, "The Art of Computer Programming" Vol. 2).
- The location used for inserting the checksum word for each
- record is the location (first four bytes) of each record. In
- order to account for different file offsets, this location
- is updated for each instance of writing the 32KB block.
-
-
-
-
-
-
-
- Page 11
-
-
-
-
- Conclusion
-
- In order to help the user select a storage system and configuration
- that will match their computing needs, a standard reference benchmark
- for measuring storage system performance has been defined. Existing
- disk benchmarking programs fall short when measuring modern 'smart'
- storage systems, or when unusual disk configurations are used. Many of
- the most important performance factors are not taken into account, and
- the characteristics of the user's application don't play any role in
- the measurement when these utilities are used. The DynaTek RAIDmark
- has been created to overcome these limitations, operating as an
- application-level benchmark program. RAIDmark measures storage system
- performance in various typical environment conditions, and provides
- the user with a good indication of the expected usefulness of their
- system.
-
-
-
- * Technical reference
-
- - Patterson, David A.; Gibson, Garth; Katz, Randy H.:
- "A Case for Redundant Arrays of Inexpensive Disks (RAID)";
- ACM SIGMOD (Chicago, 1988).
-
- - Patterson, David A.; Chen, Peter; Gibson, Garth; Katz, Randy
- H.:
- "An Introduction to Redundant Arrays of Inexpensive Disks
- (RAID)"; IEEE 1989.
-
- - Integra Technologies, Inc. - Graham, Richard A.:
- "Redundant Arrays of Inexpensive Disks" (1991)
-
- - Finney, Kenneth C.; Graham, Richard A.:
- "A Discussion of RAID Technology"; DynaTek/Integra, 1992.
-
-
-
-
-
- DynaTek and RAIDmark are trademarks of Dynatek Automation Systems Inc.
-
- HPFS, IBM, Integra, MS-DOS, MS-Windows, OS/2 and PS/2 are registered
- trademarks of their respective holders.
-
-
-
-
-
-
-
-
-
-
-
-
- Page 12
-