Chapter 12. CachesCaches store a small amount of information for repeated fast access. Many types of caches occur throughout computer hardware and software systems. We’ll take a look at the common principles, with examples of CPU cache, directory name lookup cache (DNLC), and name service cache. Then, we’ll discuss some caches used with local disks and caches used with networks. Cache PrinciplesCaches work on two basic principles that should be quite familiar to you from everyday life. The first is that if you spend a long time going to get something and you think you may need it again soon, you keep it nearby. For example, if you are working on your car, you find a 15 mm spanner in your tool box, then crawl under the car. If you then need a 10 mm spanner, you don’t put the 15 mm back in the tool box, you leave it under the car. When you have finished, there is a cache of tools in a pile under the car that make up your working set. If you allowed only one tool at a time under the car, you would waste a lot of time going back to the tool box. However, if the pile of tools gets too large, you may decide to put away a few tools that you think you won’t need any more. The second principle is that when you get something, you can save time by getting a few extra things that you think you might need soon. In the car example, you could grab a handful of spanners each time, thus saving yourself from having to make a few trips. You find that the pile under the car gets bigger much more quickly, you never use some tools, and it takes you longer to pick up and put down a handful of tools on each trip to the tool box. The first principle, called “temporal locality,” depends on reusing the same things over time. The second principle, called “spacial locality,” depends on using things at the same time that are located near each other. Caches work well only if there is good locality in what you are doing. Another good example is the “cache” of groceries that you keep in your house that saves you a trip to the supermarket every time you want a bite to eat. To take this analogy a bit further, let’s say you are the kind of person who hates shopping and lives off tinned and frozen food that you buy once a month. You have a very efficient cache, with few shopping trips, and you can spend all your spare time browsing the Internet. Now, let’s say that your mother comes to stay and tells you that you need a diet of fresh fruit and salad instead. You can’t buy in bulk as these items don’t freeze or keep, and you need to visit the shops almost every day. You now waste all your spare time shopping. You notice one day that your local shop is now also on the Internet, and you can use your browser to place orders for delivery. Now you can surf the Internet all the time, keep a good diet, and never go grocery shopping again! In cache terms, this is an asynchronous prefetch of stuff that you know you will need soon, but, by requesting it in advance, you never have to stop and wait for it to arrive. This analogy illustrates that some sequences of behavior work very efficiently with a cache, and others make little or no use of the cache. In some cases, “cache busting” behavior can be fixed by changing the system to work in a different way that bypasses the normal cache fetch. These alternative methods usually require extra hardware or software support in a computer system. Before going on to a specific kind of cache, let’s look at some generic terms and measurements that parameterize caches. When the cache is used to hold some kind of information, a read from the cache is very different from writing new information, so we’ll take a look at cache reads first, then look at writes. Cache Reads That SucceedMany kinds of information can be cached: the value of a memory location, the location of a file, or the network address of another system. When the information is needed, the system first looks for it in the cache. This approach requires that each item in the cache has a unique name or tag associated with the information. When a cache lookup occurs, the first thing that happens is a search during which the system tries to match the name you gave it with one of the tags. If the search succeeds, the associated data is returned; this case is called a read hit. This situation is the best case, as it is both quick and successful. The two measures of interest are the read hit rate and the read hit cost. The read hit rate is the number of read hits that occur per second; the read hit cost is the time it takes to search the cache and return the data. CPU Cache ExampleFor a CPU’s primary data cache, an address is used as the tag. Millions of read hits occur each second; the cost is known as the load use delay, a constant small number of CPU clock cycles. When the CPU issues a load instruction to move data from memory to a register, it has to wait a few cycles before it can use the register. A good compiler will try to optimize code by moving load instructions forward and filling in the load use delay with other work. It is hard to measure (and understand) what is going on in the CPU cache. Directory Name Lookup Cache ExampleFor the directory name lookup cache (DNLC), an open system call causes a pathname string, e.g., “export,” to be looked up to find its inode. The rate at which this lookup occurs is reported by sar -a as namei/s; it can reach a few thousand per second on a busy file server. The read hit cost is relatively low and consumes only CPU time. An efficient hashed search is used, so the size of the search is not an issue. Name Service Cache ExampleThe name service cache is implemented as a multithreaded daemon process ( nscd). It caches password and group information for users, and the host IP address information for systems. For example, a call to gethostbyname(3N) would ordinarily involve a search through the local /etc/hosts file, a call to a local NIS or NIS+ server, and perhaps a DNS lookup, as specified in /etc/nsswitch.conf. The nscd provides a systemwide cache for the results of these searches, and gethostbyname accesses it through a very fast, new, interprocess call mechanism known as a door call. The read hit cost is only a few tens of microseconds, depending on your CPU speed. Cache Reads That Miss the CacheIf the search fails and the data is not in the cache, the case is called a read miss. The rate at which this situation occurs is called the read miss rate, and the time taken is called the read miss cost. A read miss corresponds to the work that would be needed if the cache did not exist, plus the added overhead of checking the cache. Read misses are slow, and the read miss cost can be unpredictable, although it is usually possible to figure out the best case, the worst case, and the average. CPU Cache ExampleCPU caches come in many configurations. The most common setup is to have two separate primary caches built into the CPU chip, one to hold instructions and one to hold data, and often holding about 16 Kbytes each. These caches then refer to an external level-2 cache holding between 256 Kbytes and 4 Mbytes of data. A read miss in the small primary cache may result in a hit or a miss in the level-2 cache. A level-2 hit takes a few cycles longer than a level-1 hit but is still very efficient. A level-2 miss involves a main memory reference, and in a multiprocessor system, the system must access the backplane and check every CPU cache in the system as well. If the backplane is already very busy, there may be a considerable extra delay. This approach explains why heavily loaded multiprocessor servers run faster and scale better with the reduced number of read misses that come with larger CPU caches. For example, Sun currently offers a choice of 512-Kbyte, 1-Mbyte, 2-Mbyte, and 4-Mbyte caches on the UltraSPARC range. The 4-Mbyte cache choice is always faster, but it will have a proportionately greater benefit on performance and scalability of a 24-CPU E6000 than on a 4-CPU E3000. An anomaly can arise in some cases. It takes time to search the cache and to load new data into it, and these processes slow down the fetch operation. If you normally operate in a cache-busting mode, you actually go faster if there is no cache at all. Since caches add to the cost of a system (extra memory used or extra hardware), they may not be present on a simpler or cheaper system. The anomaly is that a cheaper system may be faster than a more expensive one for some kinds of work. The microSPARC processor used in the 110 MHz SPARCstation 5 is very simple: it has a high cache miss rate but a low cache miss cost. The SuperSPARC processor used in the SPARCstation 20 has much bigger caches, a far lower cache miss rate, but a much higher cache miss cost. For a commercial database workload, the large cache works well, and the SPARCstation 20 is much faster. For a large Verilog EDA simulation, the caches are all too small to help, and the low latency to access memory makes the SPARCstation 5 an extremely fast machine. The lessons learned from this experience were incorporated in the UltraSPARC design, which also has very low latency to access memory. To get the low latency, the latest, fastest cache memory and very fast, wide system buses are needed. Directory Name Lookup Cache ExampleIf we consider our other examples, the DNLC is caching a file name. If the system cannot find out the corresponding inode number from the DNLC, it has to read through the directory structures of the file system. This procedure may involve a linear search through a UFS directory structure or a series of readdir calls over NFS to an NFS server. There is some additional caching of blocks of directory data and NFS attributes that may save time, but often the search has to sleep for many milliseconds while waiting for several disk reads or network requests to complete. You can monitor the number of directory blocks read per second with sar -a as dirbk/s; you can also look at the number of NFS2 readdir calls and NFS3 readdirplus calls. NFS2 reads a single entry with each readdir call, whereas NFS3 adds the readdirplus call that reads a series of entries in one go for greater efficiency (but longer latency). To summarize the effects of a DNLC miss: A little extra CPU is consumed during the search, but a lot of sleeping—waiting for the disk and network—dominates the miss cost. DNLC hit rates (the number of read hits as a proportion of all lookups) are often in the 80%–100% range. If you run the find command, which walks through the directory structure, you will discover that the hit rate drops to perhaps 30%–40%. My rule for the DNLC hit rate is a bit too simple. It tells you to increase the DNLC size if it is fairly small, the hit rate drops below 80%, and the reference rate is above 100/s. In some cases, a lot of new files are being created or a find-like operation is needed, and there is no point having a big DNLC. Name Service Cache ExampleA read miss on the name service cache causes a sequence of lookups by the nscd; the sequence is controlled by the /etc/nsswitch.conf file. If the file contains the entry: hosts: files nisplus dns then the nscd first lookup uses the standard files “backend” to look in /etc/hosts. In recent releases of Solaris 2, the files backend itself caches the file and does a hashed lookup into it; so, very long host and passwd files are searched efficiently, and the read miss could be quite quick and CPU bound with no waiting. A NIS or NIS+ lookup involves a call over the network to a name server and so incurs at least a few milliseconds of delay. If the lookup is eventually passed up a hierarchy of DNS servers across the Internet, the lookup could take several seconds. The worst case is a lookup in a remote DNS domain for a hostname that does not exist. Unfortunately, some applications repeat the same lookup several times in quick succession, so a novel feature of the nscd is that it supports negative caching. It remembers that a lookup failed and tells you immediately that the host does not exist if you ask a second time. There is a five second time-out for negative entries by default. The full output from nscd -g (on a recently booted Solaris 2.5 system) is shown Figure 12-1. Figure 12.1. Example Name Service Cache Daemon ( nscd) Configuration
Cache Replacement PoliciesSo far, we have assumed that there is always some spare room in the cache. In practice, the cache will fill up and at some point cache entries will need to be reused. The cache replacement policy varies, but the general principle is that you want to get rid of entries that you will not need again soon. Unfortunately, the “won’t need soon” cache policy is hard to implement unless you have a very structured and predictable workload or you can predict the future! In its place, you will often find caches that use “least recently used” (LRU) or “not recently used” (NRU) policies, in the hope that your accesses have good temporal locality. CPU caches tend to use “direct mapping”—where there is no choice, the address in memory is used to fix an address in cache. For small, on-chip caches, there may be “n-way associative mapping”—where there are “n” choices of location for each cache line, usually two or four, and an LRU or random choice implemented in hardware. Random replacement (RR) policies seem to offend some of the engineering purists among us, as they can never work optimally, but I like them because the converse is true, and they rarely work badly either! Caches work well if the access pattern is somewhat random with some locality or if it has been carefully constructed to work with a particular cache policy (an example is SunSoft Performance WorkShop’s highly tuned math library). Unfortunately, many workloads have structured access patterns that work against the normal policies. Random replacement policies can be fast, don’t need any extra storage to remember the usage history, and can avoid some nasty interactions. Cache Writes and PurgesSo far, we have looked only at what is involved in reading existing information through a cache. When new information is created or old information is overwritten or deleted, there are extra problems to deal with. The first issue to consider is that a memory-based cache contains a copy of some information and the official master copy is kept somewhere else. When we want to change the information, we have several choices. We could update the copy in the cache only, which is very fast, but that copy now differs from the master copy. We could update both copies immediately, which may be slow, or we could throw away the cached copy and just update the master (if the cache write cost is high). Another issue to consider is that the cache may contain a large block of information and we want to update only a small part of it. Do we have to copy the whole block back to update the master copy? And what if there are several caches for the same data, as in a CPU with several levels of cache and other CPUs in a multiprocessor? There are many possibilities, so I’ll just look at the examples we have discussed already. The CPU cache is optimized for speed by hardware that implements relatively simple functions with a lot of the operations overlapped so they all happen at once. On an UltraSPARC, a cache block contains 64 bytes of data. A write will change between 1 and 8 bytes at a time. Each block contains some extra flags that indicate whether the data is shared with another CPU cache and whether the data has been written to. A write updates both the internal level-1 and the external level-2 caches. It is not written out to memory, but the first time a shared block in a multiprocessor system is written to, a special signal, invalidating any copies of the data that other CPUs hold, is sent to all the other CPUs. The flags are then changed from shared/clean to private/dirty. If another CPU then tries to reread that data, it is provided directly in a cache-to-cache transfer and becomes shared/dirty. The data is eventually written back to main memory only when that cache block is needed to hold different data by a read. In this case, the read operation is held in an input buffer while the dirty cache block is copied to an output buffer, then the read from memory completes, and finally the dirty cache line write completes. Another situation to consider is the case of a context switch or a process exiting. If an address space is no longer valid, the cache tags may also be invalid. For systems that store a virtual address in the cache tag, all the invalid addresses must be purged; for systems that store only a physical address in the cache tag, there is no such need. Older SPARC systems, the HyperSPARC range, and some other systems such as HP-PA use virtual caches. SuperSPARC and UltraSPARC use physical caches. In general, a virtual cache can be faster for uniprocessor systems and scientific workloads, and a physical cache scales much better on multiprocessor systems and commercial workloads that have lots of context switches. There are a lot more ways to implement caches. If you are interested, I recommend you read Computer Architecture — A Quantitative Approach by Hennessy and Patterson, published by Morgan Kaufmann. Returning to our consideration of the DNLC: When a file is created or deleted, the new directory entry is written to disk or sent over NFS immediately, so it is a slow operation. The extra overhead of updating or purging the DNLC is very small in comparison. The name service cache does not get involved in updates. If someone changes data in the name service, the nscd relies on time-outs and watching the modification dates on files to register the change. Cache EfficiencyA cache works well if there are a lot more reads than writes, and the reads or writes of the same or nearby data occur close together in time. An efficient cache has a low reference rate (doesn’t make unnecessary lookups), a very short cache hit time, a high hit ratio, the minimum possible cache miss time, and an efficient way of handling writes and purges. Some trade-offs can be made. A small, fast cache may have a lower hit ratio but can compensate by having a low miss cost. A big, slower cache can compensate for the extra miss cost by missing much less often. Beware of looking at hit ratios alone. A system may be running more efficiently by referencing the cache less often. If there are multiple levels of cache and the first level is working well, the second level will tend to see a higher proportion of misses but a low reference rate. It is actually the miss rate (in misses/second) multiplied by the miss cost (to give a total time delay) that matters more than the ratio of hits or misses as a percentage of the total references Generic Cache Behavior SummaryI’ve tried to illustrate some of the generic behavior of caches with examples that show what is similar and what is different. Computer systems are constructed from a large number of hardware and software caches of many types, and there can be huge differences in performance between a workload that works with a cache and one that works against a cache. Usually, all we can tune is the cache size for software caches like the DNLC. This tuning has some effect, but it is much more effective to change your workload (don’t run the find command any more often that you have to!) to work within the cache you have. Your computer is spending more time stalled, waiting for some kind of cache miss to complete, than you think. File Access Caching with Local DiskAccessing a file on disk or over a network is hundreds of times slower than reading a cached copy from memory. Many types of cache exist to speed up file accesses. Changing your workload to make it more “cache friendly” can result in very significant performance benefits. We’ll start by looking at the simplest configuration, the open, fstat, read, write, and mmap operations on a local disk with the default UFS file system, as shown in Figure 12-2. Figure 12-2. File Access Caches for Local Disk Access
There are a lot of interrelated caches. They are systemwide caches, shared by all users and all processes. The activity of one cache-busting process can mess up the caching of other well-behaved processes. Conversely, a group of cache-friendly processes working on similar data at similar times help each other by prefilling the caches for each other. The diagram in Figure 12-2 shows the main data flows and relationships. Directory Name Lookup CacheThe directory name lookup cache (DNLC) is a cache of directory information. A directory is a special kind of file that contains names and inode number pairs. The DNLC holds the name and a pointer to an inode cache entry. If an inode cache entry is discarded, any corresponding DNLC entries must also be purged. When a file is opened, the DNLC figures out the right inode from the file name given. If the name is in the cache, the system does a fast hashed lookup if directories need not be scanned. The UFS directory file structure is a sequence of variable length entries requiring a linear search. Each DNLC entry is a fixed size, so there is only space for a pathname component of up to 30 characters. Longer components are not cached (many older systems like SunOS 4 cache only up to 14 characters). Directories that have thousands of entries can take a long time to search, so a good DNLC hit rate is important if files are being opened frequently and very large directories are in use. In practice, file opening is not usually frequent enough to be a serious problem. NFS clients hold a file handle that includes the inode number for each open file, so each NFS operation can avoid the DNLC and go directly to the inode. The maximum tested size of the DNLC is 34906, corresponding to the maximum allowed maxusers setting of 2048. The largest size the DNLC will reach with no tuning is 17498, on systems with over 1 Gbyte of RAM. The size defaults to ( maxusers * 17) + 90, and maxusers is set to just under the number of megabytes of RAM in the system, with a default limit of 1024. I find that people are overeager in tuning ncsize; it really only needs to be increased manually on small-memory (512 Mbytes or less) NFS servers, and even then, any performance increase is unlikely to be measurable unless you are running the SPECsfs NFS stress test benchmark. Inode CacheThe fstat call returns the inode information about a file. This information includes file size and datestamps, as well as the device and inode numbers that uniquely identify the file. Every concurrently open file corresponds to an active entry in the inode cache, so if a file is kept open, its information is locked in the inode cache and is immediately available. A number (set by the tunable ufs_ninode) of inactive inode entries are also kept. ufs_ninode is set, using the same calculation as that for ncsize above, but the total size of the inode cache will be bigger because ufs_ninode limits only the inactive entries. ufs_ninode doesn’t normally need tuning, but if the DNLC is increased, make ncsize and ufs_ninode the same. Inactive files are files that were opened at some time in the past and that may be opened again in the future. If the number of inactive entries grows too large, entries that have not been used recently are discarded. Stateless NFS clients do not keep the inode active, so the pool of inactive inodes caches the inode data for files that are opened by NFS clients. The inode cache entry also provides the location of every data block on disk and the location of every page of file data in memory. If an inactive inode is discarded, all of its file data in memory is also discarded, and the memory is freed for reuse. The percentage of inode cache entries that had pages when they were freed, causing cached file data to be discarded, is reported by sar -g as %ufs_ipf. My inode cache rule described in “Inode Rule” on page 460 warns when non-zero values of %ufs_ipf are seen. The inode cache hit rate is often 90 percent or more, meaning that most files are accessed several times in a short period of time. If you run a cache-busting command, like find or ls -R, that looks at many files once only, you will see a much lower DNLC and inode cache hit rate. An inode cache hit is quick because a hashed lookup finds the entry efficiently. An inode cache miss varies because the inode may be found in the UFS metadata buffer cache or because a disk read may be needed to get the right block of inodes into the UFS metadata buffer cache. UFS Metadata Buffer CacheThis cache is often just referred to as “the buffer cache,” but there has been so much confusion about its use that I like to specifically mention metadata. Historically, Unix systems used a “buffer cache” to cache all disk data, assigning about 10 percent of total memory to this job. This changed in about 1988, when SunOS 4.0 came out with a combined virtual memory and I/O setup. This setup was later included in System V Release 4, and variants of it are used in most recent Unix releases. The buffer cache itself was left intact, but it was bypassed for all data transfers, changing it from a key role to a mostly inconsequential role. The sar -b command still reports on its activity, but I can’t remember the buffer cache itself being a performance bottleneck for many years. As the title says, this cache holds only UFS metadata. This includes disk blocks full of inodes (a disk block is 8 Kbytes, an inode is about 300 bytes), indirect blocks (used as inode extensions to keep track of large files), and cylinder group information (which records the way the disk space is divided up between inodes and data). The buffer cache sizes itself dynamically, hits are quick, and misses involve a disk access. In-memory Page CacheWhen we talk about memory usage and demand on a system, it is actually the behavior of the in-memory page cache that is the issue. This cache contains all data that is held in memory, including the files that make up executable code and normal data files, without making any distinction between them. A large proportion of the total memory in the system is used by this cache because it holds all the pages that make up the current working set of the system as a whole. All page-in and page-out operations occur between this cache and the underlying file systems on disk (or over NFS). Individual pages in the cache may currently be unmapped (e.g., a data file) or may be mapped into the address space of many processes (e.g., the pages that make up the libc.so.1 shared library). Some pages do not correspond to a named file (e.g., the stack space of a process); these anonymous pages have swap space reserved for them so that they can be written to disk if required. The vmstat and sar -pg commands monitor the activity of this cache. The cache is made up of 4-Kbyte or 8-Kbyte page frames. Each page of data may be located on disk as a file system or swap space data block, or in memory in a page frame. Some page frames are ready for reuse or empty and are kept on the free list (reported as free by vmstat). A cache hit occurs when a needed page is already in memory; this hit may be recorded as an attach to an existing page or as a reclaim if the page was on the free list. A cache miss occurs when the page needs to be created from scratch (zero fill fault), duplicated (copy on write), or read in from disk (page-in). Apart from the page-in, these operations are all quite quick, and all misses take a page frame from the free list and overwrite it. Consider a naive file-reading benchmark that opens a small file, then reads it to “see how fast the disk goes.” If the file was recently created, then all of the file may be in memory. Otherwise, the first read-through will load it into memory. Subsequent runs may be fully cached with a 100% hit rate and no page-ins from disk at all. The benchmark ends up measuring memory speed, not disk speed. The best way to make the benchmark measure disk speed is to invalidate the cache entries by unmounting and remounting the file system between each run of the test. The complexities of the entire virtual memory system and paging algorithm are beyond the scope of this book. The key thing to understand is that data is evicted from the cache only if the free memory list gets too small. The data that is evicted is any page that has not been referenced recently, where recently can means a few seconds to a few minutes. Page-out operations occur whenever files are written and also when data is reclaimed for the free list because of a memory shortage. Page-outs occur to all file systems but are often concentrated on the swap space. Disk Array Write CacheDisk array units, such as Sun’s SPARCstorage Array, or “Hardware RAID” subsystems, such as the RSM2000, contain their own cache RAM. This cache is so small in comparison to the amount of disk space in the array that it is not very useful as a read cache. If there is a lot of data to read and reread, it would be better to add large amounts of RAM to the main system than to add it to the disk subsystem. The in-memory page cache is a faster and more useful place to cache data. A common setup is to make reads bypass the disk array cache and to save all the space to speed up writes. If there is a lot of idle time and memory in the array, then the array controller may also look for sequential read patterns and prefetch some read data, but in a busy array, this practice can get in the way. The OS generates its own prefetch reads in any case, further reducing the need to do additional prefetch in the controller. Three main situations, described below, are helped by the write cache. When a lot of data is being written to a single file, it is often sent to the disk array in small blocks, perhaps 2 Kbytes to 8 Kbytes in size. The array can use its cache to coalesce adjacent blocks, which means that the disk gets fewer larger writes to handle. The reduction in the number of seeks greatly increases performance and cuts service times dramatically. This operation is safe only if the cache has battery backup for its cache (nonvolatile RAM) because the operating system assumes that when a write completes, the data is safely on the disk. As an example, 2-Kbyte raw writes during a database load can go two to three times faster. The simple Unix write operation is buffered by the in-memory page cache until the file is closed or data gets flushed out after 30 seconds. Some applications use synchronous writes to ensure that their data is safely on disk. Directory changes are also made synchronously. These synchronous writes are intercepted by the disk array write cache and safely stored in nonvolatile RAM. Since the application is waiting for the write to complete, this approach has a dramatic effect, often reducing the wait from as much as 20 ms to as little as 2 ms. For the SPARCstorage Array, use the ssaadm command to check that fast writes have been enabled on each controller and to see if they have been enabled for all writes or just synchronous writes. The ssaadm command defaults to off, so if someone has forgotten to enable fast writes, you could get a good speedup! Use ssaadm to check the SSA firmware revision and upgrade it first. Use the copy in /usr/lib/firmware/ssa, or get a later version from the patch database. The final use for a disk array write cache is to accelerate the RAID5 write operations in Hardware RAID systems. This usage does not apply to the SPARCstorage Array, which uses a slower, software-based RAID5 calculation in the host system. RAID5 combines disks, using parity for protection, but during writes, the calculation of parity means that all the blocks in a stripe are needed. With a 128-Kbyte interlace and a 6-way RAID5, each full stripe cache entry would use 768 Kbytes. Each individual small write is then combined into the full stripe before the full stripe is written back later on. This method needs a much larger cache than that needed for performing RAID5 calculations at the per-write level, but it is faster because the disks see fewer larger reads and writes. The SPARCstorage Array is very competitive for use in striped, mirrored, and read-mostly RAID5 configurations, but its RAID5 write performance is slow because each element of the RAID5 data is read into main memory for the parity calculation and then written back. With only 4 Mbytes or 16 Mbytes of cache, the SPARCstorage Array doesn’t have space to do hardware RAID5, although this is plenty of cache for normal use. Hardware RAID5 units have 64 Mbytes or more (sometimes much more); see “Disk Workloads” on page 169 for more on this topic. The Standard I/O BufferSimple text filters in Unix process data one character at a time, using the putchar and getchar macros, printf and the related stdio.h routines. To avoid a system call for every read or write of one character, stdio uses a buffer to cache the data for each file. The buffer size is 1 Kbyte, so every 1024 calls of getchar, a read system call of 1 Kbyte will occur. Every 8 system calls, a filesystem block will be paged in from disk. If your application is reading and writing data in blocks of 1 Kbyte or more, there is no point in using the stdio library—you can save time by using the open/ read/ write calls instead of fopen/ fread/ fwrite. Conversely, if you are using open/ read/ write for a few bytes at a time, you are generating a lot of unnecessary system calls and stdio would be faster. Read, Write, and Memory MappingWhen you read data, you must first allocate a buffer, then read into that buffer. The data is copied out of a page in the in-memory page cache to your buffer, so there are two copies of the data in memory. This duplication wastes memory and wastes the time it takes to do the copy. The alternative is to use mmap to map the page directly into your address space. Data accesses then occur directly to the page in the in-memory page cache, with no copying and no wasted space. The drawback is that mmap changes the address space of the process, which is a complex data structure. With a lot of memory mapped files, the data structure gets even more complex. The mmap call itself is more complex than a read or write, and a complex address space also slows down the fork operation. My recommendation is to use read and write for short-lived or small files. Use mmap for random access to large, long-lived files where the avoidance of copying and reduction in read/write/lseek system calls offsets the initial mmap overhead. Networked File AccessParalleling the discussion of local disk accesses, this section looks at networked access. Accessing a file over a network is hundreds of times slower than reading a cached copy from memory. Many types of cache exist to speed up file accesses. Changing your workload to make it more “cache friendly” can result in very significant performance benefits. NFS Access CachingAgain, we’ll start by looking at the simplest configuration, the open, fstat, read, write, and mmap operations on an NFS-mounted file system, as shown in Figure 12-3. Figure 12-3. File Access Caches for NFS and Cachefs
Compared with the previous diagram of UFS access caching, the diagram in Figure 12-3 has been split in the middle, pulled to each side, and divided between the two systems. Both systems contain an in-memory page cache, but the NFS file system uses an rnode (remote node) to hold information about the file. Like the UFS inode cache, the rnode keeps pointers to pages from the file that are in the local page cache. Unlike the UFS inode, it does not contain the disk block numbers; instead, it holds the NFS file handle. The handle has no meaning on the client, but on the server it is used to locate the mount point and inode number of the file so that NFS reads and writes can go directly to the right file on the server. An NFS file open on the client causes a DNLC lookup on the client; failure to find the file causes a DNLC lookup on the server that sets up both the DNLC entry and the rnode entry on the client, as shown in Figure 12-3. There are a lot of interrelated caches. They are systemwide caches, shared by all users and all processes. The activity of one cache-busting process can mess up the caching of other well-behaved processes. Conversely, a group of cache-friendly processes working on similar data at similar times help each other by prefilling the caches for each other. The diagram in Figure 12-3 shows the main data flows and relationships. Rnode CacheThe lookup NFS call returns the rnode information about a file. This information includes the file size and datestamps, as well as the NFS file handle that encodes server mount point and inode numbers that uniquely identify the file. Every concurrently open file corresponds to an active entry in the rnode cache, so if a file is kept open, its information is locked in the rnode cache and is immediately available. A number (set by the tunable nrnode) of rnode entries are kept. nrnode is set to twice the value of ncsize. It doesn’t normally need tuning, but if the DNLC is increased, nrnode increases as well. DNLC entries are filesystem independent; they refer to entries in the UFS inode cache as well as to those in the NFS rnode cache. Several NFS mount options affect the operation of NFS data and attribute caching. If you mount mail spool directories from an NFS server, you may have seen a warning message that advises you to use the noac option. This option turns off attribute and write caching. Why would this be a good idea? The access pattern for mail files is that sendmail on the server appends messages to the file. The mailtool on an NFS client checks the file attributes to see if new mail has been delivered. If the attributes have been cached, mailtool will not see the change until the attribute time-out expires. The access pattern of sendmail and mailtool involves multiple processes writing to the same file, so it is not a good idea to cache written data on the client. This warning highlights two issues. One issue is that with multiple clients, there may be multiple copies of the same data cached on the clients and the server, and NFS does not enforce cache coherency among the clients. The second is that in situations where the cache is of no benefit, it can be disabled. See the mount_nfs(1M) manual page for full details on the attribute cache time-out options. In-memory Page Cache on NFS ClientWhen we talk about memory usage and demand on a system, it is actually the behavior of this cache that is the issue. It contains all data that is held in memory, including the files that make up executable code and normal data files, without making any distinction between them. A large proportion of the total memory in the system is used by this cache because it holds all the pages that make up the current working set of the system as a whole. All page-in and page-out operations occur between this cache and the underlying file systems on disk and over NFS. Individual pages in the cache may currently be unmapped (e.g., a data file) or may be mapped into the address space of many processes (e.g., the pages that make up the libc.so.1 shared library). Some pages do not correspond to a named file (e.g., the stack space of a process); these anonymous pages have swap space reserved for them so that they can be written to disk if required. The vmstat and sar -pg commands monitor the activity of this cache. The cache is made up of 4-Kbyte or 8-Kbyte page frames. Each page of data may be located in a file system (local or NFS) or swap space data block, or in memory in a page frame. Some page frames are ready for reuse or empty and are kept on the free list (reported as free by vmstat). A cache hit occurs when a needed page is already in memory; this hit may be recorded as an attach to an existing page or as a reclaim if the page was on the free list. A cache miss occurs when the page needs to be created from scratch (zero fill fault), duplicated (copy on write), or read in from disk or over the network (page-in). Apart from the page-in, these operations are all quite quick, and all misses take a page frame from the free list and overwrite it. Page-out operations occur whenever data is reclaimed for the free list because of a memory shortage. Page-outs occur to all file systems, including NFS, but are often concentrated on the swap space (which may itself be an NFS-mounted file on diskless systems). Disk Array Write Cache or PrestoserveWhen the client system decides to do an NFS write, it wants to be sure that the data is safely written before continuing. With NFS V2, each 8-Kbyte NFS write is performed synchronously. On the server, the NFS write may involve several disk writes to update the inode and indirect blocks before the write is acknowledged to the client. Files that are over a few Mbytes in size will have several indirect blocks randomly spread over the disk. When the next NFS write arrives, it may make the server rewrite the same indirect blocks. The effect of this process is that writing a large sequential file over NFS V2 causes perhaps three times as many writes on the server, and those writes are randomly distributed, not sequential. The network is idle while the writes are happening. Thus, an NFS V2 write to a single disk will often show a throughput of 100 Kbytes/s or less over the network and 300 Kbytes/s of writes at the disk (about 40 random 8-Kbyte writes). And although the network is only 10% busy, the disk may be 80% busy or more, and the data rate being sustained is very poor. There are several possible fixes for this situation. Increasing the amount of data written per NFS write increases throughput, but 8 Kbytes are the practical maximum for the UDP-based transport used by older NFS2 implementations. Providing nonvolatile memory on the server with a Prestoserve or a SPARC Storage Array greatly speeds up the responses and also coalesces the rewrites of the indirect blocks. The amount of data written to disk is no longer three times that sent over NFS, and the written data can be coalesced into large sequential writes that allow several Mbytes/s to be sustained over a 100-Mbit network. NFS V3 and TCP/IP ImprovementsIn Solaris 2.5, two new NFS features were introduced. The NFS version 3 protocol uses a two-phase commit to safely avoid the bottleneck of synchronous writes at the server. Basically, the client has to buffer for a longer time the data being sent, in case the commit fails. However, the amount of outstanding write data at the server is increased to make the protocol more efficient. As a separate feature, the transport used by NFS V2 and V3 can now be TCP as well as UDP. TCP handles segment retransmissions for NFS, so there is no need to resend the whole NFS operation if a single packet is lost. This approach allows the write to be safely increased from 8 Kbytes (6 Ethernet packets) to 32 Kbytes (20 Ethernet packets) and makes operation over lossy wide-area networks practical. The larger NFS reads and writes reduce both protocol and disk seek overhead to give much higher throughput for sequential file accesses. The mount protocol defaults to both NFS V3 and TCP/IP with 32-Kbyte blocks if they are supported by both the client and the server, although NFS V3 over UDP with 32-Kbyte blocks is a little faster on a “clean” local network where retransmissions are unlikely. The use of larger blocks with NFS V3 assumes that there is good spacial locality. When the workload consists of small random reads, the extra size of each transfer slows down the read rate. It may be helpful to use a mount option to reduce the block size to 8 Kbytes and tune the server to allow a larger multiblock read-ahead. The Cache File SystemCachefs was introduced in Solaris 2.3 and is normally used to reduce network traffic by caching copies of NFS-mounted data to local disk. Once a cache directory has been set up with cfsadmin(1M), file data is cached in chunks of 64 Kbytes for files of up to 3 Mbytes total size (by default). Bigger files are left uncached, although it may be worth checking the sizes of commonly used commands in your environment to make sure that large cache-mounted executables (e.g., Netscape Navigator) are not being excluded. When new data is read, extra work must be done. The read is rounded up to a 64-Kbyte chunk and issued to the NFS server. When the data returns, it is written to the cachefs store on disk. If the data is not subsequently reread several times, a lot of extra work has gone to waste. When data is written, any data in the cache is invalidated by default. A subsequent read is needed to reload the cachefs entry on disk. This requirement is not too bad: A copy of the data is being written in RAM in the page cache, so it is likely that subsequent reads will be satisfied from RAM in the short term. Another issue is the relative speed and utilization of the local disk compared to the NFS server. A fast NFS server, over a lightly loaded 100-Mbit network, with striped NVRAM accelerated disks, may make cache misses faster than cache hits to a slow, busy, local disk! If the disk is also handling paging to and from a swap partition, the activity from paging and the cache may often be synchronized at the time a new application starts up. The best use for cachefs is with a busy, low-bandwidth network—a reduction in network load can dramatically improve client performance for all the users on the network. The best file systems to cache are read-only or read-mostly. You can check the cache hit rate with the cachefsstat(1M) command that was included in Solaris 2.5. If you are running Solaris 2.3 or 2.4, you can’t determine the hit rate, but I was getting between 97% and 99% hit rate on a read-only, application-distribution NFS mount. A development of cachefs is the Solstice Autoclient application (see http://www./Solstice). This application lets you run systems with local disks as cache-only clients. The local disks run the entire operating system installation from cache, avoiding the high network loads and low performance associated with diskless systems while providing highly automated operation with low administration costs. |
|