The OS has a number of routine responsibilities in a paged virtual memory environment:
New Process Creation: determine how large the process image will be, and create a page table for the process. Create and initialize a 'swap area' for the process on disk.
Process Scheduled to Run: Reset the MMU, flush the TLB. Set a pointer to the new process' page table. Optionally, bring some or all of the process' pages into memory before it runs.
Page Fault: Determine which page is needed, and locate it on the disk. Find an available frame for the new page, evicting some other page if needed. Read the required page from disk into the memory frame. Reset the Program Counter so that execution starts over with the instruction that caused the page fault.
Process Exit: Release the process' Page Table, its pages, and the disk spaces the pages occupy when swapped out.
The OS also needs to manage a wide variety of related issues that can occur in a paged virtual memory environment:
Local vs. Global Allocation
As noted in the previous lecture, when a virtual memory page is chosen for replacement, it may be chosen from among the pages used by the process causing the page fault (local replacement), or it may be chosen from among all the pages currently in memory (global replacement), as illustrated in the following example:
Local vs. Global replacement. (a) Process A issues a page fault for page A6. (b) A local policy might replace page A5. (c) A global policy could choose to replace page B3, from process B.
In general, global algorithms perform better.
Determining the proper page size is important for performance. A common page size is about 4K.
Larger Page Size: leads to more internal fragmentation, because many programs are small and will need only a fraction of a page.
Smaller page size: this reduces internal fragmentation, but increases the size of the page table. It also reduces the fraction of PTEs that can be held in the TLB (a larger TLB is needed to achieve the same TLB hit ratio).
Programs may share memory pages that include code, data, or both. Suppose processes A and B are sharing memory pages. Process A is chosen for eviction, having its pages swapped to disk, but process B keeps running. Process B must generate page faults to reclaim all the shared pages, which is not efficient. Similarly, if process A actually terminates, the OS needs to know the shared memory pages should not be reclaimed, since B still needs them. This is usually done by counting pointers to a page.
Copy on Write
After a UNIX fork system call, the parent and child processes get a copy of the parent's program code and data. In a paged system, each process (parent and child) may be given its own page table, but initially both of them point to the same set of pages. The data pages are marked as Read-Only. If both processes just read data, there's no problem. But, if one process modifies the data, the Read-Only protection causes an exception and a mode switch to the OS, which makes a copy of the page, so each process now has its own local copy. This strategy is called copy on write (COW). It is efficient because it takes a lazy approach, copying memory pages only when necessary. If the OS always made copies of data pages after a fork, it would inevitably be a wasted effort much of the time.
Processes A and B may share pages, for example if B forks from A. Initially, A and B's page tables point to the same pages. If either A or B modifies data in a page, a copy is made and each process gets a local copy. This lazy approach to making local page copies is called copy-on-write (COW).
Note: a vulnerability in the Linux kernel's copy-on-write implementation was identified in 2016. Dubbed "Dirty Cow," it has since been patched, but it led to a privilege escalation attack against Linux at the time.
Shared libraries, such as .so or .dll files, support dynamic linking of executables. The OS needs to make sure that the needed shared libraries are loaded somewhere in memory, so executable programs can call them.
When a process invokes a shared library, the OS "maps" the shared library into the process' virtual address space. However, if a library is used by two processes the OS will likely not map a given library in the same address-space location for process 1 as for process 2 -- each believes the library is at a different virtual address. The problem can be fixed by creating multiple copies of the library in memory, but that defeats the purpose of shared libraries.
A shared library being used by two processes.
The better solution is to compile shared libraries with a special compiler flag creating position independent code. Position independent code use relative references for all locations (i.e., n bytes up or down from the current location), instead of absolute references. As a result, the memory pages where the library is loaded can be mapped to any frames. As an aside, attackers writing shellcode also need to make the code position independent, since they may not be able to control where the shellcode gets loaded in memory.
Many operating systems implement the ability to memory map a file -- for example, using the POSIX-compliant mmap()system call. When a file is memory mapped, pages of it are copied into RAM on a demand basis. The version of the file in RAM is like a cached version of the file. Changes to the file initially go only to the version of the file in RAM. Those changes may be occasionally written back to the file on disk, or written back only when the process terminates, but in general when the process edits the file, those changes are initially made only to RAM.
A memory mapped file.
If two processes map to the same file at the same time, they can communicate with each other directly via the shared memory area.
The Cleaning Policy specifies when modified pages should be written out from memory to disk. It's usually taken care of by the paging daemon. There are two basic policies:
Demand Cleaning: write the contents out to disk only when that page is selected for replacement.
Precleaning: write out the contents of some modified pages, even if they are not yet selected for replacement.
Taking either strategy to the extreme is inefficient. Constant Precleaning, where every change to RAM is written out to disk right away, essentially creates a write-through cache, and defeats the purpose of having the file memory mapped to a buffer in RAM. Using only demand cleaning, with no pre-cleaning ever, means that when memory is full, a large batch of dirty pages may need to be written to disk. In practice, it is practical to have at least a small set of pages in a clean state (their modifications having been written to disk), so the pages can be quickly replaced. In essence, some blend of Demand Cleaning and Precleaning works well -- pages can be written out less frequently, in groups, but there are always a few clean pages available to be evicted.
Page Fault Handling
A Page Fault is generally handled by the hardware and OS as follows:
A hardware interrupt is generated, which leads to a mode switch to the kernel. Some information about the state of the current instruction is saved.
An assembly routine (interrupt service routine, or ISR) saves the registers and flags, then yields to the kernel.
The kernel identifies the virtual address required, then verifies its a valid address and checks it against access protections.
The kernel looks for a free page frame. If no frame is free, the page replacement algorithm executes, to identify a page to be replaced.
If the frame is dirty, it is scheduled for transfer to disk. A context switch occurs. The frame is marked as 'busy'. The process incurring the page fault is blocked, and another process runs while the needed page is copied from disk (swapped in).
When a disk interrupt indicates the needed page is in memory, the page tables are updated and the frame is marked as 'ready'. The kernel moves the faulting process from the blocked state to the ready state.
The faulting process is scheduled again. When selected for execution, its context is restored.
Some minor variation among architectures and operating systems is possible.
It is sometimes useful for the OS to 'pin', or lock, certain memory pages, which means they cannot be swapped out to disk. For example, the kernel itself and any page tables would normally be pinned. Additionally, a page holding a buffer that is receiving an I/O transfer (such as via DMA) may be pinned to memory while the I/O transfer is going on, to avoid the wasted effort that would result from swapping the page out, swapping it back in, and restarting the I/O transfer.
Swap Partition / Backing Store
When memory pages need to be swapped to disk in a virtual memory system, they need a defined location in the disk's file system. The simplest strategy for swapping memory pages to disk is to have a dedicated partition, sometimes called a swap partition or backing store. It can be a separate partition on the same disk, or a separate disk entirely. Each process has an indeitified area of the swap partition assigned to it.
There are two basic strategies for creating the swap area for a process:
Static: Identify the disk offset where the process' area begins, and pre-allocate a contiguous area for all the pages that process needs. This is simple to implement, but does not allow a process' memory image to grow beyond its initial size.
Dynamic: Create a 'disk map' that dynamically maps a process' swapped pages to their locations on disk, which may not be contiguous. This is more complex to implement, but is more space efficicent and flexible when a process' memory area grows in size.
The strategies are illustrated in the figure:
(a) Paging to a static swap area. (b) Backing up pages dynamically. (Tanenbaum)
Free Memory Management
The operating system must keep track of memory allocated to processes, as well as available free memory. There are two common techniques for tracking allocated and free available memory:
Bitmaps and Linked Lists
Bitmaps. With bitmaps, memory is divided into equal-sized chunks as units of allocation (for example, pages). Each unit is assigned a bit in the bitmap. The bitmap may use 0 for free and 1 for allocated, or vice versa.
(a) Part of memory with five processes and three holes. (b) The corresponding bitmap. (c) The same information as a list. (image: Tanenbaum)
The size of a unit of allocation is important. If the allocation unit is too small, the bitmap will be huge. If the allocation size is too large, internal fragmentation will be excessive.
An important limitation of bitmaps is the cost of searching through the map for free units of allocation of a sufficient size. Especially if memory is highly fragmented, this search could be expensive.
Linked Lists. Regions of allocated and free memory can also be tracked using linked lists. For efficiency, a doubly-linked list can be used. The OS can keep one list of all memory regions, with each entry being either allocated or free, or the OS can use two lists -- a free list and an allocated list.
Coalescing Adjacent Deallocated Regions
When an allocated memory region belonging to a process is freed, the memory region may need to be merged with adjacent regions, depending on whether they are allocated or free, as in the following figure:
Four neighbor combinations for the termination process, X.
Free Region Selection
When a process is allocated a new memory region from the heap, an available free region of sufficient size must be found. The OS memory manager may implement any of the following algorithms to select the region:
First Fit. Scan the free list from the beginning, selecting the first free memory region of sufficient size. First Fit is fast because it searches as little as possible.
Next Fit. Scan the free list like in First Fit, but start the search from the location of the last allocation, instead of the beginning of the list. Similarly, choose the first encountered free memory region of sufficient size. Next fit also searches as little as possible, but performs slightly worse in experiments.
Best Fit. Search the entire free list from beginning to end. Choose the smallest free memory region that is at least of sufficient size for the request. Best Fit is slower than First Fit and Next Fit because it searches the entire list every time. It also tends to create tiny regions of useless external fragmentation, resulting in frequent memory compaction.
Worst Fit. Search the entire free list, but chooses the largest free memory location. Worst Fit performs poorly in experiments, too.
Quick Fit. Maintain separate lists for some of the more common sizes requested. For example, maintain a list of 4K holes, a list of 8K holes, etc. This finds right-sized holes very fast, but still requires more complexity when memory needs to be deallocated and merged.
Sharing pages among processes is useful, and making copies of pages can be more efficient when using copy-on-write strategy.
Shared libraries are most easily implemented using position independent code.
Memory mapping a file creates a cached version of the file in RAM.
Mapped file pages may be precleaned or demand cleaned. The most efficient implementation avoids either extreme.
The OS and hardware work together to handle page faults.
The OS may use bitmaps or linked lists to manage allocation of free memory regions.