Defines how application programs invoke requests for services provided by the OS
Provides library calls for user programs to invoke OS services
Examples: MacOS API, Windows API, linux C runtime and libraries, etc.
The order in which multi-byte values are stored in a computer is called its Endianness. There are two choices, Big Endian and Little Endian. [ref1ref2]
LSB/MSB: The "most significant" byte of a multi-byte value is the digit representing the largest value. In normal (Western) written numbers, it is the leftmost digit. For example, in the decimal number 1992, the digit '1' is the "most significant digit." In the hex number 0xAA55B2E3, the byte AA is the most significant byte (MSB), and E3 is the least significant byte (LSB).
The concept can be applied to bits, as well. In the number 01001001, the most significant bit is occupied by a 0, and the least significant bit is occupied by a 1.
Multi-byte storage is defined as follows:
Big Endian: the LSB is stored at the highest memory address, in a multi-byte field.
Little Endian: the LSB is stored at the lowest memory address, in a multi-byte field.
Example: Suppose a 32-bit value is stored in using an instruction like the following pseudo-assembly: mov [R1], 0xAA55B2E3
Suppose register R1 points to memory address 0x100. The bytes would be stored as follows:
Note: observe how, in Little Endian representations, the lower-order bytes do not change when interpreting the value beginning at addresses 0x100 as a 32-bit or 64-bit number. This fact leads to architectural efficiencies that tend to favor Little Endian systems, which are generally more prevalent.
Where Little Endian storage can lead to confusion is in the intrepretation of multi-byte values, because, when reading them in order, from lower addresses to higher addresses, one byte at a time, they seem to be backwards, compared to normal written notation.
In computer systems, Endianness is important when storing multi-byte numbers in RAM, on disk, or even in network packets. In TCP/IP networks, multi-byte values for the IP address and port number are stored in the packet in Network Order, or Big Endian, and in RAM in Host Order, which is Little Endian. In socket programming, this requires a conversion between the formats.
A computer architecture may specify the use of only one Endianness (such as x86, which supports only Little Endian), or both (for example, RISC-V supports either Big or Little Endian implementations).
Integers may be stored in a number of different internal formats, according to the architectural specification. Because th ALU has to perform mathematical operations on integers, it needs to know how to interpret the bit values, according to their storage.
Unsigned Integers. Unsigned integers are normally stored as a direct conversion from their decimal value to the binary equivalent. For example, an unsigned 16-bit integer can store the values 0 through 65,535.
Signed Integers. Signed integers allow positive and negative values, but there are different ways to represent them.
Two's Complement: to store a negative value, represent its positive binary value, invert all the bits, and add 1. Two's Complement is architecturally efficient, because (disregarding overflows) the mathematical operations on positive and negative numbers work the same way.
Sign and Magnitude: in this format, the most significant bit represents the sign (0 for positive, 1 for negative), while the remaining bits represent the magniture of the value.
Example: The eight bits 10000001 would represent the value 129 as an Unsigned Integer, the value -1 as a Signed Integer, or the value -127 if using Two's Complement. Same bits, different meaning, depending on the architectural specification and data type.
Floating Point Numbers
Floating point numbers can be represented in binary using a "fixed point" format, where the location of the decimal point is fixed, or a "floating point" format, where the decimal point can be anywhere relative to the significand (the significant digits). Similar to integers, the storage format must specify how to handle negative numbers.
Main Memory - Volatile, stores data and programs. Also called "real memory" or "primary memory"
I/O Modules - Move data between computer and external environment, such as hard disk, display, keyboard, etc.
System Bus - Facilitates communication among processors, main memory, and I/O modules
In smaller form factor computers, such as phones and tablets, many or all of the above components can be combined and fabricated into a single unit called a System on Chip, or SoC.
Inside the CPU
Processor architectures vary, but several components are common in all general purpose CPUs:
Registers. A CPU contains the following registers:
Instruction Pointer (IP). Also may be called Program Counter or PC. It holds the memory address of the next instruction to be executed. ($EIP in x86, $RIP in x64)
Base Pointer (BP). Holds the base address of the current stack frame on the execution stack. ($EBP in x86, $RBP in x64)
Stack Pointer (SP). Points to the address at the top of the execution stack. ($ESP in x86, $RSP in x64)
Flags or Program Status Word (PSW). This term is used for the collection of registers and flags that are used by the CPU to track the status of execution. For example, conditions like overflow and carry for arithmetic operations, current privilege level, and interrupt status flags are part of the PSW. Some other terms may be used, such as EFLAGS in the Intel architectures. These values are sometimes visible to the programmer and sometimes not. They are generally set as a result of execution and not modified directly by the programmer.
Execution Pipeline. The execution pipeline performs the fetch, decode, execute cycle. Modern processors are deeply pipelined, meaning many instructions are in flight at one time. The execution unit often includes components like a prefetch module, a load/store module, ALUs for integer and memory operations, and FPUs for floating point operations, as well as circuits to handle interrupts.
A superscalar processor has more than one execution unit in the same CPU. In this example, multiple different machine instructions (not dependent on each other) may be executed simultaneously.
Interrupts and Exceptions
There is a great deal of inconsistency when using the terms interrupt, exception, trap, hardware interrupt, software interrupt, syscall, etc. Our goal here is to adopt some standard definitions to use in this course. The term 'trap' is historically been used to mean
a software interrupt, but we'll avoid it since seems to be the least well defined.
Interrupts are the fundamental way in which execution activity occurs in an OS. There are two distinct types:
Hardware Interrupt. An asynchronous condition that occurs due to some condition not specific to the instructions being executed. For example, a timer, I/O ready, or hardware interrupt signal coming from a device. A hardware interrupt must be checked for at some point in the execution pipeline (fetch-decode-execute) by the processor, typically at the end. In a multi-stage execution pipeline, a hardware interrupt may lead to a flush of any in-flight instructions.
Software Interrupt. A software interrupt is a change in execution flow to kernel mode caused by an instruction. Examples in Intel architectures include INT 3 (used during debugging), INT 80 (general syscall), and SYSCALL/SYSENTER.
Separate from interrupts, we have exceptions:
Exception. An exception happens as a byproduct of the unexpected result of program execution. Examples include overflow, division by zero, segfault.
In all these cases, there is a transfer of control from the running process to some other code. In the case of both hardware and software interrupts, an interrupt service routine (ISR) is invoked. ISRs are kernel-mode code, supplied by the operating system (or perhaps a 3rd party device driver).
In the case of exceptions, the OS runtime system will look to see if there is a defined exception handler for the condition that occurred. Exception handlers can be supplied by the OS or by application code, and they may run in kernel mode of the processor or user mode.
Mode Switch vs. Context Switch
Context Switch. When privileged access is required, a context switch between the user program and the kernel must be performed. A context switch occurs when the user program execution is stopped, the current state is saved and offloaded from the processor, and the kernel is swapped in to complete the protected task. Once the operating system completes the request, the kernel will stage any results to be returned to the user process, and the kernel is swapped out in favor of the user process. Execution continues from that point. A context switch is performed by the operating system.
A context switch may occur, for example, due to a software interrupt, a page fault, or the OS swapping in a new user process to run on the CPU for a while. Context switches are highly optimized for performance; software that causes an excessive number of context switches can incur a tremendous performance penalty.
Mode Switch. The term mode switch is used to describe a change in the processor between its unprivileged mode or "ring" (ring 3 on Intel CPUs), and its privileged "ring" (ring 0 on Intel CPUs). In ring 0, all machine instructions and all regions of memory are accessible; in ring 3, only user memory and unprivileged machine instructions are available. A mode switch is performed by the CPU.
The POSIX standard specifies a number of libraries that must be made available to programs. These libraries each have a set of functions that are available for user programs to call, such as printf(). To include a specific library, we just use the #include directive at the top of a C program. Included libraries are loaded by the OS as they are needed. When we refer to a function call to invoke services provided by an OS library, we often just refer to it as a library call.
Some library functions perform a calculation or service that can be accomplished entirely in user mode, without a mode switch. Many other library functions provided by the OS are actually just 'wrappers' that, when called, validate their input and then in turn invoke one or more system calls.
A system call is an entry point for requesting OS services that require privileged acess to the hardware. System calls are fundamental to the interface between the architecture and the OS, and are among the first things an OS designer must define.
Tracing Library and System Calls
One method for identifying understanding how programs using system calls or library calls is to trace program execution using ltrace and strace. These two programs monitor execution and report either the library calls used or the system calls, respectively.
Library Call Tracing
To begin, let's look at a simple program that prints "Hello World". This program defines a string called hello that references the string "Hello World." The string is then printed using puts() which puts the string to standard out, like printf(). Since we are not doing any formatting, so printf() is not required.
What we would like to do is see the library as it is used during execution. We can normally do this with ltrace. Current versions of Linux will compile programs using gccin a way that breaks ltrace (preventing it from intercepting library calls), unless we disable something called Position Independent Execution, or PIE, with -no-pie. Doing that, we can see that, yes, indeed this program calls puts(), a library function call:
Notice there are a lot of SYS_* calls. What are these? These are actual system calls being executed. The one of interest to us is SYS_write(1, "Hello, World!\n", 14).
Conclusion. From this example, we conclude that:
Our main function calls puts(), a library function defined in stdio.h.
The puts() library function invokes the write system call
The write system call is what the OS kernel actually executes.
Invoking a System Call Indirectly, using a Library Function
The Unix/Linux API (unistd.h) provides a way to cause a particular system call to be invoked: syscall(). The syscall() interface is as follows:
.--- System Call Number
syscall(long number, ...)
'---- Remaining Arguments to the system call
The first argument, the system call number, is a way to specify which system call you would like invoked. Each system call has a unique number assigned to it and it is machine code and operating system dependent. For example, in the x86_64 (64-bit) Intel architecture, the write is system call number is 1. Let's rewrite our program to use syscall() to write hello world:
char hello = "Hello, World!\n";
syscall(1, 1, hello, 14);
// 1: number of syscall
// 1: for stdout
// 14: number of bytes
Note that the arguments following the system call number match the arguments to the write() system call, which we learned from doing the ltrace above. Now we can run this program to see the output and do another trace of it (again using -no-pie to ensure ltrace sees the library calls):
Our main function calls syscall(), a library function defined in unistd.h.
The syscall() library function invokes the write system call.
The write system call is what the OS kernel actually executes.
Invoking a System Call Directly, using Assembly Language
The fact that there are library functions that will in turn invoke system calls for us is an abstraction, for simplicity and convenience. However, the actual mechanics of a system call, involving a mode switch to the kernel, are normally defined by assembly-language functions that execute a special machine instruction on the CPU.
The only way to directly invoke a system call (with an actual switch to kernel code -- a mode switch to ring 0 on the CPU) is by writing our program not in C, but rather in assembly language. The mov commands are assignments, and putting values in registers that match the arguments to syscall. Each architecture defines the machine instruction for system call invocation. Intel architectures use int 0x80 (interrupt 0x80, for the 32-bit binary interface) or sysenter / syscall (for the 64-bit binary interface), for example.
;;char hello = "Hello, World!\n"
hello db "Hello, World!",0x0a
;;syscall(60,0); //exit with status 0
Compiling and running this assembly program looks a bit different, but the result is the same.
If we run ltrace on this program to see the library calls, we get nothing! That's because we are no longer using a Unix/Linux userspace library at all -- we are now just using the OS system call interface in its purest form, directly to kernel code that communicates with the hardware architecture:
$ ltrace -S -n 3 ./hello > /dev/null
Couldn't find .dynsym or .dynstr in "/proc/292/exe"
To see what system calls are executing, we can use strace. When we do that, we see that, yes, in fact we are still executing a write(). We also see the execve() system call, which is the system call that starts the execution from the command shell. This illustrates invocation of an OS service at the lowest level possible by an application program:
The Instruction Set Architecture (ISA) is defined by the CPU designer
The Application Binary Interface (ABI) is a set of agreements, between the CPU designer and OS architect, about how programs will execute on the hardware.
The Application Programming Interface (API) is designed by the OS designer, and represents how services are presented to applications by the OS.
A hardware interrupt is An asynchronous condition that occurs due to some condition not specific to the instructions being executed.
A software interrupt A software interrupt is a change in execution flow to kernel mode caused by an instruction.
An exception happens as a byproduct of the unexpected result of program execution.
A system call invokes OS services that must run in kernel mode for protection and safety.
The library functions provided by the Unix/Linux API are a convenient interface to cause execution of the real (machine-level) syscall routines by the kernel.
The shell command ltrace shows us the library functions invoked by a program.
The shell command strace shows us a list of kernel-level system calls invoked by a program.
The only way to bypass the library calls provided by the Unix/Linux API and invoke a syscall to the kernel directly instead of indirectly is to write assembly code. This essentially bypasses the API in order to access the ABI and ISA directly.