This assignment can be found here, and the starting code can be found in the pset3 directory here.
The goal is to implement functioning kernel virtual memory management, and use it to implement syscalls such as fork.
What we start off with is a basic kernel with example programs loaded in, along with a display that visualizes page tables:
By default, WeensyOS spawns 4 programs which map pages sequentially until its memory is exhausted.
You may notice physical memory looks nearly identical to the virtual memory of each process. This is because there is also no isolation between processes or the kernel, aka an 'identity mapping'.
Kernel and process isolation
The first step of this assignment is to implement basic isolation between processes, preventing them from touching memory they shouldn't.
I chose the strategy of first applying default protections to the kernel's page table, then copying it it to each process:
This gives us the desired isolation where processes can only read and write to their own memory:
Kernel page allocator
Processes are given dedicated space in physical memory (defined by PROC_START_ADDR), a limitation which prevents them from using more than PROC_SIZE bytes of memory.
First we need to implement a simple page allocator, I've chosen to implement an extremely simple free list.
How a free list works is that chunks of unallocated pages form a single-linked list:
To fill this list, we use allocatable_physical_address to check if a physical page is allocatable, pushing them all to the list in kernel_start:
kalloc and kfree now have a trivial implementation, pushing and popping from the list:
The pages array provided by WeensyOS is how the memory viewer keeps track of physical pages, including a reference count:
This reference count is useful because multiple processes can share the same physical memory, whether it be from COW or read-only executable sections.
To facilitate reference counting I've made functions to acquire and release references to shared physical pages:
The kacquire function simply increments the reference count of the page, and krelease decrements it, freeing the page when it hits zero.
Virtual memory
In order to implement virtual memory we need to replace the primitive 1:1 page mapping with ones that takes pages from kalloc, but first we will add support for execution protection.
No-execute (NX) protection
This wasn't part of the assignment but I thought it was a cool feature to add.
Without NX protection all memory is executable, this is problematic as it makes buffer overflow exploits extremely powerful. To solve this security issue we will mark pages that do not contain code as not executable with the NX bit.
In several places WeensyOS mistakenly uses a 32 bit integer to store page protection flags, this truncates the top 32 bits that contain NX and some other useful OS flags.
The fix is fairly simple, just replace all of these instances of int with uint64_t:
- inline void map(uintptr_t pa, int perm);
+ inline void map(uintptr_t pa, uint64_t perm);
- inline void map(void* kptr, int perm);
+ inline void map(void* kptr, uint64_t perm);
- [[gnu::warn_unused_result]] int try_map(uintptr_t pa, int perm);
- [[gnu::warn_unused_result]] inline int try_map(void* kptr, int perm);
+ [[gnu::warn_unused_result]] int try_map(uintptr_t pa, uint64_t perm);
+ [[gnu::warn_unused_result]] inline int try_map(void* kptr, uint64_t perm);
- int perm_;
+ uint64_t perm_;
- static constexpr int initial_perm = 0xFFF;
+ static constexpr uint64_t initial_perm = 0xFFE0000000000FFF;
-inline void vmiter::map(uintptr_t pa, int perm) {
+inline void vmiter::map(uintptr_t pa, uint64_t perm) {
-inline void vmiter::map(void* kp, int perm) {
+inline void vmiter::map(void* kp, uint64_t perm) {
-inline int vmiter::try_map(void* kp, int perm) {
+inline int vmiter::try_map(void* kp, uint64_t perm) {
Some extra defines can be added to x86-64.h:
In order to track which segments of memory are executable, a few more modifications to the kernel and its linker script are necessary:
Finally, execution protection can be enforced for the kernel in kernel_start:
To check if this actually works you can mark the text segment with the XD bit, WeensyOS will crash on startup because it's trying to execute memory that is now protected:
If the XD bit is cleared from the text segment, the OS continues running as normal, yay!
Process allocation
So far pages are manually mapped to a process from a few places, to make it simpler I've created a single function that takes care of allocation, mapping, copying, and dealing with edge cases:
We can then use this function along with the program_image_segment::executable function defined earlier in process_setup to load ELF segments:
Additionally the syscall_page_alloc function can use proc_alloc rather than mapping directly:
The result
Now that virtual memory is implemented, WeensyOS now looks like this:
Instead of a 1:1 mapping, pages are allocated on-demand, this allows greedy processes like #4 to consume much more than PROC_SIZE bytes of memory.
Fork
The final steps of the assignment are to implement the fork syscall.
If you are not familiar with fork it essentially duplicates your process incl. its memory and execution state, for more information see the Linux fork(2).
Thankfully, all of the low level stuff is taken care of already, adding a new syscall is as simple as defining it in lib.hh and adding a switch case to exception:
Instead of simply copying all of the processes memory in fork, we only copy pages when they are written to. This optimization is called copy on write (COW) and can be implemented by marking a page read-only then copying it when a page fault occurs.
To tell our exception handler a page is COW, we will use one of the user-defined flags in its page table entry:
The implementation of syscall_fork goes as follows:
In short this function copies the parent's state to a new process, remapping any writeable pages as read-only plus PTE_COW.
Now in the exception function, we can check for page faults occurring on a COW page:
The result
Along with the default program p-allocator.cc, WeensyOS also comes with p-fork.cc and p-forkexit.cc which can be booted using special key codes.
In this case we want to test the fork implementation using p-fork.cc which can be run by pressing f at startup:
This program is a variation of the first but starts with a single process which calls fork 3 times rather than the kernel initializing them.
You may notice that the first few pages are now labelled S, this indicates the same physical address has been mapped to multiple processes.
Exit and kill
The final task of the assignment is to implement the exit syscall and test it with p-forkexit.cc. Later on we will also implement the sleep, kill, mmap, and munmap syscalls as part of extra credit.
Like with fork, you handle syscalls in the kernel by adding cases to the syscall function:
Exit and kill basically do the same thing, so their functionality can just be combined into a single function.
The implementation of proc_kill turns out to be extremely simple:
p-forkexit.cc is a fairly intense test, along with consuming memory and forking it also exits at random intervals. It is quite hard to get past this point without leaking memory, there was a lot of debugging involved.
Like p-fork.cc, this program is also started through a special key combination e:
After running it for awhile we don't experience a crash and no pages are leaked!
Those large strips of Ses are COW pages that haven't been written to yet, they are shared between children just like the shared read-only segments from before.
Sleep
To implement this syscall you must first understand how the scheduler works in WeensyOS.
Here are a few notable functions we start with:
void schedule() - Picks the next runnable process and runs it.
void run(proc* p) - Sets the current variable and uses exception_return to resume the process.
void exception(regstate* regs) - The exception handler called by k-exception.S, including APIC timer interrupts.
On startup kernel_start initializes the APIC timer, a neat bit of hardware that sends periodic interrupts to the processor. Along with keeping track of time with the ticks variable, this timer allows the kernel to prevent processes from hogging CPU by interrupting them.
When an interrupt or exception happens the processor will save register state, disable interrupts, enter kernel mode, and call the appropriate exception handler. The exception function copies the register state to current->regstate, which is used in the run function when resuming the process.
To resume a process the kernel uses the exception_return function, a small bit of assembly that restores register state and invokes iretq to get back into user mode.
Note that the kernel's state is not saved, you are reset back to the bottom of the stack every time there is an exception (schedule, run, and exception never return).
With this knowledge implementing sleep is much more straightforward, to start I added a field to keep track of when a specific process should resume from sleep:
Like the global ticks variable, the wake field is based on the frequency of the APIC timer (HZ), but the syscall will take its duration in milliseconds:
Now rewrite the schedule function to skip any processes that are not ready to wake:
Hold up, this introduces some unexpected behavior!
It turns out the exception function didn't anticipate any non-fatal exceptions to happen while in kernel mode.
The problem is that it unconditionally stashes register states to the previous process regardless of if the exception actually happened in that process. This leads to clobbering of the the state of the last running process with the state of the kernel while its halted above.
A simple solution is to only set current->regs if the lower 2 bits of the cs register (the privilege level) is non-zero:
mmap, munmap
Currently all memory management in a process is done through the very primitive sys_page_alloc syscall, this has several limitations such as:
Can only map pages at specific address, can't find free ones automatically
No way to free pages (munmap).
No way to map executable or read-only pages (such as in a dynamic linker).
No way to share memory between a parent an child process.
To solve these issues I've implemented the mmap and munmap syscalls, starting with definitions in lib.hh / u-lib.hh:
Note that MAP_ANONYMOUS is not defined because there is no filesystem, all mappings are anonymous by default!
Like other syscalls, mmap and munmap get their own functions:
The syscall_mmap function is probably the most complex so far as it has to handle several edge cases and failure conditions, its implementation goes as follows:
Because try_map / kalloc can fail at any point, the implementation of mmap writes exiting page table entries to a journal that can be rolled back.
A similar is approach is used to release pages with munmap:
Testing all the things
In order to test all the new syscalls we need to create custom user-mode programs, thankfully the process is pretty straightforward.
First, create the program:
This test covers most of the functionality of mmap, sleep, kill, munmap, then panics with a success message after everything checks out.