Virtus OS v1 ran one task at a time. The idle task was: "wait for the next user call." This week you replace it with a scheduler that decides which of several running tasks gets the CPU next.
Reading
Required. Petzold, CODE, Ch 22 ("The Operating System"). The section on timesharing is the direct ancestor of this week's work. Petzold traces the motivation: "the computer sat idle most of the time while waiting for a human at a teletype to type the next character. Timesharing was the solution -- divide the CPU's time into small slices and give each user a slice in rotation." The timer interrupt and the context switch are the two hardware mechanisms that make this possible. You are building exactly what Petzold describes.
Required. Patterson and Hennessy, Computer Organization and Design: RISC-V Edition, Chapter 5, section 5.14 (Exceptions and Interrupts in pipelines) and the sidebar on the CLINT (Core-Local Interruptor). The DE10-Nano uses a CLINT-compatible timer for the scheduler tick.
Lecture: Preemption and the Round-Robin Scheduler
From cooperative to preemptive multitasking
Module 2's kernel is cooperative: the kernel runs only when a user process calls ECALL. If the user process enters an infinite loop, the kernel never gets control. Preemptive multitasking solves this with a hardware timer: at regular intervals, the timer fires an interrupt that forces control to the kernel regardless of what the user process is doing.
The RISC-V mechanism: the CLINT timer fires when mtime >= mtimecmp. The CPU raises a machine-timer interrupt (MIP.MTIP = 1). If MIE.MTIE = 1, the processor traps to the handler at mtvec. This is identical to the ECALL trap path from Module 2, except the trap cause is mcause = 0x80000007 (interrupt bit set; cause 7 = machine timer interrupt).
The CLINT
The CLINT (Core-Local Interruptor) is a memory-mapped peripheral. Its registers on the DE10-Nano:
CLINT base address: 0x02000000 (standard SiFive CLINT layout)
+0x0000: msip[0] -- machine software interrupt pending (hart 0)
+0x4000: mtimecmp[0]_lo -- low 32 bits of timer comparison value
+0x4004: mtimecmp[0]_hi -- high 32 bits
+0xBFF8: mtime_lo -- low 32 bits of free-running counter
+0xBFFC: mtime_hi -- high 32 bits
To schedule the next timer tick 10 ms from now (assuming a 50 MHz clock = 50,000 cycles per ms):
li t0, 500000 # 10ms * 50000 cycles/ms
lw t1, CLINT_MTIME_LO # read current mtime
add t1, t1, t0 # target mtime
sw x0, CLINT_MTIMECMP_LO # write 0 to hi first (avoids spurious trip)
sw t1, CLINT_MTIMECMP_LO # then write new deadline
Wait -- the sub-word write atomicity discipline from clint.v's comments: to avoid a spurious interrupt when updating mtimecmp, write the new low value to 0xFFFFFFFF first (pushing the deadline far into the future), then write the new high, then write the new low:
li t2, -1 # 0xFFFFFFFF
sw t2, CLINT_MTIMECMP_LO # set lo = max (deadline far future)
sw t0_hi, CLINT_MTIMECMP_HI # write new hi
sw t0_lo, CLINT_MTIMECMP_LO # write new lo
Process control block (PCB)
Each process has a PCB (Process Control Block) that stores its saved state:
struct pcb {
uint32_t regs[32]; // saved x0-x31
uint32_t pc; // saved program counter
uint32_t satp; // saved page table root
uint32_t mstatus; // saved mstatus (MPP field)
uint32_t state; // RUNNING / READY / BLOCKED / EXITED
uint32_t pid; // process identifier
uint32_t stack_base; // stack physical base address
uint32_t stack_size; // stack size in bytes
};
The OS maintains an array of PCBs: struct pcb process_table[MAX_PROCS].
Context switch
A context switch saves the current process's register state to its PCB and restores the next process's register state from its PCB. This happens at the timer interrupt handler:
# timer interrupt handler (extends the Module 2 trap handler)
# assume regs are saved to the kernel trap frame as before
# 1. Save current process state to its PCB
lw t0, current_pid # get current process index
la t1, process_table
slli t2, t0, PCB_SIZE_LOG2 # PCB offset
add t1, t1, t2 # t1 -> current PCB
# copy saved regs from trap frame to PCB.regs
addi a0, t1, PCB_REGS_OFFSET
la a1, trap_frame
li a2, 128 # 32 regs * 4 bytes
call memcpy
# 2. Save PC (mepc) and satp to PCB
csrr t0, mepc
sw t0, PCB_PC_OFFSET(t1)
csrr t0, satp
sw t0, PCB_SATP_OFFSET(t1)
sw ra, PCB_MSTATUS_OFFSET(t1) # save mstatus from trap entry
# 3. Run scheduler -- returns next PID in a0
call schedule_next
sw a0, current_pid
# 4. Load next process state from its PCB
la t1, process_table
slli t2, a0, PCB_SIZE_LOG2
add t1, t1, t2 # t1 -> next PCB
la a0, trap_frame
addi a1, t1, PCB_REGS_OFFSET
li a2, 128
call memcpy
# 5. Restore PC and satp
lw t0, PCB_PC_OFFSET(t1)
csrw mepc, t0
lw t0, PCB_SATP_OFFSET(t1)
csrw satp, t0
# 6. Reprogram timer for next tick
call reprogram_timer
# 7. Return to restored process
(restore trap frame registers, mret)
Round-robin scheduler
The simplest fair scheduler: each process gets an equal time slice; processes are served in circular order.
uint32_t schedule_next(void) {
uint32_t next = (current_pid + 1) % MAX_PROCS;
uint32_t iterations = 0;
while (process_table[next].state != READY && iterations < MAX_PROCS) {
next = (next + 1) % MAX_PROCS;
iterations++;
}
if (iterations == MAX_PROCS) {
// all processes blocked; run idle process (pid 0)
return 0;
}
process_table[current_pid].state = READY; // was RUNNING
process_table[next].state = RUNNING;
return next;
}
Context-switch cost
Lab 11.1 measures the cycle cost of one context switch. The cost has three components:
- Trap entry: save 32 registers to the trap frame (~40 cycles with sequential sw instructions).
- PCB save/restore: memcpy of the trap frame to/from PCB (~70 cycles for 128-byte copy).
- Timer reprogram and mret: ~20 cycles.
Total expected: ~130-150 cycles per context switch. Compare this against the Lab 2.1 trap round-trip cost (60-120 cycles); the additional cost is the PCB memcpy.
Architecture Comparison Sidebar: Schedulers
| Scheduler | Policy | Overhead | Production use |
|---|---|---|---|
| Round-robin (CSA-201) | Equal time slice, circular | Low | Embedded RTOSes; baseline for comparison |
| Linux CFS (Completely Fair Scheduler) | Weighted virtual runtime fairness | Medium (red-black tree) | Linux 2.6.23+ |
| Windows dispatcher | Priority classes + round-robin within class | Medium | Windows NT kernel |
| FreeRTOS (RTOS) | Fixed-priority preemptive | Very low | Microcontroller RTOSes |
| RISC-V OpenSBI | N/A (firmware, not scheduler) | N/A | Provides M-mode services to S-mode Linux |
Linux CFS maintains a virtual runtime (vruntime) for each process and uses a red-black tree to find the process with the smallest vruntime (the "most deserving" process). This gives a fair share of CPU time without the worst-case latency of strict round-robin (a high-priority task might wait behind many low-priority tasks in round-robin).
Lab exercises
See labs/lab-11-preemption-scheduler.md for the full specification.
Lab 11.1: Round-robin scheduler with two demo tasks; context-switch cost measured. You will implement the CLINT timer interrupt, the PCB structure, and the round-robin scheduler. Two user-mode tasks run concurrently: Task A increments a counter and writes to the OLED; Task B independently counts elapsed ticks and writes to a separate OLED line. Both counters advance independently.
Part A: Implement the CLINT timer interrupt handler in the kernel trap handler (extending the Module 2 handler). Test with a single process: verify the timer fires every 10 ms by measuring ECALL round-trip time before and after introducing the timer interrupt.
Part B: Implement the PCB structure and context-switch code. Test with two processes that each increment a counter; verify both counters advance (neither starves).
Part C: Measure. Read mcycle at the start and end of the context-switch handler (timer interrupt entry to mret). Record the cycle cost. Compare against Lab 2.1's trap round-trip measurement.
Independent practice
-
The mtimecmp sub-word write atomicity problem: if an OS writes the new
mtimecmp_lofirst (with the oldmtimecmp_hi), andmtime_hihappens to match the oldmtimecmp_hiat that instant, a spurious interrupt fires before the new deadline is fully programmed. Walk through the safe write sequence (lo=MAX, hi=new_hi, lo=new_lo) and show why it avoids the spurious interrupt. -
The round-robin scheduler gives each process the same time slice. If one process performs mostly I/O (sleeping most of the time) and another performs mostly computation, round-robin wastes time on the I/O-bound process's time slices. How does Linux CFS handle this differently?
-
Toolchain Diary entry:
OpenSBI. Record what OpenSBI is, why it exists (M-mode firmware that provides SBI calls to the S-mode OS), and how it relates to themsipregister in the CLINT. What SBI call does Linux use to setmtimecmp? -
A context switch saves 32 registers (128 bytes) + PC + satp + mstatus. If the OS runs 16 processes and switches every 10 ms, how many bytes per second does the context switch mechanism write to the PCB array? Is this significant compared to DRAM bandwidth on the DE10-Nano (~3.2 GB/s)?
Reflection prompts
-
The round-robin scheduler marks the outgoing process as READY before selecting the next. Why does it set the state to READY rather than leaving it as RUNNING? What invariant does the state field maintain?
-
The CLINT's
mtimecounter is 64 bits. At a 50 MHz clock (1 tick per cycle), how long beforemtimeoverflows? Does this create a scheduler correctness problem (hint: consider what happens tomtimecmp < mtimecomparisons asmtimewraps around)? -
Module 10's stop-the-world GC requires pausing the scheduler. In the CSA-201 implementation, "pausing the scheduler" means: don't reprogram
mtimecmpduring the GC (so no timer interrupts fire while the GC runs), and resume when done. What does this guarantee, and what does it NOT guarantee? (Hint: what if the GC itself needs to allocate memory?)
What's next
Module 12 is the driver-writing track. The OS is now complete in its core features: privilege separation, virtual memory, PMP protection, GC, and preemptive scheduling. Module 12 connects the OS to hardware peripherals: you will write drivers for the SSD1306 OLED (I2C display), the SD card (SPI block storage), and the ENC28J60 (SPI Ethernet controller), reading each device's datasheet from scratch.