Week 11: Preemption and the Round-Robin Scheduler · CSA-201

Virtus OS v1 ran one task at a time. The idle task was: "wait for the next user call." This week you replace it with a scheduler that decides which of several running tasks gets the CPU next.

Reading

Required. Petzold, CODE, Ch 22 ("The Operating System"). The section on timesharing is the direct ancestor of this week's work. Petzold traces the motivation: "the computer sat idle most of the time while waiting for a human at a teletype to type the next character. Timesharing was the solution -- divide the CPU's time into small slices and give each user a slice in rotation." The timer interrupt and the context switch are the two hardware mechanisms that make this possible. You are building exactly what Petzold describes.

Required. Patterson and Hennessy, Computer Organization and Design: RISC-V Edition, Chapter 5, section 5.14 (Exceptions and Interrupts in pipelines) and the sidebar on the CLINT (Core-Local Interruptor). The DE10-Nano uses a CLINT-compatible timer for the scheduler tick.

Lecture: Preemption and the Round-Robin Scheduler

From cooperative to preemptive multitasking

Module 2's kernel is cooperative: the kernel runs only when a user process calls ECALL. If the user process enters an infinite loop, the kernel never gets control. Preemptive multitasking solves this with a hardware timer: at regular intervals, the timer fires an interrupt that forces control to the kernel regardless of what the user process is doing.

The RISC-V mechanism: the CLINT timer fires when mtime >= mtimecmp. The CPU raises a machine-timer interrupt (MIP.MTIP = 1). If MIE.MTIE = 1, the processor traps to the handler at mtvec. This is identical to the ECALL trap path from Module 2, except the trap cause is mcause = 0x80000007 (interrupt bit set; cause 7 = machine timer interrupt).

The CLINT

The CLINT (Core-Local Interruptor) is a memory-mapped peripheral. Its registers on the DE10-Nano:

CLINT base address: 0x02000000 (standard SiFive CLINT layout)
  +0x0000: msip[0]            -- machine software interrupt pending (hart 0)
  +0x4000: mtimecmp[0]_lo     -- low 32 bits of timer comparison value
  +0x4004: mtimecmp[0]_hi     -- high 32 bits
  +0xBFF8: mtime_lo            -- low 32 bits of free-running counter
  +0xBFFC: mtime_hi            -- high 32 bits

To schedule the next timer tick 10 ms from now (assuming a 50 MHz clock = 50,000 cycles per ms):

li      t0, 500000              # 10ms * 50000 cycles/ms
lw      t1, CLINT_MTIME_LO      # read current mtime
add     t1, t1, t0              # target mtime
sw      x0, CLINT_MTIMECMP_LO   # write 0 to hi first (avoids spurious trip)
sw      t1, CLINT_MTIMECMP_LO   # then write new deadline

Wait -- the sub-word write atomicity discipline from clint.v's comments: to avoid a spurious interrupt when updating mtimecmp, write the new low value to 0xFFFFFFFF first (pushing the deadline far into the future), then write the new high, then write the new low:

li      t2, -1                   # 0xFFFFFFFF
sw      t2, CLINT_MTIMECMP_LO    # set lo = max (deadline far future)
sw      t0_hi, CLINT_MTIMECMP_HI  # write new hi
sw      t0_lo, CLINT_MTIMECMP_LO  # write new lo

Process control block (PCB)

Each process has a PCB (Process Control Block) that stores its saved state:

struct pcb {
    uint32_t regs[32];     // saved x0-x31
    uint32_t pc;           // saved program counter
    uint32_t satp;         // saved page table root
    uint32_t mstatus;      // saved mstatus (MPP field)
    uint32_t state;        // RUNNING / READY / BLOCKED / EXITED
    uint32_t pid;          // process identifier
    uint32_t stack_base;   // stack physical base address
    uint32_t stack_size;   // stack size in bytes
};

The OS maintains an array of PCBs: struct pcb process_table[MAX_PROCS].

Context switch

A context switch saves the current process's register state to its PCB and restores the next process's register state from its PCB. This happens at the timer interrupt handler:

# timer interrupt handler (extends the Module 2 trap handler)
# assume regs are saved to the kernel trap frame as before

# 1. Save current process state to its PCB
lw      t0, current_pid         # get current process index
la      t1, process_table
slli    t2, t0, PCB_SIZE_LOG2   # PCB offset
add     t1, t1, t2              # t1 -> current PCB
# copy saved regs from trap frame to PCB.regs
addi    a0, t1, PCB_REGS_OFFSET
la      a1, trap_frame
li      a2, 128                 # 32 regs * 4 bytes
call    memcpy

# 2. Save PC (mepc) and satp to PCB
csrr    t0, mepc
sw      t0, PCB_PC_OFFSET(t1)
csrr    t0, satp
sw      t0, PCB_SATP_OFFSET(t1)
sw      ra, PCB_MSTATUS_OFFSET(t1)   # save mstatus from trap entry

# 3. Run scheduler -- returns next PID in a0
call    schedule_next
sw      a0, current_pid

# 4. Load next process state from its PCB
la      t1, process_table
slli    t2, a0, PCB_SIZE_LOG2
add     t1, t1, t2              # t1 -> next PCB
la      a0, trap_frame
addi    a1, t1, PCB_REGS_OFFSET
li      a2, 128
call    memcpy

# 5. Restore PC and satp
lw      t0, PCB_PC_OFFSET(t1)
csrw    mepc, t0
lw      t0, PCB_SATP_OFFSET(t1)
csrw    satp, t0

# 6. Reprogram timer for next tick
call    reprogram_timer

# 7. Return to restored process
(restore trap frame registers, mret)

Round-robin scheduler

The simplest fair scheduler: each process gets an equal time slice; processes are served in circular order.

uint32_t schedule_next(void) {
    uint32_t next = (current_pid + 1) % MAX_PROCS;
    uint32_t iterations = 0;
    while (process_table[next].state != READY && iterations < MAX_PROCS) {
        next = (next + 1) % MAX_PROCS;
        iterations++;
    }
    if (iterations == MAX_PROCS) {
        // all processes blocked; run idle process (pid 0)
        return 0;
    }
    process_table[current_pid].state = READY;  // was RUNNING
    process_table[next].state = RUNNING;
    return next;
}

Context-switch cost

Lab 11.1 measures the cycle cost of one context switch. The cost has three components:

Trap entry: save 32 registers to the trap frame (~40 cycles with sequential sw instructions).
PCB save/restore: memcpy of the trap frame to/from PCB (~70 cycles for 128-byte copy).
Timer reprogram and mret: ~20 cycles.

Total expected: ~130-150 cycles per context switch. Compare this against the Lab 2.1 trap round-trip cost (60-120 cycles); the additional cost is the PCB memcpy.

Architecture Comparison Sidebar: Schedulers

Scheduler	Policy	Overhead	Production use
Round-robin (CSA-201)	Equal time slice, circular	Low	Embedded RTOSes; baseline for comparison
Linux CFS (Completely Fair Scheduler)	Weighted virtual runtime fairness	Medium (red-black tree)	Linux 2.6.23+
Windows dispatcher	Priority classes + round-robin within class	Medium	Windows NT kernel
FreeRTOS (RTOS)	Fixed-priority preemptive	Very low	Microcontroller RTOSes
RISC-V OpenSBI	N/A (firmware, not scheduler)	N/A	Provides M-mode services to S-mode Linux

Linux CFS maintains a virtual runtime (vruntime) for each process and uses a red-black tree to find the process with the smallest vruntime (the "most deserving" process). This gives a fair share of CPU time without the worst-case latency of strict round-robin (a high-priority task might wait behind many low-priority tasks in round-robin).

Lab exercises

See labs/lab-11-preemption-scheduler.md for the full specification.

Lab 11.1: Round-robin scheduler with two demo tasks; context-switch cost measured. You will implement the CLINT timer interrupt, the PCB structure, and the round-robin scheduler. Two user-mode tasks run concurrently: Task A increments a counter and writes to the OLED; Task B independently counts elapsed ticks and writes to a separate OLED line. Both counters advance independently.

Part A: Implement the CLINT timer interrupt handler in the kernel trap handler (extending the Module 2 handler). Test with a single process: verify the timer fires every 10 ms by measuring ECALL round-trip time before and after introducing the timer interrupt.

Part B: Implement the PCB structure and context-switch code. Test with two processes that each increment a counter; verify both counters advance (neither starves).

Part C: Measure. Read mcycle at the start and end of the context-switch handler (timer interrupt entry to mret). Record the cycle cost. Compare against Lab 2.1's trap round-trip measurement.

Independent practice

The mtimecmp sub-word write atomicity problem: if an OS writes the new mtimecmp_lo first (with the old mtimecmp_hi), and mtime_hi happens to match the old mtimecmp_hi at that instant, a spurious interrupt fires before the new deadline is fully programmed. Walk through the safe write sequence (lo=MAX, hi=new_hi, lo=new_lo) and show why it avoids the spurious interrupt.
The round-robin scheduler gives each process the same time slice. If one process performs mostly I/O (sleeping most of the time) and another performs mostly computation, round-robin wastes time on the I/O-bound process's time slices. How does Linux CFS handle this differently?
Toolchain Diary entry: OpenSBI. Record what OpenSBI is, why it exists (M-mode firmware that provides SBI calls to the S-mode OS), and how it relates to the msip register in the CLINT. What SBI call does Linux use to set mtimecmp?
A context switch saves 32 registers (128 bytes) + PC + satp + mstatus. If the OS runs 16 processes and switches every 10 ms, how many bytes per second does the context switch mechanism write to the PCB array? Is this significant compared to DRAM bandwidth on the DE10-Nano (~3.2 GB/s)?

Reflection prompts

The round-robin scheduler marks the outgoing process as READY before selecting the next. Why does it set the state to READY rather than leaving it as RUNNING? What invariant does the state field maintain?
The CLINT's mtime counter is 64 bits. At a 50 MHz clock (1 tick per cycle), how long before mtime overflows? Does this create a scheduler correctness problem (hint: consider what happens to mtimecmp < mtime comparisons as mtime wraps around)?
Module 10's stop-the-world GC requires pausing the scheduler. In the CSA-201 implementation, "pausing the scheduler" means: don't reprogram mtimecmp during the GC (so no timer interrupts fire while the GC runs), and resume when done. What does this guarantee, and what does it NOT guarantee? (Hint: what if the GC itself needs to allocate memory?)

What's next

Module 12 is the driver-writing track. The OS is now complete in its core features: privilege separation, virtual memory, PMP protection, GC, and preemptive scheduling. Module 12 connects the OS to hardware peripherals: you will write drivers for the SSD1306 OLED (I2C display), the SD card (SPI block storage), and the ENC28J60 (SPI Ethernet controller), reading each device's datasheet from scratch.