Classroom Glossary Public page

Week 13: External DRAM and the Filesystem

1,459 words

The Tang Primer 25K has 1,008 Kbit of BRAM -- about 126 KiB. That is enough for Virtus OS v1. It is not enough for Virtus OS v2 with GC, drivers, and two user processes. This week you use the DE10-Nano's 1 GB DDR3.


Reading

Required. Petzold, CODE, Ch 14 ("Feedback and Flip-Flops") and Ch 16 ("An Assemblage of Memory"). The memory hierarchy Petzold traces -- from single flip-flops through registers through SRAM through DRAM -- ends at external DRAM. Ch 14 explains why DRAM must be refreshed (charge leaks from the capacitors that store bits; without periodic refresh, data is lost). Ch 16 explains the timing constraints that arise from DRAM's internal structure. The DDR3 on the DE10-Nano is the same technology, scaled to 1 GB.

Required. The FAT16 specification is publicly documented in Microsoft's "MS-DOS Programmer's Reference" and in the FATGEN103.doc that Microsoft released. A clean readable version: Paul Stoffregen's "FAT Filesystem Documentation" is widely available. Read the sections on: boot record, FAT table structure (16-bit FAT entries), directory entries (32-byte structure), and file data cluster chains.


Lecture: External DRAM and FAT16

The BRAM ceiling

Tang Primer 25K has 56 BSRAM blocks, each 18 Kbit = 2,250 bytes. Total BRAM: 56 * 18 Kbit = 1,008 Kbit = 126 KiB. After Modules 1-12, your CPU's HDL uses approximately:

  • Instruction memory: 32 KiB (16K 32-bit words)
  • Data memory (heap + stack): 64 KiB
  • TLB, PMP registers, CSRfile: ~2 KiB in flipflops

That is ~98 KiB, leaving about 28 KiB of BRAM headroom. Virtus OS v2 with two user processes, the GC, and the driver framebuffers needs more. The DE10-Nano's 1 GB DDR3 SDRAM is the solution.

Accessing DE10-Nano DDR3 via HPS

The DE10-Nano has a Hard Processor System (HPS -- a dual-core Cortex-A9) sharing the DDR3 SDRAM with the FPGA fabric. The FPGA accesses DDR3 through the HPS's FPGA-to-HPS bridge, which exposes an AXI4 bus interface.

For CSA-201, the simplest approach avoids AXI: use the HPS bridge in "lightweight HPS-to-FPGA" (LW-HPS2FPGA) mode, which provides a 32-bit memory-mapped window into the FPGA's register space. The FPGA side exposes the SDRAM through a custom memory controller.

Simpler alternative for Module 13: use the on-board HPS SDRAM controller in "FPGA-as-a-coprocessor" mode. The HPS boots first, sets up DDR3 initialization, and then releases the FPGA. The FPGA accesses a memory-mapped window (physical address 0xC0000000 on the HPS bus = 0x00000000 in FPGA address space) through the F2H_SDRAM_AXI bridge.

For the CSA-201 lab: configure the DE10-Nano such that:

  • Physical addresses 0x00000000-0x00FFFFFF (16 MiB) are BRAM (fast, on-chip, no latency)
  • Physical addresses 0x10000000-0x4FFFFFFF (1 GiB mapped) are DDR3 (slower, off-chip, ~200 ns latency)

Map the kernel text and critical data to BRAM; map the user heap, GC nursery, and driver framebuffers to DDR3.

Memory latency trade-off

Memory type Access latency Bandwidth Use in Virtus OS v2
BRAM (on-chip) 1-2 cycles ~4 bytes/cycle at 50 MHz Kernel code; critical data structures; MMU page tables
DDR3 SDRAM (off-chip, via HPS) ~10-30 cycles (after DDR3 init + HPS bridge) ~1.6 GB/s (800 MT/s x 16-bit bus) User heap; GC nursery; file read buffers; SD card DMA buffers

The latency difference matters for the context-switch cost from Module 11: if the PCB array is in DDR3, each context switch adds ~200 ns per 4-byte word access in the PCB copy loop. For a 128-byte PCB copy, that is 32 DDR3 accesses * ~200 ns each = ~6.4 us of additional context-switch latency. Keep the PCB array in BRAM.

FAT16 filesystem structure

FAT16 is the filesystem used by SD cards formatted for compatibility with older devices. The structure on a 32 MB SD card:

Sector layout (512 bytes/sector):

Sector 0:    Boot Sector (512 bytes)
             Bytes 0-2:   Jump instruction (EB XX 90)
             Bytes 3-10:  OEM name ("MSDOS5.0")
             Bytes 11-12: Bytes per sector (0x0200 = 512)
             Byte  13:    Sectors per cluster (e.g., 0x04 = 4 sectors = 2 KiB cluster)
             Bytes 14-15: Reserved sectors (0x0001 = 1)
             Byte  16:    Number of FATs (0x02 = 2 copies)
             Bytes 17-18: Root directory entries (0x0200 = 512 entries)
             Bytes 19-20: Total sectors 16-bit (for small volumes)
             Byte  21:    Media descriptor (0xF8 = fixed disk)
             Bytes 22-23: Sectors per FAT (16-bit)
             ...

Sectors 1 .. (1 + sectors_per_fat*2 - 1):  FAT1 + FAT2 (two copies)
Sectors (above) .. (above + 32 - 1):        Root directory (512 entries * 32 bytes = 32 sectors)
Sectors (above) ..  end:                    Data area (clusters 2, 3, ...)

FAT entry format: Each 16-bit entry in the FAT table corresponds to one cluster:

  • 0x0000: free cluster
  • 0x0002-0xFFEF: next cluster in chain
  • 0xFFF8-0xFFFF: last cluster in file (end-of-chain marker)
  • 0xFFF7: bad cluster

Directory entry format (32 bytes):

Bytes 0-7:   Filename (8 bytes, space-padded, uppercase)
Bytes 8-10:  Extension (3 bytes)
Byte  11:    Attributes (0x20 = archive, 0x10 = directory, etc.)
Bytes 14-15: Time (HHHHHMMMMMSSSSSS format)
Bytes 16-17: Date (YYYYYYYMMMMDDDDD format)
Bytes 26-27: First cluster (low 16 bits for FAT16)
Bytes 28-31: File size in bytes

Reading a file:

  1. Read the boot sector; extract bytes_per_sector, sectors_per_cluster, reserved_sectors, num_fats, root_dir_entries, sectors_per_fat.
  2. Compute the root directory start sector: reserved + num_fats * sectors_per_fat.
  3. Scan the root directory entries for the filename match.
  4. From the directory entry, get the first cluster number and file size.
  5. Read the FAT to follow the cluster chain: cluster → FAT[cluster] → next cluster → ...
  6. For each cluster, compute the data sector: data_start + (cluster - 2) * sectors_per_cluster.
  7. Read the sectors.

FAT16 limitations. Maximum volume size: 2 GiB (65,534 clusters * maximum 32 KiB/cluster). Maximum file size: 4 GiB - 1 byte (32-bit file size field). No file permissions, no ownership. Not suitable for production OS filesystems, but ideal for SD card interoperability.

Architecture Comparison Sidebar: Storage stacks

Layer Tang Primer 25K (CSA-101) DE10-Nano CSA-201 Modern PC
Processor addressable memory BRAM (~126 KiB) BRAM + HPS DDR3 (1 GB) DRAM (8-64 GB)
Persistent storage None SD card (FAT16) NVMe SSD (ext4/NTFS)
Filesystem None FAT16 read-only walker ext4/NTFS/APFS
Block interface N/A SPI block device (512-byte reads) NVMe (PCIe, queue depth 65535)
Storage latency N/A (all in BRAM) ~10 ms (SD card random access) ~50 us (NVMe SSD random access)

The FAT16 walker you implement this week is read-only: it can list directory entries and read file contents but cannot create files or modify the directory. Write support requires implementing free-cluster allocation (scan the FAT for 0x0000 entries), directory entry update, and FAT chain maintenance. These are extensions for Module 13's optional bonus exercises.


Lab exercises

See labs/lab-13-external-dram-filesystem.md for the full specification.

Lab 13.1: SD card FAT16 filesystem walker reads files; DDR3 heap setup. Two parallel deliverables this week.

Part A (DDR3 heap): Configure the Quartus project to route physical addresses 0x10000000+ to the DDR3 bridge. Verify by writing a known pattern to DDR3 address 0x10000000 and reading it back. Move the user-heap base to 0x10000000 (change Memory.alloc's HEAP_BASE constant).

Part B (FAT16 walker): Implement fat16_mount() (reads boot sector), fat16_ls() (lists root directory), and fat16_read(filename, buf, max_bytes) (reads a file). Format a 64 MB SD card as FAT16 (use mkfs.fat -F 16 on Linux), copy three test files onto it, and verify your walker reads them correctly.

Part C (integration): Write a test program that reads a configuration file from the SD card (e.g., /CONF.TXT with one key=value per line), parses the key-value pairs, and applies the configuration. Verify on DE10-Nano.


Independent practice

  1. The FAT16 root directory is fixed-size (32 sectors = 512 entries on a standard format). What happens when a directory has more than 512 entries? (It cannot; FAT16 root directories are fixed.) How does FAT32 fix this limitation?

  2. FAT16 uses "8.3 filenames" (8 characters + 3 extension, uppercase only). The long filename extension (VFAT) stores long names in pseudo-directory entries with the attribute byte = 0x0F. Describe the long-filename entry structure well enough to implement reading. (Optional: implement it.)

  3. Toolchain Diary entry: mkfs.fat. Record the command to create a 64 MB FAT16 filesystem on an SD card partition, the command to mount it on Linux for copying files, and the fsck.fat command to verify the filesystem.

  4. The DDR3 on the DE10-Nano operates at 800 MT/s on a 16-bit bus. How many bytes per second is the theoretical maximum throughput? Compare this to BRAM access at 50 MHz (1 32-bit word per cycle). At what working-set size does the performance difference between BRAM-resident and DDR3-resident code become significant?


Reflection prompts

  1. The FAT16 boot sector contains a copy of the BIOS Parameter Block (BPB). Your driver reads it at mount time. If the SD card is physically damaged and sector 0 is unreadable, the filesystem is inaccessible. FAT16 has a second copy of the FAT (FAT2). Does it also have a backup boot sector? (No -- this is a known weakness of FAT16 fixed later in FAT32 with the backup boot sector at sector 6.)

  2. Moving the user heap from BRAM to DDR3 changes the latency of heap allocations by ~10-30x (from 1-2 cycles to ~10-30 cycles for DDR3 burst access). For the GC mark phase (which traverses the heap linearly), is this a significant regression? Calculate the cycle overhead difference for a 64 KiB heap sweep.

  3. The SD card's SPI interface reads one 512-byte block per transaction. Your FAT16 walker reads the boot sector (1 block), the FAT (multiple blocks), the root directory (32 blocks), and then the file data blocks. For a 4 KiB file (8 FAT16 sectors = 2 clusters), how many total SPI block-read operations does fat16_read() perform? Is there a natural caching opportunity?


What's next

Module 14 is the capstone. All the components are built: full RV32I + M extension, privileged ISA, improved compiler, virtual memory, PMP W^X, stack canaries, tracing GC, preemptive scheduler, three device drivers, and a filesystem. The capstone integrates them into Virtus OS v2 running on DE10-Nano and demonstrates all six Tier 1 gates. See CAPSTONE.md for the full specification.