~90 minutes. Read a CSV of students; group them by grade; emit a JSON summary. Practice lists, dicts, tuples, the CSV module, and the JSON module.
Goal: ship a Python CLI tool that reads a CSV like students.csv and writes a JSON summary that groups students by grade level.
Estimated time: 90 minutes
Prerequisites: Week 4 lecture (collections, comprehensions). Labs 1-3 complete.
Setup
mkdir -p ~/fnd-102/lab-4
cd ~/fnd-102/lab-4
Create a sample students.csv (use a text editor; tab-separated also works but CSV is the lab default):
name,grade,homeroom
Alice Anderson,9,A101
Bob Brown,10,B202
Charlie Chen,9,A101
Dana Davis,11,C303
Eve Edwards,10,B202
Fred Foster,9,A102
Grace Green,11,C304
Henry Hall,12,D401
Aim for 20 rows for the lab (you can generate them or write them by hand). The CSV must have a header row and at least three columns: name, grade, and one other field of your choice.
Then create lab-4-roster.py.
Part A: Read the CSV (20 min)
Python's standard library has a csv module that parses CSV correctly (handling quoted fields, commas inside quotes, etc.). The minimum read pattern:
import csv
def read_roster(path):
"""Read a roster CSV and return a list of student dicts."""
students = []
with open(path, newline='', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
students.append(row)
return students
csv.DictReader reads the first row as field names and yields each subsequent row as a dict keyed by those field names. So row['name'] is the student's name; row['grade'] is the grade level (as a string; you will convert to int later).
The newline='' argument to open() is important: it tells open to not translate line endings, so the csv module can handle them itself. Without it, CSVs from Windows misbehave.
Test by adding at the bottom of your script:
if __name__ == '__main__':
students = read_roster('students.csv')
print(f'Read {len(students)} students.')
print('First student:', students[0])
Run python3 lab-4-roster.py. You should see Read 8 students. First student: {'name': 'Alice Anderson', 'grade': '9', 'homeroom': 'A101'}.
Part B: Group by grade (25 min)
Write a function that takes the student list and returns a dict keyed by grade, where each value is the list of student names in that grade:
def group_by_grade(students):
"""Group students by grade level. Return {grade_int: [name, ...]}."""
grouped = {}
for student in students:
grade = int(student['grade'])
grouped.setdefault(grade, []).append(student['name'])
return grouped
The setdefault idiom is "if the key is not in the dict, create it with this default value, then return the value at the key." Equivalent to:
if grade not in grouped:
grouped[grade] = []
grouped[grade].append(student['name'])
A defaultdict(list) from the collections module is the third equivalent. Use whichever reads most naturally to you.
Notice the int conversion: grade = int(student['grade']). CSV fields are always strings; converting at the boundary is good practice (you want to compare grade 9 to grade 10 numerically, not lexicographically).
Test:
if __name__ == '__main__':
students = read_roster('students.csv')
grouped = group_by_grade(students)
print(grouped)
# Expected: {9: ['Alice Anderson', 'Charlie Chen', 'Fred Foster'], 10: ['Bob Brown', 'Eve Edwards'], 11: ['Dana Davis', 'Grace Green'], 12: ['Henry Hall']}
Part C: Emit JSON (15 min)
The json module serializes Python dicts and lists to JSON:
import json
def write_summary(grouped, path):
"""Write the grouped roster as a JSON file."""
with open(path, 'w', encoding='utf-8') as f:
json.dump(grouped, f, indent=2, sort_keys=True)
Two arguments worth knowing:
indent=2produces a pretty-printed JSON with 2-space indentation. Without it, you get a single line of JSON (compact but unreadable). For human-facing files, always indent.sort_keys=Truesorts the keys alphabetically. For a grade-keyed dict that means grades come out in numeric order. (Strictly speaking,sort_keyssorts as strings, but '10', '11', '12', '9' would be wrong order; the cleaner approach is to convert grades to int as you did in Part B, but JSON requires string keys, sojson.dumpwill stringify. For numeric ordering of stringified-int keys, you need to convert grades back when writing, or use a sorted list of keys. Stretch exercise.)
Wire it up:
if __name__ == '__main__':
students = read_roster('students.csv')
grouped = group_by_grade(students)
write_summary(grouped, 'roster-summary.json')
print(f'Wrote roster-summary.json with {len(grouped)} grade buckets.')
Run and inspect:
python3 lab-4-roster.py
cat roster-summary.json
You should see well-indented JSON like:
{
"10": ["Bob Brown", "Eve Edwards"],
"11": ["Dana Davis", "Grace Green"],
"12": ["Henry Hall"],
"9": ["Alice Anderson", "Charlie Chen", "Fred Foster"]
}
Notice: even though your dict keys were ints, JSON converts them to strings (JSON requires string keys). This is fine for the lab; in week 5 you will learn how to control this with default= argument to json.dump.
Part D: Add some statistics (15 min)
Extend the summary so each grade bucket includes a count, not just a list:
def group_by_grade_with_counts(students):
"""Group students by grade level. Return {grade: {'count': N, 'students': [name, ...]}}."""
grouped = {}
for student in students:
grade = int(student['grade'])
if grade not in grouped:
grouped[grade] = {'count': 0, 'students': []}
grouped[grade]['count'] += 1
grouped[grade]['students'].append(student['name'])
return grouped
The output JSON becomes:
{
"9": {
"count": 3,
"students": ["Alice Anderson", "Charlie Chen", "Fred Foster"]
},
"10": { ... }
}
You can compute count as len(students) after the loop, but accumulating during the loop is the more general pattern (useful when you compute multiple statistics).
Part E: Commit your work (15 min)
cd ~/fnd-102/lab-4
git add lab-4-roster.py students.csv roster-summary.json
git commit -m "lab-4: class-roster grouper reads CSV, groups by grade, writes JSON summary"
A second commit for the statistics extension (Part D) is a good multi-commit habit:
git add lab-4-roster.py
git commit -m "lab-4: extend summary with per-grade count"
Expected output / artifact
lab-4-roster.py should:
- Read
students.csv(in the lab directory) - Group students by grade level
- Write
roster-summary.jsonwith per-grade counts and name lists - Use
csv.DictReaderandjson.dump(not custom CSV parsing or string concatenation for JSON)
Files committed: lab-4-roster.py, students.csv, roster-summary.json.
What's the failure mode?
This tool's likely failure modes:
- CSV with embedded commas in a name. "Anderson, Alice" as a single field requires the name to be quoted:
"Anderson, Alice",9,A101.csv.DictReaderhandles this correctly if the quoting is correct. Test by adding a row with a quoted comma-containing name; verify your output. - Missing grade column. If the CSV header has
grade_levelinstead ofgrade,student['grade']raisesKeyError. Defensive: print the header row's field names at startup to verify. - Non-numeric grade. A row with
grade,9th,A101(typo: '9th' instead of '9') crashes atint(student['grade']). The week 9 fix istry/exceptaround the conversion plus skip-with-warning; for this week, a crash is acceptable. - Empty CSV. A file with only a header row produces an empty
studentslist; the output JSON is{}. Not a bug; the behavior is correct, but worth verifying.
Common pitfalls
- Forgetting
newline=''inopen(). Works on macOS / Linux; produces extra blank rows on Windows-authored CSVs. Always include it for CSV files. - Iterating the DictReader twice.
csv.DictReaderis a one-shot iterator. After the firstfor row in reader:exhausts it, a second iteration sees nothing. If you need to walk the data twice, collect into a list first:rows = list(reader). - Comparing string grades.
'9' < '10'is False in Python (lexicographic order: '1' < '9'). Always convert to int at the boundary. json.dumpsvsjson.dump.dumpsreturns a string (no file);dumpwrites to a file object. Easy to swap accidentally. The 's' is for "string."
Stretch (optional)
- Sort the names within each grade bucket alphabetically by last name. Hint:
sorted(names, key=lambda n: n.split()[-1]). - Read from stdin instead of a file. Allow
python3 lab-4-roster.py < students.csv > summary.json. Usesys.stdinandsys.stdoutinstead ofopen(). Forward-pointer to Unix pipeline composition (week 9 subprocess). - Add a
--top-Nargument (manualsys.argvparsing for now) that emits, instead of the full grouping, a list of the top N students per grade (alphabetically). Forward-pointer to week 6 argparse. - Validate the CSV. Before processing, check that the required columns (
name,grade) are present; print a clear error and exit nonzero if not. Practitioner habit; catches typos in the input.
Lab 4 v0.1.