When to use regex; when NOT to. The lab extracts IPv4 and IPv6 addresses from a network log file using re.findall.
Theme
A regular expression (regex) is a tiny pattern language for matching text. The Python re module gives you the standard regex engine in the stdlib. Used well, regex makes "find every email address in this 1 GB log" a one-liner. Used poorly, it produces unreadable patterns that match the wrong things and waste hours.
The week's discipline is two-part: learn the syntax, and learn when NOT to reach for it. Most text-processing tasks (split by character, find a fixed substring, replace one word with another) are clearer with str.split, 'x' in s, and s.replace. Regex earns its place when the pattern is genuinely a pattern: variable-length, optional parts, character classes, alternatives.
The lab is an IP-address extractor. Network logs are full of IP addresses in various formats: IPv4 (192.168.1.1), IPv6 (fe80::1 and the longer forms), addresses with ports (192.168.1.1:8080). Your scanner reads the log, finds all the IPs, and emits a sorted unique list. The matching is genuinely a pattern (IPv4 is four numbers 0-255 separated by dots; IPv6 is groups of hex digits separated by colons), so regex is the right tool.
By the end of week 7 you can: read a basic regex (character classes, quantifiers, anchors, groups); write a regex for the common patterns (numbers, words, IPs, dates, URLs); use re.findall, re.search, re.match, re.sub; recognize raw strings (r'...') and why regex patterns always use them; debug a regex that does not match (the re.findall(pattern, text) print trick).
Reading list (~1 hour)
- Matthes, Python Crash Course 2nd ed. does not have a regex chapter. Use Sweigart for the introduction.
- Sweigart, Automate the Boring Stuff with Python 2nd ed., Ch 7 ("Pattern Matching with Regular Expressions") at
https://automatetheboringstuff.com/2e/chapter7/. Free online. The canonical beginner's introduction; the worked phone-number example is excellent. - Python
remodule HOWTO athttps://docs.python.org/3/howto/regex.html. ~30 min read. Coversre.compile, theMatchobject, groups, and the difference betweensearchandmatch. Required reading; the lab assumes you read this. - Real Python: "Regular Expressions in Python" at
https://realpython.com/regex-python/. ~30 min read. Practitioner patterns and pitfalls. regex101.com(web tool, not a reading). The standard interactive regex sandbox. Try every pattern from this week's lecture; observe the matches highlighted.
Lecture outline (~1.5 hours, 2 sessions of ~50 min)
Session 1: Regex basics
Section 1.1: A first regex
- The simplest pattern: a literal string.
import re re.findall(r'cat', 'the cat sat on the mat') # ['cat']
re.findall(pattern, text)returns all non-overlapping matches as a list of strings.- The
r'...'is a raw string. In a raw string,\is a literal backslash. Without it, Python interprets\das the escape sequence for digit (which is also valid Python but creates a hidden trap when patterns get more complex). ALWAYS user'...'for regex.
Section 1.2: Character classes
\dmatches one digit (0-9).\Dmatches one non-digit.\wmatches one "word" character (letter, digit, underscore).\Wmatches non-word.\smatches one whitespace character (space, tab, newline).\Smatches non-whitespace..matches any character (except newline by default).[abc]is a custom character class: matchesaorborc.[a-z]matches one lowercase letter.[A-Za-z0-9]matches one alphanumeric.[^abc]matches one character that is NOT a, b, or c.
Section 1.3: Quantifiers
*means "zero or more of the preceding atom."\d*matches "", "1", "12", "12345", etc.+means "one or more."\d+matches "1", "12", ... but not "".?means "zero or one."colou?rmatches "color" or "colour".{n}means "exactly n."\d{4}matches a 4-digit number.{n,m}means "between n and m."\d{1,3}matches a 1-3 digit number.- Default is greedy:
\d+matches as many digits as possible.\d+?is the non-greedy version (matches as few as possible).
Section 1.4: Anchors
^anchors to the start of the string (or start of line withre.MULTILINE).$anchors to the end of the string (or end of line withre.MULTILINE).\bis a word boundary (zero-width; matches the transition between\wand\W).^ERRORmatches "ERROR" only at the start of the string;\bERROR\bmatches "ERROR" as a whole word (not as part of "ERRORMESSAGE").
Session 2: Groups, the re module, when not to use regex
Section 2.1: Groups
- Parentheses create a capturing group. The match object remembers what each group matched.
m = re.match(r'(\d+)-(\d+)-(\d+)', '2026-05-28') m.group(0) # '2026-05-28' (the whole match) m.group(1) # '2026' (the first group) m.group(2) # '05' m.group(3) # '28' m.groups() # ('2026', '05', '28')
re.findall(r'(\d+)-(\d+)-(\d+)', text)returns a list of tuples (one per match).- Non-capturing groups:
(?:...)groups without capturing. Useful for applying quantifiers without inflating the group count. - Named groups:
(?P<year>\d+)lets you access viam.group('year'). Useful for readable code with many groups.
Section 2.2: The re module API
re.search(pattern, text)returns the first match anywhere; None if no match.re.match(pattern, text)returns the first match ONLY at the start; None otherwise. Less useful thansearch.re.findall(pattern, text)returns all non-overlapping matches as a list.re.finditer(pattern, text)returns an iterator over Match objects (use for very long inputs).re.sub(pattern, replacement, text)returns text with all matches replaced.re.compile(pattern)pre-compiles a pattern; the resulting object has.search,.findall, etc. methods. Use for patterns used many times.
Section 2.3: Alternation and groups
cat|dogmatches "cat" or "dog".- Combine with grouping:
(cat|dog) foodmatches "cat food" or "dog food". - Be careful with alternation inside character classes:
[cd]is "c or d";(cd)is the two-character string "cd". Different patterns.
Section 2.4: Common patterns
- A US phone number (rough):
r'\d{3}-\d{3}-\d{4}'matches "555-123-4567". Add\(?\d{3}\)?to make the area code optional with optional parens. - An email address (rough):
r'[\w.+-]+@[\w-]+\.[\w.-]+'. Note: a "rough" pattern matches 95% of emails; a complete pattern is RFC-5321-quoting-complete and famously hundreds of characters long. Pick the simpler one for almost everything. - An IPv4 address (rough):
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'. Matches "192.168.1.1" but also "999.999.999.999". For strict matching of valid octets (0-255), the pattern gets uglier; you typically validate after a rough match. - An IPv6 address (rough): much harder; multiple valid forms. Your Lab 7 uses a pragmatic approximation.
Section 2.5: When NOT to use regex
- Fixed substring? Use
in:'ERROR' in lineis faster and clearer thanre.search(r'ERROR', line). - Split on a single character? Use
str.split:line.split(',')notre.split(r',', line). - Parsing nested structures (HTML, JSON, code)? Use a proper parser. Regex famously cannot parse balanced parentheses correctly.
- Validating a complex format (email, URL, IP)? Use a library:
email.utils,urllib.parse,ipaddress. The stdlib already has correct validators; regex re-invents wheels and gets them wrong. - A single fixed prefix? Use
str.startswithnotre.match(r'^prefix').
The discipline: regex is a power tool. Reach for it when the pattern is variable-shaped (length, optional parts, alternatives). For fixed shapes, the string method is faster, clearer, and harder to get wrong.
Labs (~90 minutes)
Lab 7: Regex IP Extractor (labs/lab-7-regex-ip-extractor.md)
- Goal: build a CLI tool that reads a network log file and extracts every IPv4 and IPv6 address, emits a sorted unique list
- Time: ~90 minutes
- Artifact:
lab-7-extract.py+ samplenetwork.login~/fnd-102/lab-7/, committed to Git
Independent practice (~4 hours)
-
regex101.com tour (45 min). Open
https://regex101.com/(set the flavor to Python in the left sidebar). Paste the following patterns one at a time; paste sample text into the test string; observe the matches:\d+against "I have 3 cats and 12 dogs"\b\w+ing\bagainst "the running dog was chasing the singing cat"https?://\S+against "visit http://example.com and https://www.example.org/path"(\d{4})-(\d{2})-(\d{2})against "today is 2026-05-28"[A-Z][a-z]+against "Alice and Bob met Charlie" For each, write down what the pattern means in English.
-
str-method vs regex drill (30 min). For each task, decide str-method or regex. Implement both for two of them; compare readability:
- "Find every line in a file that starts with
#" - "Find every line that contains a 4-digit year"
- "Replace every occurrence of
applewithorange" - "Split a CSV line on commas (ignoring commas inside quoted fields)"
- "Find every word ending in
-ing" - "Count digits in a string"
- "Find every line in a file that starts with
-
Email extractor (45 min). Write a Python script that, given a path to a text file, prints all email addresses found, one per line, sorted unique. Use
re.findallwith the rough email pattern. Test on a file that includes some borderline cases:user.name+tag@example.co.uk,not.an.email@,@nodomain.com. -
re.subexercise (30 min). Write a Python script that reads a file and writes a new file with every phone number (US-style:\d{3}-\d{3}-\d{4}or(\d{3}) \d{3}-\d{4}) replaced with[redacted]. Useful for log scrubbing. -
Backreferences (30 min). Read the docs on backreferences (
\1,\2inside a pattern;\g<name>for named groups). Write a regex that finds doubled words:(\b\w+\b)\s+\1. Test against "the the cat sat on on the mat". Why does this only catch some duplicates and miss case differences? -
Optional stretch: IPv6 the proper way (60 min). Read
https://docs.python.org/3/library/ipaddress.html. Write a script that, given a string, attemptsipaddress.ip_address(s); if it succeeds, the string is a valid IP. Compare with your Lab 7 regex approach. Where does each win?
Reflection prompts (~30 minutes)
- You learned to recognize "this is a regex problem" vs "this is a str-method problem." Of your six tasks in practice item 2, how many felt clearly regex? Clearly str-method? In-between?
- Your Lab 7 regex matches
192.168.1.1AND999.999.999.999. The second is not a valid IPv4 address. What is your fix: tighten the regex, or validate after matching? Why? - Greedy vs non-greedy:
<.*>against<b>hello</b>matches the WHOLE thing (greedy).<.*?>matches<b>(non-greedy). Which would you want for HTML extraction? Why? - The
remodule supports are.VERBOSEflag that lets you write multi-line patterns with comments. Did you encounter a pattern this week that would have been clearer asre.VERBOSE? Show before/after. - One thing from this week you want to know more about?
Tool journal (week 7)
remodule:findall,search,match,sub,compile,finditer- Character classes:
\d,\w,\sand negations\D,\W,\S - Custom classes:
[abc],[a-z],[^abc] - Quantifiers:
*,+,?,{n},{n,m}; non-greedy with? - Anchors:
^,$,\b - Groups:
(...)capturing,(?:...)non-capturing,(?P<name>...)named - Raw strings:
r'pattern'(always for regex) regex101.com: interactive sandbox for debugging patternsipaddressstdlib module: when regex is the wrong tool
What comes next
Week 8 is the midterm. The exam is a 2-hour proctored practical: you receive a spec for a CLI tool you have not seen before and must implement it using the standard library and your notes. The exam exercises everything from weeks 1-7: language fluency, argparse, file I/O, basic regex, logging. There is no lab artifact this week beyond the exam itself.