Week 7: Regular Expressions · FND-102 · Virtus Cyber Academy Classroom

When to use regex; when NOT to. The lab extracts IPv4 and IPv6 addresses from a network log file using re.findall.

Theme

A regular expression (regex) is a tiny pattern language for matching text. The Python re module gives you the standard regex engine in the stdlib. Used well, regex makes "find every email address in this 1 GB log" a one-liner. Used poorly, it produces unreadable patterns that match the wrong things and waste hours.

The week's discipline is two-part: learn the syntax, and learn when NOT to reach for it. Most text-processing tasks (split by character, find a fixed substring, replace one word with another) are clearer with str.split, 'x' in s, and s.replace. Regex earns its place when the pattern is genuinely a pattern: variable-length, optional parts, character classes, alternatives.

The lab is an IP-address extractor. Network logs are full of IP addresses in various formats: IPv4 (192.168.1.1), IPv6 (fe80::1 and the longer forms), addresses with ports (192.168.1.1:8080). Your scanner reads the log, finds all the IPs, and emits a sorted unique list. The matching is genuinely a pattern (IPv4 is four numbers 0-255 separated by dots; IPv6 is groups of hex digits separated by colons), so regex is the right tool.

By the end of week 7 you can: read a basic regex (character classes, quantifiers, anchors, groups); write a regex for the common patterns (numbers, words, IPs, dates, URLs); use re.findall, re.search, re.match, re.sub; recognize raw strings (r'...') and why regex patterns always use them; debug a regex that does not match (the re.findall(pattern, text) print trick).

Reading list (~1 hour)

Matthes, Python Crash Course 2nd ed. does not have a regex chapter. Use Sweigart for the introduction.
Sweigart, Automate the Boring Stuff with Python 2nd ed., Ch 7 ("Pattern Matching with Regular Expressions") at https://automatetheboringstuff.com/2e/chapter7/. Free online. The canonical beginner's introduction; the worked phone-number example is excellent.
Python re module HOWTO at https://docs.python.org/3/howto/regex.html. ~30 min read. Covers re.compile, the Match object, groups, and the difference between search and match. Required reading; the lab assumes you read this.
Real Python: "Regular Expressions in Python" at https://realpython.com/regex-python/. ~30 min read. Practitioner patterns and pitfalls.
regex101.com (web tool, not a reading). The standard interactive regex sandbox. Try every pattern from this week's lecture; observe the matches highlighted.

Lecture outline (~1.5 hours, 2 sessions of ~50 min)

Session 1: Regex basics

Section 1.1: A first regex

The simplest pattern: a literal string.

import re
re.findall(r'cat', 'the cat sat on the mat')   # ['cat']

re.findall(pattern, text) returns all non-overlapping matches as a list of strings.
The r'...' is a raw string. In a raw string, \ is a literal backslash. Without it, Python interprets \d as the escape sequence for digit (which is also valid Python but creates a hidden trap when patterns get more complex). ALWAYS use r'...' for regex.

Section 1.2: Character classes

\d matches one digit (0-9). \D matches one non-digit.
\w matches one "word" character (letter, digit, underscore). \W matches non-word.
\s matches one whitespace character (space, tab, newline). \S matches non-whitespace.
. matches any character (except newline by default).
[abc] is a custom character class: matches a or b or c.
[a-z] matches one lowercase letter. [A-Za-z0-9] matches one alphanumeric.
[^abc] matches one character that is NOT a, b, or c.

Section 1.3: Quantifiers

* means "zero or more of the preceding atom." \d* matches "", "1", "12", "12345", etc.
+ means "one or more." \d+ matches "1", "12", ... but not "".
? means "zero or one." colou?r matches "color" or "colour".
{n} means "exactly n." \d{4} matches a 4-digit number.
{n,m} means "between n and m." \d{1,3} matches a 1-3 digit number.
Default is greedy: \d+ matches as many digits as possible. \d+? is the non-greedy version (matches as few as possible).

Section 1.4: Anchors

^ anchors to the start of the string (or start of line with re.MULTILINE).
$ anchors to the end of the string (or end of line with re.MULTILINE).
\b is a word boundary (zero-width; matches the transition between \w and \W).
^ERROR matches "ERROR" only at the start of the string; \bERROR\b matches "ERROR" as a whole word (not as part of "ERRORMESSAGE").

Session 2: Groups, the re module, when not to use regex

Section 2.1: Groups

Parentheses create a capturing group. The match object remembers what each group matched.

m = re.match(r'(\d+)-(\d+)-(\d+)', '2026-05-28')
m.group(0)   # '2026-05-28' (the whole match)
m.group(1)   # '2026' (the first group)
m.group(2)   # '05'
m.group(3)   # '28'
m.groups()   # ('2026', '05', '28')

re.findall(r'(\d+)-(\d+)-(\d+)', text) returns a list of tuples (one per match).
Non-capturing groups: (?:...) groups without capturing. Useful for applying quantifiers without inflating the group count.
Named groups: (?P<year>\d+) lets you access via m.group('year'). Useful for readable code with many groups.

Section 2.2: The re module API

re.search(pattern, text) returns the first match anywhere; None if no match.
re.match(pattern, text) returns the first match ONLY at the start; None otherwise. Less useful than search.
re.findall(pattern, text) returns all non-overlapping matches as a list.
re.finditer(pattern, text) returns an iterator over Match objects (use for very long inputs).
re.sub(pattern, replacement, text) returns text with all matches replaced.
re.compile(pattern) pre-compiles a pattern; the resulting object has .search, .findall, etc. methods. Use for patterns used many times.

Section 2.3: Alternation and groups

cat|dog matches "cat" or "dog".
Combine with grouping: (cat|dog) food matches "cat food" or "dog food".
Be careful with alternation inside character classes: [cd] is "c or d"; (cd) is the two-character string "cd". Different patterns.

Section 2.4: Common patterns

A US phone number (rough): r'\d{3}-\d{3}-\d{4}' matches "555-123-4567". Add $?\d{3}$? to make the area code optional with optional parens.
An email address (rough): r'[\w.+-]+@[\w-]+\.[\w.-]+'. Note: a "rough" pattern matches 95% of emails; a complete pattern is RFC-5321-quoting-complete and famously hundreds of characters long. Pick the simpler one for almost everything.
An IPv4 address (rough): r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'. Matches "192.168.1.1" but also "999.999.999.999". For strict matching of valid octets (0-255), the pattern gets uglier; you typically validate after a rough match.
An IPv6 address (rough): much harder; multiple valid forms. Your Lab 7 uses a pragmatic approximation.

Section 2.5: When NOT to use regex

Fixed substring? Use in: 'ERROR' in line is faster and clearer than re.search(r'ERROR', line).
Split on a single character? Use str.split: line.split(',') not re.split(r',', line).
Parsing nested structures (HTML, JSON, code)? Use a proper parser. Regex famously cannot parse balanced parentheses correctly.
Validating a complex format (email, URL, IP)? Use a library: email.utils, urllib.parse, ipaddress. The stdlib already has correct validators; regex re-invents wheels and gets them wrong.
A single fixed prefix? Use str.startswith not re.match(r'^prefix').

The discipline: regex is a power tool. Reach for it when the pattern is variable-shaped (length, optional parts, alternatives). For fixed shapes, the string method is faster, clearer, and harder to get wrong.

Labs (~90 minutes)

Lab 7: Regex IP Extractor (labs/lab-7-regex-ip-extractor.md)

Goal: build a CLI tool that reads a network log file and extracts every IPv4 and IPv6 address, emits a sorted unique list
Time: ~90 minutes
Artifact: lab-7-extract.py + sample network.log in ~/fnd-102/lab-7/, committed to Git

Independent practice (~4 hours)

regex101.com tour (45 min). Open https://regex101.com/ (set the flavor to Python in the left sidebar). Paste the following patterns one at a time; paste sample text into the test string; observe the matches:
- \d+ against "I have 3 cats and 12 dogs"
- \b\w+ing\b against "the running dog was chasing the singing cat"
- https?://\S+ against "visit http://example.com and https://www.example.org/path"
- (\d{4})-(\d{2})-(\d{2}) against "today is 2026-05-28"
- [A-Z][a-z]+ against "Alice and Bob met Charlie" For each, write down what the pattern means in English.
str-method vs regex drill (30 min). For each task, decide str-method or regex. Implement both for two of them; compare readability:
- "Find every line in a file that starts with #"
- "Find every line that contains a 4-digit year"
- "Replace every occurrence of apple with orange"
- "Split a CSV line on commas (ignoring commas inside quoted fields)"
- "Find every word ending in -ing"
- "Count digits in a string"
Email extractor (45 min). Write a Python script that, given a path to a text file, prints all email addresses found, one per line, sorted unique. Use re.findall with the rough email pattern. Test on a file that includes some borderline cases: user.name+tag@example.co.uk, not.an.email@, @nodomain.com.
re.sub exercise (30 min). Write a Python script that reads a file and writes a new file with every phone number (US-style: \d{3}-\d{3}-\d{4} or (\d{3}) \d{3}-\d{4}) replaced with [redacted]. Useful for log scrubbing.
Backreferences (30 min). Read the docs on backreferences (\1, \2 inside a pattern; \g<name> for named groups). Write a regex that finds doubled words: (\b\w+\b)\s+\1. Test against "the the cat sat on on the mat". Why does this only catch some duplicates and miss case differences?
Optional stretch: IPv6 the proper way (60 min). Read https://docs.python.org/3/library/ipaddress.html. Write a script that, given a string, attempts ipaddress.ip_address(s); if it succeeds, the string is a valid IP. Compare with your Lab 7 regex approach. Where does each win?

Reflection prompts (~30 minutes)

You learned to recognize "this is a regex problem" vs "this is a str-method problem." Of your six tasks in practice item 2, how many felt clearly regex? Clearly str-method? In-between?
Your Lab 7 regex matches 192.168.1.1 AND 999.999.999.999. The second is not a valid IPv4 address. What is your fix: tighten the regex, or validate after matching? Why?
Greedy vs non-greedy: <.*> against <b>hello</b> matches the WHOLE thing (greedy). <.*?> matches <b> (non-greedy). Which would you want for HTML extraction? Why?
The re module supports a re.VERBOSE flag that lets you write multi-line patterns with comments. Did you encounter a pattern this week that would have been clearer as re.VERBOSE? Show before/after.
One thing from this week you want to know more about?

Tool journal (week 7)

re module: findall, search, match, sub, compile, finditer
Character classes: \d, \w, \s and negations \D, \W, \S
Custom classes: [abc], [a-z], [^abc]
Quantifiers: *, +, ?, {n}, {n,m}; non-greedy with ?
Anchors: ^, $, \b
Groups: (...) capturing, (?:...) non-capturing, (?P<name>...) named
Raw strings: r'pattern' (always for regex)
regex101.com: interactive sandbox for debugging patterns
ipaddress stdlib module: when regex is the wrong tool

What comes next

Week 8 is the midterm. The exam is a 2-hour proctored practical: you receive a spec for a CLI tool you have not seen before and must implement it using the standard library and your notes. The exam exercises everything from weeks 1-7: language fluency, argparse, file I/O, basic regex, logging. There is no lab artifact this week beyond the exam itself.