~60 minutes. Use grep, cut, sort, and uniq to extract and summarize information from a sample log file.
Goal: build grep-based pipelines to extract structured information from text, identify patterns, and produce summary statistics.
Estimated time: 60 minutes
Prerequisites: Week 8 lecture (grep flags, pipes, cut, sort, uniq); Lab 8.1 completed
Setup
cd ~/fnd-101/lab-8-1 # reuse the sample from Lab 8.1
You will work with sample/logs/access.log from Lab 8.1.
Part A: grep flags in practice
For each task, write the command and record the output (or count).
-
Case-insensitive search: count lines containing "get" regardless of case:
grep -ic "get" sample/logs/access.log
-
Inverted search: show only lines that do NOT contain "200":
grep -v "200" sample/logs/access.log
-
Line numbers: show lines containing "403" with their line numbers:
grep -n "403" sample/logs/access.log
-
Recursive search: if you create a second log file
sample/logs/error.logwith a few lines, search for "error" in all files undersample/logs/:echo "ERROR: connection refused" > sample/logs/error.log echo "WARNING: disk space low" >> sample/logs/error.log grep -ri "error" sample/logs/
Part B: cut + sort + uniq pipeline
Find the top 3 IP addresses by request count:
cut -d' ' -f1 sample/logs/access.log | sort | uniq -c | sort -rn | head -3
Break down what each stage does:
cut -d' ' -f1: split each line on space, take the first field (the IP address)sort: sort alphabetically so identical IPs are adjacentuniq -c: count consecutive identical lines; produces "count IP" outputsort -rn: sort numerically in reverse order (highest count first)head -3: keep only the top 3
In your worksheet, write the output of this pipeline AND write a sentence explaining what each stage in the pipeline does.
Part C: Extracting specific fields
The log format is: IP METHOD PATH STATUS
-
Extract only the STATUS field (field 4) and list all unique values:
cut -d' ' -f4 sample/logs/access.log | sort | uniq
-
Count requests per HTTP method (GET, POST):
cut -d' ' -f2 sample/logs/access.log | sort | uniq -c
-
Find all paths that were requested more than once:
cut -d' ' -f3 sample/logs/access.log | sort | uniq -c | awk '$1 > 1'
Part D: Write the pipeline as a script
Save your "top IP addresses" pipeline as a reusable script:
cat > lab-8-2.sh << 'EOF'
#!/bin/bash
# Usage: ./lab-8-2.sh <logfile>
# Prints the top 5 IP addresses by request count
if [ -z "$1" ]; then
echo "Usage: $0 <logfile>"
exit 1
fi
if [ ! -f "$1" ]; then
echo "Error: $1 not found"
exit 1
fi
cut -d' ' -f1 "$1" | sort | uniq -c | sort -rn | head -5
EOF
chmod +x lab-8-2.sh
./lab-8-2.sh sample/logs/access.log
Expected output / artifact
lab-8-2-notes.txt with:
- Output of each Part A command
- Output and explanation for Part B pipeline
- Output of each Part C command
lab-8-2.sh(the script)
git add lab-8-2-notes.txt lab-8-2.sh
git commit -m "lab-8-2: text search pipelines + top-IP script"
Common pitfalls
uniq -crequires sorted input:uniqonly counts consecutive identical lines; if you skip thesortstage, it will under-count- Field numbers in cut: field numbering starts at 1, not 0
sort -rnvssort -r:-nsorts numerically; without it,sorttreats "10" as less than "9" (lexicographic ordering)
Stretch (optional)
Modify lab-8-2.sh to accept a second argument: a minimum count threshold. Only print IPs that appear at least N times. Test with N=2 on the sample log.
Lab 8.2 v0.1.