Lab 8.2: Text Search · FND-101 · Virtus Cyber Academy Classroom

~60 minutes. Use grep, cut, sort, and uniq to extract and summarize information from a sample log file.

Goal: build grep-based pipelines to extract structured information from text, identify patterns, and produce summary statistics.

Estimated time: 60 minutes

Prerequisites: Week 8 lecture (grep flags, pipes, cut, sort, uniq); Lab 8.1 completed

Setup

cd ~/fnd-101/lab-8-1    # reuse the sample from Lab 8.1

You will work with sample/logs/access.log from Lab 8.1.

Part A: grep flags in practice

For each task, write the command and record the output (or count).

Case-insensitive search: count lines containing "get" regardless of case:

grep -ic "get" sample/logs/access.log

Inverted search: show only lines that do NOT contain "200":

grep -v "200" sample/logs/access.log

Line numbers: show lines containing "403" with their line numbers:

grep -n "403" sample/logs/access.log

Recursive search: if you create a second log file sample/logs/error.log with a few lines, search for "error" in all files under sample/logs/:

echo "ERROR: connection refused" > sample/logs/error.log
echo "WARNING: disk space low" >> sample/logs/error.log
grep -ri "error" sample/logs/

Part B: cut + sort + uniq pipeline

Find the top 3 IP addresses by request count:

cut -d' ' -f1 sample/logs/access.log | sort | uniq -c | sort -rn | head -3

Break down what each stage does:

cut -d' ' -f1: split each line on space, take the first field (the IP address)
sort: sort alphabetically so identical IPs are adjacent
uniq -c: count consecutive identical lines; produces "count IP" output
sort -rn: sort numerically in reverse order (highest count first)
head -3: keep only the top 3

In your worksheet, write the output of this pipeline AND write a sentence explaining what each stage in the pipeline does.

Part C: Extracting specific fields

The log format is: IP METHOD PATH STATUS

Extract only the STATUS field (field 4) and list all unique values:

cut -d' ' -f4 sample/logs/access.log | sort | uniq

Count requests per HTTP method (GET, POST):

cut -d' ' -f2 sample/logs/access.log | sort | uniq -c

Find all paths that were requested more than once:

cut -d' ' -f3 sample/logs/access.log | sort | uniq -c | awk '$1 > 1'

Part D: Write the pipeline as a script

Save your "top IP addresses" pipeline as a reusable script:

cat > lab-8-2.sh << 'EOF'
#!/bin/bash
# Usage: ./lab-8-2.sh <logfile>
# Prints the top 5 IP addresses by request count

if [ -z "$1" ]; then
    echo "Usage: $0 <logfile>"
    exit 1
fi

if [ ! -f "$1" ]; then
    echo "Error: $1 not found"
    exit 1
fi

cut -d' ' -f1 "$1" | sort | uniq -c | sort -rn | head -5
EOF

chmod +x lab-8-2.sh
./lab-8-2.sh sample/logs/access.log

Expected output / artifact

lab-8-2-notes.txt with:

Output of each Part A command
Output and explanation for Part B pipeline
Output of each Part C command
lab-8-2.sh (the script)

git add lab-8-2-notes.txt lab-8-2.sh
git commit -m "lab-8-2: text search pipelines + top-IP script"

Common pitfalls

uniq -c requires sorted input: uniq only counts consecutive identical lines; if you skip the sort stage, it will under-count
Field numbers in cut: field numbering starts at 1, not 0
sort -rn vs sort -r: -n sorts numerically; without it, sort treats "10" as less than "9" (lexicographic ordering)

Stretch (optional)

Modify lab-8-2.sh to accept a second argument: a minimum count threshold. Only print IPs that appear at least N times. Test with N=2 on the sample log.

Lab 8.2 v0.1.