Module: 1 (LLM Black-Box Mental Model) Duration: 2 hours Substrate: Pyodide in-browser workbench + local Python (SETUP.md §2-3) Points: 5
Objectives
- Send your first structured prompts to an LLM API (OpenAI or Anthropic)
- Measure and log token costs for each request
- Explore basic model parameters: temperature, max_tokens, system prompt
- Optional: run the Karpathy micrograd companion (30 min extension)
Part 1: API Smoke Test (20 min)
Open the Pyodide workbench or your local Python environment. Run this starter:
import os
from openai import OpenAI
# Or: from anthropic import Anthropic
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "What is a large language model in one sentence?"}
],
max_tokens=100,
temperature=0.0
)
print(response.choices[0].message.content)
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")
Record:
- What did the model say?
- How many input tokens was the request?
- How many output tokens was the response?
Part 2: Token Counting Before You Send (20 min)
Before sending any prompt, estimate its token count using tiktoken:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 / Claude tokenizer
prompts = [
"What is a large language model?",
"Explain prompt injection in three sentences.",
"List all 10 OWASP LLM Top 10 entries from 2025.",
"Write a 500-word essay on AI security."
]
for prompt in prompts:
tokens = len(enc.encode(prompt))
cost_estimate = tokens * 0.0000025 # GPT-4o-mini input price, USD
print(f"Prompt: {prompt[:40]}...")
print(f" Tokens: {tokens}, Est. input cost: ${cost_estimate:.5f}")
Cost discipline rule: any loop or batch with more than 100 iterations against a paid API requires an explicit cost estimate before running. Use Ollama for high-volume lab work.
Record:
- Which prompt had the most tokens?
- What is the estimated cost of sending all 4 prompts to GPT-4o-mini?
- At what token count does a single prompt start costing more than $0.01 (input only)?
Part 3: The 10 Structured Prompts (40 min)
Send exactly 10 prompts and record tokens + response for each. Choose from the categories below -- you must include at least one from each category:
Category A: Factual queries
- "What is the OWASP LLM Top 10?"
- "What does CVSS stand for and what does each component measure?"
- "Explain the difference between LLM01:2025 and LLM01:2023."
Category B: Role + constraint tests
- Send with system prompt:
"You are a security researcher. Respond only in bullet points." - Send the same user query with temperature=0.0 and temperature=1.0. Compare responses.
Category C: Multi-turn conversation
messages = [
{"role": "system", "content": "You are a cybersecurity tutor."},
{"role": "user", "content": "What is prompt injection?"},
# Record the response, then add it and continue:
{"role": "assistant", "content": "<RESPONSE FROM ABOVE>"},
{"role": "user", "content": "Give me a one-sentence example of an indirect prompt injection."}
]
Category D: Edge tests
- Send
max_tokens=5and observe that the response is truncated - Send
temperature=2.0(maximum) and observe incoherent output
Record for each of the 10 prompts:
- Input tokens
- Output tokens
- Model response (truncated to 100 chars)
- Any unexpected behavior
Total cost estimate: compute total tokens across all 10 prompts. Did you stay under $0.25?
Part 4: System Prompt Exploration (30 min)
Observe how the system prompt changes model behavior:
system_prompts = [
"You are a helpful assistant.",
"You are a concise assistant. Respond in 20 words or fewer.",
"You are a skeptical assistant. Always question assumptions in the user's query.",
"You must respond in Spanish regardless of the language of the question."
]
user_message = "What is the main security risk of giving an AI too many permissions?"
for sp in system_prompts:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": sp},
{"role": "user", "content": user_message}
],
max_tokens=150,
temperature=0.0
)
print(f"System: {sp[:50]}...")
print(f"Response: {response.choices[0].message.content[:200]}")
print()
Record:
- Which system prompt produced the most constrained response?
- Could you override the Spanish instruction with a user message? (Try: "Please respond in English.")
- This question previews Module 2. What does the result suggest about the reliability of system prompt constraints?
Part 5: Ollama Local Model (20 min)
Run the same set of prompts against your local Ollama model and compare:
# Ollama uses OpenAI-compatible API at localhost:11434
local_client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama"
)
response = local_client.chat.completions.create(
model="llama3.2:3b",
messages=[
{"role": "system", "content": "You are a cybersecurity assistant."},
{"role": "user", "content": "What is prompt injection?"}
],
max_tokens=200
)
print(response.choices[0].message.content)
Record:
- How does the local llama3.2:3b response compare to GPT-4o-mini for the same question?
- What is the response latency for Ollama vs. the OpenAI API?
- Ollama does not charge per token. What are the practical implications for lab work?
Optional: Karpathy micrograd Companion (30 min)
If you want to understand what the model is doing internally, clone and run micrograd:
git clone https://github.com/karpathy/micrograd.git
cd micrograd
pip install -r requirements.txt
jupyter notebook demo.ipynb
Work through the demo notebook. Specifically observe:
- How a Value node stores data + gradient
- How backpropagation computes gradients
- How a simple MLP (multi-layer perceptron) is built from Value nodes
The connection to LLMs: the same backpropagation mechanism, applied to billions of parameters and trillions of tokens, produces GPT-4. The key insight for security: this is a statistical optimization process, not rule-following. There is no "instruction" the model follows -- only learned statistical associations.
Record (optional):
- What does
loss.backward()compute? - Run the
moonsdataset demo. What accuracy does the MLP achieve?
Lab Report
Answer in writing (4-6 sentences total):
-
In Module 1 we said "there is no structural separation between instruction and data" in an LLM. Based on your Part 4 experiment, what evidence supports this claim? What would you need to see to challenge it?
-
You sent 10 prompts in Part 3. Pick the most expensive and the least expensive. What accounts for the token cost difference?
Grading (5 points)
| Item | Points |
|---|---|
| Parts 1-3: all 10 prompts sent; tokens recorded; cost under $0.25 | 2 |
| Part 4: system prompt experiment documented; question #3 answered | 1 |
| Part 5: Ollama comparison documented | 1 |
| Lab report: both questions answered with evidence from the lab | 1 |