Setup
The Brief
2 min
Work in pairs. You and your partner have been funded to study conversational clustering.
Design a complete evaluation plan: what you'll measure, how you'll run the evaluation, and what you'll report.
Be concrete and specific.
- Don't explain the three-part structure too much. Let them discover that they need to think about reporting before running experiments.
- Students can fill the markdown during the session or polish and push after class — the in-class thinking is what matters.
Phase 1
Design Your Evaluation
10 min
Evaluation Plan
Group: _______
1
Metrics
Primary metric
What it measures
What it does NOT measure
Why this metric and not alternatives
Secondary metric(s)
Success criterion
"We consider our method successful if..."
2
Evaluation Workflow
Ground truth source
Who provides labels
How many annotators
When annotators disagree
Known limitations
Data split strategy
Leakage prevention
Baselines (at least 3):
Baseline
What claim does beating it support?
1.
2.
3.
Statistical validation
"How will you know the difference isn't noise?"
3
Reporting
Your results table — what columns will you report?
| Method | __________ | __________ | __________ |
|---|---|---|---|
| Baseline 1 | |||
| Baseline 2 | |||
| Ours |
Failure cases to show
Ablations:
Remove...
Shows that...
What result means failure
What you will NOT claim
Push for specificity
- Metrics: "Not 'accuracy' — accuracy of what prediction, on what data, judged by whom?"
- Metrics: "You say NMI — NMI between what and what?"
- Workflow: "Which humans? How many? What exact question do you ask them?"
- Workflow: "Your success criterion says 'better than baselines' — by how much? How will you know it's not noise?"
- Reporting: "If your method loses on one metric but wins on another, what do you report?"
- Reporting: "What would a hostile reviewer want to see that you're not showing?"
- Common shortcuts: "We'll use human evaluation" (which humans?), "We'll compare against baselines" (which ones?), "We'll report accuracy" (one column isn't enough).
- If groups skip Part 3, nudge them: "If you don't decide what to report before you run experiments, you'll cherry-pick after."
- Phase 1 quality determines Phase 2 quality. The three-part structure helps — it's harder to be vague when you have to fill in specific columns for your results table.
Delivery via GitHub Classroom
- Accept the assignment on the Resources page (GitHub Classroom link). One repo per pair.
- Download the templates from Resources → Breakout Templates:
- README.md — team info, AI usage declaration, mini-presentation
- evaluation_plan.md — your evaluation plan
Fill them out and push to your repo.