Session 2 · Breakout 3

Evaluation Design

Design a complete evaluation plan: what you'll measure, how you'll run it, and what you'll report.

20 min total Pairs of 2
Setup

The Brief

2 min
Work in pairs. You and your partner have been funded to study conversational clustering. Design a complete evaluation plan: what you'll measure, how you'll run the evaluation, and what you'll report. Be concrete and specific.
  • Don't explain the three-part structure too much. Let them discover that they need to think about reporting before running experiments.
  • Students can fill the markdown during the session or polish and push after class — the in-class thinking is what matters.
Phase 1

Design Your Evaluation

10 min
Evaluation Plan Group: _______
1

Metrics

Primary metric
What it measures
What it does NOT measure
Why this metric and not alternatives
Secondary metric(s)
Success criterion
"We consider our method successful if..."
2

Evaluation Workflow

Ground truth source
Who provides labels
How many annotators
When annotators disagree
Known limitations
Data split strategy
Leakage prevention

Baselines (at least 3):

Baseline
What claim does beating it support?
1.
2.
3.
Statistical validation
"How will you know the difference isn't noise?"
3

Reporting

Your results table — what columns will you report?

Method __________ __________ __________
Baseline 1
Baseline 2
Ours
Failure cases to show

Ablations:

Remove...
Shows that...
What result means failure
What you will NOT claim

Push for specificity

  • Metrics: "Not 'accuracy' — accuracy of what prediction, on what data, judged by whom?"
  • Metrics: "You say NMI — NMI between what and what?"
  • Workflow: "Which humans? How many? What exact question do you ask them?"
  • Workflow: "Your success criterion says 'better than baselines' — by how much? How will you know it's not noise?"
  • Reporting: "If your method loses on one metric but wins on another, what do you report?"
  • Reporting: "What would a hostile reviewer want to see that you're not showing?"
  • Common shortcuts: "We'll use human evaluation" (which humans?), "We'll compare against baselines" (which ones?), "We'll report accuracy" (one column isn't enough).
  • If groups skip Part 3, nudge them: "If you don't decide what to report before you run experiments, you'll cherry-pick after."
  • Phase 1 quality determines Phase 2 quality. The three-part structure helps — it's harder to be vague when you have to fill in specific columns for your results table.

Delivery via GitHub Classroom

  1. Accept the assignment on the Resources page (GitHub Classroom link). One repo per pair.
  2. Download the templates from Resources → Breakout Templates:
  • README.md — team info, AI usage declaration, mini-presentation
  • evaluation_plan.md — your evaluation plan

Fill them out and push to your repo.