Breakout 3: Evaluation Design — Research Methodology

Setup

The Brief

2 min

Work in pairs. You and your partner have been funded to study conversational clustering. Design a complete evaluation plan: what you'll measure, how you'll run the evaluation, and what you'll report. Be concrete and specific.

Don't explain the three-part structure too much. Let them discover that they need to think about reporting before running experiments.
Students can fill the markdown during the session or polish and push after class — the in-class thinking is what matters.

Phase 1

Design Your Evaluation

10 min

Evaluation Plan Group: _______

1

Metrics

Primary metric

What it measures

What it does NOT measure

Why this metric and not alternatives

Secondary metric(s)

Success criterion

"We consider our method successful if..."

2

Evaluation Workflow

Ground truth source

Who provides labels

How many annotators

When annotators disagree

Known limitations

Data split strategy

Leakage prevention

Baselines (at least 3):

1.

2.

3.

Statistical validation

"How will you know the difference isn't noise?"

3

Reporting

Your results table — what columns will you report?

Method	__________	__________	__________
Baseline 1
Baseline 2
Ours

Failure cases to show

Ablations:

What result means failure

What you will NOT claim

Push for specificity

Metrics: "Not 'accuracy' — accuracy of what prediction, on what data, judged by whom?"
Metrics: "You say NMI — NMI between what and what?"
Workflow: "Which humans? How many? What exact question do you ask them?"
Workflow: "Your success criterion says 'better than baselines' — by how much? How will you know it's not noise?"
Reporting: "If your method loses on one metric but wins on another, what do you report?"
Reporting: "What would a hostile reviewer want to see that you're not showing?"

Common shortcuts: "We'll use human evaluation" (which humans?), "We'll compare against baselines" (which ones?), "We'll report accuracy" (one column isn't enough).
If groups skip Part 3, nudge them: "If you don't decide what to report before you run experiments, you'll cherry-pick after."
Phase 1 quality determines Phase 2 quality. The three-part structure helps — it's harder to be vague when you have to fill in specific columns for your results table.

Delivery via GitHub Classroom

Accept the assignment on the Resources page (GitHub Classroom link). One repo per pair.
Download the templates from Resources → Breakout Templates:

README.md — team info, AI usage declaration, mini-presentation
evaluation_plan.md — your evaluation plan

Fill them out and push to your repo.