Datasets & Resources

Public datasets of customer support conversations and knowledge bases. Useful for the running example (conversational clustering) and for student projects.

Tools & Links

Let's Get to Know Each Other

Thank you for the info!

GitHub Classroom

Accept the assignment to get your team repo. Work in pairs — one repo per pair.

Accept Assignment →

Research Questions

View your suggestions and the analysis of 92 research questions from Session 1.

View results →

Readings

How to do Research at the MIT AI Lab

Classic guide by MIT AI Lab graduate students. Heuristics for methodology, topic selection, writing, and the emotional side of research.

Read PDF →

Breakout Templates

Download the markdown templates, fill them out with your partner, and push to your GitHub Classroom repo.

Scoping, SOTA, and Evaluation Design

Design a complete evaluation plan: metrics, workflow, and reporting.

Design and Eval Template →

(Noisy) Customer support conversation datasets - find your own

Customer Support on Twitter

Large collection of customer support conversations from major brands on Twitter. Multi-turn, real interactions.

Kaggle →

Ubuntu Dialogue Corpus

~1M multi-turn dialogues from Ubuntu IRC channels. Technical support conversations with natural language.

arXiv →

CXM Arena

Customer experience management benchmark dataset from Sprinklr.

HuggingFace →

IBM Twitter Customer Care

Twitter conversations for conversational document prediction. Links support dialogs to resolution documents.

GitHub →

JDDC

JD.com customer service dialog corpus. Large-scale Chinese e-commerce support conversations.

arXiv →

Document-grounded & knowledge base datasets - find your own

Doc2Dial

Dialogs grounded in documents. Conversations where agents reference specific KB articles to resolve issues. 4,500 conversations across 450 documents (SSA, VA, DMV, CDC).

HuggingFace →

MultiDoc2Dial

Extension of Doc2Dial to multi-document grounding. Conversations that require reasoning across multiple KB articles.

ACL Anthology →

TechQA (IBM Research)

Technical support QA dataset. Questions derived from real IBM support forums, answered from Technotes.

ACL Anthology →

WixQA

Question-answer pairs from the Wix.com help center. Real user questions matched to KB article answers. 200 expert-written pairs across payments, bookings, domains, CMS, etc.

HuggingFace →

DoQA

Conversational QA over FAQ pages. Multi-turn questions grounded in real FAQ documents (cooking, travel, movies).

ACL Anthology →

Example Zendesk documentation - find your own

How real knowledge base systems work in production.

Metrics for Zendesk Knowledge

How Zendesk measures KB effectiveness: article views, votes, subscriptions, linked tickets.

Zendesk Docs →

Articles linked in tickets

How to track which KB articles agents link most often when resolving tickets.

Zendesk Docs →

Tickets solved by linked articles

Measuring how KB articles contribute to ticket resolution over time.

Zendesk Docs →