In noisy, multi-turn conversations, language models often lose track of context due to rapid topic shifts. This tool disentangles chat histories into topic-wise threads using semantic embeddings + clustering, enabling more focused LLM behavior in summarization, retrieval, and reasoning tasks.
- Sentence Embedding: Captures semantic similarity using
all-MiniLM-L6-v2 - Unsupervised Clustering: Detects coherent topics via Agglomerative Clustering
- Thread Segmentation: Groups utterances into clean topic-wise threads
- Optional Topic Labeling: Uses RAKE to name each thread
- LLM-Ready Context Windows: Improves downstream summarization and QA
git clone https://github.com/yourusername/chat-disentanglement.git
cd chat-disentanglementpip install -r requirements.txtimport nltk
nltk.download('stopwords')python main.py[
{ "speaker": "user", "message": "Hey, can you help me with my WiFi?" },
{ "speaker": "agent", "message": "Sure, what seems to be the issue?" },
{ "speaker": "user", "message": "Also, I wanted to ask about roaming charges." },
{ "speaker": "agent", "message": "We have international plans for that." },
{ "speaker": "user", "message": "The WiFi cuts out randomly." },
{ "speaker": "agent", "message": "Let me look into that." }
] Thread 0: wifi issue
------------------------
• Hey, can you help me with my WiFi?
• The WiFi cuts out randomly.
• Let me look into that.
Thread 1: international plans
-------------------------------
• Also, I wanted to ask about roaming charges.
• We have international plans for that.
chat-disentanglement/
├── main.py
├── requirements.txt
├── README.md
├── sample_data/
│ └── sample_chat.json
└── disentangler/
├── embedder.py
├── clusterer.py
├── labeler.py
└── utils.py
- Customer support: separate billing vs. tech queries
- Meeting analysis: extract task items from casual talk
- Multi-topic chat summarization for LLMs