Dialogue Memory Pipeline

dialogue_memory_pipeline turns a dialogue transcript into:

candidate topic-shift boundaries
per-utterance local discourse states
finalized dialogue segments
episode-style memory records for each segment

The package is built around an OpenAI-compatible JSON LLM client and is intended for dialogue understanding, segmentation, and memory construction workflows.

Project Status

This package is currently in active development and should be treated as an alpha release.

APIs, behavior, and output formats may still change.
The package is not yet production-ready.
At the moment, the only provider setup that is tested and supported is Bailian (DashScope) using its OpenAI-compatible endpoint.

If you publish this package, it is best to assume early-adopter usage rather than stable general availability.

What It Does

Given a sequence of utterances, the pipeline runs four stages:

Candidate boundary generation Scores every possible boundary between adjacent utterances and keeps only the highest-confidence candidates.
Local state extraction Extracts a structured state for each utterance, including topic, intent, entities, cue markers, and obligation signals.
Transition judgment and segmentation Walks candidate boundaries in order and decides whether each one starts a new segment.
Episodic memory building Produces one memory record per final segment.

The top-level entrypoint is DialogueSegmentationPipeline.

Features

End-to-end dialogue segmentation and memory generation
OpenAI-compatible client with optional custom base_url
Structured JSON outputs for every pipeline stage
Configurable candidate selection thresholding
Simple API for loading dialogues from JSON files

Installation

Install the package from source:

git clone https://github.com/Keyan0412/dialogue_memory_pipeline.git
cd dialogue_memory_pipeline
python -m venv .venv
source .venv/bin/activate
pip install .

Required dependencies:

openai
python-dotenv

Environment Variables

The pipeline currently expects a Bailian (DashScope) API key and endpoint through an OpenAI-compatible interface.

Supported environment variables:

OPENAI_API_KEY
OPENAI_BASE_URL (optional)
OPENAI_MODEL (optional when using from_env)

Example .env:

OPENAI_API_KEY=YOUR_API_KEY
OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
OPENAI_MODEL=qwen3.5-plus

At this stage, other OpenAI-compatible providers may or may not work, but they are not yet officially supported by this package.

Quick Start

Use the pipeline with environment variables

from dialogue_memory_pipeline import (
    DialogueSegmentationPipeline,
    PipelineConfig,
    load_sample_dialogue,
)

dialogue = load_sample_dialogue()

config = PipelineConfig(
    top_p_candidates=0.30,
    min_candidate_score=0.20,
    right_preview_window=3,
    min_segment_len=2,
)

pipeline = DialogueSegmentationPipeline.from_env(config=config)
result = pipeline.run(dialogue)

print(result["segments"])
print(result["episodes"])

Use the pipeline with explicit credentials

from dialogue_memory_pipeline import DialogueSegmentationPipeline, load_sample_dialogue

dialogue = load_sample_dialogue()

pipeline = DialogueSegmentationPipeline.from_openai(
    model="qwen3.5-plus",
    api_key="YOUR_API_KEY",
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

result = pipeline.run(dialogue)

Public API

The package exports:

from dialogue_memory_pipeline import (
    DialogueSegmentationPipeline,
    PipelineConfig,
    load_dialogue,
    load_sample_dialogue,
)

`DialogueSegmentationPipeline`

Constructors:

DialogueSegmentationPipeline(llm, config=None)
DialogueSegmentationPipeline.from_env(model=None, config=None)
DialogueSegmentationPipeline.from_openai(model, api_key=None, base_url=None, config=None)

Main method:

run(utterances) -> dict

`PipelineConfig`

Current configuration fields:

PipelineConfig(
    top_p_candidates=0.30,
    min_candidate_score=0.20,
    right_preview_window=3,
    min_segment_len=2,
)

Field meanings:

top_p_candidates: Fraction of available boundaries to keep after scoring. If a dialogue has N utterances, there are N - 1 possible boundaries. The retained candidate count is ceil((N - 1) * top_p_candidates), with a minimum of 1 whenever any boundary exists.
min_candidate_score: Minimum boundary score to keep before the top-p cap is applied.
right_preview_window: Number of right-side local states shown to the transition judge when evaluating a candidate split.
min_segment_len: Minimum allowed segment length used during segmentation and cleanup merge.

Input Format

load_sample_dialogue() loads the packaged example dialogue shipped in the wheel.

load_dialogue(...) expects a JSON file containing a list of utterances:

[
  {
    "turn_id": 0,
    "speaker": "user",
    "text": "I need to reschedule my flight."
  },
  {
    "turn_id": 1,
    "speaker": "assistant",
    "text": "Sure, what is your booking number?"
  }
]

Each item must contain:

turn_id
speaker
text

Output Structure

pipeline.run(...) returns a dictionary with these top-level keys:

candidates
local_states
decisions
segments
episodes
timing

`candidates`

Each candidate includes:

boundary_after_turn
score
left_turn_id
right_turn_id
left_text
right_text
reasoning
source

`local_states`

Each local state includes:

turn_id
speaker
summary_topic
intent
salient_entities
cue_markers
obligation.opens
obligation.resolves

`segments`

Each finalized segment includes:

segment_id
utterance_span
utterances
local_states
segment_state

segment_state contains:

stable_topic
discourse_goal
focus_topics
entity_core
open_obligations
dominant_relation

`episodes`

Each episodic memory record includes:

episode_id
utterance_span
utterances
retrieval_summary
key_entities
importance

Running the Included Scripts

Full pipeline demo

python scripts/demo.py

Optional flags:

python scripts/demo.py --output outputs/demo_output.json --model qwen3.5-plus

Candidate boundary generator test

python scripts/test_candidate_generator.py --top-p 0.30 --min-score 0.40

This writes a JSON report with:

all scored boundaries
filtered candidate boundaries
the effective candidate-generation config

Repository Layout

src/dialogue_memory_pipeline/ Importable package
src/dialogue_memory_pipeline/clients/ LLM client adapters
src/dialogue_memory_pipeline/core/ Shared dataclasses and schemas
src/dialogue_memory_pipeline/modules/ Pipeline modules for boundary generation, state extraction, transition judgment, and memory building
src/dialogue_memory_pipeline/data/ Packaged sample dialogue data included in the wheel
examples/ Small usage examples
scripts/ Runnable scripts for demos and module-level testing
tests/ Local test coverage for packaging and defensive parsing behavior
outputs/ Generated artifacts

Model and Provider Notes

The current tested provider is Bailian (DashScope) via its OpenAI-compatible endpoint.
OPENAI_BASE_URL should currently point to the Bailian compatible endpoint unless you are experimenting on your own.
from_env() defaults to OPENAI_MODEL when set, otherwise it falls back to qwen3.5-plus.
Support for additional providers is not finalized yet.

Current Limitations

The project is still in alpha and may change in breaking ways.
Only Bailian (DashScope) API credentials and endpoint configuration are currently supported.
The implementation is fully LLM-driven; there is no local fallback model path in the package.
Transition-judge behavior is model-dependent because split decisions are generated by the LLM.
The implementation depends on an OpenAI-compatible JSON-capable model endpoint.

License

This project is released under the Apache License 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
examples		examples
outputs		outputs
scripts		scripts
src/dialogue_memory_pipeline		src/dialogue_memory_pipeline
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dialogue Memory Pipeline

Project Status

What It Does

Features

Installation

Environment Variables

Quick Start

Use the pipeline with environment variables

Use the pipeline with explicit credentials

Public API

`DialogueSegmentationPipeline`

`PipelineConfig`

Input Format

Output Structure

`candidates`

`local_states`

`segments`

`episodes`

Running the Included Scripts

Full pipeline demo

Candidate boundary generator test

Repository Layout

Model and Provider Notes

Current Limitations

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dialogue Memory Pipeline

Project Status

What It Does

Features

Installation

Environment Variables

Quick Start

Use the pipeline with environment variables

Use the pipeline with explicit credentials

Public API

DialogueSegmentationPipeline

PipelineConfig

Input Format

Output Structure

candidates

local_states

segments

episodes

Running the Included Scripts

Full pipeline demo

Candidate boundary generator test

Repository Layout

Model and Provider Notes

Current Limitations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`DialogueSegmentationPipeline`

`PipelineConfig`

`candidates`

`local_states`

`segments`

`episodes`

Packages