Skip to content

skalogerakis/docguard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

docguard 🛡️

Stop Documentation Drift. Start Semantic Linting.

Python Version License Open Source

docguard is an AI-powered semantic linter that bridges the gap between implementation and documentation. It doesn't just check if a docstring exists — it understands what your code does and ensures your comments tell the truth.

FeaturesInstallationQuick StartCommandsConfigurationCI/CDMCPArchitecture


💡 Why docguard?

Traditional linters check for formatting and existence. docguard checks for truth.

In fast-moving codebases, docstrings become "stale" in subtle ways:

  • A parameter is renamed in the code but stays unchanged in the comment.
  • A return type changes, but the documentation still points to the old model.
  • A new exception is raised, yet it's nowhere to be found in the Raises section.
  • An MCP tool description no longer reflects what the tool actually does — making it invisible to LLMs.

docguard acts as an automated Peer Reviewer that specialises in documentation quality.


✨ Features

Feature Description
🧠 Semantic Analysis Deep logic verification using OpenAI, Gemini, or Ollama
⚠️ Severity Levels CRITICAL / ERROR / WARNING / INFO — graduate what fails your build
🔗 MCP-Aware Mode Specialized analysis for @*.tool decorated functions
Streaming Interface Results appear as the first function is analysed — no waiting
🚀 Parallel File I/O ThreadPoolExecutor scans large directories concurrently
💾 Persistent Cache Content-hashed results; unchanged functions are never re-analysed
🪙 Token Budget Smart truncation for large functions — keeps signatures and key statements
🛠️ Interactive Fixes Review and apply surgical docstring patches with docguard fix
🔁 Smart Retry Not satisfied? Provide a hint and ask the LLM again
💡 Generate Docstrings Auto-generate Google-style docstrings for undocumented functions

📦 Installation

# Clone the repository
git clone https://github.com/skalogerakis/docguard.git
cd docguard

# Set up a virtual environment
python -m venv venv
source venv/bin/activate          # Windows: venv\Scripts\activate

# Install runtime dependencies
pip install -e .

# Install development dependencies (tests + linting)
pip install -e ".[dev]"

LLM Provider Prerequisites

Provider Requirement
Ollama (default) Install Ollama and run ollama pull llama3
OpenAI OPENAI_API_KEY environment variable
Gemini GEMINI_API_KEY environment variable

⚙️ Configuration

docguard reads configuration from a .env file (or any file passed via --config) and from environment variables.

# ── Provider API keys ──────────────────────────────────────────────
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=AIza...

# ── Model overrides (optional) ─────────────────────────────────────
OPENAI_MODEL=gpt-4o-mini
GEMINI_MODEL=gemini-2.5-flash
OLLAMA_MODEL=llama3

# ── Engine ─────────────────────────────────────────────────────────
MAX_CONCURRENCY=5          # Parallel LLM calls per run

# ── Analysis ───────────────────────────────────────────────────────
# Minimum severity that causes `check` to exit 1.
# Accepts: critical | error | warning | info
FAIL_ON=error

# Token budget per function (4 chars ≈ 1 token; default 8000 ≈ 2000 tokens)
MAX_FUNCTION_CHARS=8000

# ── Display ────────────────────────────────────────────────────────
# Max failures shown inline before the truncation notice
SHOW_MENU_LIMIT=50

# Cache directory (relative to cwd)
CACHE_DIR=.docguard_cache

# Extra directories to exclude (comma-separated)
EXTRA_EXCLUDE_DIRS=scripts,migrations

All settings can also be exported as shell environment variables — .env is a convenience.

Severity Levels

docguard assigns a severity to every discrepancy so you can control exactly what breaks your build:

Severity When it applies Default --fail-on
critical Function does the opposite of what is documented ✅ always fails
error Wrong parameter names/types, wrong return type (default) ✅ always fails
warning Minor inaccuracy, incomplete description ❌ unless --fail-on warning
info Cosmetic improvement only ❌ unless --fail-on info

🚀 Quick Start

# Verify docstrings in a single file (uses local Ollama — free)
docguard check src/my_module.py

# Scan a directory with Gemini, fail only on CRITICAL issues
docguard check src/ --provider gemini --fail-on critical

# Generate docstrings for undocumented functions
docguard suggest src/ --provider ollama --model llama3

# Interactively review and apply fixes
docguard fix src/ --provider openai

# Analyse only MCP tool functions
docguard check src/mcp_server.py --mcp --provider gemini

📖 Commands

docguard check — Verify Docstring Accuracy

Analyses every documented function and flags any discrepancy between the docstring and the implementation.

Usage: docguard check [OPTIONS] PATH

  Verify that every docstring accurately reflects its function's implementation.

Arguments:
  PATH  Path to the Python file or directory to lint  [required]

Options:
  --provider TEXT       LLM provider: gemini, openai, ollama  [default: ollama]
  -m, --model TEXT      Model name, overrides provider default
  -f, --force           Ignore cache and re-analyse everything
  --show-all            Show all failures (no cap)
  -v, --verbose         Show per-function timing and token usage (+ cache path)
  --fail-on TEXT        Minimum severity to fail on: critical|error|warning|info  [default: error]
  --mcp                 MCP mode: only check @*.tool decorated functions
  -e, --exclude TEXT    Extra directory to exclude (repeatable)
  --config TEXT         Path to a custom .env file
  --help                Show this message and exit

Exit codes:

  • 0 — All docstrings pass (or no failures above the --fail-on threshold).
  • 1 — One or more discrepancies found above threshold, or a fatal error occurred.

Examples:

# Quick check with local Ollama (no API key needed)
docguard check src/

# Fail only on critical or error severity issues
docguard check src/ --provider gemini --fail-on error

# Strict mode — fail on anything at all
docguard check src/ --provider openai --fail-on info

# MCP-only mode with verbose output
docguard check src/mcp_server.py --mcp --provider gemini --verbose

# Exclude generated code and migrations
docguard check src/ --exclude migrations --exclude generated --provider ollama

# Load a project-specific config file
docguard check src/ --config .env.production

# Force re-analysis, show all failures
docguard check src/ --provider openai --force --show-all

Sample output:

╭─────────────────────────────╮
│ DocGuard 🛡️  - Initializing │
╰─────────────────────────────╯
🤖 Provider: Ollama | Model: llama3
📂 Scanning: src/
💾 Cache: .docguard_cache
🧠 Analysis with ollama…

❌ src/api/users.py::get_user  [error]  (line 14)
   Reason: Docstring says "fetches by email" but implementation uses user_id
   Suggested Fix: Fetch a user record by their integer ID.

                  Args:
                      user_id: The unique integer identifier.

                  Returns:
                      A dict containing the user's profile data.

┌──────────────────┬───────┐
│ Total Functions  │ 42    │
│ Cached (Skipped) │ 38    │
│ New (Processed)  │ 4     │
│ Failed           │ 1     │
└──────────────────┴───────┘

❌ Found 1 docstring discrepancy.

docguard check --mcp — MCP Tool Analysis

When --mcp is passed, docguard switches to MCP mode:

  • Scans only functions decorated with @mcp.tool, @server.tool, @app.tool, @tool, or any @*.tool pattern.
  • Uses a MCP-specialized LLM prompt that evaluates:
    1. Discoverability — is the summary specific enough that an LLM knows when to call this tool?
    2. Parameter completeness — are all args described with type and purpose?
    3. Return value accuracy — does the doc describe the actual output format?
    4. Exception transparency — are failure modes documented?
  • Results include a discoverability_issues list with specific problems.
  • Uses a separate cache namespace (mcp) so results don't interfere with standard checks.
# Analyse MCP tool descriptions
docguard check src/mcp_server.py --mcp --provider gemini

# Strict MCP check — fail on any warning
docguard check src/mcp_server.py --mcp --fail-on warning --provider openai

Sample MCP output:

🔗 MCP mode: scanning only @*.tool functions
🧠 MCP tool analysis with gemini…

🔗 MCP src/mcp_server.py::search_documents  [warning]  (line 12)
   Reason: Summary is too vague — an LLM cannot distinguish this from other search tools
   Suggested Fix: Search the document store by keyword and return ranked results.
                  ...
   MCP Issues:
     • Summary does not clarify the scope of search (title-only vs full-text)
     • Missing Returns section — LLM cannot interpret the output format

docguard suggest — Generate Missing Docstrings

Finds all undocumented functions and generates accurate, Google-style docstrings.

Usage: docguard suggest [OPTIONS] PATH

  Generate Google-style docstring suggestions for undocumented functions.

Arguments:
  PATH  Path to the Python file or directory to scan  [required]

Options:
  --provider TEXT       LLM provider: gemini, openai, ollama  [default: ollama]
  -m, --model TEXT      Model name, overrides provider default
  -f, --force           Ignore cached suggestions and regenerate
  -v, --verbose         Show per-function timing and token usage
  -e, --exclude TEXT    Extra directory to exclude (repeatable)
  --config TEXT         Path to a custom .env file
  --help                Show this message and exit

Note: suggest only prints suggestions — it does not modify your files. Use docguard fix to apply them.

Examples:

# Generate docstrings for everything missing one
docguard suggest src/ --provider gemini

# Use a specific Ollama model
docguard suggest src/ --provider ollama --model gemma3:4b

# Regenerate even if suggestions are cached
docguard suggest src/ --provider openai --force

docguard fix — Interactively Apply Fixes

Runs the full analysis, then presents each failed function one by one for review and patching.

Usage: docguard fix [OPTIONS] PATH

  Interactively review and apply docstring fixes.

Arguments:
  PATH  Path to the Python file or directory to fix  [required]

Options:
  --provider TEXT       LLM provider: gemini, openai, ollama  [default: ollama]
  -m, --model TEXT      Model name, overrides provider default
  -f, --force           Ignore cache and re-analyse everything
  --auto-apply          Apply all fixes without interactive prompting
  -v, --verbose         Show per-function timing and token usage
  -e, --exclude TEXT    Extra directory to exclude (repeatable)
  --config TEXT         Path to a custom .env file
  --help                Show this message and exit

Interactive prompt options:

Input Action
y or Enter Apply the suggested fix to the source file
n Skip this function — leave it unchanged
r Retry — ask the LLM again, optionally with a hint

Examples:

# Interactive review
docguard fix src/ --provider gemini

# Fully automated — apply every fix (CI-friendly)
docguard fix src/ --provider openai --auto-apply

# Re-analyse everything and auto-apply
docguard fix src/ --force --auto-apply --provider gemini

The retry flow:

❌ src/api/users.py::get_user  [error]  (line 14)
   Issue:   Docstring says "fetches by email" but uses user_id
   Fix:     Fetch a user by their unique integer ID.

   [y]es apply / [n]o skip / [r]etry with LLM: r
   Hint for the LLM (press Enter to skip): mention the dict structure of the return value
   🔄 Retrying…
   New fix: Fetch a user record by their integer ID.

             Args:
                 user_id: The unique integer identifier.

             Returns:
                 A dict with keys 'id', 'name', and 'email'.

   [y]es apply / [n]o skip / [r]etry with LLM: y
   ✓ Applied.

docguard cache-clear — Clear the Result Cache

Wipes all cached analysis and generation results (all namespaces: check, suggest, mcp).

docguard cache-clear

Tip: Cache files are stored in .docguard_cache/ by default (configurable via CACHE_DIR). Add it to .gitignore.


docguard version — Print Version

docguard version
# DocGuard v0.1.0

🔗 MCP Mode — Built for AI Tools

When using MCP (Model Context Protocol) servers, the docstring is the interface — it's what the LLM reads to decide whether to invoke a tool and how.

A stale or inaccurate MCP tool description means:

  • The LLM cannot find the tool when it should be used.
  • The LLM invokes the tool with the wrong arguments.
  • The tool's output is misinterpreted.

Decorator Detection

docguard automatically detects all common MCP tool patterns:

@mcp.tool           # FastMCP
@mcp.tool()         # FastMCP with kwargs
@server.tool        # Custom server instances
@app.tool           # App-style servers
@tool               # Bare decorator

MCP Check Workflow

# Verify your MCP server's tool descriptions
docguard check src/mcp_server.py --mcp --provider gemini

# Auto-fix any drifted descriptions
docguard fix src/mcp_server.py --provider gemini --auto-apply

# Strict CI check — fail on any warning
docguard check src/mcp_server.py --mcp --fail-on warning --provider openai

What a Good MCP Docstring Looks Like

@mcp.tool
def search_documents(query: str, limit: int = 10) -> list[dict]:
    """
    Search the document store for entries matching the query string.

    Args:
        query: The search term to match against document titles and content.
        limit: Maximum number of results to return.

    Returns:
        A list of matching document dicts, each with 'id', 'title', and 'score'.

    Raises:
        ValueError: If query is empty or limit is less than 1.
    """
    ...

🔬 Running Tests

# Run all tests (unit + e2e, mocked — no API calls)
pytest tests/

# Verbose output
pytest tests/ -v

# Unit tests only
pytest tests/unit/

# E2E tests only
pytest tests/e2e/

# Run live tests against a real LLM (requires Ollama running)
pytest tests/ --run-live-llm --live-llm-provider ollama

# Live tests with a specific provider
pytest tests/ --run-live-llm --live-llm-provider gemini

The live test flag (--run-live-llm) is intentionally off by default to avoid costs and network dependencies.


🔧 CI/CD Integration

GitHub Actions

# .github/workflows/docguard.yml
name: DocGuard — Documentation Lint

on: [push, pull_request]

jobs:
  docguard:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install DocGuard
        run: pip install -e .

      - name: Install and start Ollama
        run: |
          curl -fsSL https://ollama.com/install.sh | sh
          ollama serve &
          sleep 5
          ollama pull llama3

      - name: Run DocGuard check
        run: docguard check src/ --provider ollama --fail-on error

Using OpenAI or Gemini in CI

      - name: Run DocGuard check
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: docguard check src/ --provider openai --model gpt-4o-mini --fail-on error

MCP Server CI

      - name: Verify MCP tool descriptions
        env:
          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
        # Fail if any MCP tool has a WARNING or worse description
        run: docguard check src/mcp_server.py --mcp --provider gemini --fail-on warning

Pre-commit Hook

Add to .pre-commit-config.yaml:

repos:
  - repo: local
    hooks:
      - id: docguard
        name: DocGuard — Docstring Lint
        entry: docguard check
        args: [src/, --provider, ollama, --fail-on, error]
        language: system
        types: [python]
        pass_filenames: false

      - id: docguard-mcp
        name: DocGuard — MCP Tool Lint
        entry: docguard check
        args: [src/mcp_server.py, --mcp, --provider, ollama, --fail-on, warning]
        language: system
        types: [python]
        pass_filenames: false

🏗️ Architecture

src/docguard/
├── main.py               # Entrypoint — imports from cli package
├── constants.py          # Shared constants
├── cli/                  # CLI commands (Typer + Rich)
│   ├── __init__.py       # Typer app + command registration
│   ├── shared.py         # Console, _build_engine, print helpers
│   ├── check.py          # `check` command + _run_check async core
│   ├── suggest.py        # `suggest` command + _run_suggest async core
│   ├── fix.py            # `fix` command + interactive loop helpers
│   └── misc.py           # `cache-clear`, `version`
├── core/
│   ├── config.py         # Pydantic settings (env + .env)
│   ├── engine.py         # Async streaming orchestration engine
│   └── exceptions.py     # Custom exception hierarchy
├── analysis/
│   ├── cache.py          # Content-hashed DiskCache (schema-versioned)
│   └── parser.py         # Tree-sitter single-pass AST parser (parallel I/O)
├── llm/
│   ├── base.py           # BaseLLMProvider ABC
│   ├── factory.py        # Provider registry + instantiation
│   ├── protocol.py       # LLMProviderProtocol (structural typing)
│   ├── gemini.py         # Google Gemini provider
│   ├── openai.py         # OpenAI provider
│   ├── ollama.py         # Ollama (local) provider
│   └── prompts/          # Prompt builders (split by concern)
│       ├── __init__.py   # Re-exports for backward compat
│       ├── check.py      # Accuracy analysis prompts + severity guide
│       ├── suggest.py    # Docstring generation prompts
│       └── mcp.py        # MCP-specialised discoverability prompts
├── models/
│   ├── entity.py         # CodeEntity dataclass
│   └── schema.py         # Pydantic schemas: DocstringAnalysis (+ Severity),
│                         #   DocstringGeneration, MCPToolAnalysis
├── output/               # Output format adapters (stub — planned)
│   └── __init__.py       #   planned: terminal.py, sarif.py, html_report.py
└── utils/
    ├── patcher.py        # Surgical docstring insertion/replacement
    └── timer.py          # perf_counter context manager

Key Design Properties

Property How it's achieved
Single-pass parsing _parse_all returns (documented, undocumented, mcp_tools) in one tree traversal
Parallel file I/O ThreadPoolExecutor in _scan_parallel — tree-sitter C extension releases the GIL
Token budget _smart_truncate keeps signature + return/raise lines; body middle is dropped
Cache safety Keys hash only code + docstring; renames don't bust the cache; _SCHEMA_VERSION prefix auto-busts on schema changes
Severity filtering severity_exceeds_threshold(severity, fail_on) — O(1) rank comparison
MCP namespace mcp_check_stream uses namespace="mcp" so MCP and standard results never collide

Adding a New LLM Provider

  1. Create src/docguard/llm/my_provider.py implementing BaseLLMProvider._call_raw.
  2. Register it in _PROVIDERS dict in factory.py.
  3. Add my_provider_model: str = "default-model" to DocGuardConfig.

The provider automatically inherits analyze, generate, retry, and mcp_analyze from BaseLLMProvider.


🗺️ Roadmap

  • docguard check --since HEAD~1 — Git-aware incremental analysis (only changed functions)
  • SARIF output--format sarif for native GitHub Actions inline PR annotations
  • HTML report--report-html for human-readable summary with trend comparison
  • docguard init — Generate [tool.docguard] config block in pyproject.toml
  • Cross-tool disambiguation — Detect MCP tools with overlapping descriptions (cosine similarity)
  • PyPI releasepip install docguard from the public registry

🤝 Contributing

  1. Fork the repository.
  2. Branch (git checkout -b feature/my-improvement).
  3. Implement your change with tests.
  4. Verify: pytest tests/ && ruff check src/
  5. PR — include a clear description of the problem and solution.

📄 License

Distributed under the Apache License 2.0. See LICENSE for the full text.


Created by Stefanos Kalogerakis | GitHub

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages