Skip to content

neoza-labs/github-indexer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub Indexer

A comprehensive GitHub indexing service for the RAG (Retrieval-Augmented Generation) system. Automatically indexes repositories, issues, pull requests, discussions, and documentation from GitHub organizations and makes them searchable via embeddings.

Features

  • Multi-source Indexing: Repositories, issues, PRs, discussions, and wiki pages
  • Incremental Updates: Only re-indexes changed content using content hashing
  • Rate Limiting: Built-in GitHub API rate limiting to stay within quotas
  • Webhook Support: Real-time updates via GitHub webhooks
  • Pattern Matching: Flexible include/exclude patterns for repositories
  • Smart Chunking: Automatically chunks large files for optimal embedding
  • PostgreSQL Tracking: Tracks indexed items to enable incremental syncing

Architecture

┌─────────────────┐
│  GitHub API     │
└────────┬────────┘
         │
    ┌────▼────────────────┐
    │  GitHub Client      │
    │  (Rate Limited)     │
    └────────┬────────────┘
             │
    ┌────────▼────────────┐
    │  Indexers           │
    │  - Repository       │
    │  - Issues           │
    │  - Pull Requests    │
    │  - Discussions      │
    │  - Wiki             │
    └────────┬────────────┘
             │
    ┌────────▼────────────┐
    │  Sync Tracker       │
    │  (Change Detection) │
    └────────┬────────────┘
             │
    ┌────────▼────────────┐
    │  Embedding Service  │
    │  (Vector Store)     │
    └─────────────────────┘

Quick Start

Prerequisites

  • Rust 1.75 or later
  • PostgreSQL 14+
  • GitHub Personal Access Token with repo, read:org, read:discussion scopes
  • Embedding service running (see embedding-service/)

Installation

# Clone the repository
cd github-indexer

# Set up environment variables
export GITHUB_TOKEN="ghp_your_token_here"
export DATABASE_URL="postgresql://user:pass@localhost/github_indexer"
export GITHUB_WEBHOOK_SECRET="your_webhook_secret"

# Run database migrations
cargo install sqlx-cli
sqlx database create
sqlx migrate run

# Build the service
cargo build --release

Configuration

Edit config.yaml to configure:

  • GitHub organizations to index
  • Repository patterns (include/exclude)
  • File types to index
  • Issue/PR filtering rules
  • Embedding service URL
  • Sync schedules

See the inline comments in config.yaml for detailed options.

Usage

# Run a full synchronization
./target/release/github-indexer full-sync

# Run an incremental sync
./target/release/github-indexer incremental-sync

# Index a specific repository
./target/release/github-indexer index-repo neoza-labs/apexops-gateway

# Start the webhook server
./target/release/github-indexer webhook-server

# Show indexing statistics
./target/release/github-indexer stats

# Health check
./target/release/github-indexer health-check

Docker

# Build the image
docker build -t github-indexer:latest .

# Run full sync
docker run --rm \
  -e GITHUB_TOKEN=$GITHUB_TOKEN \
  -e DATABASE_URL=$DATABASE_URL \
  -v $(pwd)/config.yaml:/app/config.yaml \
  github-indexer:latest full-sync

# Run webhook server
docker run -d \
  -e GITHUB_TOKEN=$GITHUB_TOKEN \
  -e DATABASE_URL=$DATABASE_URL \
  -e GITHUB_WEBHOOK_SECRET=$GITHUB_WEBHOOK_SECRET \
  -v $(pwd)/config.yaml:/app/config.yaml \
  -p 8085:8085 \
  github-indexer:latest webhook-server

Kubernetes Deployment

See apexops-infra/k8s/github-indexer/ for Helm charts.

# Install with Helm
helm install github-indexer ./charts/github-indexer \
  --set github.token=$GITHUB_TOKEN \
  --set database.url=$DATABASE_URL

GitHub Webhook Setup

  1. Go to your organization settings → Webhooks
  2. Add webhook:
    • Payload URL: https://your-domain.com/webhook
    • Content type: application/json
    • Secret: Your webhook secret
    • Events: Select:
      • Push
      • Issues
      • Pull requests
      • Discussions
  3. Save the webhook

The service will automatically re-index affected content when events are received.

What Gets Indexed

See INDEXING.md for detailed information about what content is indexed and how.

Rate Limiting

GitHub API limits:

  • Personal Access Token: 5,000 requests/hour
  • GitHub App: 5,000 requests/hour per installation

The indexer is configured to use 4,000 requests/hour by default, leaving buffer for other operations.

Incremental Sync Strategy

  1. Content Hashing: SHA-256 hash of content
  2. Change Detection: Compare hash with stored value
  3. Selective Re-indexing: Only changed items are re-embedded
  4. Timestamp Tracking: Track last sync per repository
  5. Stale Cleanup: Remove items not seen in recent syncs

Performance

  • Full Sync: ~1-2 hours for 100 repositories (depending on size)
  • Incremental Sync: ~5-10 minutes for typical updates
  • Webhook Updates: Real-time (< 1 minute)

Monitoring

The service exposes metrics via logs:

items_indexed_total{type="file"} 1234
items_indexed_total{type="issue"} 567
indexing_duration_seconds{type="full"} 3600
api_calls_total{status="success"} 2345

Integrate with your observability stack (Prometheus, Grafana, etc.).

Security

  • Secrets Management: All secrets via environment variables
  • Non-root Container: Runs as user ID 1000
  • Webhook Verification: HMAC-SHA256 signature validation
  • Rate Limiting: Prevents API abuse
  • Network Policies: Egress-only to GitHub API

Troubleshooting

Rate Limit Exceeded

Error: GitHub API rate limit exceeded

Solution: Wait for rate limit reset or reduce requests_per_hour in config.

Database Connection Failed

Error: Failed to connect to database

Solution: Verify DATABASE_URL and ensure PostgreSQL is running.

Embedding Service Unavailable

Error: Failed to embed document

Solution: Check embedding service is running and service_url is correct.

Development

# Run tests
cargo test

# Run with debug logging
RUST_LOG=debug cargo run -- full-sync

# Format code
cargo fmt

# Lint
cargo clippy

Contributing

See CONTRIBUTING.md for guidelines.

License

See LICENSE.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors