A comprehensive GitHub indexing service for the RAG (Retrieval-Augmented Generation) system. Automatically indexes repositories, issues, pull requests, discussions, and documentation from GitHub organizations and makes them searchable via embeddings.
- Multi-source Indexing: Repositories, issues, PRs, discussions, and wiki pages
- Incremental Updates: Only re-indexes changed content using content hashing
- Rate Limiting: Built-in GitHub API rate limiting to stay within quotas
- Webhook Support: Real-time updates via GitHub webhooks
- Pattern Matching: Flexible include/exclude patterns for repositories
- Smart Chunking: Automatically chunks large files for optimal embedding
- PostgreSQL Tracking: Tracks indexed items to enable incremental syncing
┌─────────────────┐
│ GitHub API │
└────────┬────────┘
│
┌────▼────────────────┐
│ GitHub Client │
│ (Rate Limited) │
└────────┬────────────┘
│
┌────────▼────────────┐
│ Indexers │
│ - Repository │
│ - Issues │
│ - Pull Requests │
│ - Discussions │
│ - Wiki │
└────────┬────────────┘
│
┌────────▼────────────┐
│ Sync Tracker │
│ (Change Detection) │
└────────┬────────────┘
│
┌────────▼────────────┐
│ Embedding Service │
│ (Vector Store) │
└─────────────────────┘
- Rust 1.75 or later
- PostgreSQL 14+
- GitHub Personal Access Token with
repo,read:org,read:discussionscopes - Embedding service running (see
embedding-service/)
# Clone the repository
cd github-indexer
# Set up environment variables
export GITHUB_TOKEN="ghp_your_token_here"
export DATABASE_URL="postgresql://user:pass@localhost/github_indexer"
export GITHUB_WEBHOOK_SECRET="your_webhook_secret"
# Run database migrations
cargo install sqlx-cli
sqlx database create
sqlx migrate run
# Build the service
cargo build --releaseEdit config.yaml to configure:
- GitHub organizations to index
- Repository patterns (include/exclude)
- File types to index
- Issue/PR filtering rules
- Embedding service URL
- Sync schedules
See the inline comments in config.yaml for detailed options.
# Run a full synchronization
./target/release/github-indexer full-sync
# Run an incremental sync
./target/release/github-indexer incremental-sync
# Index a specific repository
./target/release/github-indexer index-repo neoza-labs/apexops-gateway
# Start the webhook server
./target/release/github-indexer webhook-server
# Show indexing statistics
./target/release/github-indexer stats
# Health check
./target/release/github-indexer health-check# Build the image
docker build -t github-indexer:latest .
# Run full sync
docker run --rm \
-e GITHUB_TOKEN=$GITHUB_TOKEN \
-e DATABASE_URL=$DATABASE_URL \
-v $(pwd)/config.yaml:/app/config.yaml \
github-indexer:latest full-sync
# Run webhook server
docker run -d \
-e GITHUB_TOKEN=$GITHUB_TOKEN \
-e DATABASE_URL=$DATABASE_URL \
-e GITHUB_WEBHOOK_SECRET=$GITHUB_WEBHOOK_SECRET \
-v $(pwd)/config.yaml:/app/config.yaml \
-p 8085:8085 \
github-indexer:latest webhook-serverSee apexops-infra/k8s/github-indexer/ for Helm charts.
# Install with Helm
helm install github-indexer ./charts/github-indexer \
--set github.token=$GITHUB_TOKEN \
--set database.url=$DATABASE_URL- Go to your organization settings → Webhooks
- Add webhook:
- Payload URL:
https://your-domain.com/webhook - Content type:
application/json - Secret: Your webhook secret
- Events: Select:
- Push
- Issues
- Pull requests
- Discussions
- Payload URL:
- Save the webhook
The service will automatically re-index affected content when events are received.
See INDEXING.md for detailed information about what content is indexed and how.
GitHub API limits:
- Personal Access Token: 5,000 requests/hour
- GitHub App: 5,000 requests/hour per installation
The indexer is configured to use 4,000 requests/hour by default, leaving buffer for other operations.
- Content Hashing: SHA-256 hash of content
- Change Detection: Compare hash with stored value
- Selective Re-indexing: Only changed items are re-embedded
- Timestamp Tracking: Track last sync per repository
- Stale Cleanup: Remove items not seen in recent syncs
- Full Sync: ~1-2 hours for 100 repositories (depending on size)
- Incremental Sync: ~5-10 minutes for typical updates
- Webhook Updates: Real-time (< 1 minute)
The service exposes metrics via logs:
items_indexed_total{type="file"} 1234
items_indexed_total{type="issue"} 567
indexing_duration_seconds{type="full"} 3600
api_calls_total{status="success"} 2345
Integrate with your observability stack (Prometheus, Grafana, etc.).
- Secrets Management: All secrets via environment variables
- Non-root Container: Runs as user ID 1000
- Webhook Verification: HMAC-SHA256 signature validation
- Rate Limiting: Prevents API abuse
- Network Policies: Egress-only to GitHub API
Error: GitHub API rate limit exceeded
Solution: Wait for rate limit reset or reduce requests_per_hour in config.
Error: Failed to connect to database
Solution: Verify DATABASE_URL and ensure PostgreSQL is running.
Error: Failed to embed document
Solution: Check embedding service is running and service_url is correct.
# Run tests
cargo test
# Run with debug logging
RUST_LOG=debug cargo run -- full-sync
# Format code
cargo fmt
# Lint
cargo clippySee CONTRIBUTING.md for guidelines.
See LICENSE.