AI-powered research assistant backend service for academic search, PDF processing, and intelligent analysis
The ScholarAI FastAPI Backend serves as the AI processing engine for the ScholarAI research platform. It orchestrates academic searches across multiple sources, processes PDFs, performs intelligent analysis, and communicates with the Spring Boot core service through RabbitMQ messaging.
- π Multi-Source Academic Search: ArXiv, PubMed, Semantic Scholar, OpenAlex, CrossRef, and more
- π PDF Processing: Upload, extract text, and analyze research papers
- π€ AI-Powered Analysis: Gap analysis, summarization, and research insights
- ποΈ Document Storage: Backblaze B2 cloud storage integration
- π¨ Message Queue Processing: Asynchronous task handling via RabbitMQ
- π¬ Research Orchestration: Intelligent coordination of academic workflows
- π Data Aggregation: Comprehensive paper metadata collection
- π‘οΈ Robust Error Handling: Retry logic and failover mechanisms
- π RESTful API: OpenAPI/Swagger documentation and testing
- β‘ High Performance: Async/await architecture for concurrent processing
- Python 3.10+ (recommended 3.11+)
- Poetry (Python dependency manager)
- Docker & Docker Compose (for infrastructure services)
- Git
# Clone the repository
git clone https://github.com/Tasriad/ScholarAI-Backend-FastAPI
cd ScholarAI-Backend-FastAPI
# Install dependencies with Poetry
poetry install
# Set up environment variables
cp env.example .env
# Edit .env with your configuration (see Environment Configuration section)Create a .env file in the project root with the following variables:
# RabbitMQ Configuration
RABBITMQ_USER=scholar_user
RABBITMQ_PASSWORD=your_secure_password
RABBITMQ_HOST=localhost
RABBITMQ_PORT=5672
# Academic API Configuration
CORE_API_KEY=your_core_api_key_here
UNPAYWALL_EMAIL=your.email@example.com
# PDF Storage (Backblaze B2)
B2_KEY_ID=your_b2_key_id
B2_APPLICATION_KEY=your_b2_application_key
B2_BUCKET_NAME=scholar-ai-papers
# AI Services (Optional)
GOOGLE_API_KEY=your_google_generative_ai_key
# Application Configuration
LOG_LEVEL=info
ENV=dev
ALLOWED_ORIGINS=http://localhost:3000,http://127.0.0.1:3000# Build and start the FastAPI service
./scripts/docker.sh build
./scripts/docker.sh start
# This starts:
# - FastAPI Service: localhost:8000
# - RabbitMQ Consumer: Background task processing
# View logs
./scripts/docker.sh logs
# Stop service
./scripts/docker.sh stop
# Clean up
./scripts/docker.sh clean# Build Docker image
docker compose -f docker/docker-compose.yml build --no-cache
# Start service
docker compose -f docker/docker-compose.yml up -d
# View logs
docker compose -f docker/docker-compose.yml logs -f
# Stop service
docker compose -f docker/docker-compose.yml downEnsure RabbitMQ is running (from Spring Boot backend setup):
# Start RabbitMQ from Spring Boot project
cd ../ScholarAI-Backend-Springboot
./scripts/docker.sh start-svc
# Verify RabbitMQ is running
curl http://localhost:15672 # Management UI# Activate Poetry environment and run
poetry run uvicorn app.main:app --reload --port 8000
# Or use Poetry shell
poetry shell
uvicorn app.main:app --reload --port 8000
# Application will be available at:
# - API: http://localhost:8000
# - Docs: http://localhost:8000/docs
# - Health: http://localhost:8000/health# Code formatting
poetry run black .
# Import sorting
poetry run isort .
# Linting
poetry run flake8
# Run tests
poetry run pytest
# Run specific test file
poetry run pytest tests/test_websearch.py
# Run tests with coverage
poetry run pytest --cov=app tests/
# Format and lint all at once
poetry run black . && poetry run isort . && poetry run flake8ScholarAI-Backend-FastAPI/
βββ app/
β βββ main.py # FastAPI application entry point
β βββ api/ # API routes and endpoints
β β βββ api_v1/ # API version 1
β β β βββ endpoints/ # Endpoint implementations
β β β β βββ admin.py # Admin endpoints
β β β β βββ gap_analysis.py # Gap analysis endpoints
β β β β βββ papercall.py # Paper call endpoints
β β β β βββ qa.py # Q&A endpoints
β β βββ router.py # Main API router
β βββ core/ # Core configuration
β β βββ config.py # Application settings
β β βββ logging_config.py # Logging configuration
β β βββ services.py # Service initialization
β βββ models/ # Pydantic models
β β βββ message.py # Message models
β βββ services/ # Business logic services
β βββ academic_apis/ # Academic search clients
β β βββ clients/ # API client implementations
β β β βββ arxiv_client.py # ArXiv API client
β β β βββ pubmed_client.py # PubMed API client
β β β βββ semantic_scholar_client.py # Semantic Scholar client
β β β βββ openalex_client.py # OpenAlex API client
β β β βββ crossref_client.py # CrossRef API client
β β β βββ ... # More API clients
β β βββ common/ # Shared utilities
β β β βββ base_client.py # Base client class
β β β βββ exceptions.py # Custom exceptions
β β β βββ normalizers.py # Data normalization
β β β βββ utils.py # Utility functions
β β βββ parsers/ # Data parsers
β β βββ feed_parser.py # RSS/Atom feed parsing
β β βββ json_parser.py # JSON response parsing
β β βββ xml_parser.py # XML response parsing
β βββ extractor/ # Text extraction services
β β βββ text_extractor.py # PDF text extraction
β βββ gap_analyzer/ # Research gap analysis
β β βββ orchestrator.py # Gap analysis orchestration
β β βββ paper_analyzer.py # Paper analysis engine
β β βββ search_agent.py # Intelligent search agent
β β βββ background_processor.py # Background task processor
β βββ messaging/ # RabbitMQ message handling
β β βββ consumer.py # Message consumer
β β βββ connection.py # RabbitMQ connection management
β β βββ handlers/ # Message handlers
β β βββ extraction_handler.py # PDF extraction handler
β β βββ summarization_handler.py # Summarization handler
β β βββ structuring_handler.py # Text structuring handler
β βββ papercall/ # Academic conference data
β β βββ fetchers/ # Conference data fetchers
β β βββ papercall_service.py # Paper call service
β βββ qa/ # Question & Answer service
β β βββ paper_qa_service.py # Paper Q&A implementation
β βββ summarizer/ # AI summarization
β β βββ summarizer_agent.py # Intelligent summarization
β βββ websearch/ # Academic search orchestration
β β βββ search_orchestrator.py # Search coordination
β β βββ search_filters/ # Source-specific filters
β β βββ deduplication.py # Duplicate detection
β β βββ metadata_enrichment.py # Metadata enhancement
β β βββ filter_service.py # Search filtering
β βββ b2_storage.py # Backblaze B2 integration
β βββ pdf_processor.py # PDF processing pipeline
β βββ rabbitmq_consumer.py # RabbitMQ message consumer
βββ tests/ # Test files
β βββ search_filters/ # Search filter tests
β βββ integration_test.py # Integration tests
β βββ test_websearch.py # Web search tests
β βββ ... # More test files
βββ docs/ # Documentation
β βββ 2_Setup_Instructions.md # Setup guide
β βββ 4_Communication_Architecture.md # Architecture docs
β βββ ... # More documentation
βββ docker/ # Docker configuration
βββ scripts/ # Deployment scripts
βββ pyproject.toml # Poetry project configuration
βββ poetry.lock # Dependency lock file
βββ env.example # Environment template
| Source | Description | Coverage |
|---|---|---|
| ArXiv | Physics, mathematics, computer science preprints | 2M+ papers |
| PubMed | Biomedical and life sciences literature | 35M+ citations |
| Semantic Scholar | Computer science and biomedical papers | 200M+ papers |
| OpenAlex | Comprehensive academic literature | 250M+ works |
| CrossRef | DOI registration and metadata | 140M+ records |
| BioRxiv | Biology preprint server | 150K+ preprints |
| Europe PMC | Life sciences literature | 40M+ records |
| DOAJ | Open access journals | 18K+ journals |
| DBLP | Computer science bibliography | 6M+ publications |
| Unpaywall | Open access status detection | Global coverage |
# Multi-source search example
search_request = {
"query": "machine learning in healthcare",
"sources": ["arxiv", "pubmed", "semantic_scholar"],
"max_results": 50,
"filters": {
"publication_year": {"min": 2020, "max": 2024},
"open_access": True
}
}
# Advanced filtering and deduplication
# Intelligent metadata enrichment
# Real-time result streaming
# Fallback and retry mechanismsEach academic source has a dedicated client implementing the BaseSearchClient interface:
class BaseSearchClient:
async def search(self, query: str, **kwargs) -> List[Paper]:
"""Search for papers using the specific API"""
async def get_paper_details(self, identifier: str) -> Paper:
"""Get detailed information for a specific paper"""
async def health_check(self) -> bool:
"""Check if the API is available"""The FastAPI backend processes messages from the Spring Boot service:
| Queue | Purpose | Handler |
|---|---|---|
websearch.request |
Academic paper search requests | WebSearchHandler |
extraction.request |
PDF text extraction requests | ExtractionHandler |
summarization.request |
Paper summarization requests | SummarizationHandler |
structuring.request |
Text structuring requests | StructuringHandler |
gap.analysis.request |
Research gap analysis requests | GapAnalysisHandler |
| Queue | Purpose | Data |
|---|---|---|
websearch.result |
Search results with paper metadata | Paper lists with relevance scores |
extraction.result |
Extracted text and metadata | Structured text content |
summarization.result |
AI-generated summaries | Summary text and key insights |
structuring.result |
Structured document content | Organized text sections |
gap.analysis.result |
Research gap findings | Gap identification and recommendations |
# Example message handler
@consumer.register_handler("websearch.request")
async def handle_websearch_request(message: WebSearchRequest):
"""Process academic search request"""
try:
# Orchestrate multi-source search
results = await search_orchestrator.search(
query=message.query,
sources=message.sources,
filters=message.filters
)
# Send results back to Spring Boot
await publisher.send_message(
queue="websearch.result",
data=WebSearchResponse(results=results)
)
except Exception as e:
# Handle error and send failure notification
await publisher.send_error_response(message.correlation_id, str(e))The gap analysis service uses AI to identify research gaps:
# Gap analysis workflow
class GapAnalyzer:
async def analyze_research_gap(
self,
topic: str,
existing_papers: List[Paper]
) -> GapAnalysisResult:
"""
1. Analyze existing research landscape
2. Identify methodological gaps
3. Find temporal gaps in research
4. Suggest future research directions
"""AI-powered paper summarization:
# Summarization features
class SummarizerAgent:
async def summarize_paper(self, paper: Paper) -> Summary:
"""
- Extract key findings and contributions
- Generate concise abstracts
- Identify methodology and results
- Create structured summaries
"""Intelligent Q&A on research papers:
# Q&A service
class PaperQAService:
async def answer_question(
self,
paper: Paper,
question: str
) -> QAResponse:
"""
- Context-aware question answering
- Citation and reference extraction
- Multi-document reasoning
"""# B2 storage integration
class B2StorageService:
async def upload_pdf(
self,
file: bytes,
filename: str
) -> UploadResult:
"""
- Secure PDF upload to B2 bucket
- Automatic metadata generation
- Download URL generation
- File integrity verification
"""# Multi-method text extraction
class TextExtractor:
async def extract_text(self, pdf_file: bytes) -> ExtractedContent:
"""
Extraction methods:
- PyPDF2: Standard PDF text extraction
- pdfplumber: Table and layout-aware extraction
- OCR: Image-based text recognition (Tesseract)
- Hybrid: Combination approach for best results
"""# Run all tests
poetry run pytest
# Run with coverage
poetry run pytest --cov=app tests/
# Run specific test categories
poetry run pytest tests/search_filters/ # Search filter tests
poetry run pytest tests/integration_test.py # Integration tests
poetry run pytest tests/test_websearch.py # Web search tests
# Run tests with verbose output
poetry run pytest -v
# Run tests and generate HTML coverage report
poetry run pytest --cov=app --cov-report=html tests/- Unit Tests: Individual component testing
- Integration Tests: Full workflow testing with real APIs
- Search Filter Tests: Academic source filter validation
- API Client Tests: External API interaction testing
- Message Handler Tests: RabbitMQ message processing tests
# pytest.ini configuration
[tool:pytest]
testpaths = tests
python_files = test_*.py
python_classes = Test*
python_functions = test_*
addopts =
--strict-markers
--disable-warnings
--tb=short
-ra
markers =
integration: marks tests as integration tests
slow: marks tests as slow
api: marks tests that require external APIs# Test academic search clients
poetry run python test_all_api_clients.py
# Test specific search functionality
poetry run python test_websearch.py
# Test B2 integration
poetry run python test_b2_integration.py
# Test PDF processing
poetry run python test_enhanced_pdf_collection.py
# Test gap analysis
poetry run python test_gap_analyzer.py| Variable | Description | Required | Default |
|---|---|---|---|
RABBITMQ_USER |
RabbitMQ username | β | - |
RABBITMQ_PASSWORD |
RabbitMQ password | β | - |
RABBITMQ_HOST |
RabbitMQ hostname | β | localhost |
RABBITMQ_PORT |
RabbitMQ port | β | 5672 |
CORE_API_KEY |
Core API access key | β | - |
UNPAYWALL_EMAIL |
Email for Unpaywall API | β | - |
B2_KEY_ID |
Backblaze B2 key ID | β | - |
B2_APPLICATION_KEY |
Backblaze B2 application key | β | - |
B2_BUCKET_NAME |
B2 bucket name | β | - |
GOOGLE_API_KEY |
Google Generative AI key | β | - |
LOG_LEVEL |
Logging level | β | info |
ENV |
Environment (dev/prod/docker) | β | dev |
# app/core/config.py
class Settings(BaseSettings):
app_name: str = "ScholarAI FastAPI Backend"
version: str = "0.1.0"
description: str = "AI-powered research assistant backend"
# API Configuration
api_v1_prefix: str = "/api/v1"
allowed_origins: List[str] = ["http://localhost:3000"]
# Academic API Settings
max_concurrent_requests: int = 10
request_timeout: int = 30
retry_attempts: int = 3
# AI Service Settings
max_summary_length: int = 500
gap_analysis_depth: str = "comprehensive"
class Config:
env_file = ".env"
case_sensitive = True# Application health
curl http://localhost:8000/health
# Detailed service health
curl http://localhost:8000/api/v1/admin/health
# Academic API status
curl http://localhost:8000/api/v1/admin/api-status{
"status": "healthy",
"timestamp": "2024-01-15T10:30:00Z",
"services": {
"rabbitmq": "connected",
"b2_storage": "available",
"academic_apis": {
"arxiv": "healthy",
"pubmed": "healthy",
"semantic_scholar": "healthy",
"openalex": "degraded",
"crossref": "healthy"
}
},
"performance": {
"avg_response_time": "1.2s",
"active_tasks": 3,
"memory_usage": "156MB"
}
}# Structured logging with different levels
import logging
logger = logging.getLogger(__name__)
# Log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
logger.info("Processing search request", extra={
"query": query,
"sources": sources,
"correlation_id": correlation_id
})
logger.error("API request failed", extra={
"error": str(e),
"api": "arxiv",
"retry_count": retry_count
})-
RabbitMQ Connection Errors
# Check RabbitMQ status curl http://localhost:15672 # Verify credentials in .env file RABBITMQ_USER=scholar_user RABBITMQ_PASSWORD=your_password # Check RabbitMQ logs docker logs core-rabbitmq
-
Academic API Rate Limiting
# Monitor API status curl http://localhost:8000/api/v1/admin/api-status # Check rate limiting in logs grep "rate_limit" logs/app.log # Adjust request delays in configuration
-
B2 Storage Issues
# Test B2 connection poetry run python test_b2_integration.py # Verify B2 credentials B2_KEY_ID=your_key_id B2_APPLICATION_KEY=your_app_key B2_BUCKET_NAME=your_bucket
-
Memory Issues with Large PDFs
# Monitor memory usage htop # Increase Docker memory limit docker run --memory="2g" your_image # Use streaming for large files async with aiofiles.open(file_path, 'rb') as f: content = await f.read()
# Concurrent API requests
async def fetch_from_multiple_sources(query: str):
tasks = [
arxiv_client.search(query),
pubmed_client.search(query),
semantic_scholar_client.search(query)
]
results = await asyncio.gather(*tasks, return_exceptions=True)
return [r for r in results if not isinstance(r, Exception)]
# Connection pooling
async with aiohttp.ClientSession(
connector=aiohttp.TCPConnector(limit=100, limit_per_host=10)
) as session:
# Make requests with shared connection poolThe ScholarAI FastAPI Backend is live and deployed on Azure VM:
π API Base URL: http://4.247.29.26:8000 π API Documentation: http://4.247.29.26:8000/docs π Health Check: http://4.247.29.26:8000/health
This production deployment includes:
- Multi-container Docker deployment
- RabbitMQ message processing
- Backblaze B2 storage integration
- Academic API orchestration
- AI-powered analysis services
- Automated CI/CD pipeline
# Build production Docker image
docker compose -f docker/docker-compose.yml build --no-cache
# Deploy to production
./scripts/docker.sh deploy
# Health check after deployment
curl http://your-domain:8000/health# Development
ENV=dev poetry run uvicorn app.main:app --reload
# Production
ENV=prod poetry run uvicorn app.main:app --host 0.0.0.0 --port 8000
# Docker
ENV=docker uvicorn app.main:app --host 0.0.0.0 --port 8000# Deploy to Azure VM
./scripts/azure-setup.sh
# This script:
# 1. Sets up Azure VM with Docker
# 2. Configures AI service dependencies
# 3. Deploys FastAPI container
# 4. Sets up monitoring and logging
# 5. Configures academic API access- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Follow code quality standards
- Add comprehensive tests
- Submit a pull request
# Required before committing
poetry run black . # Code formatting
poetry run isort . # Import sorting
poetry run flake8 # Code linting
poetry run pytest # Run tests
# Type checking (optional but recommended)
poetry run mypy app/- Black: Automatic code formatting
- isort: Import statement organization
- flake8: PEP 8 compliance checking
- Type Hints: Use type annotations where possible
- Docstrings: Document all public functions and classes
- Async/Await: Use async patterns for I/O operations
- Setup Instructions
- Communication Architecture
- B2 Integration Guide
- Job Recovery System
- Paper Entity Structure
The FastAPI application provides interactive API documentation:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
- OpenAPI Schema: http://localhost:8000/openapi.json
This project is licensed under the MIT License - see the LICENSE file for details.
- Issues: GitHub Issues
- Main Repository: ScholarAI
- API Documentation: FastAPI Docs
- Framework: FastAPI 0.115.12
- Language: Python 3.10+
- Package Manager: Poetry
- Message Queue: RabbitMQ with aio-pika
- HTTP Client: httpx, aiohttp
- PDF Processing: PyPDF2, pdfplumber, pytesseract
- AI Integration: Google Generative AI
- Cloud Storage: Backblaze B2
- Data Processing: pandas, numpy, scikit-learn
- NLP: NLTK, spaCy, TextBlob
- Web Scraping: BeautifulSoup4, lxml
- Testing: pytest, pytest-asyncio
- Code Quality: black, isort, flake8
- Containerization: Docker & Docker Compose