This implementation provides seamless integration between ScholarAI and Backblaze B2 cloud storage for PDF management. When papers are fetched, their PDFs are automatically downloaded and uploaded to B2 storage, with the pdfContent field replaced by pdfContentUrl containing the B2 download URL.
- Automatic PDF Processing: Papers fetched through academic APIs automatically have their PDFs uploaded to B2
- Unique File Storage: PDFs are stored with unique identifiers (DOI, ArXiv ID, PubMed ID, etc.) to prevent duplicates
- Duplicate Detection: Before uploading, the system checks if the PDF already exists in B2
- Admin Management: Comprehensive CRUD operations for managing stored PDFs
- Health Monitoring: Built-in health checks and storage statistics
- Error Handling: Robust error handling with graceful fallbacks
The B2 SDK dependency is already added to pyproject.toml:
poetry installAdd your Backblaze B2 credentials to your .env file:
# Backblaze B2 Configuration
B2_KEY_ID=your_b2_key_id_here
B2_APPLICATION_KEY=your_b2_application_key_here
B2_BUCKET_NAME=scholar-ai-papers- Go to your Backblaze B2 dashboard
- Create a new bucket named
scholar-ai-papers(or whatever you specified inB2_BUCKET_NAME) - Set the bucket to "Private" for security
- Note your application key ID and application key
Run the test script to verify everything is working:
python test_b2_integration.py- Paper Fetching: Academic APIs return papers with
pdfUrlfields - PDF Download: The system downloads PDFs from the original URLs
- Duplicate Check: Before uploading, checks if PDF already exists in B2
- File Upload: Uploads PDF to B2 with a unique filename
- URL Replacement: Replaces
pdfContentwithpdfContentUrl(B2 download URL)
PDFs are stored with unique identifiers in priority order:
- DOI:
doi_10.1000_example.pdf - ArXiv ID:
arxiv_2301.00001.pdf - PubMed ID:
pmid_12345678.pdf - Semantic Scholar ID:
ss_abc123def.pdf - Title hash:
title_md5hash.pdf - Random UUID:
unknown_uuid.pdf
GET /api/v1/admin/healthCheck if B2 storage service is healthy and accessible.
GET /api/v1/admin/statsGet comprehensive statistics about PDF storage.
GET /api/v1/admin/files?limit=100List all PDF files stored in B2 with metadata.
DELETE /api/v1/admin/files/allDELETE /api/v1/admin/files/paper
Content-Type: application/json
{
"doi": "10.1000/example",
"title": "Paper Title",
"arxivId": "2301.00001"
}GET /api/v1/admin/files/paper/url?doi=10.1000/exampleGet the B2 download URL for a specific paper's PDF.
POST /api/v1/admin/process/paper
Content-Type: application/json
{
"title": "Paper Title",
"doi": "10.1000/example",
"pdfUrl": "https://example.com/paper.pdf"
}GET /api/v1/admin/content-reportGenerate a comprehensive report about stored PDF content.
POST /api/v1/admin/test/search-with-pdf?query=machine learning&limit=5Test endpoint to demonstrate paper search with PDF processing.
The B2 integration is configured through environment variables in app/core/config.py:
class Settings:
# Backblaze B2 Configuration
B2_KEY_ID: str = os.getenv("B2_KEY_ID", "")
B2_APPLICATION_KEY: str = os.getenv("B2_APPLICATION_KEY", "")
B2_BUCKET_NAME: str = os.getenv("B2_BUCKET_NAME", "scholar-ai-papers")-
Test B2 Connection:
python test_b2_integration.py
-
Test via API:
curl -X POST "http://localhost:8001/api/v1/admin/test/search-with-pdf?query=neural networks&limit=3" -
Check Storage Stats:
curl -X GET "http://localhost:8001/api/v1/admin/stats"
When papers are fetched:
Before B2 Integration:
{
"title": "Example Paper",
"doi": "10.1000/example",
"pdfUrl": "https://arxiv.org/pdf/2301.00001.pdf",
"pdfContent": null
}After B2 Integration:
{
"title": "Example Paper",
"doi": "10.1000/example",
"pdfUrl": "https://arxiv.org/pdf/2301.00001.pdf",
"pdfContentUrl": "https://f000.backblazeb2.com/file/scholar-ai-papers/doi_10.1000_example.pdf"
}The system includes robust error handling:
- Missing Credentials: Graceful fallback with warning messages
- Network Errors: Retries and timeouts for PDF downloads
- Upload Failures: Continues processing without breaking the paper fetching flow
- Duplicate Files: Efficiently detects and reuses existing files
- Private Bucket: Use private B2 buckets for security
- Access Control: Admin endpoints should be protected (add authentication)
- URL Expiration: B2 download URLs have expiration times
- File Validation: PDF content is validated before upload
The system provides detailed statistics:
- Total files and storage size
- File categories by identifier type
- Upload success rates
- Storage efficiency metrics
Regular health checks ensure:
- B2 connectivity
- Bucket accessibility
- Storage quotas
- Service performance
The integration is seamlessly built into the MultiSourceSearchOrchestrator:
# After paper fetching and enrichment
final_papers = await pdf_processor.process_papers_batch(final_papers)B2 storage is initialized during application startup:
# In app/main.py
await pdf_processor.initialize()app/services/b2_storage.py: Core B2 storage operationsapp/services/pdf_processor.py: PDF processing and integration logicapp/api/api_v1/endpoints/admin.py: Admin endpoints for managementapp/core/config.py: Configuration management
B2StorageService: Handles all B2 operationsPDFProcessorService: Orchestrates PDF processing workflow- Admin endpoints: Provide management interface
- Add authentication to admin endpoints
- Implement PDF text extraction for search
- Add batch processing optimization
- Include PDF thumbnail generation
- Add storage cleanup policies
- Implement CDN integration
-
B2 Connection Failed:
- Check credentials in
.envfile - Verify bucket exists and is accessible
- Check network connectivity
- Check credentials in
-
PDF Upload Failed:
- Verify PDF URLs are accessible
- Check file size limits (50MB max)
- Ensure bucket has sufficient space
-
Admin Endpoints Not Working:
- Ensure application is running
- Check endpoint URLs and HTTP methods
- Verify B2 service is initialized
Enable detailed logging by setting:
LOG_LEVEL=debugCheck logs for detailed error information and processing steps.