GLM-OCR#19
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
This PR switches the backend image OCR implementation from Tesseract/pytesseract to the HuggingFace zai-org/GLM-OCR model, and updates frontend environment configuration to point at the backend running on port 8000.
Changes:
- Replace Tesseract-based OCR with GLM-OCR (Transformers) inference and add model-loading caching.
- Update backend dependencies to include
transformers,torch, andaccelerate. - Adjust frontend
.env.example/.envAPI URL values and addmyenv/to backend.gitignore.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| Frontend/.env.example | Updates example API base URL to http://127.0.0.1:8000. |
| Frontend/.env | Changes committed Vite API base URL to localhost. |
| Backend/services/image_ocr.py | Replaces pytesseract OCR pipeline with GLM-OCR model inference. |
| Backend/requirements.txt | Adds Transformers + Torch + Accelerate dependencies for GLM-OCR. |
| Backend/.gitignore | Ignores myenv/ directory. |
Comments suppressed due to low confidence (1)
Backend/services/image_ocr.py:10
ImageOpsis imported but no longer used after switching away from the pre-processing path. Please remove the unused import to keep the module clean (and avoid lint/test failures if the repo enforces them).
from PIL import Image, ImageOps
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import os | ||
| import tempfile | ||
| import uuid |
There was a problem hiding this comment.
tempfile is imported but never used. Either remove the import or use tempfile.NamedTemporaryFile/mkstemp for the temporary PNG (which would also avoid writing into the app’s working directory).
|
|
||
| from runtime import get_runtime_config | ||
| from services.document_ingestion import DocumentValidationError, parse_document | ||
| from services.groq_ai import extract_csv_from_ocr_text, groq_status |
There was a problem hiding this comment.
get_runtime_config is now unused in this module (the prior Tesseract config path was removed). Please drop the unused import to avoid confusion and potential linting failures.
| _processor = None | ||
| _model = None | ||
|
|
||
| def _load_glm_ocr(): | ||
| global _processor, _model | ||
| if _processor is None or _model is None: | ||
| from transformers import AutoProcessor, AutoModelForImageTextToText |
There was a problem hiding this comment.
Model initialization is guarded only by _processor is None or _model is None with module-level globals. Under concurrent requests, two threads can enter _load_glm_ocr() simultaneously and race while downloading/loading the model, causing excessive memory use or intermittent failures. Consider protecting initialization with a threading.Lock (or initializing once during app startup/lifespan).
| def ocr_status() -> dict[str, Any]: | ||
| config = get_runtime_config() | ||
| tesseract_cmd = str(getattr(config, "TESSERACT_CMD", "")).strip() | ||
| return { | ||
| "pytesseract_installed": pytesseract is not None, | ||
| "tesseract_cmd_configured": bool(tesseract_cmd), | ||
| "glm_ocr_configured": True, | ||
| "groq": groq_status(), | ||
| } |
There was a problem hiding this comment.
ocr_status() reports "glm_ocr_configured": True unconditionally, even if transformers/torch aren’t installed or the model can’t be loaded/downloaded. This can mislead the /options API and the frontend. Consider making this flag reflect reality (e.g., attempt a lightweight import and/or _load_glm_ocr() in a try/except and return configured: False plus an error detail when it fails).
| temp_filename = f"{uuid.uuid4().hex}.png" | ||
| try: | ||
| image = Image.open(BytesIO(image_bytes)) | ||
| image.save(temp_filename, format="PNG") | ||
| except Exception as exc: | ||
| if os.path.exists(temp_filename): | ||
| os.remove(temp_filename) | ||
| raise OcrConversionError(f"Uploaded file is not a readable image: {exc}") from exc | ||
|
|
||
| # Grayscale + auto-contrast usually improves OCR quality on scans. | ||
| processed = ImageOps.autocontrast(ImageOps.grayscale(image)) | ||
|
|
||
| try: | ||
| text = pytesseract.image_to_string(processed) | ||
| messages = [ | ||
| { | ||
| "role": "user", | ||
| "content": [ | ||
| { | ||
| "type": "image", | ||
| "url": temp_filename | ||
| }, |
There was a problem hiding this comment.
The OCR path writes a UUID-named PNG into the current working directory and then passes a relative path (url: temp_filename) into the processor. In deployed environments the CWD may be read-only or different than expected, which can cause OCR to fail. Prefer creating the file in the OS temp directory (e.g., via tempfile.NamedTemporaryFile(dir=tempfile.gettempdir(), ...)) and pass an absolute path to the model/processor.
| inputs.pop("token_type_ids", None) | ||
| generated_ids = _model.generate(**inputs, max_new_tokens=8192) | ||
| output_text = _processor.decode(generated_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False) | ||
|
|
||
| normalized = "\n".join(line.rstrip() for line in output_text.splitlines()) | ||
| if len(normalized.strip()) < 20: | ||
| raise OcrConversionError( |
There was a problem hiding this comment.
_processor.decode(..., skip_special_tokens=False) will include the model’s special/control tokens in the OCR output. Since the downstream Groq extraction expects plain OCR text, these tokens can pollute the prompt and reduce extraction quality. Consider decoding with skip_special_tokens=True (or explicitly stripping known template tokens) before normalization.
| @@ -1,2 +1,2 @@ | |||
| # Backend API URL | |||
| VITE_API_URL=https://taxai-77xc.onrender.com | |||
| VITE_API_URL=http://127.0.0.1:8000 | |||
There was a problem hiding this comment.
Committing Frontend/.env with VITE_API_URL=http://127.0.0.1:8000 will bake a localhost API URL into any production build that uses repository .env defaults, breaking deployments (and overriding the fallback onrender URL in src/utils/api.js). Typically .env should be untracked and .env.example used for documentation; consider removing Frontend/.env from version control (and adding it to Frontend/.gitignore) or restoring the deployed API URL here.
| VITE_API_URL=http://127.0.0.1:8000 | |
| # Leave VITE_API_URL unset here so the app can use its built-in fallback URL. | |
| # For local development, define VITE_API_URL in an untracked local env file instead. |
| transformers | ||
| torch | ||
| accelerate |
There was a problem hiding this comment.
Adding transformers, torch, and accelerate significantly increases build size and may fail to install in some deployment targets without extra system dependencies / correct wheels. Consider documenting the required runtime (CPU vs GPU) and deployment implications, or making these optional extras so environments that don’t use OCR don’t have to install them.
| transformers | |
| torch | |
| accelerate | |
| # Optional OCR/ML dependencies: | |
| # Install these only in deployments that need OCR features, as they | |
| # significantly increase build size and may require CPU/GPU-specific wheels. | |
| # Example: | |
| # pip install transformers torch accelerate |
…nce OCR configuration
No description provided.