diff --git a/README.md b/README.md index df78910..0249f87 100644 --- a/README.md +++ b/README.md @@ -47,7 +47,7 @@ Open-source AI assistant for ERPNext. Ask business questions in plain English an 8. **Built-In Support Tab** — A dedicated support interface is included within the changAI interface for raising support queries directly from your ERPNext desk, without needing to leave the app or contact support through an external channel. -9. **Module-Wise Training Data Automation** — changAI includes tools to auto-generate training data on a per-module basis across your ERPNext installation. You can select individual modules such as Accounts, Inventory, or HR and generate targeted training data for each, allowing the model's retrieval accuracy to improve incrementally without needing to retrain everything at once. +9. **Module-Wise Training Data Automation** — changAI includes tools to auto-generate training data on a per-module basis across your ERPNext setup. You can select individual modules such as Accounts, Inventory, or HR and generate targeted training data for each, allowing the model's retrieval accuracy to improve incrementally without needing to retrain everything at once. 10. **Fine-Tuned Embedding Model** — changAI uses a custom fine-tuned embedding model built on nomic-embed-text-v1.5, specifically trained on ERPNext schema and retrieval data for better semantic matching. @@ -77,11 +77,11 @@ Open-source AI assistant for ERPNext. Ask business questions in plain English an - Qwen3 via Replicate (Remote Mode) — Used for both schema retrieval and SQL generation in the fully hosted pipeline. - Anthropic Claude — Used optionally for schema enrichment. Provide a Claude API key to let changAI analyse your ERPNext customisations and update its understanding of your specific environment. - Amazon Polly — Optional voice output engine. Converts query results to speech when the voice assistant feature is enabled. -- RAG (Retrieval-Augmented Generation) — Core architecture for grounding SQL generation in relevant schema context before passing to the language model. +- RAG (Retrieval-Augmented Generation) — Core approach for grounding SQL generation in relevant schema context before passing to the language model. **Frontend** -- [Frappe Desk](https://frappeframework.com) — The ERPNext desk UI framework used to render the changAI interface. Provides the Chat, Debug, and Support tabs as native Frappe pages without requiring a separate frontend build or deployment. +- [Frappe Desk](https://frappeframework.com) — The ERPNext desk UI framework used to render the changAI interface. Provides the Chat, Debug, and Support tabs as native Frappe pages without requiring a separate frontend build or hosting setup. - JavaScript — Used for client-side interactions within the Frappe Desk interface, including query submission, tab switching, and rendering pipeline debug output. **Dataset** @@ -107,9 +107,9 @@ The free tier is the fastest way to get started. Generate your API key at [aistu **Enterprise Tier — Vertex AI (recommended for production)** -For high-volume or production deployments, Vertex AI provides a more scalable and reliable backend. Set up your Google Cloud environment following the [Vertex AI getting started guide](https://cloud.google.com/vertex-ai/docs/start/cloud-environment), then enter the corresponding credentials in changAI Settings. +For high-volume or production use, Vertex AI provides a more scalable and reliable backend. Set up your Google Cloud environment following the [Vertex AI getting started guide](https://cloud.google.com/vertex-ai/docs/start/cloud-environment), then enter the corresponding credentials in changAI Settings. -**Step 3 — Choose a Deployment Mode** +**Step 3 — Choose a Mode** In addition to the Gemini configuration, changAI supports a Remote Mode that offloads the full pipeline to Replicate . @@ -156,7 +156,7 @@ This step is mandatory. changAI needs to index your master tables before it can **Step 7 — Sync Schema (Optional)** -changAI ships pre-configured with the standard ERPNext schema, so core modules work immediately after installation without any additional mapping. If your ERPNext instance has custom doctypes, custom fields, or significant workflow customisations, you can enrich the AI's understanding of your specific environment. +changAI ships pre-configured with the standard ERPNext schema, so core modules work immediately after setup without any additional mapping. If your ERPNext instance has custom doctypes, custom fields, or significant workflow customisations, you can enrich the AI's understanding of your specific environment. To do this, enter an [Anthropic Claude API key](https://console.anthropic.com/) in the Remote tab of changAI Settings, then click **Update Schema** in the Training tab. changAI will analyse your customisations and incorporate them into its schema context. @@ -212,10 +212,10 @@ changAI supports ERPNext v15, and v16 on Ubuntu with Python 3.14 or higher. **Note** - Python 3.14 requires sudo apt-get install build-essential python3-dev before bench get-app **Which modules does changAI cover out of the box?** -changAI ships pre-configured with the standard ERPNext schema, so modules like Accounts, Inventory, Purchasing, Sales, and HR work immediately after installation without any additional mapping. Custom doctypes and fields require a schema sync using an Anthropic Claude API key. +changAI ships pre-configured with the standard ERPNext schema, so modules like Accounts, Inventory, Purchasing, Sales, and HR work immediately after setup without any additional mapping. Custom doctypes and fields require a schema sync using an Anthropic Claude API key. **Should I use the free Gemini tier or Vertex AI?** -The free tier available at Google AI Studio is well suited for testing and low-volume usage. For production deployments with higher query volumes or stricter reliability requirements, Vertex AI is recommended. +The free tier available at Google AI Studio is well suited for testing and low-volume usage. For production use with higher query volumes or stricter reliability requirements, Vertex AI is recommended. **Should I use Local Mode or Remote Mode?** Use Local Mode if you want schema retrieval to stay on your own server and use Gemini for SQL generation. Use Remote Mode if you prefer a fully hosted pipeline through Replicate using Qwen3 with no local model dependency. diff --git a/changai/changai/Datasets_2_v1/meta.json b/changai/changai/Datasets_2_v1/meta.json index 54b57c3..f43f5b3 100644 --- a/changai/changai/Datasets_2_v1/meta.json +++ b/changai/changai/Datasets_2_v1/meta.json @@ -162,7 +162,7 @@ "module": "Setup", "description": "Legal Entity / Subsidiary with a separate Chart of Accounts belonging to the Organization.", "fields": [ - "details", + "details", "company_name", "abbr", "default_currency", @@ -227,7 +227,6 @@ "dashboard_tab" ] }, - "Serial and Batch Bundle": { "module": "Stock", "description": "Standard ERPNext doctype for Serial and Batch Bundle", @@ -872,7 +871,7 @@ "connections_tab" ] }, - "Sales Invoice" : { + "Sales Invoice": { "module": "Accounts", "description": "Standard ERPNext doctype for Sales Invoice", "fields": [ @@ -1879,7 +1878,7 @@ "is_standard" ] }, - "Purchase Invoice" : { + "Purchase Invoice": { "module": "Accounts", "description": "Standard ERPNext doctype for Purchase Invoice", "fields": [ @@ -2099,7 +2098,6 @@ "payment_request_outstanding" ] }, - "Asset Capitalization": { "module": "Assets", "description": "Standard ERPNext doctype for Asset Capitalization", @@ -3482,7 +3480,6 @@ "status", "column_break_112", "per_installed", - "installation_status", "column_break_89", "per_returned", "transporter_info", @@ -3529,7 +3526,7 @@ "connections_tab" ] }, - "Quotation" : { + "Quotation": { "module": "Selling", "description": "Standard ERPNext doctype for Quotation", "fields": [ @@ -11528,34 +11525,6 @@ "append_emails_to_sent_folder" ] }, - "Installation Note": { - "module": "Selling", - "description": "Standard ERPNext doctype for Installation Note", - "fields": [ - "installation_note", - "column_break0", - "naming_series", - "customer", - "customer_address", - "contact_person", - "customer_name", - "address_display", - "contact_display", - "contact_mobile", - "contact_email", - "territory", - "customer_group", - "column_break1", - "inst_date", - "inst_time", - "status", - "company", - "amended_from", - "remarks", - "item_details", - "items" - ] - }, "Maintenance Visit": { "module": "Maintenance", "description": "Standard ERPNext doctype for Maintenance Visit", @@ -12002,20 +11971,6 @@ "dropbox_access_token" ] }, - "Installation Note Item": { - "module": "Selling", - "description": "Standard ERPNext doctype for Installation Note Item", - "fields": [ - "item_code", - "serial_and_batch_bundle", - "serial_no", - "qty", - "description", - "prevdoc_detail_docname", - "prevdoc_docname", - "prevdoc_doctype" - ] - }, "Account Closing Balance": { "module": "Accounts", "description": "Standard ERPNext doctype for Account Closing Balance", diff --git a/changai/changai/api/v2/build_cards_faiss_index_v2.py b/changai/changai/api/v2/build_cards_faiss_index_v2.py index 05849ac..61844e5 100644 --- a/changai/changai/api/v2/build_cards_faiss_index_v2.py +++ b/changai/changai/api/v2/build_cards_faiss_index_v2.py @@ -11,6 +11,8 @@ from changai.changai.api.v2.retrieve import get_embedding_engine import os import pickle +from changai.changai.api.v2.non_erp_handler import _safe_open_path + def get_app_fvs_base(): return os.path.join( @@ -226,11 +228,9 @@ def clean_schema(schema: Dict[str, Any], output_path: str): field for field in fields if field.get("name") not in GENERIC_FIELDS ] - # nosemgrep: frappe-semgrep-rules.rules.security.frappe-security-file-traversal - with open(output_path, "w") as f: - yaml.dump(schema, f, allow_unicode=True, sort_keys=False) - - print(f"Cleaned schema written to {output_path}") + allowed_dir = str(Path(output_path).parent.resolve()) + safe = _safe_open_path(output_path, allowed_dir) + safe.write_text(yaml.dump(schema, allow_unicode=True, sort_keys=False), encoding="utf-8") def build_schema_docs(schema: Dict[str, Any]) -> List[Document]: @@ -427,12 +427,12 @@ def save_field_matrix(schema_docs, base_dir): safe_dir.mkdir(parents=True, exist_ok=True) np.save(safe_dir / "field_embs.npy", embs) - # nosemgrep: frappe-semgrep-rules.rules.security.frappe-security-file-traversal - with open(safe_dir / "field_docs.pkl", "wb") as f: - pickle.dump(schema_docs, f) - # nosemgrep: frappe-semgrep-rules.rules.security.frappe-security-file-traversal - with open(safe_dir / "table_to_idx.pkl", "wb") as f: - pickle.dump(table_to_idx, f) + allowed_dir = str(safe_dir) + safe_docs = _safe_open_path(str(safe_dir / "field_docs.pkl"), allowed_dir) + safe_docs.write_bytes(pickle.dumps(schema_docs)) + + safe_idx = _safe_open_path(str(safe_dir / "table_to_idx.pkl"), allowed_dir) + safe_idx.write_bytes(pickle.dumps(table_to_idx)) def build_schema_fvs_job(): diff --git a/changai/changai/api/v2/fvs_stores/erpnext/report_fvs/index.faiss b/changai/changai/api/v2/fvs_stores/erpnext/report_fvs/index.faiss index 8645b46..eda5e17 100644 Binary files a/changai/changai/api/v2/fvs_stores/erpnext/report_fvs/index.faiss and b/changai/changai/api/v2/fvs_stores/erpnext/report_fvs/index.faiss differ diff --git a/changai/changai/api/v2/fvs_stores/erpnext/table_fvs/index.faiss b/changai/changai/api/v2/fvs_stores/erpnext/table_fvs/index.faiss index 8734fe8..ecaea4c 100644 Binary files a/changai/changai/api/v2/fvs_stores/erpnext/table_fvs/index.faiss and b/changai/changai/api/v2/fvs_stores/erpnext/table_fvs/index.faiss differ diff --git a/changai/changai/api/v2/non_erp_handler.py b/changai/changai/api/v2/non_erp_handler.py index 161fbde..7c0e609 100644 --- a/changai/changai/api/v2/non_erp_handler.py +++ b/changai/changai/api/v2/non_erp_handler.py @@ -4,6 +4,7 @@ import json import time import threading +from pathlib import Path import pickle from dataclasses import dataclass from typing import Dict, List, Optional, Set, Tuple, Any @@ -23,9 +24,19 @@ class ResponseEntry: priority: int = 100 is_active: bool = True +def _safe_open_path(requested_path: str, allowed_dir: str) -> Path: + """Resolve path and ensure it stays within allowed_dir.""" + allowed = Path(allowed_dir).resolve() + resolved = Path(requested_path).resolve() + if not str(resolved).startswith(str(allowed)): + raise ValueError(f"Path traversal blocked: {requested_path}") + return resolved class IntelligentStaticResponder: def __init__(self, json_file: str, alias_path: str): + self._allowed_dir = os.path.join( + frappe.get_app_path("changai"), "changai", "api", "v2", "assets" + ) t0 = time.time() self.json_file = json_file @@ -39,9 +50,8 @@ def __init__(self, json_file: str, alias_path: str): self._arabic_detect_re = re.compile(r"[\u0600-\u06FF]") t1 = time.time() - # nosemgrep: frappe-semgrep-rules.rules.security.frappe-security-file-traversal - with open(alias_path, "r", encoding="utf-8") as f: - alias_map = json.load(f) + safe = _safe_open_path(alias_path, self._allowed_dir) + alias_map = json.loads(safe.read_text(encoding="utf-8")) print(f"[non_erp] alias json load: {time.time() - t1:.4f}s") t2 = time.time() @@ -127,9 +137,8 @@ def _build_from_json(self) -> None: self.entries.clear() self.responses_by_key.clear() self.keys.clear() - # nosemgrep: frappe-semgrep-rules.rules.security.frappe-security-file-traversal - with open(self.json_file, "r", encoding="utf-8") as f: - rows = json.load(f) + safe = _safe_open_path(self.json_file, self._allowed_dir) + rows = json.loads(safe.read_text(encoding="utf-8")) processed_rows = [] @@ -178,17 +187,15 @@ def _write_pickle_cache(self, cache_path: str) -> None: rows = getattr(self, "_processed_rows_for_pickle", None) if rows is None: return - # nosemgrep: frappe-semgrep-rules.rules.security.frappe-security-file-traversal - with open(cache_path, "wb") as f: - pickle.dump(rows, f, protocol=pickle.HIGHEST_PROTOCOL) + safe = _safe_open_path(cache_path, self._allowed_dir) + safe.write_bytes(pickle.dumps(rows, protocol=pickle.HIGHEST_PROTOCOL)) def _load_from_pickle(self, cache_path: str) -> None: self.entries.clear() self.responses_by_key.clear() self.keys.clear() - # nosemgrep: frappe-semgrep-rules.rules.security.frappe-security-file-traversal - with open(cache_path, "rb") as f: # nosemgrep: cache_path derived from self.json_file, validated in __init__ - rows = pickle.load(f) + safe = _safe_open_path(cache_path, self._allowed_dir) + rows = pickle.loads(safe.read_bytes()) for row in rows: entry = ResponseEntry( diff --git a/changai/changai/api/v2/retrieve.py b/changai/changai/api/v2/retrieve.py index f8a8e3b..870cdc1 100644 --- a/changai/changai/api/v2/retrieve.py +++ b/changai/changai/api/v2/retrieve.py @@ -19,6 +19,8 @@ publish_pipeline_update, _safe_join, ) +from changai.changai.api.v2.non_erp_handler import _safe_open_path + from changai.changai.api.v2.clients import ( _post_json, @@ -113,8 +115,9 @@ def load_field_matrix(): app_root = Path(frappe.get_app_path("changai")).resolve() schema_rel = "changai/api/v2/fvs_stores/erpnext/emb_dir" - # nosemgrep: frappe-semgrep-rules.rules.security.frappe-security-file-traversal - schema_path = _safe_join(app_root, schema_rel) + schema_path = _safe_join(app_root, schema_rel) # already validates traversal + + allowed_dir = str(schema_path) # all files must live here embs_path = schema_path / "field_embs.npy" docs_path = schema_path / "field_docs.pkl" @@ -123,13 +126,11 @@ def load_field_matrix(): if not embs_path.exists(): frappe.throw(f"Missing field_embs.npy. Rebuild schema FVS first: {embs_path}") - # nosemgrep: frappe-semgrep-rules.rules.security.frappe-security-file-traversal - with open(docs_path, "rb") as f: - docs = pickle.load(f) + safe_docs = _safe_open_path(str(docs_path), allowed_dir) + docs = pickle.loads(safe_docs.read_bytes()) - # nosemgrep: frappe-semgrep-rules.rules.security.frappe-security-file-traversal - with open(table_idx_path, "rb") as f: - table_to_idx = pickle.load(f) + safe_table_idx = _safe_open_path(str(table_idx_path), allowed_dir) + table_to_idx = pickle.loads(safe_table_idx.read_bytes()) embs = np.load(embs_path, mmap_mode="r") diff --git a/changai/changai/api/v2/store_chats.py b/changai/changai/api/v2/store_chats.py index 73aefe5..b8e4779 100644 --- a/changai/changai/api/v2/store_chats.py +++ b/changai/changai/api/v2/store_chats.py @@ -27,7 +27,7 @@ def to_json_if_needed(v: Any) -> Any: MAX_LOG_LEN = 140 doc = frappe.new_doc("ChangAI Logs") doc.user_question = user_question - safe_question=(formatted_q[:137] + "..." if len(formatted_q) > MAX_LOG_LEN else formatted_q) + safe_question=(formatted_q[:137] + "..." if formatted_q and len(formatted_q) > MAX_LOG_LEN else formatted_q or "") doc.rewritten_question = safe_question doc.schema_retrieved = to_json_if_needed(context) doc.sql_generated = to_json_if_needed(sql) diff --git a/changai/changai/api/v2/text2sql_pipeline_v2.py b/changai/changai/api/v2/text2sql_pipeline_v2.py index 6496110..66cede8 100644 --- a/changai/changai/api/v2/text2sql_pipeline_v2.py +++ b/changai/changai/api/v2/text2sql_pipeline_v2.py @@ -52,7 +52,6 @@ ) from changai.changai.api.v2.format_output import ( format_data - ) from changai.changai.api.v2.clients import call_model,gemini_client from changai.changai.api.v2.non_erp_handler import non_erp_response @@ -1024,11 +1023,13 @@ def run_text2sql_pipeline(user_question: str, chat_id: str, request_id: str, sen "entity_raw": final.get("entity_raw"), "question_rewritten": formatted_q } + formatted_q = formatted_q or "" + if final.get("stop_followup"): save_turn_2(session_id=chat_id, user_text=user_question, bot_text=final.get("message"),type_="non_erp") save_logs( user_question=user_question, - formatted_q=None, + formatted_q="", context=None, sql=None, val=None,