diff --git a/CHANGELOG.md b/CHANGELOG.md index 1cbb90981a..508eeac5bc 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -16,6 +16,12 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm - **CI / docs preflight guard**: Added `bin/check_docs_latex_unicode.sh` and a fast `docs-latex-unicode-guard` CI job to fail early on non-BMP Unicode in docs-fed text sources before the slower Dockerized `test-docs` LaTeX build. - **Release process / deploy gate reminder**: Documented that tag-triggered PyPI publishes can pause in `waiting` on environment approval, and explicitly call out approving `Review deployments` for `pypi-release` before expecting the final PyPI job to complete. +### Added +- **GFQL/Cypher validate-only preflight API (#1320)**: Added `g.gfql_validate(...)` on `ComputeMixin` as a public no-execution validation entrypoint for GFQL chains/JSON-style queries, Let/DAG queries, and Cypher strings. The API returns structured diagnostics (`ok`, `diagnostics`, query/language metadata) instead of executing query operators. Cypher preflight runs parser+compiler checks and supports optional strict binder/schema mode (`strict=True`) using the bound graph schema catalog; chain/JSON preflight reuses existing `validate_chain_schema()` semantics (including `collect_all=True`), and Let/DAG preflight now includes best-effort schema checks for direct chain-like bindings. + +### Changed +- **GFQL execution prevalidation semantics (#1320)**: `g.gfql(..., validate=True)` now runs local preflight validation before execution. `g.gfql_remote(..., validate=True)` now validates query payloads before implicit upload/network dispatch, so invalid queries fail locally prior to upload when possible. String query inputs are now treated consistently as Cypher during preflight (`g.gfql_validate("...")` and `g.gfql("...", validate=True)`), so users get Cypher parser/compiler diagnostics instead of shape-guessing type errors. `g.gfql_validate(...)` now raises structured GFQL exceptions on invalid queries (instead of returning `ok=False`), and collect-all mode surfaces full diagnostics via exception context for LM/retry workflows. + ### Internal - **GFQL / Cypher reentry follow-through cleanup (#989, post-#1260 extraction)**: In `graphistry/compute/gfql/cypher/reentry/runtime.py`, free-form intermediate MATCH plan construction now routes through the whole-row/free-form `ReentryPlan` contract instead of scalar-only fallback tagging. This makes the dedicated runtime `plan.free_form` lane reachable again and removes incidental scalar-only-path dependence for free-form reentry dispatch. - **GFQL native types T4 — Arrow/type bridge contracts and coercion semantics (#1312, #1262, #1046)**: Added `graphistry/compute/gfql/ir/arrow_bridge.py` with stable schema-level interchange helpers `to_arrow()` and `from_arrow()` for `RowSchema` + schema-confidence metadata. The bridge records per-field logical-type metadata (`gfql.logical_type`) and confidence (`gfql.schema_confidence`) for deterministic round-trips, supports strict vs widening coercion (`coercion='strict'|'widen'`) at export/import boundaries, preserves scalar nullability exactly, and defines structural-type fallback behavior (`NodeRef`/`EdgeRef`/`PathType` as widened string bridge fields in widen mode). Added focused regression coverage in `graphistry/tests/compute/gfql/test_ir_arrow_bridge.py` for round-trip fidelity, nullability behavior, confidence handling, and strict/widen coercion boundaries. diff --git a/docs/source/gfql/cypher.rst b/docs/source/gfql/cypher.rst index 220565caa4..cd5ec8c9ca 100644 --- a/docs/source/gfql/cypher.rst +++ b/docs/source/gfql/cypher.rst @@ -440,7 +440,37 @@ Static Validation / Preflight Check ----------------------------------- If you want to know whether a query fits the current Cypher-in-GFQL subset before -execution, preflight it with the helper APIs: +execution, start with the bound-graph inline preflight APIs: + +.. code-block:: python + + g.gfql_validate( + "MATCH (p) RETURN p.name AS name ORDER BY name DESC LIMIT $top_n", + params={"top_n": 5}, + # strict=True is the default for local bound-graph preflight + ) + + # On failure: + # - GFQLSyntaxError for invalid syntax + # - GFQLValidationError for unsupported/scheme-invalid shapes + +- Use ``g.gfql_validate(...)`` when you want a stable validate-only entrypoint + that never executes query operators and raises structured exceptions on invalid queries. +- Use ``g.gfql(..., validate=True)`` when you want execution guarded by a + local preflight check. For Cypher strings, this uses schema-aware strict + preflight by default. +- Use ``g.gfql_remote(..., validate=True)`` when you want remote execution + guarded by local preflight before upload/network dispatch. For Cypher strings, + remote preflight uses ``strict=False`` by default because remote schema is authoritative. +- Use ``parse_cypher()`` when you only want grammar validation and access to + the parsed representation. +- Use ``compile_cypher()`` when you need low-level compiler/lowering output for + tooling or whitebox inspection. +- Use ``cypher_to_gfql()`` only when you specifically need a single GFQL + ``Chain``. It is intentionally stricter than direct execution through + ``g.gfql("...")``. + +Low-level helper example: .. code-block:: python @@ -450,25 +480,19 @@ execution, preflight it with the helper APIs: query = "MATCH (p:Person) RETURN p.name AS name" try: - parse_cypher(query) # grammar + AST checks - compile_cypher(query) # GFQL Cypher compiler / lowering checks + parsed = parse_cypher(query) # grammar + AST checks + compiled = compile_cypher(query) # compiler/lowering checks except GFQLSyntaxError as exc: print("Invalid Cypher syntax for g.gfql(\"MATCH ...\"):", exc) except GFQLValidationError as exc: print("Valid Cypher, but outside the current GFQL Cypher surface:", exc) -- Use ``parse_cypher()`` when you only want syntax and AST validation. -- Use ``compile_cypher()`` for the strongest compiler preflight, because it also - catches unsupported-but-valid query shapes in lowering. -- Use ``cypher_to_gfql()`` only when you specifically need a single GFQL - ``Chain``. It is intentionally stricter than direct execution through - ``g.gfql("...")``. - Common Rewrites --------------- - Need remote execution on Graphistry infrastructure instead of running against - the current bound graph? Prefer ``g.gfql_remote([...])`` for remote GFQL. + the current bound graph? Prefer ``g.gfql_remote(...)`` for remote GFQL, and + keep ``validate=True`` (default) for local preflight before upload. - Need remote database Cypher against Neo4j/Bolt-style backends instead of remote GFQL? Use ``graphistry.cypher("...")`` or ``g.cypher("...")``. - Need a pure GFQL chain object? Use ``cypher_to_gfql()`` when the query fits a diff --git a/docs/source/gfql/validation/fundamentals.rst b/docs/source/gfql/validation/fundamentals.rst index 64ab964fb0..394a627e2f 100644 --- a/docs/source/gfql/validation/fundamentals.rst +++ b/docs/source/gfql/validation/fundamentals.rst @@ -152,7 +152,77 @@ GFQL validates automatically - just write your queries and run them: Pre-Execution Validation Options ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Use ``validate_chain_schema()`` to check compatibility without running the query, then execute separately: +Use the inline GFQL entrypoints first: + +1. ``g.gfql_validate(...)`` for validate-only preflight (no execution) +2. ``g.gfql(..., validate=True)`` for preflight + execution +3. ``validate_chain_schema()`` for low-level chain-schema checks only + +``g.gfql_validate(...)`` (validate-only, no execution) supports: + +* **Input forms**: Cypher strings, GFQL JSON payloads, and GFQL Python objects + (for example ``Chain(...)``, ``[n(), e(), n()]``, and ``ASTLet(...)``) + String inputs are always validated as Cypher (no separate string-shape precheck). +* **Predicate + structural validation**: yes +* **Schema validation**: + + * GFQL JSON and GFQL Python chain-like forms: yes (default ``schema=True``) + * GFQL Let/DAG forms: DAG structure + schema checks for direct graph-bound + steps; reference-based steps stay structural-only + * Cypher strings: syntax/compile + schema-aware name checks against the bound + graph schema by default (``strict=True``); pass ``strict=False`` for + syntax/compile-only preflight + +.. code-block:: python + + # Chain / JSON-style GFQL + g.gfql_validate([n({'type': 'customer'})], collect_all=True) + + # Cypher + g.gfql_validate("MATCH (c) RETURN c.id AS id LIMIT $n", params={"n": 10}) + +Validation failures raise ``GFQLValidationError`` / ``GFQLSyntaxError`` with +structured, inspectable context: + +.. code-block:: python + + from graphistry.compute.exceptions import GFQLValidationError + + try: + g.gfql_validate([n({"missing_col": "x"})], collect_all=True) + except GFQLValidationError as exc: + payload = exc.to_dict() + # LM-friendly payload: + # { + # "code": "...", + # "message": "...", + # "query_type": "chain", + # "language": "gfql", + # "diagnostics": [...] + # } + print(payload) + +``g.gfql(..., validate=True)`` accepts the same query inputs as ``g.gfql(...)`` +(Cypher string, GFQL JSON, GFQL Python objects), runs local preflight first, and +executes only when preflight passes. Its preflight uses ``g.gfql_validate(...)`` +defaults, so local bound-graph execution runs schema-aware checks by default. + +.. code-block:: python + + # Run preflight first; execute only if preflight passes + result = g.gfql( + "MATCH (c) RETURN c.id AS id LIMIT $n", + params={"n": 10}, + validate=True, + ) + +Use ``validate_chain_schema()`` when you specifically want the low-level chain-schema helper. +It is intentionally narrower than ``g.gfql_validate(...)``: + +* validates chain operations against currently bound node/edge dataframe columns +* does **not** parse/compile Cypher strings +* does **not** run Let/DAG orchestration validation +* does **not** execute query operators .. code-block:: python @@ -169,6 +239,22 @@ Use ``validate_chain_schema()`` to check compatibility without running the query result = g.gfql(chain.chain) print(f"Query executed: {len(result._nodes)} nodes") +Execution-time Preflight Toggles +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For remote execution, ``g.gfql_remote(..., validate=True)`` runs local query +prevalidation before implicit upload/network execution, so invalid queries fail +before data upload when possible. For Cypher strings, remote prevalidation uses +``strict=False`` by default because the authoritative schema is on the remote dataset. + +Grounded vs Ungrounded Validation +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Schema checks are most useful when local graph tables are bound on ``g``. +If local node/edge tables are missing, GFQL JSON/AST chain validation can only +do structural/predicate checks, and column-existence checks are effectively +ungrounded. + Error Collection ^^^^^^^^^^^^^^^^ @@ -197,4 +283,4 @@ See Also -------- * :doc:`../spec/language` - Complete language specification -* :doc:`../overview` - GFQL overview \ No newline at end of file +* :doc:`../overview` - GFQL overview diff --git a/docs/source/gfql/validation/llm.rst b/docs/source/gfql/validation/llm.rst index d42516c1b2..a6c63f71c4 100644 --- a/docs/source/gfql/validation/llm.rst +++ b/docs/source/gfql/validation/llm.rst @@ -128,6 +128,27 @@ Combined Validation return {"success": True, "chain": chain} +Direct Preflight For Retry Loops +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For generate-validate-repair loops, you can run ``g.gfql_validate(...)`` and +convert raised exceptions into structured payloads: + +.. code-block:: python + + from graphistry.compute.exceptions import GFQLValidationError, GFQLSyntaxError + + def preflight_payload(g, query): + try: + g.gfql_validate(query, collect_all=True) + return {"ok": True} + except (GFQLValidationError, GFQLSyntaxError) as exc: + payload = exc.to_dict() + return { + "ok": False, + "error": payload, # includes code/message + diagnostics context + } + Automated Fix Suggestions ------------------------- @@ -181,4 +202,4 @@ See Also * :doc:`production` - Production patterns * :doc:`../spec/language` - Language specification -* :doc:`../spec/cypher_mapping` - Cypher to GFQL mapping \ No newline at end of file +* :doc:`../spec/cypher_mapping` - Cypher to GFQL mapping diff --git a/graphistry/compute/ComputeMixin.py b/graphistry/compute/ComputeMixin.py index 814ddec2c1..c853523d73 100644 --- a/graphistry/compute/ComputeMixin.py +++ b/graphistry/compute/ComputeMixin.py @@ -10,6 +10,7 @@ from .chain import Chain, chain as chain_base from .chain_let import chain_let as chain_let_base from .gfql_unified import gfql as gfql_base +from .gfql_validate import gfql_validate as gfql_validate_base from .chain_remote import ( chain_remote as chain_remote_base, chain_remote_shape as chain_remote_shape_base @@ -508,6 +509,10 @@ def gfql(self, *args, **kwargs): return gfql_base(self, *args, **kwargs) gfql.__doc__ = gfql_base.__doc__ + def gfql_validate(self, *args, **kwargs): + return gfql_validate_base(self, *args, **kwargs) + gfql_validate.__doc__ = gfql_validate_base.__doc__ + def chain_remote(self, *args, **kwargs) -> Plottable: """ .. deprecated:: 2.XX.X @@ -591,7 +596,7 @@ def gfql_remote( def gfql_remote_shape( self, - chain: Union[Chain, List[ASTObject], Dict[str, JSONVal]], + chain: Union[Chain, List[ASTObject], ASTLet, Dict[str, JSONVal], str], api_token: Optional[str] = None, dataset_id: Optional[str] = None, format: Optional[FormatType] = None, diff --git a/graphistry/compute/chain_remote.py b/graphistry/compute/chain_remote.py index 1689eb7cf2..5c8faba995 100644 --- a/graphistry/compute/chain_remote.py +++ b/graphistry/compute/chain_remote.py @@ -16,6 +16,7 @@ from graphistry.compute.chain import Chain from graphistry.compute.gfql.cypher.lowering import compile_cypher_query from graphistry.compute.gfql.cypher.parser import parse_cypher +from graphistry.compute.gfql_validate import gfql_validate as gfql_preflight_validate from graphistry.io.metadata import deserialize_plottable_metadata from graphistry.models.compute.chain_remote import OutputTypeGraph, FormatType, output_types_graph from graphistry.utils.json import JSONVal @@ -136,18 +137,8 @@ def chain_remote_generic( self._pygraphistry.refresh() api_token = self.session.api_token - if not dataset_id: - dataset_id = self._dataset_id - - if not dataset_id: - self = self.upload(validate=validate) - dataset_id = self._dataset_id - if output_type not in output_types_graph: raise ValueError(f"Unknown output_type, expected one of {output_types_graph}, got: {output_type}") - - if not dataset_id: - raise ValueError("Missing dataset_id; either pass in, or call on g2=g1.plot(render='g') in api=3 mode ahead of time") # Resolve engine: auto -> pandas/cudf based on graph DataFrame type engine_resolved = resolve_engine(engine, self) @@ -201,8 +192,25 @@ def chain_remote_generic( else: raise TypeError(f"gfql_remote() query must be Chain, List, ASTLet, Dict, or str. Got {type(chain)}") - if validate and not is_let: - Chain.from_json(chain_json) + if validate: + gfql_preflight_validate( + self, + chain, + params=params, + strict=False, + collect_all=False, + schema=False, + ) + + if not dataset_id: + dataset_id = self._dataset_id + + if not dataset_id: + self = self.upload(validate=validate) + dataset_id = self._dataset_id + + if not dataset_id: + raise ValueError("Missing dataset_id; either pass in, or call on g2=g1.plot(render='g') in api=3 mode ahead of time") # --- Build request body (dual-field for backward compat) --- if is_let: @@ -504,8 +512,8 @@ def chain_remote( Uses the latest bound `_dataset_id`, and uploads current dataset if not already bound. Note that rebinding calls of `edges()` and `nodes()` reset the `_dataset_id` binding. - :param chain: GFQL chain query as a Python object or in serialized JSON format - :type chain: Union[Chain, List[ASTObject], Dict[str, JSONVal]] + :param chain: GFQL query as a Python object, serialized GFQL JSON, or Cypher string + :type chain: Union[Chain, List[ASTObject], Dict[str, JSONVal], ASTLet, str] :param api_token: Optional JWT token. If not provided, refreshes JWT and uses that. :type api_token: Optional[str] diff --git a/graphistry/compute/gfql_unified.py b/graphistry/compute/gfql_unified.py index 310be0cda2..9e77d8593a 100644 --- a/graphistry/compute/gfql_unified.py +++ b/graphistry/compute/gfql_unified.py @@ -2,7 +2,6 @@ # ruff: noqa: E501 from dataclasses import replace -import re from typing import Any, Dict, List, Literal, Mapping, Optional, Sequence, Tuple, Union, cast from graphistry.Plottable import Plottable from graphistry.Engine import Engine, EngineAbstract, df_concat, df_cons, resolve_engine, safe_merge @@ -55,6 +54,7 @@ from graphistry.compute.typing import DataFrameT, SeriesT from graphistry.compute.util.generate_safe_column_name import generate_safe_column_name from graphistry.compute.validate.validate_schema import validate_chain_schema +from graphistry.compute.gfql_validate import gfql_validate as gfql_preflight_validate from graphistry.otel import otel_traced, otel_detail_enabled logger = setup_logger(__name__) @@ -62,16 +62,6 @@ _REENTRY_WHOLE_ROW_SUGGESTION = "Carry a whole-row node alias through WITH before MATCH re-entry." _REENTRY_SCALAR_SUGGESTION = "Carry scalar columns through WITH before MATCH re-entry." -_CYPHER_LEAD_RE = re.compile( - r"^\s*(?:MATCH|OPTIONAL\s+MATCH|WITH|RETURN|UNWIND|CALL|CREATE|MERGE|DELETE|DETACH\s+DELETE|SET|REMOVE|FOREACH|GRAPH|USE)\b", - re.IGNORECASE, -) - - -def _looks_like_cypher_query(query: str) -> bool: - return _CYPHER_LEAD_RE.match(query) is not None - - def _series_to_pylist(values: Any) -> List[Any]: if hasattr(values, "to_arrow"): try: @@ -1589,6 +1579,7 @@ def gfql(self: Plottable, where: Optional[Sequence[WhereComparison]] = None, language: Optional[Literal["cypher", "gremlin"]] = None, params: Optional[Mapping[str, Any]] = None, + validate: bool = False, shortest_path_backend: str = "auto") -> Plottable: """ Execute a GFQL query - either a chain or a DAG @@ -1603,6 +1594,7 @@ def gfql(self: Plottable, :param where: Optional same-path constraints for list/Chain queries :param language: Optional string-query language selector. Defaults to ``"cypher"`` when ``query`` is a string. :param params: Optional parameter dictionary for string-query compilation + :param validate: When ``True``, run local preflight validation before execution via ``g.gfql_validate(...)``. :param shortest_path_backend: Backend for shortestPath execution: ``"auto"`` (default), ``"igraph"`` (require igraph, raise if missing), ``"cugraph"`` (require cugraph, raise if missing), or ``"bfs"`` (always use DataFrame BFS). ``"auto"`` tries @@ -1800,11 +1792,28 @@ def policy(context: PolicyContext) -> None: if where_param and isinstance(query, (dict, ASTLet)): raise ValueError("where must be provided inside dict chain under the 'where' key") + if not isinstance(query, str): + if language is not None: + raise ValueError("language is only supported when query is a string") + if params is not None: + raise ValueError("params is only supported when query is a string") if isinstance(query, str): if where_param: raise ValueError("where cannot be combined with string queries; embed Cypher predicates in the query itself") - if language is None and not _looks_like_cypher_query(query): - raise TypeError("Query must be ASTObject, List[ASTObject], Chain, ASTLet, or dict. Got str") + + if validate: + gfql_preflight_validate( + dispatch_self, + query, + where=where_param, + language=language, + params=params, + strict=True, + schema=True, + collect_all=False, + ) + + if isinstance(query, str): compiled_query = _compile_string_query(query, language=language, params=params) if isinstance(compiled_query, CompiledCypherGraphQuery): return _execute_graph_query(self, compiled_query, engine=engine, policy=expanded_policy, context=context) @@ -1812,11 +1821,6 @@ def policy(context: PolicyContext) -> None: if compiled_query.graph_bindings or compiled_query.use_ref: return _execute_query_with_graph_context(self, compiled_query, engine=engine, policy=expanded_policy, context=context) query = compiled_query.chain - else: - if language is not None: - raise ValueError("language is only supported when query is a string") - if params is not None: - raise ValueError("params is only supported when query is a string") if isinstance(query, dict) and query.get("type") == "Let": from .ast import ASTLet as _ASTLet diff --git a/graphistry/compute/gfql_validate.py b/graphistry/compute/gfql_validate.py new file mode 100644 index 0000000000..65d7096a73 --- /dev/null +++ b/graphistry/compute/gfql_validate.py @@ -0,0 +1,377 @@ +"""Validate-only GFQL/Cypher preflight helpers (no query execution).""" + +from __future__ import annotations + +from typing import Any, Dict, List, Literal, Mapping, NoReturn, Optional, Sequence, Tuple, Union, cast + +from graphistry.Plottable import Plottable +from graphistry.compute.ast import ASTLet, ASTObject, ASTNode, ASTEdge, ASTCall, ASTRef, from_json +from graphistry.compute.chain import Chain +from graphistry.compute.exceptions import ErrorCode, GFQLSyntaxError, GFQLValidationError +from graphistry.compute.gfql.cypher.lowering import ( + CompiledCypherGraphQuery, + CompiledCypherQuery, + CompiledCypherUnionQuery, + compile_cypher_query, +) +from graphistry.compute.gfql.cypher.parser import parse_cypher +from graphistry.compute.gfql.frontends.cypher.binder import FrontendBinder +from graphistry.compute.gfql.ir.compilation import GraphSchemaCatalog, PlanContext +from graphistry.compute.gfql.same_path_types import ( + WhereComparison, + normalize_where_entries, + parse_where_json, +) +from graphistry.compute.validate.validate_schema import validate_chain_schema + + +GFQLValidationQuery = Union[ASTObject, List[ASTObject], ASTLet, Chain, dict, str] + +def _serialize_error(exc: Exception, *, stage: str) -> Dict[str, Any]: + if hasattr(exc, "to_dict") and callable(getattr(exc, "to_dict")): + out = cast(Dict[str, Any], exc.to_dict()) # GFQLValidationError surface + elif hasattr(exc, "code") and hasattr(exc, "message"): + out = { + "code": cast(Any, getattr(exc, "code")), + "message": cast(Any, getattr(exc, "message")), + } + context = cast(Any, getattr(exc, "context", None)) + if isinstance(context, dict): + out.update(context) + else: + out = { + "code": ErrorCode.E108, + "message": str(exc), + } + out["stage"] = stage + return out + + +def _raise_diagnostics( + diagnostics: List[Dict[str, Any]], + *, + query_type: str, + language: str, +) -> NoReturn: + first = diagnostics[0] + code = cast(Any, first.get("code")) or ErrorCode.E108 + message = cast(Any, first.get("message")) or "GFQL validation failed" + if len(diagnostics) > 1: + message = f"GFQL validation failed with {len(diagnostics)} errors; first: {message}" + extra = { + key: value + for key, value in first.items() + if key not in {"code", "message", "field", "value", "suggestion", "operation_index"} + } + exc_cls = GFQLSyntaxError if code == ErrorCode.E107 else GFQLValidationError + raise exc_cls( + code, + message, + field=cast(Optional[str], first.get("field")), + value=first.get("value"), + suggestion=cast(Optional[str], first.get("suggestion")), + operation_index=cast(Optional[int], first.get("operation_index")), + diagnostics=diagnostics, + query_type=query_type, + language=language, + **extra, + ) + + +def _build_schema_catalog(g: Plottable, *, strict: bool) -> GraphSchemaCatalog: + node_columns: Tuple[str, ...] = tuple() + edge_columns: Tuple[str, ...] = tuple() + if getattr(g, "_nodes", None) is not None: + node_columns = tuple(str(c) for c in cast(Any, g)._nodes.columns) + if getattr(g, "_edges", None) is not None: + edge_columns = tuple(str(c) for c in cast(Any, g)._edges.columns) + return GraphSchemaCatalog.from_schema_parts( + node_columns=node_columns, + edge_columns=edge_columns, + node_id_column=getattr(g, "_node", None), + edge_source_column=getattr(g, "_source", None), + edge_destination_column=getattr(g, "_destination", None), + metadata={"strict": strict}, + ) + + +def _validate_cypher( + g: Plottable, + query: str, + *, + params: Optional[Mapping[str, Any]], + strict: bool, +) -> Dict[str, Any]: + parsed = parse_cypher(query) + if strict: + strict_ctx = PlanContext(catalog=_build_schema_catalog(g, strict=True)) + FrontendBinder().bind(parsed, strict_ctx, strict_name_resolution=True) + compiled = compile_cypher_query(parsed, params=params) + compiled_kind: Literal["query", "union", "graph"] = "query" + if isinstance(compiled, CompiledCypherUnionQuery): + compiled_kind = "union" + elif isinstance(compiled, CompiledCypherGraphQuery): + compiled_kind = "graph" + else: + compiled = cast(CompiledCypherQuery, compiled) + return { + "ok": True, + "query_type": "chain", + "language": "cypher", + "diagnostics": [], + "compiled_kind": compiled_kind, + } + + +def _coerce_non_string_query( + query: GFQLValidationQuery, + *, + where: Optional[Sequence[WhereComparison]], +) -> Union[ASTObject, ASTLet, Chain]: + where_param: Optional[List[WhereComparison]] = None + if where is not None: + if isinstance(where, (list, tuple)): + where_param = normalize_where_entries(where) + else: + raise ValueError(f"where must be a list of comparisons, got {type(where).__name__}") + + out: Union[ASTObject, ASTLet, Chain, dict, List[ASTObject], str] = query + if isinstance(out, dict) and out.get("type") == "Let": + out = ASTLet.from_json(out) + elif isinstance(out, dict) and "chain" in out: + chain_items: List[ASTObject] = [] + for item in cast(List[Any], out["chain"]): + if isinstance(item, dict): + chain_items.append(from_json(item)) + elif isinstance(item, ASTObject): + chain_items.append(item) + else: + raise TypeError(f"Unsupported chain entry type: {type(item)}") + dict_where = parse_where_json(cast(Any, out).get("where")) + if where_param is not None and dict_where: + raise ValueError("where cannot be combined with dict chain that already includes where") + effective_where = where_param if where_param is not None else dict_where + if not chain_items and effective_where: + raise ValueError("where requires at least one named node/edge step; empty chains have no aliases") + out = Chain(chain_items, where=effective_where) + elif isinstance(out, dict): + wrapped_dict: Dict[str, Any] = {} + for key, value in out.items(): + if isinstance(value, (ASTNode, ASTEdge)): + wrapped_dict[key] = Chain([value]) + else: + wrapped_dict[key] = value + out = ASTLet(wrapped_dict) # type: ignore[arg-type] + elif isinstance(out, Chain): + if where_param: + if out.where: + raise ValueError("where provided for Chain that already includes where") + out = Chain(out.chain, where=where_param) + elif isinstance(out, ASTLet): + pass + elif isinstance(out, ASTObject): + out = Chain([out], where=where_param) + elif isinstance(out, list): + converted_query: List[ASTObject] = [] + for item in out: + if isinstance(item, dict): + converted_query.append(from_json(item)) + else: + converted_query.append(item) + if not converted_query and where_param: + raise ValueError("where requires at least one named node/edge step; empty chains have no aliases") + out = Chain(converted_query, where=where_param) + else: + raise TypeError( + f"Query must be ASTObject, List[ASTObject], Chain, ASTLet, dict, or string. " + f"Got {type(out).__name__}" + ) + + if isinstance(out, (Chain, ASTLet, ASTObject)): + return out + raise TypeError( + f"Query must be ASTObject, List[ASTObject], Chain, ASTLet, dict, or string. Got {type(out).__name__}" + ) + + +def _validate_non_string_query( + g: Plottable, + query: GFQLValidationQuery, + *, + where: Optional[Sequence[WhereComparison]], + collect_all: bool, + schema: bool, +) -> Dict[str, Any]: + coerced = _coerce_non_string_query(query, where=where) + if isinstance(coerced, Chain): + if not schema: + if collect_all: + errors = cast(Any, coerced).validate(collect_all=True) or [] + diagnostics = [cast(Any, e).to_dict() for e in errors] + if diagnostics: + _raise_diagnostics(diagnostics, query_type="chain", language="gfql") + return { + "ok": True, + "query_type": "chain", + "language": "gfql", + "diagnostics": [], + } + cast(Any, coerced).validate(collect_all=False) + return { + "ok": True, + "query_type": "chain", + "language": "gfql", + "diagnostics": [], + } + if collect_all: + errors = validate_chain_schema(g, coerced.chain, collect_all=True) or [] + diagnostics = [cast(Any, e).to_dict() for e in errors] + if diagnostics: + _raise_diagnostics(diagnostics, query_type="chain", language="gfql") + return { + "ok": True, + "query_type": "chain", + "language": "gfql", + "diagnostics": [], + } + validate_chain_schema(g, coerced.chain, collect_all=False) + return { + "ok": True, + "query_type": "chain", + "language": "gfql", + "diagnostics": [], + } + + if isinstance(coerced, ASTLet): + return _validate_let_query(g, coerced, collect_all=collect_all, schema=schema) + + # For non-chain/non-let AST forms, preserve existing AST structural validation + # surface without introducing a new schema simulator. + if collect_all: + errors = cast(Any, coerced).validate(collect_all=True) or [] + diagnostics = [cast(Any, e).to_dict() for e in errors] + if diagnostics: + _raise_diagnostics(diagnostics, query_type="single", language="gfql") + return { + "ok": True, + "query_type": "single", + "language": "gfql", + "diagnostics": [], + } + cast(Any, coerced).validate(collect_all=False) + return { + "ok": True, + "query_type": "single", + "language": "gfql", + "diagnostics": [], + } + + +def _validate_let_binding_schema_errors(g: Plottable, value: Any) -> List[Any]: + # Structural validation for AST forms is handled by ASTSerializable.validate(); + # this helper adds best-effort schema validation for bindings that execute + # directly against dataframe-like tables. + errors: List[Any] = [] + + if isinstance(value, ASTLet): + for nested in value.bindings.values(): + errors.extend(_validate_let_binding_schema_errors(g, nested)) + return errors + + if isinstance(value, Chain): + return validate_chain_schema(g, value.chain, collect_all=True) or [] + + if isinstance(value, (ASTNode, ASTEdge, ASTCall)): + return validate_chain_schema(g, [value], collect_all=True) or [] + + # ASTRef bindings execute against prior DAG bindings and may have schema + # transformations not visible from root graph statically; keep structural + # checks only to avoid false positives. + if isinstance(value, ASTRef): + return [] + + return [] + + +def _validate_let_query( + g: Plottable, + let_query: ASTLet, + *, + collect_all: bool, + schema: bool, +) -> Dict[str, Any]: + if collect_all: + errors = cast(Any, let_query).validate(collect_all=True) or [] + if schema: + for value in let_query.bindings.values(): + errors.extend(_validate_let_binding_schema_errors(g, value)) + diagnostics = [cast(Any, e).to_dict() for e in errors] + if diagnostics: + _raise_diagnostics(diagnostics, query_type="dag", language="gfql") + return { + "ok": True, + "query_type": "dag", + "language": "gfql", + "diagnostics": [], + } + + cast(Any, let_query).validate(collect_all=False) + if schema: + for value in let_query.bindings.values(): + binding_errors = _validate_let_binding_schema_errors(g, value) + if binding_errors: + raise cast(Any, binding_errors[0]) + return { + "ok": True, + "query_type": "dag", + "language": "gfql", + "diagnostics": [], + } + + +def gfql_validate( + g: Plottable, + query: GFQLValidationQuery, + *, + where: Optional[Sequence[WhereComparison]] = None, + language: Optional[Literal["cypher", "gremlin"]] = None, + params: Optional[Mapping[str, Any]] = None, + strict: bool = True, + collect_all: bool = False, + schema: bool = True, +) -> Dict[str, Any]: + """Validate a GFQL/Cypher query without executing it. + + Raises structured GFQL exceptions on validation failures and never dispatches + query execution operators. + """ + try: + if isinstance(query, str): + if where is not None: + raise ValueError("where cannot be combined with string queries; embed Cypher predicates in the query itself") + query_language = language or "cypher" + if query_language != "cypher": + raise GFQLValidationError( + ErrorCode.E108, + f"Unsupported GFQL string language '{query_language}'", + field="language", + value=query_language, + suggestion="Use language='cypher' for now; Gremlin string compilation is not implemented yet.", + language="gfql", + ) + return _validate_cypher(g, query, params=params, strict=strict) + + if language is not None: + raise ValueError("language is only supported when query is a string") + if params is not None: + raise ValueError("params is only supported when query is a string") + return _validate_non_string_query(g, query, where=where, collect_all=collect_all, schema=schema) + except GFQLValidationError: + raise + except Exception as exc: + diagnostic = _serialize_error(exc, stage="validate") + _raise_diagnostics( + [diagnostic], + query_type="chain" if isinstance(query, str) else "single", + language="cypher" if isinstance(query, str) else "gfql", + ) diff --git a/graphistry/tests/compute/test_chain_let.py b/graphistry/tests/compute/test_chain_let.py index ae336a6f1f..318868a4ab 100644 --- a/graphistry/tests/compute/test_chain_let.py +++ b/graphistry/tests/compute/test_chain_let.py @@ -14,7 +14,7 @@ detect_cycles, determine_execution_order ) from graphistry.compute.execution_context import ExecutionContext -from graphistry.compute.exceptions import GFQLTypeError +from graphistry.compute.exceptions import GFQLTypeError, GFQLSyntaxError, ErrorCode from graphistry.tests.test_compute import CGFull @@ -547,9 +547,9 @@ def test_invalid_dag_type(self): """Test helpful error when dag parameter is wrong type""" g = CGFull() - with pytest.raises(TypeError) as exc_info: + with pytest.raises(GFQLSyntaxError) as exc_info: g.gfql("not a dag") - assert "Query must be ASTObject, List[ASTObject], Chain, ASTLet, or dict" in str(exc_info.value) + assert exc_info.value.code == ErrorCode.E107 # When passed a dict, gfql creates an ASTLet which validates with pytest.raises(GFQLTypeError) as exc_info: @@ -1249,10 +1249,9 @@ def test_chain_let_validates(self): g = CGFull().edges(pd.DataFrame({'s': ['a'], 'd': ['b']}), 's', 'd') # Invalid DAG should raise during validation - with pytest.raises(TypeError) as exc_info: + with pytest.raises(GFQLSyntaxError) as exc_info: g.gfql("not a dag") - - assert "Query must be ASTObject, List[ASTObject], Chain, ASTLet, or dict" in str(exc_info.value) + assert exc_info.value.code == ErrorCode.E107 def test_chain_let_output_selection(self): """Test output parameter selects specific binding""" diff --git a/graphistry/tests/compute/test_chain_remote_v2.py b/graphistry/tests/compute/test_chain_remote_v2.py index 1566e332d2..b392f9b3ea 100644 --- a/graphistry/tests/compute/test_chain_remote_v2.py +++ b/graphistry/tests/compute/test_chain_remote_v2.py @@ -53,9 +53,9 @@ def __init__(self): self.branches = () -def _mock_plottable() -> MagicMock: +def _mock_plottable(dataset_id: str | None = "test-dataset-123") -> MagicMock: mock = MagicMock() - mock._dataset_id = "test-dataset-123" + mock._dataset_id = dataset_id mock._edges = pd.DataFrame({"s": [0], "d": [1]}) mock._nodes = pd.DataFrame({"id": [0, 1]}) mock._privacy = None @@ -271,3 +271,36 @@ def test_let_emits_warning(self) -> None: finally: _cr.warnings.warn = _orig # type: ignore assert any("Let/DAG" in str(a[0]) for a in captured) + + def test_validate_true_rejects_before_implicit_upload(self) -> None: + g = _mock_plottable(dataset_id=None) + + with patch("graphistry.compute.chain_remote.requests.post") as mock_post: + with pytest.raises(Exception): + chain_remote_generic( + g, + "MATCH (n RETURN n", + format="json", + validate=True, + ) + + g.upload.assert_not_called() + mock_post.assert_not_called() + + def test_validate_true_uses_remote_safe_local_preflight(self) -> None: + g = _mock_plottable() + ok_report = {"ok": True, "query_type": "chain", "language": "gfql", "diagnostics": []} + + with patch("graphistry.compute.chain_remote.gfql_preflight_validate", return_value=ok_report) as mock_validate: + with patch("graphistry.compute.chain_remote.requests.post") as mock_post: + mock_post.return_value = _JSON_RESPONSE + chain_remote_generic( + g, + [ASTNode(filter_dict={"type": "Person"})], + format="json", + validate=True, + ) + + kwargs = mock_validate.call_args.kwargs + assert kwargs["strict"] is False + assert kwargs["schema"] is False diff --git a/graphistry/tests/compute/test_gfql.py b/graphistry/tests/compute/test_gfql.py index 9e79e8bcbe..048a62cb65 100644 --- a/graphistry/tests/compute/test_gfql.py +++ b/graphistry/tests/compute/test_gfql.py @@ -1,6 +1,7 @@ import pandas as pd import pytest from typing import Any, Dict, List +from unittest.mock import patch from graphistry.compute.ast import ASTLet, ASTRef, n, e from graphistry.compute.chain import Chain from graphistry.compute.exceptions import ErrorCode, GFQLSyntaxError, GFQLValidationError @@ -258,6 +259,45 @@ def test_gfql_non_string_rejects_language_and_params(self): with pytest.raises(ValueError): g.gfql([n()], params={"x": 1}) + def test_gfql_validate_true_runs_preflight_before_compile(self): + g = _mk_people_company_graph3() + with patch( + "graphistry.compute.gfql_unified.gfql_preflight_validate", + side_effect=GFQLValidationError(ErrorCode.E108, "synthetic preflight failure"), + ): + with patch( + "graphistry.compute.gfql_unified._compile_string_query", + side_effect=AssertionError("compile should not run when preflight fails"), + ): + with pytest.raises(GFQLValidationError, match="synthetic preflight failure"): + g.gfql("MATCH (p) RETURN p", validate=True) + + def test_gfql_validate_false_skips_preflight(self): + g = _mk_people_company_graph3() + + with patch( + "graphistry.compute.gfql_unified.gfql_preflight_validate", + side_effect=AssertionError("preflight should not run when validate=False"), + ): + result = g.gfql([n()]) + assert result is not None + + def test_gfql_validate_true_catches_cypher_schema_errors_by_default(self): + g = _mk_people_company_graph3() + + with pytest.raises(GFQLValidationError) as exc_info: + g.gfql("MATCH (p:Employee) RETURN p.id AS id", validate=True) + + assert exc_info.value.code == ErrorCode.E301 + + def test_gfql_validate_true_treats_all_strings_as_cypher(self): + g = _mk_people_company_graph3() + + with pytest.raises(GFQLSyntaxError) as exc_info: + g.gfql("hello world not cypher", validate=True) + + assert exc_info.value.code == ErrorCode.E107 + @pytest.mark.parametrize( ("direction", "expected"), [ diff --git a/graphistry/tests/compute/test_gfql_validate_only.py b/graphistry/tests/compute/test_gfql_validate_only.py new file mode 100644 index 0000000000..097f84f99d --- /dev/null +++ b/graphistry/tests/compute/test_gfql_validate_only.py @@ -0,0 +1,133 @@ +import pandas as pd +import pytest + +from graphistry.compute.ast import ASTLet, n +from graphistry.compute.chain import Chain +from graphistry.compute.exceptions import ErrorCode, GFQLSyntaxError, GFQLValidationError +from graphistry.tests.test_compute import CGFull + + +def _mk_graph(): + nodes_df = pd.DataFrame( + { + "id": ["a", "b", "c"], + "label__Person": [True, True, False], + "name": ["Alice", "Bob", "Corp"], + "score": [3, 1, 2], + } + ) + edges_df = pd.DataFrame({"s": ["a", "b"], "d": ["b", "c"], "type": ["KNOWS", "WORKS_AT"]}) + return CGFull().nodes(nodes_df, "id").edges(edges_df, "s", "d") + + +def test_gfql_validate_exists_on_public_api(): + g = CGFull() + assert hasattr(g, "gfql_validate") + assert callable(g.gfql_validate) + + +def test_gfql_validate_chain_success(): + g = _mk_graph() + report = g.gfql_validate([n({"name": "Alice"})]) + assert report["ok"] is True + assert report["language"] == "gfql" + assert report["query_type"] == "chain" + assert report["diagnostics"] == [] + + +def test_gfql_validate_chain_failure_collect_all(): + g = _mk_graph() + with pytest.raises(GFQLValidationError) as exc_info: + g.gfql_validate([n({"missing_col": "x"})], collect_all=True) + assert exc_info.value.code == ErrorCode.E301 + diagnostics = exc_info.value.context.get("diagnostics") + assert isinstance(diagnostics, list) and diagnostics + assert diagnostics[0]["code"] == ErrorCode.E301 + + +def test_gfql_validate_cypher_success(): + g = _mk_graph() + report = g.gfql_validate( + "MATCH (p:Person) RETURN p.name AS name ORDER BY name DESC LIMIT $top_n", + params={"top_n": 2}, + ) + assert report["ok"] is True + assert report["language"] == "cypher" + assert report["query_type"] == "chain" + assert report["compiled_kind"] == "query" + assert report["diagnostics"] == [] + + +def test_gfql_validate_cypher_default_reports_schema_errors(): + g = _mk_graph() + with pytest.raises(GFQLValidationError) as exc_info: + g.gfql_validate("MATCH (p:Employee) RETURN p.name AS name") + assert exc_info.value.code == ErrorCode.E301 + + +def test_gfql_validate_cypher_can_disable_strict_schema_checks(): + g = _mk_graph() + report = g.gfql_validate("MATCH (p:Employee) RETURN p.name AS name", strict=False) + assert report["ok"] is True + assert report["language"] == "cypher" + assert report["diagnostics"] == [] + + +def test_gfql_validate_treats_all_strings_as_cypher(): + g = _mk_graph() + with pytest.raises(GFQLSyntaxError) as exc_info: + g.gfql_validate("hello world not cypher") + assert exc_info.value.code == ErrorCode.E107 + assert "Got str" not in str(exc_info.value) + + +def test_gfql_validate_does_not_execute_query_operators(monkeypatch): + g = _mk_graph() + + def _should_not_run(*args, **kwargs): + raise AssertionError("execution path should not be called by gfql_validate") + + monkeypatch.setattr("graphistry.compute.chain.chain", _should_not_run) + report = g.gfql_validate([n({"name": "Alice"})]) + assert report["ok"] is True + + +def test_gfql_validate_let_success(): + g = _mk_graph() + query = ASTLet({"people": Chain([n({"name": "Alice"})])}) + report = g.gfql_validate(query) + assert report["ok"] is True + assert report["language"] == "gfql" + assert report["query_type"] == "dag" + assert report["diagnostics"] == [] + + +def test_gfql_validate_let_schema_failure(): + g = _mk_graph() + query = ASTLet({"people": Chain([n({"missing_col": "x"})])}) + with pytest.raises(GFQLValidationError) as exc_info: + g.gfql_validate(query, collect_all=True) + assert exc_info.value.code == ErrorCode.E301 + assert exc_info.value.context.get("query_type") == "dag" + + +def test_gfql_validate_exception_payload_is_llm_friendly(): + g = _mk_graph() + with pytest.raises(GFQLValidationError) as exc_info: + g.gfql_validate([n({"missing_col": "x"})], collect_all=True) + payload = exc_info.value.to_dict() + assert payload["code"] == ErrorCode.E301 + assert payload["query_type"] == "chain" + assert payload["language"] == "gfql" + diagnostics = payload.get("diagnostics") + assert isinstance(diagnostics, list) and diagnostics + assert diagnostics[0]["code"] == ErrorCode.E301 + + +def test_gfql_validate_chain_without_bound_tables_is_structural_only(): + g = CGFull() + report = g.gfql_validate([n({"missing_col": "x"})]) + assert report["ok"] is True + assert report["language"] == "gfql" + assert report["query_type"] == "chain" + assert report["diagnostics"] == []