From 0bb637b37093754ad840401df2194a52204e0bbd Mon Sep 17 00:00:00 2001
From: Li Jiang <lijiang1@microsoft.com>
Date: Wed, 6 May 2026 11:08:33 +0800
Subject: [PATCH 1/4] =?UTF-8?q?chore:=20post-CLI=20cleanup=20=E2=80=94=20f?=
 =?UTF-8?q?ix=20Feature.compute,=20llm=20exports,=20cache=20silence;=20add?=
 =?UTF-8?q?=20CLI=20docs?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Findings from a post-PR-#5 sweep of the repo:

Bug fixes
---------

1. `featcopilot/core/feature.py` — `Feature.compute` previously called
   `exec(self.code, {"__builtins__": {}}, locals)` so any stored snippet
   that used a Python builtin (`len`, `range`, `int`, `sum`, ...)
   raised `NameError` at exec time, even though the snippet was
   otherwise valid. The fix introduces a curated `_SAFE_BUILTINS` dict
   mirroring `TransformRule._get_safe_builtins` so common idioms work
   without giving the snippet unrestricted access to imports / file I/O.

2. `featcopilot/llm/__init__.py` — `SyncCopilotFeatureClient` was
   defined in `copilot_client.py` and used internally by sibling
   modules but was missing from the package's imports and `__all__`,
   while sister classes `SyncOpenAIFeatureClient` and
   `SyncLiteLLMFeatureClient` were both exported. `from
   featcopilot.llm import SyncCopilotFeatureClient` raised
   `ImportError` even though the class itself was importable from the
   submodule. The fix adds the missing import + `__all__` entry so all
   three sync clients have a consistent public surface.

3. `featcopilot/utils/cache.py` — three bare `except Exception: pass`
   blocks silently swallowed disk I/O / pickle / JSON errors when the
   on-disk cache entry was corrupted, when a value was unpicklable
   (e.g. lambda), or when metadata JSON was truncated. The fix replaces
   each with `logger.warning(...)` so callers can diagnose disk
   issues, version mismatches, and unpicklable values from logs while
   still preserving the "cache is best-effort, never crash the caller"
   contract.

Documentation fixes
-------------------

4. `docs/user-guide/cli.md` — NEW page documenting the `featcopilot`
   CLI added in PR #5 (`info` / `transform` / `explain`
   subcommands, output contract, agentic-usage tips, parquet caveat,
   `--config` example). The CLI was previously only mentioned in the
   README; mkdocs had no nav entry.

5. `mkdocs.yml` — added `Command-Line Interface: user-guide/cli.md`
   to the `User Guide` nav section.

6. `docs/getting-started/installation.md:82` — sample expected output
   updated from `# Output: 0.1.0` to `# Output: 0.3.7` (matches
   current `pyproject.toml` version).

Tests
-----

* `tests/test_core.py::TestFeature::test_feature_compute_uses_safe_builtins`
  — pins the safe-builtin whitelist; covers `len` / `range` /
  `sum` / `int` / `abs` / `round` / `min` / `max`.
* `tests/test_utils.py` — three new `FeatureCache` regression tests
  for corrupted `.pkl` reads, unpicklable values on `set`, and
  corrupted metadata JSON, each asserting `logger.warning` is called
  via `mock.patch` on the module logger.
* `tests/test_llm_modules.py::TestLLMPackageExports` — pins the
  `featcopilot.llm` public API: all three sync clients are
  importable directly AND listed in `__all__`, with identity checks
  against the underlying submodule definitions.

Local: 912 tests pass (full suite), pre-commit clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 docs/getting-started/installation.md |   2 +-
 docs/user-guide/cli.md               | 223 +++++++++++++++++++++++++++
 featcopilot/core/feature.py          |  55 ++++++-
 featcopilot/llm/__init__.py          |   3 +-
 featcopilot/utils/cache.py           |  45 +++++-
 mkdocs.yml                           |   1 +
 tests/test_core.py                   |  36 +++++
 tests/test_llm_modules.py            |  49 ++++++
 tests/test_utils.py                  |  75 +++++++++
 9 files changed, 479 insertions(+), 10 deletions(-)
 create mode 100644 docs/user-guide/cli.md

diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md
index 89c8964..e567704 100644
--- a/docs/getting-started/installation.md
+++ b/docs/getting-started/installation.md
@@ -79,7 +79,7 @@ Development dependencies include:
 ```python
 import featcopilot
 print(featcopilot.__version__)
-# Output: 0.1.0
+# Output: 0.3.7
 
 from featcopilot import AutoFeatureEngineer
 print("Installation successful!")
diff --git a/docs/user-guide/cli.md b/docs/user-guide/cli.md
new file mode 100644
index 0000000..de27b36
--- /dev/null
+++ b/docs/user-guide/cli.md
@@ -0,0 +1,223 @@
+# Command-Line Interface
+
+FeatCopilot ships a stable, agent-friendly `featcopilot` CLI for using the
+library from shells, CI pipelines, and **agentic / LLM tool-use** workflows
+without writing Python glue. All subcommands accept `--json` for
+machine-readable stdout; user-facing errors are written to **stderr** with
+a non-zero exit code so that automation can parse failures
+deterministically.
+
+The CLI is installed automatically with the package via the
+`[project.scripts]` entry point (`featcopilot = "featcopilot.cli:main"`),
+so after `pip install featcopilot` the `featcopilot` command is available
+on `$PATH`. The equivalent module form `python -m featcopilot ...` always
+works regardless of how the package was installed.
+
+## Subcommands
+
+| Command | Purpose |
+| --- | --- |
+| `featcopilot info` | Print version, supported engines, selection methods, leakage guards, I/O formats, and a runtime `parquet_available` flag. |
+| `featcopilot transform` | Read a CSV / Parquet / JSON file, run [`AutoFeatureEngineer`](../user-guide/overview.md), and write engineered features to an output file. |
+| `featcopilot explain` | Fit and print a JSON document with `{name, explanation, code}` per feature for downstream LLM consumption (no output file is written). |
+
+Run any subcommand with `--help` to see the full flag list:
+
+```bash
+featcopilot --help
+featcopilot transform --help
+featcopilot explain --help
+```
+
+## Output contract
+
+All three subcommands honor the same agent-friendly contract:
+
+* **`stdout`** carries the result. With `--json` (always implicit for
+  `explain`), exactly one JSON document is written.
+* **`stderr`** is reserved for failures. A successful run keeps `stderr`
+  empty even when `AutoFeatureEngineer` emits leakage warnings or
+  `verbose` logger output ─ those are surfaced via the JSON payload's
+  `warnings` field instead. This same contract covers warnings emitted
+  during pandas / pyarrow read or write phases (e.g. `DtypeWarning` on
+  mixed-type CSVs, `FutureWarning` from a successful Parquet write):
+  they are routed to the JSON `warnings` field, never to `stderr`.
+* **Exit codes**: `0` on success; `2` for user-input errors (missing
+  files, malformed config, unknown target, etc.); `1` for unexpected
+  internal errors.
+
+## `featcopilot info`
+
+Discover capabilities without running an engineer:
+
+```bash
+featcopilot info --json
+```
+
+Sample (truncated) output:
+
+```json
+{
+  "version": "0.3.7",
+  "supported_engines": ["relational", "tabular", "text", "timeseries"],
+  "supported_selection_methods": ["correlation", "importance", "mutual_info"],
+  "supported_leakage_guards": ["block", "ignore", "warn"],
+  "supported_input_formats": ["csv", "json"],
+  "supported_output_formats": ["csv", "json"],
+  "parquet_available": false
+}
+```
+
+`parquet_available` reflects whether `pyarrow` or `fastparquet` is
+importable in the current environment. The base FeatCopilot install does
+not pin a parquet engine; install one with
+`pip install pyarrow` (or `fastparquet`) to enable Parquet I/O.
+
+## `featcopilot transform`
+
+Run feature engineering on a tabular input and write the engineered
+features to disk:
+
+```bash
+featcopilot transform \
+    --input data.csv --target label --output features.csv \
+    --engines tabular --max-features 50 \
+    --json
+```
+
+Common flags:
+
+| Flag | Purpose |
+| --- | --- |
+| `--input / -i` | Path to input file (CSV / Parquet / JSON). Required. |
+| `--output / -o` | Path to output file. Required. |
+| `--target / -t` | Target column. Required when feature selection is applied (i.e. when `--max-features` / config `max_features` is set). |
+| `--input-format` / `--output-format` | Override format detection (`csv` / `parquet` / `json`). |
+| `--engines` | One or more engines to enable (default: `tabular`). |
+| `--max-features N` | Cap on engine output / selection. Forwarded both to engine constructors and to the selector. |
+| `--no-selection` | Skip feature selection entirely (raw feature generation). |
+| `--selection-methods` | Override the default `mutual_info importance` selection set. |
+| `--leakage-guard` | How to handle suspicious column names: `warn` (default), `block`, or `ignore`. |
+| `--include-target` | Re-attach the target column to the output file (collision-safe). |
+| `--task-description` | Free-form ML task description forwarded to LLM-aware engines. |
+| `--config FILE` | JSON config with nested keys (e.g. `llm_config`, `selection_methods`). CLI flags override config values. |
+| `--verbose / --no-verbose` | Toggle verbose logging. With `--json`, log records are routed to the JSON `warnings` field rather than `stderr`. |
+| `--gate-n-jobs` | Parallelism for the do-no-harm gate's RF (default 1; `-1` = all cores). |
+| `--json` | Emit a one-line JSON status object on stdout instead of human-readable text. |
+
+A successful `--json` run prints something like:
+
+```json
+{
+  "status": "ok",
+  "input": "data.csv",
+  "output": "features.csv",
+  "input_format": "csv",
+  "output_format": "csv",
+  "n_rows": 1000,
+  "n_features": 47,
+  "n_input_columns": 12,
+  "n_generated_features": 47,
+  "engines": ["tabular"],
+  "selection_methods": ["mutual_info", "importance"],
+  "max_features": 50,
+  "target": "label",
+  "selection_applied": true,
+  "warnings": []
+}
+```
+
+## `featcopilot explain`
+
+Fit the engineer (without writing any output file) and print a JSON
+catalog of generated features for downstream LLM consumption:
+
+```bash
+featcopilot explain --input data.csv --target label
+```
+
+Each entry in the `features` array contains the feature `name`, an
+LLM-style natural-language `explanation`, and the executable Python
+`code` used to produce it.
+
+`explain` defaults to running on the **full** input so the metadata is
+a faithful description of what a corresponding `transform` would
+generate. Some engines (notably the tabular engine's categorical
+encoding) consult per-row / per-category statistics when planning
+features, so blind subsampling can silently change results. For very
+large inputs where metadata-only `explain` should not pay full memory
+or compute cost, opt in with:
+
+```bash
+featcopilot explain --input big.csv --target label --explain-sample-size 5000
+```
+
+The cap is a deterministic *head slice* (the first N rows), threaded
+through `pd.read_csv(nrows=N)` for CSV so memory is bounded natively.
+For Parquet / JSON pandas has no native row-limit, so the file is
+fully read and then truncated; a `UserWarning` explaining the
+limitation is emitted (and surfaced in the JSON `warnings` field) only
+when the cap actually truncates the input.
+
+## Configuration files
+
+Pass `--config config.json` to provide nested keys that don't have
+matching CLI flags, such as the `llm_config` engine kwargs:
+
+```json
+{
+  "engines": ["tabular", "llm"],
+  "max_features": 80,
+  "selection_methods": ["mutual_info", "importance"],
+  "llm_config": {
+    "backend": "litellm",
+    "model": "gpt-4o",
+    "max_suggestions": 20
+  }
+}
+```
+
+Explicit CLI flags override values from the config file. Any malformed
+scalar (e.g. `"max_features": "5"`, `"verbose": "false"`) is rejected
+with a clean exit-2 error rather than failing later inside the
+engineer.
+
+## Parquet I/O
+
+The base FeatCopilot install does not pin a parquet engine. To use
+`--input file.parquet` / `--output file.parquet` (or the `parquet`
+value of `--input-format` / `--output-format`), install one of:
+
+```bash
+pip install pyarrow      # recommended
+# or
+pip install fastparquet
+```
+
+Confirm with `featcopilot info --json`:
+
+```json
+{ "parquet_available": true, ... }
+```
+
+If neither engine is installed, attempting Parquet I/O fails with a
+clean exit-2 error pointing at the missing dependency.
+
+## Agentic-usage tips
+
+* Always pass `--json`. Treat anything on `stderr` as a hard failure;
+  treat anything on `stdout` as the JSON result.
+* Treat the JSON `warnings` field as a list of human-readable
+  diagnostic strings ─ it is non-empty for `transform` runs that
+  generated leakage / mock-mode / sampling notices, and empty for
+  fully clean runs.
+* For long-running batch jobs, prefer `featcopilot transform` to
+  `python -m featcopilot transform` only because the former is shorter;
+  both invoke the exact same entry point.
+
+## See also
+
+* [Overview](overview.md) ─ the underlying `AutoFeatureEngineer` API.
+* [Engines](engines.md) ─ what each engine generates.
+* [LLM Features](llm-features.md) ─ configuring the LLM backend (used
+  by `--config llm_config`).
diff --git a/featcopilot/core/feature.py b/featcopilot/core/feature.py
index 4e3e933..2feff77 100644
--- a/featcopilot/core/feature.py
+++ b/featcopilot/core/feature.py
@@ -12,6 +12,45 @@
 logger = get_logger(__name__)
 
 
+# Curated set of safe Python builtins exposed to ``Feature.compute``'s
+# stored code. Without this whitelist (i.e. with ``{"__builtins__": {}}``)
+# even basic idioms like ``len(df)``, ``range(...)``, ``sum(...)``, or
+# ``int(x)`` raise ``NameError`` at exec time, which means a feature whose
+# code legitimately uses a Python builtin crashes during ``compute`` even
+# though the snippet is otherwise valid. The set mirrors the one used by
+# :class:`featcopilot.core.transform_rule.TransformRule` so both code
+# execution paths agree on what is safe.
+_SAFE_BUILTINS: dict[str, Any] = {
+    "len": len,
+    "sum": sum,
+    "max": max,
+    "min": min,
+    "int": int,
+    "float": float,
+    "str": str,
+    "bool": bool,
+    "abs": abs,
+    "round": round,
+    "pow": pow,
+    "range": range,
+    "list": list,
+    "dict": dict,
+    "set": set,
+    "tuple": tuple,
+    "sorted": sorted,
+    "reversed": reversed,
+    "enumerate": enumerate,
+    "zip": zip,
+    "any": any,
+    "all": all,
+    "map": map,
+    "filter": filter,
+    "isinstance": isinstance,
+    "hasattr": hasattr,
+    "getattr": getattr,
+}
+
+
 class FeatureType(Enum):
     """Types of features."""
 
@@ -109,6 +148,13 @@ def compute(self, df: pd.DataFrame) -> pd.Series:
         """
         Compute feature values from DataFrame using stored code.
 
+        The stored ``code`` is executed with ``df``, ``np`` and ``pd`` available
+        as local variables and a curated set of safe Python builtins
+        (``len``, ``range``, ``sum``, numeric / sequence constructors, etc.)
+        in globals so common idioms work without giving the snippet
+        unrestricted access to imports / file I/O. The snippet must
+        bind its output to a local variable named ``result``.
+
         Parameters
         ----------
         df : DataFrame
@@ -118,11 +164,16 @@ def compute(self, df: pd.DataFrame) -> pd.Series:
         -------
         Series
             Computed feature values
+
+        Raises
+        ------
+        ValueError
+            If ``self.code`` is empty / missing or if the executed code
+            does not bind a ``result`` variable.
         """
         if self.code:
-            # Execute stored code to compute feature
             local_vars = {"df": df, "np": np, "pd": pd}
-            exec(self.code, {"__builtins__": {}}, local_vars)
+            exec(self.code, {"__builtins__": _SAFE_BUILTINS}, local_vars)
             if "result" in local_vars:
                 return local_vars["result"]
         raise ValueError(f"No code defined for feature {self.name}")
diff --git a/featcopilot/llm/__init__.py b/featcopilot/llm/__init__.py
index d63780e..3039868 100644
--- a/featcopilot/llm/__init__.py
+++ b/featcopilot/llm/__init__.py
@@ -4,7 +4,7 @@
 """
 
 from featcopilot.llm.code_generator import FeatureCodeGenerator
-from featcopilot.llm.copilot_client import CopilotFeatureClient
+from featcopilot.llm.copilot_client import CopilotFeatureClient, SyncCopilotFeatureClient
 from featcopilot.llm.explainer import FeatureExplainer
 from featcopilot.llm.litellm_client import LiteLLMFeatureClient, SyncLiteLLMFeatureClient
 from featcopilot.llm.openai_client import OpenAIFeatureClient, SyncOpenAIFeatureClient
@@ -13,6 +13,7 @@
 
 __all__ = [
     "CopilotFeatureClient",
+    "SyncCopilotFeatureClient",
     "LiteLLMFeatureClient",
     "SyncLiteLLMFeatureClient",
     "OpenAIFeatureClient",
diff --git a/featcopilot/utils/cache.py b/featcopilot/utils/cache.py
index ad8e23e..1067727 100644
--- a/featcopilot/utils/cache.py
+++ b/featcopilot/utils/cache.py
@@ -8,6 +8,10 @@
 
 import pandas as pd
 
+from featcopilot.utils.logger import get_logger
+
+logger = get_logger(__name__)
+
 
 class FeatureCache:
     """
@@ -85,8 +89,18 @@ def get(self, key: str, data_hash: str | None = None) -> Any | None:
                     # Store in memory cache
                     self._memory_cache[cache_key] = value
                     return value
-                except Exception:
-                    pass
+                except Exception as exc:  # noqa: BLE001 - we re-surface via logger
+                    # Cache reads must never crash the caller (a corrupted
+                    # entry should fall back to recomputing the feature)
+                    # but the failure must not be silent: log a warning
+                    # so users can diagnose disk issues / pickle version
+                    # mismatches / partially-written files.
+                    logger.warning(
+                        "FeatureCache: failed to read cache entry %s (%s: %s)",
+                        cache_path,
+                        type(exc).__name__,
+                        exc,
+                    )
 
         return None
 
@@ -137,8 +151,18 @@ def set(
                 meta_path = self.cache_dir / f"{cache_key}.meta.json"
                 with open(meta_path, "w") as f:
                     json.dump(metadata or {}, f)
-            except Exception:
-                pass
+            except Exception as exc:  # noqa: BLE001 - we re-surface via logger
+                # Persisting to disk is best-effort: a write failure (e.g.
+                # disk full, read-only filesystem, unpicklable value)
+                # should not crash the caller, but it must not be silent
+                # either. Log a warning so operational issues with the
+                # cache directory surface in observability.
+                logger.warning(
+                    "FeatureCache: failed to persist cache entry %s (%s: %s)",
+                    cache_path,
+                    type(exc).__name__,
+                    exc,
+                )
 
     def has(self, key: str, data_hash: str | None = None) -> bool:
         """Check if key exists in cache."""
@@ -205,8 +229,17 @@ def get_metadata(self, key: str, data_hash: str | None = None) -> dict | None:
                 try:
                     with open(meta_path) as f:
                         return json.load(f)
-                except Exception:
-                    pass
+                except Exception as exc:  # noqa: BLE001 - we re-surface via logger
+                    # Same rationale as the cache-read path: a corrupted
+                    # metadata file should fall back to "no metadata
+                    # available" rather than crashing the caller, but
+                    # the failure must surface via the logger.
+                    logger.warning(
+                        "FeatureCache: failed to read cache metadata %s (%s: %s)",
+                        meta_path,
+                        type(exc).__name__,
+                        exc,
+                    )
 
         return None
 
diff --git a/mkdocs.yml b/mkdocs.yml
index 8136adf..48ca57f 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -80,6 +80,7 @@ nav:
     - LLM Features: user-guide/llm-features.md
     - Feature Selection: user-guide/selection.md
     - Feature Stores: user-guide/feature-stores.md
+    - Command-Line Interface: user-guide/cli.md
     - Benchmarks: user-guide/benchmarks.md
     - Best Practices: user-guide/best-practices.md
   - Examples:
diff --git a/tests/test_core.py b/tests/test_core.py
index cbee907..c014df3 100644
--- a/tests/test_core.py
+++ b/tests/test_core.py
@@ -60,6 +60,42 @@ def test_feature_compute(self):
 
         assert list(result) == [1, 4, 9, 16]
 
+    def test_feature_compute_uses_safe_builtins(self):
+        """Regression: ``Feature.compute`` previously called
+        ``exec(code, {"__builtins__": {}}, locals)`` which broke any
+        snippet that used a Python builtin (``len``, ``range``, ``int``,
+        ``sum``, ...). The fix exposes a curated set of safe builtins
+        identical to ``TransformRule._get_safe_builtins`` so common
+        idioms work without giving the snippet unrestricted access.
+        """
+        df = pd.DataFrame({"col1": [10, 20, 30, 40, 50]})
+
+        # ``len`` (most common builtin used in a feature) — would NameError
+        # before the fix.
+        f_len = Feature(name="row_count", code="result = pd.Series([len(df)] * len(df))")
+        result = f_len.compute(df)
+        assert list(result) == [5, 5, 5, 5, 5]
+
+        # ``range`` + ``sum`` + numeric builtins — exercises a broader
+        # subset of the whitelist in one snippet.
+        f_range = Feature(
+            name="cumulative_index_sum",
+            code="result = pd.Series([sum(range(int(v))) for v in df['col1']])",
+        )
+        result = f_range.compute(df)
+        # sum(range(10))=45, sum(range(20))=190, sum(range(30))=435,
+        # sum(range(40))=780, sum(range(50))=1225
+        assert list(result) == [45, 190, 435, 780, 1225]
+
+        # ``abs`` + ``round`` + ``min`` / ``max`` — ensure the
+        # whitelist's full set is exposed.
+        f_clip = Feature(
+            name="clipped",
+            code="result = pd.Series([round(min(max(abs(v - 25), 0), 20), 2) for v in df['col1']])",
+        )
+        result = f_clip.compute(df)
+        assert list(result) == [15, 5, 5, 15, 20]
+
 
 class TestFeatureSet:
     """Tests for FeatureSet class."""
diff --git a/tests/test_llm_modules.py b/tests/test_llm_modules.py
index e1c426b..a42a2d8 100644
--- a/tests/test_llm_modules.py
+++ b/tests/test_llm_modules.py
@@ -2596,3 +2596,52 @@ def test_send_prompt_handles_concurrent_futures_timeout_error(self):
 
         result = asyncio.run(client.send_prompt("hi"))
         assert "timed out" in result.lower()
+
+
+# ---------------------------------------------------------------------------
+# featcopilot.llm package public-API consistency
+# ---------------------------------------------------------------------------
+
+
+class TestLLMPackageExports:
+    """Pin the public API of the ``featcopilot.llm`` package.
+
+    Regression: ``SyncCopilotFeatureClient`` was defined in
+    ``copilot_client.py`` and used by sibling modules (e.g.
+    ``code_generator.py``, ``explainer.py``) but was missing from
+    ``featcopilot/llm/__init__.py``'s imports and ``__all__`` while the
+    sister classes ``SyncOpenAIFeatureClient`` and
+    ``SyncLiteLLMFeatureClient`` WERE both exported. That made
+    ``from featcopilot.llm import SyncCopilotFeatureClient`` raise
+    ``ImportError`` even though the class itself was importable from the
+    submodule, which is a public-API surface inconsistency callers
+    legitimately notice.
+    """
+
+    def test_sync_clients_are_all_exported(self):
+        """All three sync LLM clients must be importable directly from
+        the ``featcopilot.llm`` package."""
+        from featcopilot.llm import (
+            SyncCopilotFeatureClient,
+            SyncLiteLLMFeatureClient,
+            SyncOpenAIFeatureClient,
+        )
+
+        # Identity check: the names exposed by the package must be the
+        # same objects defined in the underlying submodules.
+        from featcopilot.llm.copilot_client import SyncCopilotFeatureClient as _SCC
+        from featcopilot.llm.litellm_client import SyncLiteLLMFeatureClient as _SLC
+        from featcopilot.llm.openai_client import SyncOpenAIFeatureClient as _SOC
+
+        assert SyncCopilotFeatureClient is _SCC
+        assert SyncLiteLLMFeatureClient is _SLC
+        assert SyncOpenAIFeatureClient is _SOC
+
+    def test_sync_clients_are_listed_in_dunder_all(self):
+        """``__all__`` must mirror the imports so ``from featcopilot.llm
+        import *`` is consistent with explicit imports."""
+        import featcopilot.llm as llm_pkg
+
+        assert "SyncCopilotFeatureClient" in llm_pkg.__all__
+        assert "SyncLiteLLMFeatureClient" in llm_pkg.__all__
+        assert "SyncOpenAIFeatureClient" in llm_pkg.__all__
diff --git a/tests/test_utils.py b/tests/test_utils.py
index 75731d3..9996895 100644
--- a/tests/test_utils.py
+++ b/tests/test_utils.py
@@ -204,6 +204,81 @@ def test_list_keys_includes_disk(self):
             cache._memory_cache.clear()
             assert "disk_key" in cache.list_keys()
 
+    def test_corrupted_cache_read_logs_warning(self):
+        """Regression: ``FeatureCache.get`` previously had a bare
+        ``except Exception: pass`` that silently swallowed ``pickle``
+        errors when the on-disk cache entry was corrupted (truncated
+        write, version mismatch, partial flush). The fix logs a
+        warning via ``featcopilot.utils.cache``'s module logger so
+        users can diagnose disk issues. This test corrupts the on-disk
+        ``.pkl`` and asserts ``get()`` returns ``None`` AND emits a
+        warning containing the cache path.
+        """
+        from pathlib import Path
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            cache = FeatureCache(cache_dir=tmpdir)
+            cache.set("victim", {"value": 42})
+
+            # Find the persisted file and corrupt it.
+            pkl_path = next(Path(tmpdir).glob("*.pkl"))
+            pkl_path.write_bytes(b"\x00\x01\x02not-a-valid-pickle-stream")
+
+            # Drop the in-memory entry so ``get`` falls through to disk.
+            cache._memory_cache.clear()
+
+            with patch("featcopilot.utils.cache.logger") as mock_logger:
+                value = cache.get("victim")
+                assert value is None
+                assert mock_logger.warning.called, "FeatureCache must log on read failure"
+                msg = mock_logger.warning.call_args[0][0]
+                assert "failed to read cache entry" in msg
+
+    def test_unpicklable_value_logs_warning_on_set(self):
+        """Regression: ``FeatureCache.set`` previously had a bare
+        ``except Exception: pass`` that silently swallowed
+        ``pickle.PicklingError`` when the value could not be persisted
+        (e.g. lambda / open file handle). The fix logs a warning so
+        operational issues with the cache directory surface in
+        observability.
+        """
+        with tempfile.TemporaryDirectory() as tmpdir:
+            cache = FeatureCache(cache_dir=tmpdir)
+
+            # Lambdas cannot be pickled by the default protocol.
+            unpicklable = lambda x: x  # noqa: E731
+
+            with patch("featcopilot.utils.cache.logger") as mock_logger:
+                cache.set("bad", unpicklable)
+                assert mock_logger.warning.called, "FeatureCache must log on persist failure"
+                msg = mock_logger.warning.call_args[0][0]
+                assert "failed to persist cache entry" in msg
+
+    def test_corrupted_metadata_read_logs_warning(self):
+        """Regression: ``FeatureCache.get_metadata`` previously had a
+        bare ``except Exception: pass`` that hid corrupted JSON
+        metadata files. The fix logs a warning.
+        """
+        from pathlib import Path
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            cache = FeatureCache(cache_dir=tmpdir)
+            cache.set("k", "v", metadata={"src": "tabular"})
+
+            # Corrupt the metadata JSON.
+            meta_path = next(Path(tmpdir).glob("*.meta.json"))
+            meta_path.write_text("{not valid json")
+
+            # Force re-read from disk by clearing the in-memory metadata cache.
+            cache._metadata.clear()
+
+            with patch("featcopilot.utils.cache.logger") as mock_logger:
+                md = cache.get_metadata("k")
+                assert md is None
+                assert mock_logger.warning.called, "FeatureCache must log on metadata read failure"
+                msg = mock_logger.warning.call_args[0][0]
+                assert "failed to read cache metadata" in msg
+
 
 class TestFeatureCacheDataHash:
     """Tests for FeatureCache with data_hash parameter."""

From 3cd59a699c4aa01ee3eb14fd0faa5780e52ea863 Mon Sep 17 00:00:00 2001
From: Li Jiang <lijiang1@microsoft.com>
Date: Wed, 6 May 2026 15:07:59 +0800
Subject: [PATCH 2/4] fix(feature): single namespace for exec; per-call builtin
 copy; distinct error
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Round-1 review on PR #6 raised four threads (3x copilot-pull-request-reviewer,
1x codex). All addressed:

1. **Codex P2 (feature.py:176) — comprehensions/lambdas saw NameError**
   Free variables inside a comprehension or lambda body are resolved
   against the enclosing function's globals, NOT against the caller's
   locals dict passed to `exec`. The previous fix put `df`/`np`/
   `pd` only in `locals` so a snippet such as
   `[df['c'].iloc[i] for i in range(len(df))]` would still raise
   `NameError` and `FeatureSet.compute_all` would silently drop the
   feature. `Feature.compute` now uses a SINGLE shared namespace
   (passing one dict as both `globals` and `locals` to `exec`),
   so comprehensions / lambdas / generator expressions resolve `df`,
   `np`, `pd` and the safe builtins identically to the top-level
   statement scope.

2. **Copilot (feature.py:51) — `_SAFE_BUILTINS` mutation persisted**
   `_SAFE_BUILTINS` was passed to `exec` by reference, so a snippet
   that rebound entries (`__builtins__['len'] = lambda x: -999`)
   could leak the mutation into every subsequent `Feature.compute`
   call in the same process. `compute` now passes `dict(_SAFE_BUILTINS)`
   (a fresh shallow copy) on every call, mirroring
   `TransformRule._get_safe_builtins()` which already returns a fresh
   dict per call. Side effects are now strictly local to the single
   `exec` call.

3. **Copilot (feature.py:179) — misleading error when `result` missing**
   The docstring said `ValueError` would be raised when the executed
   code did not bind a `result` variable, but the implementation
   always surfaced `"No code defined for feature ..."` even when
   `self.code` was clearly present and only `result` was missing.
   The two branches now produce DISTINCT messages:
   - `self.code` empty → `"No code defined for feature ..."`
   - `self.code` present but no `result` → `"Feature '...' code did not produce a 'result' variable. Stored snippet must bind its output to a name called 'result'."`

4. **Copilot (docs/user-guide/cli.md:223) — misleading "by `--config llm_config`"**
   `--config` takes a JSON file path; `llm_config` is a key inside
   that JSON. Reworded the See-also bullet to "provide an `llm_config`
   object inside the JSON file passed to `--config`" and link back
   to the existing Configuration Files section.

Tests added (regression locks):

- `test_feature_compute_resolves_df_inside_comprehension` — list
  comprehension, lambda body, and generator expression all referencing
  `df` / `pd` / `len` / `range` succeed.
- `test_feature_compute_isolates_safe_builtins_across_calls` — runs
  a "poisoner" snippet that rebinds `__builtins__['len']`, then
  asserts a clean follow-up snippet still sees the original `len`.
- `test_feature_compute_distinguishes_missing_code_from_missing_result`
  — pins the two distinct `ValueError` messages.

Local: 915 tests pass (full suite); +3 new tests vs round-0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 docs/user-guide/cli.md      |   5 +-
 featcopilot/core/feature.py |  65 ++++++++++++++++-----
 tests/test_core.py          | 112 ++++++++++++++++++++++++++++++++++++
 3 files changed, 166 insertions(+), 16 deletions(-)

diff --git a/docs/user-guide/cli.md b/docs/user-guide/cli.md
index de27b36..c832bd0 100644
--- a/docs/user-guide/cli.md
+++ b/docs/user-guide/cli.md
@@ -219,5 +219,6 @@ clean exit-2 error pointing at the missing dependency.
 
 * [Overview](overview.md) ─ the underlying `AutoFeatureEngineer` API.
 * [Engines](engines.md) ─ what each engine generates.
-* [LLM Features](llm-features.md) ─ configuring the LLM backend (used
-  by `--config llm_config`).
+* [LLM Features](llm-features.md) ─ configuring the LLM backend (provide
+  an `llm_config` object inside the JSON file passed to `--config`, as
+  shown in the [Configuration files](#configuration-files) section above).
diff --git a/featcopilot/core/feature.py b/featcopilot/core/feature.py
index 2feff77..b1fe375 100644
--- a/featcopilot/core/feature.py
+++ b/featcopilot/core/feature.py
@@ -148,12 +148,28 @@ def compute(self, df: pd.DataFrame) -> pd.Series:
         """
         Compute feature values from DataFrame using stored code.
 
-        The stored ``code`` is executed with ``df``, ``np`` and ``pd`` available
-        as local variables and a curated set of safe Python builtins
-        (``len``, ``range``, ``sum``, numeric / sequence constructors, etc.)
-        in globals so common idioms work without giving the snippet
-        unrestricted access to imports / file I/O. The snippet must
-        bind its output to a local variable named ``result``.
+        The stored ``code`` is executed in a single shared namespace
+        with ``df``, ``np`` and ``pd`` bound as names alongside a
+        curated set of safe Python builtins (``len``, ``range``,
+        ``sum``, numeric / sequence constructors, etc.) so common
+        idioms work without giving the snippet unrestricted access to
+        imports / file I/O. The snippet must bind its output to a name
+        called ``result``.
+
+        A *fresh copy* of the safe-builtins dict is passed into ``exec``
+        on every call so that any mutation the snippet performs on
+        ``__builtins__`` (``del``, ``pop``, attribute assignment,
+        rebinding) does not bleed into subsequent ``compute`` calls.
+        Likewise the data-bound namespace is constructed fresh per
+        call. Using a SINGLE dict for both ``globals`` and ``locals``
+        is what makes free variables inside comprehensions and lambdas
+        — which Python resolves against the enclosing function's
+        globals, not the caller's locals — see ``df``, ``np`` and
+        ``pd`` correctly. With separate ``locals`` and ``globals``
+        dicts a snippet such as ``[df['c'].iloc[i] for i in
+        range(len(df))]`` would otherwise raise ``NameError`` because
+        the implicit comprehension function's body looks ``df`` up in
+        the (empty) ``globals``.
 
         Parameters
         ----------
@@ -168,15 +184,36 @@ def compute(self, df: pd.DataFrame) -> pd.Series:
         Raises
         ------
         ValueError
-            If ``self.code`` is empty / missing or if the executed code
-            does not bind a ``result`` variable.
+            * If ``self.code`` is empty / missing — message
+              ``"No code defined for feature ..."``.
+            * If ``self.code`` is present but did not bind a
+              ``result`` variable — message
+              ``"Feature ... code did not produce a 'result' variable"``.
+              These two cases produce DIFFERENT messages so a failing
+              snippet is distinguishable from an unset feature when
+              debugging.
         """
-        if self.code:
-            local_vars = {"df": df, "np": np, "pd": pd}
-            exec(self.code, {"__builtins__": _SAFE_BUILTINS}, local_vars)
-            if "result" in local_vars:
-                return local_vars["result"]
-        raise ValueError(f"No code defined for feature {self.name}")
+        if not self.code:
+            raise ValueError(f"No code defined for feature {self.name}")
+
+        # Single shared namespace so comprehensions / lambdas /
+        # generator expressions inside the snippet see ``df``, ``np``,
+        # ``pd`` and the safe builtins. Fresh dicts per call so the
+        # snippet cannot pollute either the safe-builtins whitelist or
+        # the data bindings for later ``compute`` invocations.
+        namespace: dict[str, Any] = {
+            "__builtins__": dict(_SAFE_BUILTINS),
+            "df": df,
+            "np": np,
+            "pd": pd,
+        }
+        exec(self.code, namespace)
+        if "result" not in namespace:
+            raise ValueError(
+                f"Feature {self.name!r} code did not produce a 'result' variable. "
+                "Stored snippet must bind its output to a name called 'result'."
+            )
+        return namespace["result"]
 
 
 class FeatureSet:
diff --git a/tests/test_core.py b/tests/test_core.py
index c014df3..66b2610 100644
--- a/tests/test_core.py
+++ b/tests/test_core.py
@@ -96,6 +96,118 @@ def test_feature_compute_uses_safe_builtins(self):
         result = f_clip.compute(df)
         assert list(result) == [15, 5, 5, 15, 20]
 
+    def test_feature_compute_resolves_df_inside_comprehension(self):
+        """Regression for round-2 review: when a snippet uses a
+        comprehension or lambda whose body references ``df``, ``np`` or
+        ``pd``, Python resolves those free variables against the
+        enclosing function's *globals*, not the caller's locals. With
+        a previous implementation that passed ``df`` only via the
+        ``locals`` argument to ``exec`` (and had ``globals`` set to
+        only ``__builtins__``), the inner reference would fail with
+        ``NameError`` and the call would either crash or be silently
+        dropped by ``FeatureSet.compute_all``. The fix uses a single
+        shared namespace for both ``globals`` and ``locals``.
+        """
+        df = pd.DataFrame({"col1": [1, 2, 3, 4, 5]})
+
+        # List comprehension whose body references ``df`` and uses
+        # ``range`` / ``len`` builtins — this is the exact pattern
+        # called out in the round-2 review.
+        f_compre = Feature(
+            name="from_comprehension",
+            code="result = pd.Series([df['col1'].iloc[i] * 2 for i in range(len(df))])",
+        )
+        result = f_compre.compute(df)
+        assert list(result) == [2, 4, 6, 8, 10]
+
+        # Lambda whose body references ``df`` and ``pd`` — same
+        # globals-resolution rule applies.
+        f_lambda = Feature(
+            name="from_lambda",
+            code=(
+                "fn = lambda i: int(df['col1'].iloc[i]) ** 2\n" "result = pd.Series([fn(i) for i in range(len(df))])"
+            ),
+        )
+        result = f_lambda.compute(df)
+        assert list(result) == [1, 4, 9, 16, 25]
+
+        # Generator expression piped into a builtin — exercises the
+        # iterator-protocol path through the same namespace.
+        f_gen = Feature(
+            name="from_generator",
+            code="result = pd.Series([sum(v for v in df['col1'] if v <= n) for n in df['col1']])",
+        )
+        result = f_gen.compute(df)
+        # cumulative sums for thresholds 1, 2, 3, 4, 5: 1, 3, 6, 10, 15
+        assert list(result) == [1, 3, 6, 10, 15]
+
+    def test_feature_compute_isolates_safe_builtins_across_calls(self):
+        """Regression: ``_SAFE_BUILTINS`` must NOT be passed to ``exec``
+        by reference. If it were, a malicious / buggy snippet could
+        delete or rebind entries (e.g. ``del __builtins__['len']``)
+        and the mutation would persist across subsequent ``Feature.compute``
+        calls in the same process, breaking unrelated features. The
+        fix passes ``dict(_SAFE_BUILTINS)`` (a shallow copy) per call
+        so that side effects are local to that single ``exec``.
+        """
+        df = pd.DataFrame({"col1": [1, 2, 3]})
+
+        # Snippet that rebinds ``len`` inside its own builtins view.
+        # If the implementation passed _SAFE_BUILTINS by reference,
+        # this would corrupt the whitelist for every feature that runs
+        # after it.
+        f_attacker = Feature(
+            name="poisoner",
+            code=(
+                "import builtins\n"
+                # We can't directly mutate __builtins__ because exec
+                # turns the dict into a module, but we CAN rebind keys.
+                "_b = __builtins__\n"
+                "if isinstance(_b, dict):\n"
+                "    _b['len'] = lambda x: -999\n"
+                "result = pd.Series([1, 2, 3])"
+            ),
+        )
+        # The attacker may or may not actually corrupt anything —
+        # depending on Python's exec semantics — but the next call
+        # MUST behave correctly regardless.
+        try:
+            f_attacker.compute(df)
+        except Exception:
+            # Even if the attacker snippet itself fails, the isolation
+            # contract still applies to the next call.
+            pass
+
+        # A clean, well-behaved feature run AFTER the attacker must
+        # see the original ``len`` builtin, not the poisoned version.
+        f_victim = Feature(
+            name="clean_user",
+            code="result = pd.Series([len(df)] * len(df))",
+        )
+        result = f_victim.compute(df)
+        assert list(result) == [3, 3, 3], (
+            "Per-call dict copy of _SAFE_BUILTINS is broken: a previous " "feature's mutation leaked into a later call"
+        )
+
+    def test_feature_compute_distinguishes_missing_code_from_missing_result(self):
+        """Regression: the ``ValueError`` raised by ``Feature.compute``
+        must use DIFFERENT messages for the two distinct failure
+        modes (``self.code`` empty / missing vs. snippet ran but did
+        not bind ``result``). Previously both paths surfaced the
+        misleading ``"No code defined ..."`` message even when the
+        code was clearly present, making debugging harder.
+        """
+        # No code at all → "No code defined" branch.
+        f_empty = Feature(name="no_code")
+        with pytest.raises(ValueError, match="No code defined for feature no_code"):
+            f_empty.compute(pd.DataFrame({"x": [1]}))
+
+        # Code present but does not produce ``result`` → distinct
+        # message that points at the actual cause.
+        f_no_result = Feature(name="no_result", code="x = df['col1'] * 2")
+        with pytest.raises(ValueError, match=r"Feature 'no_result' code did not produce a 'result' variable"):
+            f_no_result.compute(pd.DataFrame({"col1": [1, 2, 3]}))
+
 
 class TestFeatureSet:
     """Tests for FeatureSet class."""

From a9c25ec5d97c4d4d1c3032ffe25de092e45358e4 Mon Sep 17 00:00:00 2001
From: Li Jiang <lijiang1@microsoft.com>
Date: Wed, 6 May 2026 16:03:29 +0800
Subject: [PATCH 3/4] docs(cli): correct --leakage-guard values; pin
 doc-vs-source agreement
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Round-2 review on PR #6: Codex P2 noted that the user-guide CLI doc
listed `--leakage-guard` choices as `warn` / `block` / `ignore`
and the `info` JSON sample showed `["block", "ignore", "warn"]`,
but the actual `AutoFeatureEngineer.SUPPORTED_LEAKAGE_GUARDS` set is
`{"off", "warn", "raise"}` (per
`featcopilot/transformers/sklearn_compat.py:136`) and the argparse
parser validates against `sorted(SUPPORTED_LEAKAGE_GUARDS)`. Users
following the doc with `--leakage-guard block` would get an
argparse error rather than the described behavior.

Two doc fixes:
- `"supported_leakage_guards": ["block", "ignore", "warn"]` →
  `"supported_leakage_guards": ["off", "raise", "warn"]` in the
  `info` JSON sample.
- `--leakage-guard` row in the transform flags table now lists the
  three actual values (`warn`, `raise`, `off`) with a short
  description of what each does.

Added `test_cli_doc_leakage_guard_values_match_source_of_truth` which
reads `docs/user-guide/cli.md` and asserts (a) the JSON sample
contains the literal sorted-list of `SUPPORTED_LEAKAGE_GUARDS` and
(b) every valid value appears as a backtick-quoted token AND no
non-existent value (`block`, `ignore`) appears anywhere on the
page. Future drift in either direction (renaming a guard, adding a
new one, removing one) will fail this test fast.

Local: 916 tests pass (full suite); pre-commit clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 docs/user-guide/cli.md |  4 ++--
 tests/test_cli.py      | 38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/docs/user-guide/cli.md b/docs/user-guide/cli.md
index c832bd0..8345a5d 100644
--- a/docs/user-guide/cli.md
+++ b/docs/user-guide/cli.md
@@ -61,7 +61,7 @@ Sample (truncated) output:
   "version": "0.3.7",
   "supported_engines": ["relational", "tabular", "text", "timeseries"],
   "supported_selection_methods": ["correlation", "importance", "mutual_info"],
-  "supported_leakage_guards": ["block", "ignore", "warn"],
+  "supported_leakage_guards": ["off", "raise", "warn"],
   "supported_input_formats": ["csv", "json"],
   "supported_output_formats": ["csv", "json"],
   "parquet_available": false
@@ -97,7 +97,7 @@ Common flags:
 | `--max-features N` | Cap on engine output / selection. Forwarded both to engine constructors and to the selector. |
 | `--no-selection` | Skip feature selection entirely (raw feature generation). |
 | `--selection-methods` | Override the default `mutual_info importance` selection set. |
-| `--leakage-guard` | How to handle suspicious column names: `warn` (default), `block`, or `ignore`. |
+| `--leakage-guard` | How to handle suspicious column names: `warn` (default — log a warning and continue), `raise` (hard-fail with an error), or `off` (disable the check). |
 | `--include-target` | Re-attach the target column to the output file (collision-safe). |
 | `--task-description` | Free-form ML task description forwarded to LLM-aware engines. |
 | `--config FILE` | JSON config with nested keys (e.g. `llm_config`, `selection_methods`). CLI flags override config values. |
diff --git a/tests/test_cli.py b/tests/test_cli.py
index 2f8c1f3..c7e3cfa 100644
--- a/tests/test_cli.py
+++ b/tests/test_cli.py
@@ -124,6 +124,44 @@ def test_info_text_mode_is_human_readable():
     assert __version__ in out
 
 
+def test_cli_doc_leakage_guard_values_match_source_of_truth():
+    """Regression for round-1 review: the user-guide CLI doc previously
+    listed ``--leakage-guard`` choices as ``warn`` / ``block`` / ``ignore``,
+    but the actual ``AutoFeatureEngineer.SUPPORTED_LEAKAGE_GUARDS`` set is
+    ``{off, warn, raise}``. Following the doc with
+    ``--leakage-guard block`` produced an argparse error rather than the
+    described behavior. This test pins the doc against the source of
+    truth so any future drift fails fast.
+    """
+    from featcopilot.transformers.sklearn_compat import AutoFeatureEngineer
+
+    expected = sorted(AutoFeatureEngineer.SUPPORTED_LEAKAGE_GUARDS)
+    cli_doc = (Path(__file__).resolve().parent.parent / "docs" / "user-guide" / "cli.md").read_text(encoding="utf-8")
+
+    # The ``info`` JSON sample MUST list the actual values (sorted).
+    expected_json_fragment = '"supported_leakage_guards": ' + json.dumps(expected)
+    assert (
+        expected_json_fragment in cli_doc
+    ), f"docs/user-guide/cli.md must show {expected_json_fragment!r} in the info sample; got mismatch with source of truth"
+
+    # The ``--leakage-guard`` row MUST mention each valid value as
+    # ```value`` and MUST NOT mention any value that is not actually
+    # accepted. Using a substring check on the literal backtick-quoted
+    # token avoids false positives from prose elsewhere on the page
+    # (e.g. ``warn`` appears in unrelated wording).
+    invalid_examples = ("`block`", "`ignore`")
+    for bad in invalid_examples:
+        assert bad not in cli_doc, (
+            f"docs/user-guide/cli.md must not advertise {bad} as a leakage_guard value; "
+            f"actual valid values are {expected}"
+        )
+    for value in expected:
+        assert f"`{value}`" in cli_doc, (
+            f"docs/user-guide/cli.md must advertise `{value}` as a valid leakage_guard value; "
+            f"current values are {expected}"
+        )
+
+
 def test_top_level_version_flag(capsys):
     # ``--version`` (argparse action) prints to stdout; main() now traps the
     # SystemExit and returns the code so the API contract is consistent.

From e5b7a1976ec62082a7b8991f906fe3d2bb56e512 Mon Sep 17 00:00:00 2001
From: Li Jiang <lijiang1@microsoft.com>
Date: Wed, 6 May 2026 16:25:29 +0800
Subject: [PATCH 4/4] fix: rigorous builtin-isolation test; complete info
 sample; honest docstring
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Round-2 review on PR #6 (4 new threads). All addressed:

1. **Codex P3 (test_core.py:162) + Copilot (test_core.py:168) — broken
   isolation test.** `test_feature_compute_isolates_safe_builtins_across_calls`
   started its attacker snippet with `import builtins`, but
   `__import__` is intentionally NOT in the safe-builtins whitelist,
   so the snippet raised at the first line and the surrounding
   `try/except` swallowed the failure. Result: the test would still
   pass even if `_SAFE_BUILTINS` were passed by reference, so it was
   not actually pinning the per-call-copy contract it claims to
   protect. Rewritten to mutate `__builtins__` directly via the dict
   that `exec` sees as the snippet's builtins source AND assert
   inside the snippet (`len(df) == -999`) that the rebinding took
   effect. Now requires BOTH the in-snippet mutation to succeed AND
   the next call to see the original `len`: a real regression check.

2. **Codex P2 (cli.md:63) — info sample omitted valid choices.** The
   sample showed `supported_engines: ["relational", "tabular", "text",
   "timeseries"]` (missing `llm`) and
   `supported_selection_methods: ["correlation", "importance",
   "mutual_info"]` (missing `chi2`, `f_test`, `xgboost`). The
   sample now lists the FULL sorted set from
   `AutoFeatureEngineer.SUPPORTED_*` and a follow-up paragraph
   clarifies the parquet-conditional behaviour for the I/O format
   arrays. Added `test_cli_doc_info_sample_matches_engines_and_selection_methods`
   which pins both lists against the source of truth.

3. **Copilot (feature.py:156) — misleading docstring.** The previous
   docstring claimed the snippet ran "without giving the snippet
   unrestricted access to imports / file I/O". That's incorrect:
   `pd` is in scope, which means `pd.read_csv` /
   `pd.read_parquet` / `df.to_csv` are all reachable. The fix
   reframes the doc accurately: `__import__` is not in the safe
   builtins (so plain `import foo` raises) but the runtime is NOT a
   security sandbox; stored snippets must come from a trusted source.

Verification

- 917 tests pass (full suite); pre-commit clean.
- The new attacker-isolation test would FAIL if `_SAFE_BUILTINS` were
  passed by reference (the `-999` rebinding would leak into the
  next call's `[len(df)] * len(df)` and produce `[-999, -999, -999]`
  instead of `[3, 3, 3]`); confirmed locally.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 docs/user-guide/cli.md      | 16 ++++++++--
 featcopilot/core/feature.py | 43 ++++++++++++++++---------
 tests/test_cli.py           | 42 +++++++++++++++++++++++++
 tests/test_core.py          | 63 +++++++++++++++++++------------------
 4 files changed, 117 insertions(+), 47 deletions(-)

diff --git a/docs/user-guide/cli.md b/docs/user-guide/cli.md
index 8345a5d..ed68367 100644
--- a/docs/user-guide/cli.md
+++ b/docs/user-guide/cli.md
@@ -59,8 +59,15 @@ Sample (truncated) output:
 ```json
 {
   "version": "0.3.7",
-  "supported_engines": ["relational", "tabular", "text", "timeseries"],
-  "supported_selection_methods": ["correlation", "importance", "mutual_info"],
+  "supported_engines": ["llm", "relational", "tabular", "text", "timeseries"],
+  "supported_selection_methods": [
+    "chi2",
+    "correlation",
+    "f_test",
+    "importance",
+    "mutual_info",
+    "xgboost"
+  ],
   "supported_leakage_guards": ["off", "raise", "warn"],
   "supported_input_formats": ["csv", "json"],
   "supported_output_formats": ["csv", "json"],
@@ -68,6 +75,11 @@ Sample (truncated) output:
 }
 ```
 
+When a parquet engine (`pyarrow` or `fastparquet`) IS importable in the
+current environment, `"parquet"` is added to `supported_input_formats`
+and `supported_output_formats` (in source order, so the lists become
+`["csv", "parquet", "json"]`) and `parquet_available` flips to `true`.
+
 `parquet_available` reflects whether `pyarrow` or `fastparquet` is
 importable in the current environment. The base FeatCopilot install does
 not pin a parquet engine; install one with
diff --git a/featcopilot/core/feature.py b/featcopilot/core/feature.py
index b1fe375..dc13209 100644
--- a/featcopilot/core/feature.py
+++ b/featcopilot/core/feature.py
@@ -152,24 +152,37 @@ def compute(self, df: pd.DataFrame) -> pd.Series:
         with ``df``, ``np`` and ``pd`` bound as names alongside a
         curated set of safe Python builtins (``len``, ``range``,
         ``sum``, numeric / sequence constructors, etc.) so common
-        idioms work without giving the snippet unrestricted access to
-        imports / file I/O. The snippet must bind its output to a name
-        called ``result``.
+        idioms work without giving the snippet a Python import system
+        — ``__import__`` is intentionally NOT in the safe builtins, so
+        an ``import foo`` statement inside the snippet raises at exec
+        time. The snippet must bind its output to a name called
+        ``result``.
+
+        .. note::
+           This is **not** a security sandbox for untrusted code.
+           ``pd`` is in scope, which means the snippet can reach
+           pandas' file I/O helpers (``pd.read_csv``, ``pd.read_parquet``,
+           ``df.to_csv``, ...), and dunder attribute access on objects
+           reachable from ``df`` / ``np`` / ``pd`` is not blocked. The
+           builtin whitelist limits the *namespace* available to plain
+           Python idioms; it does not isolate FeatCopilot from the
+           ambient process. Stored snippets must therefore come from a
+           trusted source (your own code generator, a vetted feature
+           store, or a transform-rule registry you control).
 
         A *fresh copy* of the safe-builtins dict is passed into ``exec``
         on every call so that any mutation the snippet performs on
-        ``__builtins__`` (``del``, ``pop``, attribute assignment,
-        rebinding) does not bleed into subsequent ``compute`` calls.
-        Likewise the data-bound namespace is constructed fresh per
-        call. Using a SINGLE dict for both ``globals`` and ``locals``
-        is what makes free variables inside comprehensions and lambdas
-        — which Python resolves against the enclosing function's
-        globals, not the caller's locals — see ``df``, ``np`` and
-        ``pd`` correctly. With separate ``locals`` and ``globals``
-        dicts a snippet such as ``[df['c'].iloc[i] for i in
-        range(len(df))]`` would otherwise raise ``NameError`` because
-        the implicit comprehension function's body looks ``df`` up in
-        the (empty) ``globals``.
+        ``__builtins__`` (rebinding entries, ``del``, ``pop``) does not
+        bleed into subsequent ``compute`` calls. Likewise the
+        data-bound namespace is constructed fresh per call. Using a
+        SINGLE dict for both ``globals`` and ``locals`` is what makes
+        free variables inside comprehensions and lambdas — which Python
+        resolves against the enclosing function's globals, not the
+        caller's locals — see ``df``, ``np`` and ``pd`` correctly.
+        With separate ``locals`` and ``globals`` dicts a snippet such
+        as ``[df['c'].iloc[i] for i in range(len(df))]`` would
+        otherwise raise ``NameError`` because the implicit comprehension
+        function's body looks ``df`` up in the (empty) ``globals``.
 
         Parameters
         ----------
diff --git a/tests/test_cli.py b/tests/test_cli.py
index c7e3cfa..4cb0e9b 100644
--- a/tests/test_cli.py
+++ b/tests/test_cli.py
@@ -162,6 +162,48 @@ def test_cli_doc_leakage_guard_values_match_source_of_truth():
         )
 
 
+def test_cli_doc_info_sample_matches_engines_and_selection_methods():
+    """Regression for round-2 review: the CLI doc's ``info`` JSON sample
+    must list the FULL sorted ``SUPPORTED_ENGINES`` and
+    ``SUPPORTED_SELECTION_METHODS`` sets, not a truncated subset.
+
+    Round-2 specifically caught that the sample omitted ``llm`` from
+    ``supported_engines`` and ``chi2`` / ``f_test`` / ``xgboost`` from
+    ``supported_selection_methods``. Even with a "truncated" caveat the
+    arrays look complete and steered users away from valid CLI choices,
+    so we now pin the entire sorted list against
+    ``AutoFeatureEngineer.SUPPORTED_*`` so future drift in either
+    direction (adding a new engine, renaming a method) fails fast.
+    """
+    from featcopilot.transformers.sklearn_compat import AutoFeatureEngineer
+
+    cli_doc = (Path(__file__).resolve().parent.parent / "docs" / "user-guide" / "cli.md").read_text(encoding="utf-8")
+
+    expected_engines = sorted(AutoFeatureEngineer.SUPPORTED_ENGINES)
+    engines_fragment = '"supported_engines": ' + json.dumps(expected_engines)
+    assert engines_fragment in cli_doc, (
+        f"docs/user-guide/cli.md ``info`` sample must show {engines_fragment!r}; "
+        "the JSON sample drifted away from AutoFeatureEngineer.SUPPORTED_ENGINES"
+    )
+
+    expected_methods = sorted(AutoFeatureEngineer.SUPPORTED_SELECTION_METHODS)
+    # The methods array is multi-line in the doc (one entry per line)
+    # for readability, so check that each value is present as a
+    # JSON-quoted string in the sample. We anchor the search inside the
+    # ``"supported_selection_methods": [`` block to avoid false
+    # positives from prose like "``mutual_info``".
+    sm_marker = '"supported_selection_methods": ['
+    sm_idx = cli_doc.find(sm_marker)
+    assert sm_idx != -1, "docs/user-guide/cli.md must contain a supported_selection_methods JSON sample block"
+    sm_block_end = cli_doc.find("]", sm_idx)
+    sm_block = cli_doc[sm_idx:sm_block_end]
+    for method in expected_methods:
+        assert f'"{method}"' in sm_block, (
+            f"docs/user-guide/cli.md ``info`` sample must list {method!r} in supported_selection_methods; "
+            f"current set is {expected_methods}"
+        )
+
+
 def test_top_level_version_flag(capsys):
     # ``--version`` (argparse action) prints to stdout; main() now traps the
     # SystemExit and returns the code so the API contract is consistent.
diff --git a/tests/test_core.py b/tests/test_core.py
index 66b2610..a4c4b59 100644
--- a/tests/test_core.py
+++ b/tests/test_core.py
@@ -143,40 +143,43 @@ def test_feature_compute_resolves_df_inside_comprehension(self):
 
     def test_feature_compute_isolates_safe_builtins_across_calls(self):
         """Regression: ``_SAFE_BUILTINS`` must NOT be passed to ``exec``
-        by reference. If it were, a malicious / buggy snippet could
-        delete or rebind entries (e.g. ``del __builtins__['len']``)
-        and the mutation would persist across subsequent ``Feature.compute``
-        calls in the same process, breaking unrelated features. The
-        fix passes ``dict(_SAFE_BUILTINS)`` (a shallow copy) per call
-        so that side effects are local to that single ``exec``.
+        by reference. If it were, a snippet that rebinds entries in its
+        own ``__builtins__`` view (e.g. ``__builtins__['len'] = lambda
+        x: -999``) would leak that mutation into every subsequent
+        ``Feature.compute`` call in the same process, breaking unrelated
+        features. The fix passes ``dict(_SAFE_BUILTINS)`` (a fresh
+        shallow copy) per call so that side effects are local to that
+        single ``exec``.
+
+        Round-2 review note: an earlier version of this test started
+        the attacker snippet with ``import builtins``, but ``__import__``
+        is intentionally NOT in the safe-builtins whitelist, so the
+        snippet raised at the very first line and the surrounding
+        ``try/except`` swallowed the failure ─ meaning the test would
+        pass even if ``_SAFE_BUILTINS`` were still being shared by
+        reference. This rewrite mutates ``__builtins__`` directly via
+        the dict that ``exec`` sees as the snippet's builtins source,
+        AND asserts inside the snippet (via ``len(df) == -999``) that
+        the rebinding actually took effect. That dual assertion is
+        what makes the test a real regression check.
         """
         df = pd.DataFrame({"col1": [1, 2, 3]})
 
-        # Snippet that rebinds ``len`` inside its own builtins view.
-        # If the implementation passed _SAFE_BUILTINS by reference,
-        # this would corrupt the whitelist for every feature that runs
-        # after it.
+        # Attacker snippet: rebind ``len`` in this call's __builtins__
+        # dict. The trailing ``result = pd.Series([len(df)])`` would
+        # yield 3 if the rebinding had failed, and -999 if the
+        # rebinding succeeded. That's the in-snippet proof that
+        # mutation IS possible within a single exec ─ which is what
+        # makes the per-call-copy isolation contract meaningful.
         f_attacker = Feature(
             name="poisoner",
-            code=(
-                "import builtins\n"
-                # We can't directly mutate __builtins__ because exec
-                # turns the dict into a module, but we CAN rebind keys.
-                "_b = __builtins__\n"
-                "if isinstance(_b, dict):\n"
-                "    _b['len'] = lambda x: -999\n"
-                "result = pd.Series([1, 2, 3])"
-            ),
+            code=("__builtins__['len'] = lambda x: -999\nresult = pd.Series([len(df)])\n"),
+        )
+        poisoned = f_attacker.compute(df)
+        assert list(poisoned) == [-999], (
+            "Attacker snippet failed to mutate __builtins__: the regression test "
+            "is no longer exercising the isolation path it claims to protect"
         )
-        # The attacker may or may not actually corrupt anything —
-        # depending on Python's exec semantics — but the next call
-        # MUST behave correctly regardless.
-        try:
-            f_attacker.compute(df)
-        except Exception:
-            # Even if the attacker snippet itself fails, the isolation
-            # contract still applies to the next call.
-            pass
 
         # A clean, well-behaved feature run AFTER the attacker must
         # see the original ``len`` builtin, not the poisoned version.
@@ -184,8 +187,8 @@ def test_feature_compute_isolates_safe_builtins_across_calls(self):
             name="clean_user",
             code="result = pd.Series([len(df)] * len(df))",
         )
-        result = f_victim.compute(df)
-        assert list(result) == [3, 3, 3], (
+        clean = f_victim.compute(df)
+        assert list(clean) == [3, 3, 3], (
             "Per-call dict copy of _SAFE_BUILTINS is broken: a previous " "feature's mutation leaked into a later call"
         )