Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/getting-started/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ Development dependencies include:
```python
import featcopilot
print(featcopilot.__version__)
# Output: 0.1.0
# Output: 0.3.7

from featcopilot import AutoFeatureEngineer
print("Installation successful!")
Expand Down
236 changes: 236 additions & 0 deletions docs/user-guide/cli.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
# Command-Line Interface

FeatCopilot ships a stable, agent-friendly `featcopilot` CLI for using the
library from shells, CI pipelines, and **agentic / LLM tool-use** workflows
without writing Python glue. All subcommands accept `--json` for
machine-readable stdout; user-facing errors are written to **stderr** with
a non-zero exit code so that automation can parse failures
deterministically.

The CLI is installed automatically with the package via the
`[project.scripts]` entry point (`featcopilot = "featcopilot.cli:main"`),
so after `pip install featcopilot` the `featcopilot` command is available
on `$PATH`. The equivalent module form `python -m featcopilot ...` always
works regardless of how the package was installed.

## Subcommands

| Command | Purpose |
| --- | --- |
| `featcopilot info` | Print version, supported engines, selection methods, leakage guards, I/O formats, and a runtime `parquet_available` flag. |
| `featcopilot transform` | Read a CSV / Parquet / JSON file, run [`AutoFeatureEngineer`](../user-guide/overview.md), and write engineered features to an output file. |
| `featcopilot explain` | Fit and print a JSON document with `{name, explanation, code}` per feature for downstream LLM consumption (no output file is written). |

Run any subcommand with `--help` to see the full flag list:

```bash
featcopilot --help
featcopilot transform --help
featcopilot explain --help
```

## Output contract

All three subcommands honor the same agent-friendly contract:

* **`stdout`** carries the result. With `--json` (always implicit for
`explain`), exactly one JSON document is written.
* **`stderr`** is reserved for failures. A successful run keeps `stderr`
empty even when `AutoFeatureEngineer` emits leakage warnings or
`verbose` logger output ─ those are surfaced via the JSON payload's
`warnings` field instead. This same contract covers warnings emitted
during pandas / pyarrow read or write phases (e.g. `DtypeWarning` on
mixed-type CSVs, `FutureWarning` from a successful Parquet write):
they are routed to the JSON `warnings` field, never to `stderr`.
* **Exit codes**: `0` on success; `2` for user-input errors (missing
files, malformed config, unknown target, etc.); `1` for unexpected
internal errors.

## `featcopilot info`

Discover capabilities without running an engineer:

```bash
featcopilot info --json
```

Sample (truncated) output:

```json
{
"version": "0.3.7",
"supported_engines": ["llm", "relational", "tabular", "text", "timeseries"],
"supported_selection_methods": [
"chi2",
"correlation",
"f_test",
"importance",
"mutual_info",
"xgboost"
],
"supported_leakage_guards": ["off", "raise", "warn"],
"supported_input_formats": ["csv", "json"],
"supported_output_formats": ["csv", "json"],
"parquet_available": false
}
```

When a parquet engine (`pyarrow` or `fastparquet`) IS importable in the
current environment, `"parquet"` is added to `supported_input_formats`
and `supported_output_formats` (in source order, so the lists become
`["csv", "parquet", "json"]`) and `parquet_available` flips to `true`.

`parquet_available` reflects whether `pyarrow` or `fastparquet` is
importable in the current environment. The base FeatCopilot install does
not pin a parquet engine; install one with
`pip install pyarrow` (or `fastparquet`) to enable Parquet I/O.

## `featcopilot transform`

Run feature engineering on a tabular input and write the engineered
features to disk:

```bash
featcopilot transform \
--input data.csv --target label --output features.csv \
--engines tabular --max-features 50 \
--json
```

Common flags:

| Flag | Purpose |
| --- | --- |
| `--input / -i` | Path to input file (CSV / Parquet / JSON). Required. |
| `--output / -o` | Path to output file. Required. |
| `--target / -t` | Target column. Required when feature selection is applied (i.e. when `--max-features` / config `max_features` is set). |
| `--input-format` / `--output-format` | Override format detection (`csv` / `parquet` / `json`). |
| `--engines` | One or more engines to enable (default: `tabular`). |
| `--max-features N` | Cap on engine output / selection. Forwarded both to engine constructors and to the selector. |
| `--no-selection` | Skip feature selection entirely (raw feature generation). |
| `--selection-methods` | Override the default `mutual_info importance` selection set. |
| `--leakage-guard` | How to handle suspicious column names: `warn` (default — log a warning and continue), `raise` (hard-fail with an error), or `off` (disable the check). |
| `--include-target` | Re-attach the target column to the output file (collision-safe). |
| `--task-description` | Free-form ML task description forwarded to LLM-aware engines. |
| `--config FILE` | JSON config with nested keys (e.g. `llm_config`, `selection_methods`). CLI flags override config values. |
| `--verbose / --no-verbose` | Toggle verbose logging. With `--json`, log records are routed to the JSON `warnings` field rather than `stderr`. |
| `--gate-n-jobs` | Parallelism for the do-no-harm gate's RF (default 1; `-1` = all cores). |
| `--json` | Emit a one-line JSON status object on stdout instead of human-readable text. |

A successful `--json` run prints something like:

```json
{
"status": "ok",
"input": "data.csv",
"output": "features.csv",
"input_format": "csv",
"output_format": "csv",
"n_rows": 1000,
"n_features": 47,
"n_input_columns": 12,
"n_generated_features": 47,
"engines": ["tabular"],
"selection_methods": ["mutual_info", "importance"],
"max_features": 50,
"target": "label",
"selection_applied": true,
"warnings": []
}
```

## `featcopilot explain`

Fit the engineer (without writing any output file) and print a JSON
catalog of generated features for downstream LLM consumption:

```bash
featcopilot explain --input data.csv --target label
```

Each entry in the `features` array contains the feature `name`, an
LLM-style natural-language `explanation`, and the executable Python
`code` used to produce it.

`explain` defaults to running on the **full** input so the metadata is
a faithful description of what a corresponding `transform` would
generate. Some engines (notably the tabular engine's categorical
encoding) consult per-row / per-category statistics when planning
features, so blind subsampling can silently change results. For very
large inputs where metadata-only `explain` should not pay full memory
or compute cost, opt in with:

```bash
featcopilot explain --input big.csv --target label --explain-sample-size 5000
```

The cap is a deterministic *head slice* (the first N rows), threaded
through `pd.read_csv(nrows=N)` for CSV so memory is bounded natively.
For Parquet / JSON pandas has no native row-limit, so the file is
fully read and then truncated; a `UserWarning` explaining the
limitation is emitted (and surfaced in the JSON `warnings` field) only
when the cap actually truncates the input.

## Configuration files

Pass `--config config.json` to provide nested keys that don't have
matching CLI flags, such as the `llm_config` engine kwargs:

```json
{
"engines": ["tabular", "llm"],
"max_features": 80,
"selection_methods": ["mutual_info", "importance"],
"llm_config": {
"backend": "litellm",
"model": "gpt-4o",
"max_suggestions": 20
}
}
```

Explicit CLI flags override values from the config file. Any malformed
scalar (e.g. `"max_features": "5"`, `"verbose": "false"`) is rejected
with a clean exit-2 error rather than failing later inside the
engineer.

## Parquet I/O

The base FeatCopilot install does not pin a parquet engine. To use
`--input file.parquet` / `--output file.parquet` (or the `parquet`
value of `--input-format` / `--output-format`), install one of:

```bash
pip install pyarrow # recommended
# or
pip install fastparquet
```

Confirm with `featcopilot info --json`:

```json
{ "parquet_available": true, ... }
```

If neither engine is installed, attempting Parquet I/O fails with a
clean exit-2 error pointing at the missing dependency.

## Agentic-usage tips

* Always pass `--json`. Treat anything on `stderr` as a hard failure;
treat anything on `stdout` as the JSON result.
* Treat the JSON `warnings` field as a list of human-readable
diagnostic strings ─ it is non-empty for `transform` runs that
generated leakage / mock-mode / sampling notices, and empty for
fully clean runs.
* For long-running batch jobs, prefer `featcopilot transform` to
`python -m featcopilot transform` only because the former is shorter;
both invoke the exact same entry point.

## See also

* [Overview](overview.md) ─ the underlying `AutoFeatureEngineer` API.
* [Engines](engines.md) ─ what each engine generates.
* [LLM Features](llm-features.md) ─ configuring the LLM backend (provide
an `llm_config` object inside the JSON file passed to `--config`, as
shown in the [Configuration files](#configuration-files) section above).
115 changes: 108 additions & 7 deletions featcopilot/core/feature.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,45 @@
logger = get_logger(__name__)


# Curated set of safe Python builtins exposed to ``Feature.compute``'s
# stored code. Without this whitelist (i.e. with ``{"__builtins__": {}}``)
# even basic idioms like ``len(df)``, ``range(...)``, ``sum(...)``, or
# ``int(x)`` raise ``NameError`` at exec time, which means a feature whose
# code legitimately uses a Python builtin crashes during ``compute`` even
# though the snippet is otherwise valid. The set mirrors the one used by
# :class:`featcopilot.core.transform_rule.TransformRule` so both code
# execution paths agree on what is safe.
_SAFE_BUILTINS: dict[str, Any] = {
"len": len,
"sum": sum,
"max": max,
"min": min,
"int": int,
"float": float,
"str": str,
"bool": bool,
"abs": abs,
"round": round,
"pow": pow,
"range": range,
"list": list,
"dict": dict,
"set": set,
"tuple": tuple,
"sorted": sorted,
"reversed": reversed,
"enumerate": enumerate,
"zip": zip,
"any": any,
"all": all,
"map": map,
"filter": filter,
"isinstance": isinstance,
"hasattr": hasattr,
"getattr": getattr,
}
Comment thread
thinkall marked this conversation as resolved.


class FeatureType(Enum):
"""Types of features."""

Expand Down Expand Up @@ -109,6 +148,42 @@ def compute(self, df: pd.DataFrame) -> pd.Series:
"""
Compute feature values from DataFrame using stored code.

The stored ``code`` is executed in a single shared namespace
with ``df``, ``np`` and ``pd`` bound as names alongside a
curated set of safe Python builtins (``len``, ``range``,
``sum``, numeric / sequence constructors, etc.) so common
idioms work without giving the snippet a Python import system
— ``__import__`` is intentionally NOT in the safe builtins, so
an ``import foo`` statement inside the snippet raises at exec
time. The snippet must bind its output to a name called
``result``.

.. note::
This is **not** a security sandbox for untrusted code.
``pd`` is in scope, which means the snippet can reach
pandas' file I/O helpers (``pd.read_csv``, ``pd.read_parquet``,
``df.to_csv``, ...), and dunder attribute access on objects
reachable from ``df`` / ``np`` / ``pd`` is not blocked. The
builtin whitelist limits the *namespace* available to plain
Python idioms; it does not isolate FeatCopilot from the
ambient process. Stored snippets must therefore come from a
trusted source (your own code generator, a vetted feature
store, or a transform-rule registry you control).

A *fresh copy* of the safe-builtins dict is passed into ``exec``
on every call so that any mutation the snippet performs on
``__builtins__`` (rebinding entries, ``del``, ``pop``) does not
bleed into subsequent ``compute`` calls. Likewise the
data-bound namespace is constructed fresh per call. Using a
SINGLE dict for both ``globals`` and ``locals`` is what makes
free variables inside comprehensions and lambdas — which Python
resolves against the enclosing function's globals, not the
caller's locals — see ``df``, ``np`` and ``pd`` correctly.
With separate ``locals`` and ``globals`` dicts a snippet such
as ``[df['c'].iloc[i] for i in range(len(df))]`` would
otherwise raise ``NameError`` because the implicit comprehension
function's body looks ``df`` up in the (empty) ``globals``.

Parameters
----------
df : DataFrame
Expand All @@ -118,14 +193,40 @@ def compute(self, df: pd.DataFrame) -> pd.Series:
-------
Series
Computed feature values

Raises
------
ValueError
* If ``self.code`` is empty / missing — message
``"No code defined for feature ..."``.
* If ``self.code`` is present but did not bind a
``result`` variable — message
``"Feature ... code did not produce a 'result' variable"``.
These two cases produce DIFFERENT messages so a failing
snippet is distinguishable from an unset feature when
debugging.
"""
if self.code:
# Execute stored code to compute feature
local_vars = {"df": df, "np": np, "pd": pd}
exec(self.code, {"__builtins__": {}}, local_vars)
if "result" in local_vars:
return local_vars["result"]
raise ValueError(f"No code defined for feature {self.name}")
if not self.code:
raise ValueError(f"No code defined for feature {self.name}")

# Single shared namespace so comprehensions / lambdas /
# generator expressions inside the snippet see ``df``, ``np``,
# ``pd`` and the safe builtins. Fresh dicts per call so the
# snippet cannot pollute either the safe-builtins whitelist or
# the data bindings for later ``compute`` invocations.
namespace: dict[str, Any] = {
"__builtins__": dict(_SAFE_BUILTINS),
"df": df,
"np": np,
"pd": pd,
}
exec(self.code, namespace)
if "result" not in namespace:
raise ValueError(
f"Feature {self.name!r} code did not produce a 'result' variable. "
"Stored snippet must bind its output to a name called 'result'."
)
return namespace["result"]


class FeatureSet:
Expand Down
3 changes: 2 additions & 1 deletion featcopilot/llm/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"""

from featcopilot.llm.code_generator import FeatureCodeGenerator
from featcopilot.llm.copilot_client import CopilotFeatureClient
from featcopilot.llm.copilot_client import CopilotFeatureClient, SyncCopilotFeatureClient
from featcopilot.llm.explainer import FeatureExplainer
from featcopilot.llm.litellm_client import LiteLLMFeatureClient, SyncLiteLLMFeatureClient
from featcopilot.llm.openai_client import OpenAIFeatureClient, SyncOpenAIFeatureClient
Expand All @@ -13,6 +13,7 @@

__all__ = [
"CopilotFeatureClient",
"SyncCopilotFeatureClient",
"LiteLLMFeatureClient",
"SyncLiteLLMFeatureClient",
"OpenAIFeatureClient",
Expand Down
Loading
Loading