DatasetAnalyser incorrectly removes all columns as null even when DataFrame has valid data.
When calling model.build() with a valid pandas DataFrame, Plexe’s internal DatasetAnalyser incorrectly concludes that all dataset columns are completely null and removes them using drop_null_columns.
As a result, dataset_0 ends up with zero columns, even though the input DataFrame contains valid non-null string data.
This occurs consistently, regardless of whether the dataset is loaded from CSV, JSON, TXT, or even constructed manually in code.
This makes it impossible to train a model using any tabular data.
-Python version: 3.12
-Plexe version: 0.26.2
A simple example like this one:
import pandas as pd
import plexe
model = plexe.Model(
intent="Map description to code.",
input_schema={"description": str},
output_schema={"code": str},
)
df = pd.DataFrame({
"code": ["A", "B", "C"],
"description": ["one", "two", "three"],
})
model.build(
datasets=[df],
provider="openai/gpt-4o-mini",
max_iterations=1,
timeout=300,
run_timeout=120,
verbose=True,
)
Plexe’s internal agent logs show:
Orchestrator · Step 2 · 09:57:38
💡 Executing Initial Steps for ML
💭 I began by executing the first two steps of the plan. I called the get_latest_datasets() function and retrieved dataset_0, which is the latest version of our dataset. I also selected the target metric for the task, which was determined to be Mean Absolute Error (MAE), but it returned an infinite value, indicating a potential issue.
DatasetAnalyser · Step 3 · 09:58:47
💡 Retrieving and Cleaning Datasets
💭 I needed to retrieve the latest datasets to ensure I was working with the most accurate data. I called the get_latest_datasets() function to access the available datasets, and confirmed that dataset_0 was retrieved. My next steps will involve cleaning the dataset by removing null columns and
analyzing its characteristics using the provided template.
DatasetAnalyser · Step 4 · 09:59:12
💡 Cleaning Null Columns in Dataset
💭 I identified that dataset_0 was available as a raw dataset and needed cleaning by dropping completely null columns. I executed the function to drop the null columns from dataset_0. As a result, I successfully dropped 2 null columns from the dataset.
DatasetAnalyser · Step 5 · 09:59:59
💡 Retrieving Dataset Schema
💭 I dropped the null columns from dataset_0 and accessed the dataset using the specified pattern. I converted the dataset to a pandas DataFrame and retrieved its schema to understand the structure and types of columns. However, I observed that the DataFrame was empty, indicating that no data was available after the drop operation.
DatasetAnalyser · Step 6 · 10:00:48
💡 Registering EDA Report for Empty Dataset
💭 I realized that after dropping null columns, dataset_0 was empty, containing no data for analysis. I summarized my findings in an EDA report, highlighting the absence of features and relationships due to the dataset's emptiness. I recommended investigating the source of the dataset for potential data handling errors.
These or similar logs occur consistently on every execution of the program. I am confident the files are not the source of the problem since I routinely print the DataFrame prior to model creation and it always displays the correct data.
DatasetAnalyser incorrectly removes all columns as null even when DataFrame has valid data.
When calling model.build() with a valid pandas DataFrame, Plexe’s internal DatasetAnalyser incorrectly concludes that all dataset columns are completely null and removes them using drop_null_columns.
As a result, dataset_0 ends up with zero columns, even though the input DataFrame contains valid non-null string data.
This occurs consistently, regardless of whether the dataset is loaded from CSV, JSON, TXT, or even constructed manually in code.
This makes it impossible to train a model using any tabular data.
-Python version: 3.12
-Plexe version: 0.26.2
A simple example like this one:
Plexe’s internal agent logs show:
Orchestrator · Step 2 · 09:57:38
💡 Executing Initial Steps for ML
💭 I began by executing the first two steps of the plan. I called the
get_latest_datasets()function and retrieveddataset_0, which is the latest version of our dataset. I also selected the target metric for the task, which was determined to be Mean Absolute Error (MAE), but it returned an infinite value, indicating a potential issue.DatasetAnalyser · Step 3 · 09:58:47
💡 Retrieving and Cleaning Datasets
💭 I needed to retrieve the latest datasets to ensure I was working with the most accurate data. I called the
get_latest_datasets()function to access the available datasets, and confirmed thatdataset_0was retrieved. My next steps will involve cleaning the dataset by removing null columns andanalyzing its characteristics using the provided template.
DatasetAnalyser · Step 4 · 09:59:12
💡 Cleaning Null Columns in Dataset
💭 I identified that
dataset_0was available as a raw dataset and needed cleaning by dropping completely null columns. I executed the function to drop the null columns fromdataset_0. As a result, I successfully dropped 2 null columns from the dataset.DatasetAnalyser · Step 5 · 09:59:59
💡 Retrieving Dataset Schema
💭 I dropped the null columns from
dataset_0and accessed the dataset using the specified pattern. I converted the dataset to a pandas DataFrame and retrieved its schema to understand the structure and types of columns. However, I observed that the DataFrame was empty, indicating that no data was available after the drop operation.DatasetAnalyser · Step 6 · 10:00:48
💡 Registering EDA Report for Empty Dataset
💭 I realized that after dropping null columns,
dataset_0was empty, containing no data for analysis. I summarized my findings in an EDA report, highlighting the absence of features and relationships due to the dataset's emptiness. I recommended investigating the source of the dataset for potential data handling errors.These or similar logs occur consistently on every execution of the program. I am confident the files are not the source of the problem since I routinely print the DataFrame prior to model creation and it always displays the correct data.