Skip to content

Vw/synthetic data analysis#72

Open
vivwqy wants to merge 11 commits intomainfrom
vw/synthetic-data-analysis
Open

Vw/synthetic data analysis#72
vivwqy wants to merge 11 commits intomainfrom
vw/synthetic-data-analysis

Conversation

@vivwqy
Copy link
Collaborator

@vivwqy vivwqy commented Mar 12, 2026

  • Cleaned notebook for synthetic data analysis - contains malawi and togo analysis results. Code in the notebook aims to easily replicate analysis for new datasets.
  • Added an analysis_helper.py file consisting of functions to run analysis.
  • Added small logging line for featurizer.core.py since multiple errors came from preprocessing that filtered all data, leaving empty processed data.

@vivwqy vivwqy requested a review from poornimaramesh March 12, 2026 06:24
@vivwqy vivwqy self-assigned this Mar 12, 2026
Copilot AI review requested due to automatic review settings March 12, 2026 06:24
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds reusable utilities to support synthetic data analysis workflows (Malawi/Togo and future datasets) and improves observability in the CIDER preprocessing pipeline when filtering removes all rows.

Changes:

  • Added notebooks/analysis_helper.py with helper functions for describing data, plotting, preprocessing/featurization, and simple modeling evaluation utilities.
  • Added an info log in preprocess_data to surface when date filtering yields zero rows for a given schema.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 6 comments.

File Description
src/cider/featurizer/core.py Adds logging for empty post-filter datasets inside preprocess_data.
notebooks/analysis_helper.py Introduces notebook-focused helper functions for analysis, plotting, preprocessing/featurization, and k-fold evaluation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

antennas_merged_shp = gpd.sjoin(
antenna_gdf, shapefile_gdf, how="left", predicate="within"
)[["antenna_id", "region"]]
antennas_merged_shp.region.fillna("Unknown", inplace=True)
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

antennas_merged_shp.region.fillna("Unknown", inplace=True) can trigger chained-assignment / view-vs-copy issues and relies on an inplace pattern that is increasingly discouraged in pandas. Assign back explicitly (e.g., set the column) to ensure the fill is applied reliably.

Suggested change
antennas_merged_shp.region.fillna("Unknown", inplace=True)
antennas_merged_shp["region"] = antennas_merged_shp["region"].fillna("Unknown")

Copilot uses AI. Check for mistakes.
Comment on lines +2388 to 2392
if len(filtered_df) == 0:
logger.info(
f"No data points found for {schema.__name__} after filtering by dates."
)
filtered_df_with_day = add_day_column(filtered_df)
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In preprocess_data, outlier_days is only assigned inside the if schema == CallDataRecordData: branch, but it is used unconditionally when building filtered_no_outlier_days_df. If CallDataRecordData is missing from data_dict (KeyError -> continue), later iterations will hit an UnboundLocalError (or reuse stale values). Initialize outlier_days before the loop (e.g., empty list) and/or compute it conditionally per schema before it’s referenced.

Copilot uses AI. Check for mistakes.
Comment on lines +112 to +114
sns.boxplot(data=data, y=feat, x=groupby, showfliers=showfliers, ax=axes[i])
elif orient == "h":
sns.boxplot(data=data, x=feat, y=groupby, showfliers=showfliers, ax=axes[i])
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generate_boxplots passes x=groupby / y=groupby into seaborn even when groupby is None. Seaborn typically treats this as an invalid column name and raises (e.g., “Could not interpret value None for parameter ...”). Handle the groupby is None case by calling sns.boxplot without the grouping axis (single box per feature).

Suggested change
sns.boxplot(data=data, y=feat, x=groupby, showfliers=showfliers, ax=axes[i])
elif orient == "h":
sns.boxplot(data=data, x=feat, y=groupby, showfliers=showfliers, ax=axes[i])
if groupby is None:
sns.boxplot(
data=data,
y=feat,
showfliers=showfliers,
ax=axes[i],
)
else:
sns.boxplot(
data=data,
y=feat,
x=groupby,
showfliers=showfliers,
ax=axes[i],
)
elif orient == "h":
if groupby is None:
sns.boxplot(
data=data,
x=feat,
showfliers=showfliers,
ax=axes[i],
)
else:
sns.boxplot(
data=data,
x=feat,
y=groupby,
showfliers=showfliers,
ax=axes[i],
)

Copilot uses AI. Check for mistakes.
dict with precision and recall at k% for specified tail
"""
data = pd.DataFrame({"y_true": y_true, "y_pred": y_pred})
data
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s a stray expression statement (data) immediately after constructing the DataFrame. It has no effect in normal Python execution and looks like leftover notebook/debug code; removing it will avoid confusion and keep the helper clean.

Suggested change
data

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +242 to +252
antenna_gdf = gpd.GeoDataFrame(
data[AntennaData],
geometry=gpd.points_from_xy(
x=data[AntennaData]["longitude"], y=data[AntennaData]["latitude"]
),
).set_crs(epsg=4326)
antennas_merged_shp = gpd.sjoin(
antenna_gdf, shapefile_gdf, how="left", predicate="within"
)[["antenna_id", "region"]]
antennas_merged_shp.region.fillna("Unknown", inplace=True)
antennas_df = antennas_merged_shp.merge(data[AntennaData], on="antenna_id")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preprocess_data() function doesn't process nor assigns any AntennaData, so we have to use raw data AntennaData here.

antennas_merged_shp = gpd.sjoin(
antenna_gdf, shapefile_gdf, how="left", predicate="within"
)[["antenna_id", "region"]]
antennas_merged_shp.region.fillna("Unknown", inplace=True)
dict with precision and recall at k% for specified tail
"""
data = pd.DataFrame({"y_true": y_true, "y_pred": y_pred})
data
Comment on lines +263 to +265
) -> pd.DataFrame:
"""
Runs featurization for datasets available, collects them in a dictionary.
Comment on lines +241 to +252
# Prepare antenna_data
antenna_gdf = gpd.GeoDataFrame(
data[AntennaData],
geometry=gpd.points_from_xy(
x=data[AntennaData]["longitude"], y=data[AntennaData]["latitude"]
),
).set_crs(epsg=4326)
antennas_merged_shp = gpd.sjoin(
antenna_gdf, shapefile_gdf, how="left", predicate="within"
)[["antenna_id", "region"]]
antennas_merged_shp.region.fillna("Unknown", inplace=True)
antennas_df = antennas_merged_shp.merge(data[AntennaData], on="antenna_id")
Comment on lines +233 to +256
if keep_optional_columns:
logging.info(
"Proceeding with processing antenna data since optional columns were included in synthetic data"
)
if shapefile_gdf is None:
raise ValueError(
"Geographic features cannot be merged since no shapefile_gdf geodataframe was provided."
)
# Prepare antenna_data
antenna_gdf = gpd.GeoDataFrame(
data[AntennaData],
geometry=gpd.points_from_xy(
x=data[AntennaData]["longitude"], y=data[AntennaData]["latitude"]
),
).set_crs(epsg=4326)
antennas_merged_shp = gpd.sjoin(
antenna_gdf, shapefile_gdf, how="left", predicate="within"
)[["antenna_id", "region"]]
antennas_merged_shp.region.fillna("Unknown", inplace=True)
antennas_df = antennas_merged_shp.merge(data[AntennaData], on="antenna_id")

preprocessed_data[AntennaData] = antennas_df

return preprocessed_data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add this to the original preprocess_data function in featurizer.core?

return preprocessed_data


def run_featurization(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you update featurize_all_data in featurizer.core with this way of doing things? i.e. checking what data is available inside the preprocessed_data variable and doing featurization accordingly?

X,
drop_zero_variance=False,
null_max_threshold=None,
fillna_method: Optional[Literal["median", "mean", "zero"]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's define and enum for this? Easier to keep track and/or modify if we need to

Comment on lines +375 to +385
if fillna_method == "median":
X = X.apply(
lambda col: col.fillna(col.median()), axis=0
) # impute remaining null values with median
elif fillna_method == "mean":
X = X.apply(
lambda col: col.fillna(col.mean()), axis=0
) # impute remaining null values with mean
elif fillna_method == "zero":
X = X.fillna(0) # impute remaining null values with zero

Copy link
Collaborator

@poornimaramesh poornimaramesh Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use a match - case structure for this -- it's easier to add or remove conditions, and also allows linters like mypy to catch errors in the logic more easily

Comment on lines +556 to +559
for x in train_k_eval_dict:
k_eval_means["mean_train_" + x] = np.mean(train_k_eval_dict[x], axis=0)
for x in test_k_eval_dict:
k_eval_means["mean_test_" + x] = np.mean(test_k_eval_dict[x], axis=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add calculations for std too?

Comment on lines +158 to +160
fig, axes = plt.subplots(
nrow, ncol, figsize=(single_figsize[0] * ncol, single_figsize[1] * nrow)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very nitpicky of me, but could you use the matplotlibrc file in src/cider to typeset the plot?

Comment on lines +108 to +110
fig, axes = plt.subplots(
nrow, ncol, figsize=(single_figsize[0] * ncol, single_figsize[1] * nrow)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same nitpicky comment: matplotlibrc file for typesetting?

@vivwqy vivwqy requested a review from poornimaramesh March 17, 2026 06:03
@vivwqy
Copy link
Collaborator Author

vivwqy commented Mar 17, 2026

Main reviews:

  1. Updated featurizer core.py:
  • preprocessing and featurize functions
  • add handling and loggings for several cases with missing data/ possible error cases
  1. Updated analysis_helper.py based on reviews.
  2. Updated notebooks synthetic_data_analysis and demo_pipeline based on updated functions.
  3. Added python version for mypy in precommit to be able to handle match-case structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants