Conversation
There was a problem hiding this comment.
Pull request overview
Adds reusable utilities to support synthetic data analysis workflows (Malawi/Togo and future datasets) and improves observability in the CIDER preprocessing pipeline when filtering removes all rows.
Changes:
- Added
notebooks/analysis_helper.pywith helper functions for describing data, plotting, preprocessing/featurization, and simple modeling evaluation utilities. - Added an info log in
preprocess_datato surface when date filtering yields zero rows for a given schema.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
src/cider/featurizer/core.py |
Adds logging for empty post-filter datasets inside preprocess_data. |
notebooks/analysis_helper.py |
Introduces notebook-focused helper functions for analysis, plotting, preprocessing/featurization, and k-fold evaluation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
notebooks/analysis_helper.py
Outdated
| antennas_merged_shp = gpd.sjoin( | ||
| antenna_gdf, shapefile_gdf, how="left", predicate="within" | ||
| )[["antenna_id", "region"]] | ||
| antennas_merged_shp.region.fillna("Unknown", inplace=True) |
There was a problem hiding this comment.
antennas_merged_shp.region.fillna("Unknown", inplace=True) can trigger chained-assignment / view-vs-copy issues and relies on an inplace pattern that is increasingly discouraged in pandas. Assign back explicitly (e.g., set the column) to ensure the fill is applied reliably.
| antennas_merged_shp.region.fillna("Unknown", inplace=True) | |
| antennas_merged_shp["region"] = antennas_merged_shp["region"].fillna("Unknown") |
| if len(filtered_df) == 0: | ||
| logger.info( | ||
| f"No data points found for {schema.__name__} after filtering by dates." | ||
| ) | ||
| filtered_df_with_day = add_day_column(filtered_df) |
There was a problem hiding this comment.
In preprocess_data, outlier_days is only assigned inside the if schema == CallDataRecordData: branch, but it is used unconditionally when building filtered_no_outlier_days_df. If CallDataRecordData is missing from data_dict (KeyError -> continue), later iterations will hit an UnboundLocalError (or reuse stale values). Initialize outlier_days before the loop (e.g., empty list) and/or compute it conditionally per schema before it’s referenced.
notebooks/analysis_helper.py
Outdated
| sns.boxplot(data=data, y=feat, x=groupby, showfliers=showfliers, ax=axes[i]) | ||
| elif orient == "h": | ||
| sns.boxplot(data=data, x=feat, y=groupby, showfliers=showfliers, ax=axes[i]) |
There was a problem hiding this comment.
generate_boxplots passes x=groupby / y=groupby into seaborn even when groupby is None. Seaborn typically treats this as an invalid column name and raises (e.g., “Could not interpret value None for parameter ...”). Handle the groupby is None case by calling sns.boxplot without the grouping axis (single box per feature).
| sns.boxplot(data=data, y=feat, x=groupby, showfliers=showfliers, ax=axes[i]) | |
| elif orient == "h": | |
| sns.boxplot(data=data, x=feat, y=groupby, showfliers=showfliers, ax=axes[i]) | |
| if groupby is None: | |
| sns.boxplot( | |
| data=data, | |
| y=feat, | |
| showfliers=showfliers, | |
| ax=axes[i], | |
| ) | |
| else: | |
| sns.boxplot( | |
| data=data, | |
| y=feat, | |
| x=groupby, | |
| showfliers=showfliers, | |
| ax=axes[i], | |
| ) | |
| elif orient == "h": | |
| if groupby is None: | |
| sns.boxplot( | |
| data=data, | |
| x=feat, | |
| showfliers=showfliers, | |
| ax=axes[i], | |
| ) | |
| else: | |
| sns.boxplot( | |
| data=data, | |
| x=feat, | |
| y=groupby, | |
| showfliers=showfliers, | |
| ax=axes[i], | |
| ) |
notebooks/analysis_helper.py
Outdated
| dict with precision and recall at k% for specified tail | ||
| """ | ||
| data = pd.DataFrame({"y_true": y_true, "y_pred": y_pred}) | ||
| data |
There was a problem hiding this comment.
There’s a stray expression statement (data) immediately after constructing the DataFrame. It has no effect in normal Python execution and looks like leftover notebook/debug code; removing it will avoid confusion and keep the helper clean.
| data |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 3 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
notebooks/analysis_helper.py
Outdated
| antenna_gdf = gpd.GeoDataFrame( | ||
| data[AntennaData], | ||
| geometry=gpd.points_from_xy( | ||
| x=data[AntennaData]["longitude"], y=data[AntennaData]["latitude"] | ||
| ), | ||
| ).set_crs(epsg=4326) | ||
| antennas_merged_shp = gpd.sjoin( | ||
| antenna_gdf, shapefile_gdf, how="left", predicate="within" | ||
| )[["antenna_id", "region"]] | ||
| antennas_merged_shp.region.fillna("Unknown", inplace=True) | ||
| antennas_df = antennas_merged_shp.merge(data[AntennaData], on="antenna_id") |
There was a problem hiding this comment.
preprocess_data() function doesn't process nor assigns any AntennaData, so we have to use raw data AntennaData here.
notebooks/analysis_helper.py
Outdated
| antennas_merged_shp = gpd.sjoin( | ||
| antenna_gdf, shapefile_gdf, how="left", predicate="within" | ||
| )[["antenna_id", "region"]] | ||
| antennas_merged_shp.region.fillna("Unknown", inplace=True) |
notebooks/analysis_helper.py
Outdated
| dict with precision and recall at k% for specified tail | ||
| """ | ||
| data = pd.DataFrame({"y_true": y_true, "y_pred": y_pred}) | ||
| data |
notebooks/analysis_helper.py
Outdated
| ) -> pd.DataFrame: | ||
| """ | ||
| Runs featurization for datasets available, collects them in a dictionary. |
notebooks/analysis_helper.py
Outdated
| # Prepare antenna_data | ||
| antenna_gdf = gpd.GeoDataFrame( | ||
| data[AntennaData], | ||
| geometry=gpd.points_from_xy( | ||
| x=data[AntennaData]["longitude"], y=data[AntennaData]["latitude"] | ||
| ), | ||
| ).set_crs(epsg=4326) | ||
| antennas_merged_shp = gpd.sjoin( | ||
| antenna_gdf, shapefile_gdf, how="left", predicate="within" | ||
| )[["antenna_id", "region"]] | ||
| antennas_merged_shp.region.fillna("Unknown", inplace=True) | ||
| antennas_df = antennas_merged_shp.merge(data[AntennaData], on="antenna_id") |
notebooks/analysis_helper.py
Outdated
| if keep_optional_columns: | ||
| logging.info( | ||
| "Proceeding with processing antenna data since optional columns were included in synthetic data" | ||
| ) | ||
| if shapefile_gdf is None: | ||
| raise ValueError( | ||
| "Geographic features cannot be merged since no shapefile_gdf geodataframe was provided." | ||
| ) | ||
| # Prepare antenna_data | ||
| antenna_gdf = gpd.GeoDataFrame( | ||
| data[AntennaData], | ||
| geometry=gpd.points_from_xy( | ||
| x=data[AntennaData]["longitude"], y=data[AntennaData]["latitude"] | ||
| ), | ||
| ).set_crs(epsg=4326) | ||
| antennas_merged_shp = gpd.sjoin( | ||
| antenna_gdf, shapefile_gdf, how="left", predicate="within" | ||
| )[["antenna_id", "region"]] | ||
| antennas_merged_shp.region.fillna("Unknown", inplace=True) | ||
| antennas_df = antennas_merged_shp.merge(data[AntennaData], on="antenna_id") | ||
|
|
||
| preprocessed_data[AntennaData] = antennas_df | ||
|
|
||
| return preprocessed_data |
There was a problem hiding this comment.
could you add this to the original preprocess_data function in featurizer.core?
notebooks/analysis_helper.py
Outdated
| return preprocessed_data | ||
|
|
||
|
|
||
| def run_featurization( |
There was a problem hiding this comment.
could you update featurize_all_data in featurizer.core with this way of doing things? i.e. checking what data is available inside the preprocessed_data variable and doing featurization accordingly?
notebooks/analysis_helper.py
Outdated
| X, | ||
| drop_zero_variance=False, | ||
| null_max_threshold=None, | ||
| fillna_method: Optional[Literal["median", "mean", "zero"]] = None, |
There was a problem hiding this comment.
Let's define and enum for this? Easier to keep track and/or modify if we need to
notebooks/analysis_helper.py
Outdated
| if fillna_method == "median": | ||
| X = X.apply( | ||
| lambda col: col.fillna(col.median()), axis=0 | ||
| ) # impute remaining null values with median | ||
| elif fillna_method == "mean": | ||
| X = X.apply( | ||
| lambda col: col.fillna(col.mean()), axis=0 | ||
| ) # impute remaining null values with mean | ||
| elif fillna_method == "zero": | ||
| X = X.fillna(0) # impute remaining null values with zero | ||
|
|
There was a problem hiding this comment.
let's use a match - case structure for this -- it's easier to add or remove conditions, and also allows linters like mypy to catch errors in the logic more easily
notebooks/analysis_helper.py
Outdated
| for x in train_k_eval_dict: | ||
| k_eval_means["mean_train_" + x] = np.mean(train_k_eval_dict[x], axis=0) | ||
| for x in test_k_eval_dict: | ||
| k_eval_means["mean_test_" + x] = np.mean(test_k_eval_dict[x], axis=0) |
There was a problem hiding this comment.
could you add calculations for std too?
notebooks/analysis_helper.py
Outdated
| fig, axes = plt.subplots( | ||
| nrow, ncol, figsize=(single_figsize[0] * ncol, single_figsize[1] * nrow) | ||
| ) |
There was a problem hiding this comment.
this is very nitpicky of me, but could you use the matplotlibrc file in src/cider to typeset the plot?
notebooks/analysis_helper.py
Outdated
| fig, axes = plt.subplots( | ||
| nrow, ncol, figsize=(single_figsize[0] * ncol, single_figsize[1] * nrow) | ||
| ) |
There was a problem hiding this comment.
same nitpicky comment: matplotlibrc file for typesetting?
|
Main reviews:
|
analysis_helper.pyfile consisting of functions to run analysis.featurizer.core.pysince multiple errors came from preprocessing that filtered all data, leaving empty processed data.