Vw/synthetic data analysis by vivwqy · Pull Request #72 · IDinsight/cider

vivwqy · 2026-03-12T06:24:27Z

Cleaned notebook for synthetic data analysis - contains malawi and togo analysis results. Code in the notebook aims to easily replicate analysis for new datasets.
Added an analysis_helper.py file consisting of functions to run analysis.
Added small logging line for featurizer.core.py since multiple errors came from preprocessing that filtered all data, leaving empty processed data.

Copilot

Pull request overview

Adds reusable utilities to support synthetic data analysis workflows (Malawi/Togo and future datasets) and improves observability in the CIDER preprocessing pipeline when filtering removes all rows.

Changes:

Added notebooks/analysis_helper.py with helper functions for describing data, plotting, preprocessing/featurization, and simple modeling evaluation utilities.
Added an info log in preprocess_data to surface when date filtering yields zero rows for a given schema.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 6 comments.

File	Description
`src/cider/featurizer/core.py`	Adds logging for empty post-filter datasets inside `preprocess_data`.
`notebooks/analysis_helper.py`	Introduces notebook-focused helper functions for analysis, plotting, preprocessing/featurization, and k-fold evaluation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

notebooks/analysis_helper.py

Copilot · 2026-03-12T06:28:14Z

notebooks/analysis_helper.py

+        antennas_merged_shp = gpd.sjoin(
+            antenna_gdf, shapefile_gdf, how="left", predicate="within"
+        )[["antenna_id", "region"]]
+        antennas_merged_shp.region.fillna("Unknown", inplace=True)


antennas_merged_shp.region.fillna("Unknown", inplace=True) can trigger chained-assignment / view-vs-copy issues and relies on an inplace pattern that is increasingly discouraged in pandas. Assign back explicitly (e.g., set the column) to ensure the fill is applied reliably.

Suggested change

antennas_merged_shp.region.fillna("Unknown", inplace=True)

antennas_merged_shp["region"] = antennas_merged_shp["region"].fillna("Unknown")

Copilot · 2026-03-12T06:28:15Z

src/cider/featurizer/core.py

+        if len(filtered_df) == 0:
+            logger.info(
+                f"No data points found for {schema.__name__} after filtering by dates."
+            )
        filtered_df_with_day = add_day_column(filtered_df)


In preprocess_data, outlier_days is only assigned inside the if schema == CallDataRecordData: branch, but it is used unconditionally when building filtered_no_outlier_days_df. If CallDataRecordData is missing from data_dict (KeyError -> continue), later iterations will hit an UnboundLocalError (or reuse stale values). Initialize outlier_days before the loop (e.g., empty list) and/or compute it conditionally per schema before it’s referenced.

notebooks/analysis_helper.py

Copilot · 2026-03-12T06:28:15Z

notebooks/analysis_helper.py

+            sns.boxplot(data=data, y=feat, x=groupby, showfliers=showfliers, ax=axes[i])
+        elif orient == "h":
+            sns.boxplot(data=data, x=feat, y=groupby, showfliers=showfliers, ax=axes[i])


generate_boxplots passes x=groupby / y=groupby into seaborn even when groupby is None. Seaborn typically treats this as an invalid column name and raises (e.g., “Could not interpret value None for parameter ...”). Handle the groupby is None case by calling sns.boxplot without the grouping axis (single box per feature).

Suggested change

sns.boxplot(data=data, y=feat, x=groupby, showfliers=showfliers, ax=axes[i])

elif orient == "h":

sns.boxplot(data=data, x=feat, y=groupby, showfliers=showfliers, ax=axes[i])

if groupby is None:

sns.boxplot(

data=data,

y=feat,

showfliers=showfliers,

ax=axes[i],

)

else:

sns.boxplot(

data=data,

y=feat,

x=groupby,

showfliers=showfliers,

ax=axes[i],

)

elif orient == "h":

if groupby is None:

sns.boxplot(

data=data,

x=feat,

showfliers=showfliers,

ax=axes[i],

)

else:

sns.boxplot(

data=data,

x=feat,

y=groupby,

showfliers=showfliers,

ax=axes[i],

)

Copilot · 2026-03-12T06:28:15Z

notebooks/analysis_helper.py

+        dict with precision and recall at k% for specified tail
+    """
+    data = pd.DataFrame({"y_true": y_true, "y_pred": y_pred})
+    data


There’s a stray expression statement (data) immediately after constructing the DataFrame. It has no effect in normal Python execution and looks like leftover notebook/debug code; removing it will avoid confusion and keep the helper clean.

Suggested change

data

Copilot

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vivwqy · 2026-03-16T04:07:35Z

notebooks/analysis_helper.py

+        antenna_gdf = gpd.GeoDataFrame(
+            data[AntennaData],
+            geometry=gpd.points_from_xy(
+                x=data[AntennaData]["longitude"], y=data[AntennaData]["latitude"]
+            ),
+        ).set_crs(epsg=4326)
+        antennas_merged_shp = gpd.sjoin(
+            antenna_gdf, shapefile_gdf, how="left", predicate="within"
+        )[["antenna_id", "region"]]
+        antennas_merged_shp.region.fillna("Unknown", inplace=True)
+        antennas_df = antennas_merged_shp.merge(data[AntennaData], on="antenna_id")


preprocess_data() function doesn't process nor assigns any AntennaData, so we have to use raw data AntennaData here.

notebooks/analysis_helper.py

+        antennas_merged_shp = gpd.sjoin(
+            antenna_gdf, shapefile_gdf, how="left", predicate="within"
+        )[["antenna_id", "region"]]
+        antennas_merged_shp.region.fillna("Unknown", inplace=True)


notebooks/analysis_helper.py

+        dict with precision and recall at k% for specified tail
+    """
+    data = pd.DataFrame({"y_true": y_true, "y_pred": y_pred})
+    data


notebooks/analysis_helper.py

+) -> pd.DataFrame:
+    """
+    Runs featurization for datasets available, collects them in a dictionary.


notebooks/analysis_helper.py

+        # Prepare antenna_data
+        antenna_gdf = gpd.GeoDataFrame(
+            data[AntennaData],
+            geometry=gpd.points_from_xy(
+                x=data[AntennaData]["longitude"], y=data[AntennaData]["latitude"]
+            ),
+        ).set_crs(epsg=4326)
+        antennas_merged_shp = gpd.sjoin(
+            antenna_gdf, shapefile_gdf, how="left", predicate="within"
+        )[["antenna_id", "region"]]
+        antennas_merged_shp.region.fillna("Unknown", inplace=True)
+        antennas_df = antennas_merged_shp.merge(data[AntennaData], on="antenna_id")


poornimaramesh · 2026-03-13T11:39:24Z

notebooks/analysis_helper.py

+    if keep_optional_columns:
+        logging.info(
+            "Proceeding with processing antenna data since optional columns were included in synthetic data"
+        )
+        if shapefile_gdf is None:
+            raise ValueError(
+                "Geographic features cannot be merged since no shapefile_gdf geodataframe was provided."
+            )
+        # Prepare antenna_data
+        antenna_gdf = gpd.GeoDataFrame(
+            data[AntennaData],
+            geometry=gpd.points_from_xy(
+                x=data[AntennaData]["longitude"], y=data[AntennaData]["latitude"]
+            ),
+        ).set_crs(epsg=4326)
+        antennas_merged_shp = gpd.sjoin(
+            antenna_gdf, shapefile_gdf, how="left", predicate="within"
+        )[["antenna_id", "region"]]
+        antennas_merged_shp.region.fillna("Unknown", inplace=True)
+        antennas_df = antennas_merged_shp.merge(data[AntennaData], on="antenna_id")
+
+        preprocessed_data[AntennaData] = antennas_df
+
+    return preprocessed_data


could you add this to the original preprocess_data function in featurizer.core?

poornimaramesh · 2026-03-13T11:41:17Z

notebooks/analysis_helper.py

+    return preprocessed_data
+
+
+def run_featurization(


could you update featurize_all_data in featurizer.core with this way of doing things? i.e. checking what data is available inside the preprocessed_data variable and doing featurization accordingly?

poornimaramesh · 2026-03-13T11:45:17Z

notebooks/analysis_helper.py

+    X,
+    drop_zero_variance=False,
+    null_max_threshold=None,
+    fillna_method: Optional[Literal["median", "mean", "zero"]] = None,


Let's define and enum for this? Easier to keep track and/or modify if we need to

poornimaramesh · 2026-03-13T11:46:03Z

notebooks/analysis_helper.py

+    if fillna_method == "median":
+        X = X.apply(
+            lambda col: col.fillna(col.median()), axis=0
+        )  # impute remaining null values with median
+    elif fillna_method == "mean":
+        X = X.apply(
+            lambda col: col.fillna(col.mean()), axis=0
+        )  # impute remaining null values with mean
+    elif fillna_method == "zero":
+        X = X.fillna(0)  # impute remaining null values with zero
+


let's use a match - case structure for this -- it's easier to add or remove conditions, and also allows linters like mypy to catch errors in the logic more easily

poornimaramesh · 2026-03-13T11:49:07Z

notebooks/analysis_helper.py

+    for x in train_k_eval_dict:
+        k_eval_means["mean_train_" + x] = np.mean(train_k_eval_dict[x], axis=0)
+    for x in test_k_eval_dict:
+        k_eval_means["mean_test_" + x] = np.mean(test_k_eval_dict[x], axis=0)


could you add calculations for std too?

poornimaramesh · 2026-03-15T07:47:38Z

notebooks/analysis_helper.py

+    fig, axes = plt.subplots(
+        nrow, ncol, figsize=(single_figsize[0] * ncol, single_figsize[1] * nrow)
+    )


this is very nitpicky of me, but could you use the matplotlibrc file in src/cider to typeset the plot?

poornimaramesh · 2026-03-15T07:52:14Z

notebooks/analysis_helper.py

+    fig, axes = plt.subplots(
+        nrow, ncol, figsize=(single_figsize[0] * ncol, single_figsize[1] * nrow)
+    )


same nitpicky comment: matplotlibrc file for typesetting?

vivwqy · 2026-03-17T06:14:07Z

Main reviews:

Updated featurizer core.py:

preprocessing and featurize functions
add handling and loggings for several cases with missing data/ possible error cases

Updated analysis_helper.py based on reviews.
Updated notebooks synthetic_data_analysis and demo_pipeline based on updated functions.
Added python version for mypy in precommit to be able to handle match-case structure.

vivwqy added 5 commits March 5, 2026 10:40

add synthetic data analysis notebook

6863cb6

change file paths

dd4f53f

Clean synthetic data analysis notebook

e560089

Add logging info

b1dbd02

Clean synthetic data analysis notebook

0f1e07c

vivwqy requested a review from poornimaramesh March 12, 2026 06:24

vivwqy self-assigned this Mar 12, 2026

Copilot AI review requested due to automatic review settings March 12, 2026 06:24

Copilot started reviewing on behalf of vivwqy March 12, 2026 06:24 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

Some fixes for function argument and datatypes

dea7570

vivwqy requested a review from Copilot March 13, 2026 02:35

Copilot started reviewing on behalf of vivwqy March 13, 2026 02:36 View session

Tiny fix

230f8f6

Copilot AI reviewed Mar 13, 2026

View reviewed changes

poornimaramesh reviewed Mar 13, 2026

View reviewed changes

poornimaramesh reviewed Mar 15, 2026

View reviewed changes

vivwqy added 4 commits March 16, 2026 16:16

Update functions for analysis

7d90c2a

Update notebooks based on updated functions

f20aaeb

Add mypy python version

7605a7e

Add handling for missing CDR data case

523ba6b

vivwqy requested a review from poornimaramesh March 17, 2026 06:03

	antennas_merged_shp.region.fillna("Unknown", inplace=True)
	antennas_merged_shp["region"] = antennas_merged_shp["region"].fillna("Unknown")

Conversation

vivwqy commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

vivwqy Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

poornimaramesh Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

poornimaramesh Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

poornimaramesh Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

poornimaramesh Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

poornimaramesh Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

poornimaramesh Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

poornimaramesh Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

vivwqy commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

poornimaramesh Mar 13, 2026 •

edited

Loading