Skip to content
This repository was archived by the owner on Mar 10, 2026. It is now read-only.
This repository was archived by the owner on Mar 10, 2026. It is now read-only.

Improved handling for zero-vectors representing proportional responses in TaylorEstimator #70

@iamchrisearle

Description

@iamchrisearle

In some survey edge cases I find myself with a zero-vector response on proportional data: "Do you have a PhD" is a survey question where full sample is all 0 as a toy example.

import pandas as pd
from samplics.estimation import TaylorEstimator
from samplics.utils.types import PopParam, 

# Setup data
test_df = pd.DataFrame(
    {
        "stratum": [
            "province_a",
            "province_a",
            "province_a",
            "province_b",
            "province_b",
            "province_b",
        ],
        "var": [0, 0, 0, 0, 0, 0],  # NOTE: All the same value of 0
        "wcol": [0.2, 0.3, 0.7, 0.2, 0.1, 0.7],
        "domain": ["dom_1", "dom_1", "dom_2", "dom_3", "dom_3", "dom_3"],
        "psu": [1, 2, 3, 4, 5, 6],
    }
)

# inspect
test_df

# NOTE: setup up with PopParam.prop
te = TaylorEstimator(param=PopParam.prop, alpha=0.95)
te.estimate(
    y=test_df["var"],
    samp_weight=test_df["wcol"],
    stratum=test_df["stratum"],
    domain=test_df["domain"],
    psu=test_df["psu"]
)

te.to_dataframe()
_param _domain _level _estimate _stderror _lci _uci _cv
0 PopParam.prop dom_1 0 1 0 1 1 0
1 PopParam.prop dom_2 0 1 0 1 1 0
2 PopParam.prop dom_3 0 1 0 1 1 0

So the point estimates for a zero-vector are 1, because the PopParam.prop uses pd.dummies to create a boolean vector of the categories of input:

# Breakpoin at `y_dummies = pd.get_dummies(y)` in expansion.py to recreate
>>> y_dummies
      0
0  True
1  True
2  True
3  True
4  True
5  True

Using PopParam.mean with as_factor = True still kicks into this dummies block resulting in the same bool vector.

If I switch to PopParam.mean to avoid the dummy encoding, the protection blocks present in the PopParam.prop logic branch are not there:

# This catches the edge case in the `PopParam.prop` branch nicely
# however, at this point the incorrect point estimate has already been made
        if point_est1[level] == 0:
            lower_ci[level] = 0
            upper_ci[level] = 0
            coef_var[level] = 0

...

# But in `PopParam.mean` (in domain non-None case)
# This will fail with zero-division as the (correct) self.point_est[key] is 0
        self.coef_var[key] = (
            math.sqrt(self.variance[key]) / self.point_est[key]
        )

Thoughts on adding a if self.point_est[key] == 0-like catch block in the PopParam.mean coef_var calculation? Something like this works for my use case:

                    if self.point_est[key] == 0:
                        self.coef_var[key] = 0.0
                    else:
                        self.coef_var[key] = (
                            math.sqrt(self.variance[key]) / self.point_est[key]
                        )

But requires using PopParam.mean to get the correct point estimate.

Or any ideas about how to handle the get_dummies resulting in a non-zero point estimate for all zero inputs? I was thinking that since pandas is used for dummies, y could be allowed to be passed as as a pd Series and retain categories?

import numpy as np
import pandas as pd

y = np.array([0,0,0,0])
y_series = pd.Series(y, dtype="category")
y_series = y_series.cat.set_categories([0, 1])

print(pd.get_dummies(y_series).to_markdown())
0 1
0 1 0
1 1 0
2 1 0
3 1 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions