In some survey edge cases I find myself with a zero-vector response on proportional data: "Do you have a PhD" is a survey question where full sample is all 0 as a toy example.
import pandas as pd
from samplics.estimation import TaylorEstimator
from samplics.utils.types import PopParam,
# Setup data
test_df = pd.DataFrame(
{
"stratum": [
"province_a",
"province_a",
"province_a",
"province_b",
"province_b",
"province_b",
],
"var": [0, 0, 0, 0, 0, 0], # NOTE: All the same value of 0
"wcol": [0.2, 0.3, 0.7, 0.2, 0.1, 0.7],
"domain": ["dom_1", "dom_1", "dom_2", "dom_3", "dom_3", "dom_3"],
"psu": [1, 2, 3, 4, 5, 6],
}
)
# inspect
test_df
# NOTE: setup up with PopParam.prop
te = TaylorEstimator(param=PopParam.prop, alpha=0.95)
te.estimate(
y=test_df["var"],
samp_weight=test_df["wcol"],
stratum=test_df["stratum"],
domain=test_df["domain"],
psu=test_df["psu"]
)
te.to_dataframe()
|
_param |
_domain |
_level |
_estimate |
_stderror |
_lci |
_uci |
_cv |
| 0 |
PopParam.prop |
dom_1 |
0 |
1 |
0 |
1 |
1 |
0 |
| 1 |
PopParam.prop |
dom_2 |
0 |
1 |
0 |
1 |
1 |
0 |
| 2 |
PopParam.prop |
dom_3 |
0 |
1 |
0 |
1 |
1 |
0 |
So the point estimates for a zero-vector are 1, because the PopParam.prop uses pd.dummies to create a boolean vector of the categories of input:
# Breakpoin at `y_dummies = pd.get_dummies(y)` in expansion.py to recreate
>>> y_dummies
0
0 True
1 True
2 True
3 True
4 True
5 True
Using PopParam.mean with as_factor = True still kicks into this dummies block resulting in the same bool vector.
If I switch to PopParam.mean to avoid the dummy encoding, the protection blocks present in the PopParam.prop logic branch are not there:
# This catches the edge case in the `PopParam.prop` branch nicely
# however, at this point the incorrect point estimate has already been made
if point_est1[level] == 0:
lower_ci[level] = 0
upper_ci[level] = 0
coef_var[level] = 0
...
# But in `PopParam.mean` (in domain non-None case)
# This will fail with zero-division as the (correct) self.point_est[key] is 0
self.coef_var[key] = (
math.sqrt(self.variance[key]) / self.point_est[key]
)
Thoughts on adding a if self.point_est[key] == 0-like catch block in the PopParam.mean coef_var calculation? Something like this works for my use case:
if self.point_est[key] == 0:
self.coef_var[key] = 0.0
else:
self.coef_var[key] = (
math.sqrt(self.variance[key]) / self.point_est[key]
)
But requires using PopParam.mean to get the correct point estimate.
Or any ideas about how to handle the get_dummies resulting in a non-zero point estimate for all zero inputs? I was thinking that since pandas is used for dummies, y could be allowed to be passed as as a pd Series and retain categories?
import numpy as np
import pandas as pd
y = np.array([0,0,0,0])
y_series = pd.Series(y, dtype="category")
y_series = y_series.cat.set_categories([0, 1])
print(pd.get_dummies(y_series).to_markdown())
|
0 |
1 |
| 0 |
1 |
0 |
| 1 |
1 |
0 |
| 2 |
1 |
0 |
| 3 |
1 |
0 |
In some survey edge cases I find myself with a zero-vector response on proportional data: "Do you have a PhD" is a survey question where full sample is all 0 as a toy example.
So the point estimates for a zero-vector are 1, because the
PopParam.propuses pd.dummies to create a boolean vector of the categories of input:Using
PopParam.meanwithas_factor = Truestill kicks into this dummies block resulting in the same bool vector.If I switch to
PopParam.meanto avoid the dummy encoding, the protection blocks present in thePopParam.proplogic branch are not there:Thoughts on adding a
if self.point_est[key] == 0-like catch block in thePopParam.meancoef_var calculation? Something like this works for my use case:But requires using
PopParam.meanto get the correct point estimate.Or any ideas about how to handle the
get_dummiesresulting in a non-zero point estimate for all zero inputs? I was thinking that since pandas is used for dummies,ycould be allowed to be passed as as a pd Series and retain categories?