Improved handling for zero-vectors representing proportional responses in TaylorEstimator

In some survey edge cases I find myself with a zero-vector response on proportional data: "Do you have a PhD" is a survey question where full sample is all 0 as a toy example.

```python
import pandas as pd
from samplics.estimation import TaylorEstimator
from samplics.utils.types import PopParam, 

# Setup data
test_df = pd.DataFrame(
    {
        "stratum": [
            "province_a",
            "province_a",
            "province_a",
            "province_b",
            "province_b",
            "province_b",
        ],
        "var": [0, 0, 0, 0, 0, 0],  # NOTE: All the same value of 0
        "wcol": [0.2, 0.3, 0.7, 0.2, 0.1, 0.7],
        "domain": ["dom_1", "dom_1", "dom_2", "dom_3", "dom_3", "dom_3"],
        "psu": [1, 2, 3, 4, 5, 6],
    }
)

# inspect
test_df

# NOTE: setup up with PopParam.prop
te = TaylorEstimator(param=PopParam.prop, alpha=0.95)
te.estimate(
    y=test_df["var"],
    samp_weight=test_df["wcol"],
    stratum=test_df["stratum"],
    domain=test_df["domain"],
    psu=test_df["psu"]
)

te.to_dataframe()
```

|    | _param        | _domain   |   _level |   _estimate |   _stderror |   _lci |   _uci |   _cv |
|---:|:--------------|:----------|---------:|------------:|------------:|-------:|-------:|------:|
|  0 | PopParam.prop | dom_1     |        0 |           1 |           0 |      1 |      1 |     0 |
|  1 | PopParam.prop | dom_2     |        0 |           1 |           0 |      1 |      1 |     0 |
|  2 | PopParam.prop | dom_3     |        0 |           1 |           0 |      1 |      1 |     0 |


So the point estimates for a zero-vector are 1, because the `PopParam.prop` uses pd.dummies to create a boolean vector of the categories of input:

```python
# Breakpoin at `y_dummies = pd.get_dummies(y)` in expansion.py to recreate
>>> y_dummies
      0
0  True
1  True
2  True
3  True
4  True
5  True
```
Using `PopParam.mean` with `as_factor = True` still kicks into this dummies block resulting in the same bool vector. 

If I switch to `PopParam.mean` to avoid the dummy encoding, the protection blocks present in the `PopParam.prop` logic branch are not there:

```python
# This catches the edge case in the `PopParam.prop` branch nicely
# however, at this point the incorrect point estimate has already been made
        if point_est1[level] == 0:
            lower_ci[level] = 0
            upper_ci[level] = 0
            coef_var[level] = 0

...

# But in `PopParam.mean` (in domain non-None case)
# This will fail with zero-division as the (correct) self.point_est[key] is 0
        self.coef_var[key] = (
            math.sqrt(self.variance[key]) / self.point_est[key]
        )

```

Thoughts on adding a `if self.point_est[key] == 0`-like catch block in the `PopParam.mean` coef_var calculation? Something like this works for my use case: 
```python
                    if self.point_est[key] == 0:
                        self.coef_var[key] = 0.0
                    else:
                        self.coef_var[key] = (
                            math.sqrt(self.variance[key]) / self.point_est[key]
                        )
```
But requires using `PopParam.mean` to get the correct point estimate. 

Or any ideas about how to handle the `get_dummies` resulting in a non-zero point estimate for all zero inputs? I was thinking that since pandas is used for dummies, `y` could be allowed to be passed as as a pd Series and retain categories?

```python
import numpy as np
import pandas as pd

y = np.array([0,0,0,0])
y_series = pd.Series(y, dtype="category")
y_series = y_series.cat.set_categories([0, 1])

print(pd.get_dummies(y_series).to_markdown())
```

|    |   0 |   1 |
|---:|----:|----:|
|  0 |   1 |   0 |
|  1 |   1 |   0 |
|  2 |   1 |   0 |
|  3 |   1 |   0 |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved handling for zero-vectors representing proportional responses in TaylorEstimator #70

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	_param	_domain	_estimate	_lci	_uci
0	PopParam.prop	dom_1	1	1	1
1	PopParam.prop	dom_2	1	1	1
2	PopParam.prop	dom_3	1	1	1

Uh oh!

Improved handling for zero-vectors representing proportional responses in TaylorEstimator #70

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions