Dear Jonathan,
Firstly, thank you for developing the PolyMR R package. It seems like a very useful tool for exploring non-linear causal effects.
I am currently trying to use the polymr() function, but I'm consistently encountering an error:
Error in cut.default(exposure, breaks = bin_boundaries, include.lowest = TRUE, ) :
'breaks' are not unique
This error occurs when running the function with default arguments. Based on investigating my exposure data (data$Drink in my case), the issue seems to stem from its distribution:
Initial Data: The exposure variable originally had very few unique values (21 unique values in ~4700 observations) and a large proportion of observations were zero (at least 25%, quantile(exposure, probs = seq(0, 1, 1/4))` showed 0% and 25% quantiles were both 0.00).
Filtered Data (Drinkers Only): To see if removing the zeros would help, I filtered the data to include only non-zero exposure values. While the distribution spread out, the error persisted. Checking the quantiles for this filtered data revealed duplicates when calculating for 10 bins:
Quantiles for exposure > 0 (calculated in R)
> print(quantile(exposure_filtered, probs = seq(0, 1, 1 / 10), na.rm = TRUE))
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0.11 0.33 1.32 3.96 5.50 11.00 15.40 15.40 30.80 30.80 77.00
Specifically, the 60th/70th percentiles were identical (15.40), and the 80th/90th percentiles were also identical (30.80).
This suggests that the internal process in polymr which uses cut() is failing because the quantile-based (or similar) method used to determine bin_boundaries results in duplicate values due to the clustering of identical values in my exposure data, even after filtering out zeros.
I have reviewed the function's arguments in the help file and understand that the bins argument (default 100) controls the binning for the output summary, but there doesn't seem to be a direct argument to control the number of bins used internally for the model-fitting step that triggers this cut error.
My questions are:
Is there a recommended way to handle exposure data with these characteristics (highly skewed, clustered/duplicate values) within the PolyMR framework?
Is there any way (perhaps undocumented or via other parameter interactions) to adjust the internal binning process to potentially avoid this error (e.g., using fewer bins internally)?
Alternatively, would you consider PolyMR perhaps less suitable for exposure variables with such distributions compared to other MR methods?
Any advice or clarification you could provide would be greatly appreciated. I am happy to provide more details if needed.
Thank you for your time and for the valuable package.
Sincerely,
Takeshi Nishiyama
Dear Jonathan,
Firstly, thank you for developing the PolyMR R package. It seems like a very useful tool for exploring non-linear causal effects.
I am currently trying to use the polymr() function, but I'm consistently encountering an error:
Error in cut.default(exposure, breaks = bin_boundaries, include.lowest = TRUE, ) :
'breaks' are not unique
This error occurs when running the function with default arguments. Based on investigating my exposure data (data$Drink in my case), the issue seems to stem from its distribution:
Initial Data: The exposure variable originally had very few unique values (21 unique values in ~4700 observations) and a large proportion of observations were zero (at least 25%, quantile(exposure, probs = seq(0, 1, 1/4))` showed 0% and 25% quantiles were both 0.00).
Filtered Data (Drinkers Only): To see if removing the zeros would help, I filtered the data to include only non-zero exposure values. While the distribution spread out, the error persisted. Checking the quantiles for this filtered data revealed duplicates when calculating for 10 bins:
Quantiles for exposure > 0 (calculated in R)
> print(quantile(exposure_filtered, probs = seq(0, 1, 1 / 10), na.rm = TRUE))
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0.11 0.33 1.32 3.96 5.50 11.00 15.40 15.40 30.80 30.80 77.00
Specifically, the 60th/70th percentiles were identical (15.40), and the 80th/90th percentiles were also identical (30.80).
This suggests that the internal process in polymr which uses cut() is failing because the quantile-based (or similar) method used to determine bin_boundaries results in duplicate values due to the clustering of identical values in my exposure data, even after filtering out zeros.
I have reviewed the function's arguments in the help file and understand that the bins argument (default 100) controls the binning for the output summary, but there doesn't seem to be a direct argument to control the number of bins used internally for the model-fitting step that triggers this cut error.
My questions are:
Is there a recommended way to handle exposure data with these characteristics (highly skewed, clustered/duplicate values) within the PolyMR framework?
Is there any way (perhaps undocumented or via other parameter interactions) to adjust the internal binning process to potentially avoid this error (e.g., using fewer bins internally)?
Alternatively, would you consider PolyMR perhaps less suitable for exposure variables with such distributions compared to other MR methods?
Any advice or clarification you could provide would be greatly appreciated. I am happy to provide more details if needed.
Thank you for your time and for the valuable package.
Sincerely,
Takeshi Nishiyama