-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Issue:
I encountered an issue while running tractor where I was getting the following error while running tractor:
## Command being run (I have to add the export OMP_NUM_THREADS=1 to prevent a fork bomb on our slurm cluster):
export OMP_NUM_THREADS=1 && run_tractor.R --hapdose $HAPDOSE --phenofile $PHENO_FILE --covarcollist Age_WBC,Sex,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 --method linear --output $OUTPUT_FILE --sampleidcol GRID --phenocol WBC --chunksize $CHUNK_SIZE --nthreads 3
## error message
Joining with `by = join_by(IID)`
Error in `anti_join()`:
! Input columns in `x` must be unique.
✖ Problem with `IID`.
Backtrace:
▆
1. ├─global RunTractor(...)
2. │ ├─anti_join(df_phe, sampleID_hapdos) %>% select("IID")
3. │ ├─dplyr::anti_join(df_phe, sampleID_hapdos)
4. │ └─dplyr:::anti_join.data.frame(df_phe, sampleID_hapdos)
5. │ └─dplyr:::join_filter(...)
6. │ └─dplyr:::join_cols(x_names, y_names, by = by, error_call = error_call)
7. │ └─dplyr:::check_duplicate_vars(x_names, "x", error_call = error_call)
8. │ └─rlang::abort(bullets, call = error_call)
9. └─dplyr::select(., "IID")
Execution haltedMy pheno file already had a column IID (although I was using the GRID column for the sample ID). There is logic that renames the GRID column to IID and basically duplicates the already existing IID column. The presence of 2 IID columns ends up creating an issue for the anti-join which leads to the above stacktrace. It is fairly common in genetics to have both an FID column and a IID column in phenotype files.
Is this issue already mentioned or being discussed:
I have looked through the issues and the README and don't see anything that states that this could happen. There is also no explicit instruction to the user that they should run tractor either 1) using a "#IID or IID" column in the phenotype file or 2) using a different column but ensuring that there is no IID column in the phenotype file
Where the problem is occurring:
To the best I can tell the problem originates at lines 159-176 (see below code). If the column "IID" already exists then the line m colnames(df_phe)[colnames(df_phe) == sampleidcol] <- "IID", will generate a duplicate column name (note: it seems the values in this column can be different and the error still occurs, so it only fails when the column name "IID" is duplicated). The program then fails when the anti-join between teh df_phe and sampleID_hapdos
if (!is.null(sampleidcol)) {
if (sampleidcol %in% colnames(df_phe)) {
colnames(df_phe)[colnames(df_phe) == sampleidcol] <- "IID"
cat("Sample ID column used : ", sampleidcol, "\n")
} else {
stop(paste("Error: Column", sampleidcol, "does not exist in the file."))
}
} else {
if ("#IID" %in% colnames(df_phe)) {
colnames(df_phe)[colnames(df_phe) == "#IID"] <- "IID"
cat("Sample ID column used : #IID\n")
} else if (!("IID" %in% colnames(df_phe))) {
stop(paste("Error: Unable to identify sample ID column. Default column name expected for sample ID is IID or #IID.
Alternatively, sample ID can be provided with --sampleidcol"))
} else {
cat("Sample ID column used : IID\n")
}
}Potential fixes:
-
A very simple solution would be to just tell users to either have a "#IID or IID" column or to tell users that there file can't already contain an IID column if they are using a different sampleIDcol. This doesn't change the code only the readme.
-
Check to see if the IID col already exists when the user provides a sampleIDcol and then drop it and effectively recreate if. This change keeps the current branching structure of the code so it has minimal changes.
-
I think you could just use the following logic. This logic has the largest rewrite but reduces have far the code branches and then also just allows you to use the variable name anytime you want to access the column. You don't need to reassign the existing column to have a different name
(psuedocode) This is not valid R code
if (user provided sampleIDcol) {
idColName=sampleIDcol
} else if ("IID" in colname(df_phe) {
idColName="IID"
} else if ("#IID" in colname(df_phe) {
idColName="#IID"
} else {
throw an error because the user did not pass a column name and the default values are not present in the file
}
I am willing to take a try at implementing any of these changes or having a discussion about other alternative fixes if y'all would want that
How to reproduce:
- Everything was run on an ubuntu 24.02 LTS server with R version 4.3.1 install using mamba (faster conda)
- I think you can take a test phenotype file that you have and just copy the current sampleIDcol that you are using and name it "IID" and it should reproduce this error