I tested an eval config that skips almost everything and keeps only pearson_delta and discrimination_score_l1 for faster iteration through evaluator.compute(profile="full", metric_configs={}, skip_metrics=skip_metrics). Unexpectedly, those two metrics got much worse, even though model predictions were the same. This looks like a cell_eval pipeline bug: skipping many metrics changes internal intermediate state (likely hidden dependency/order effect), which makes pearson_delta/discrimination_score_l1 unreliable in that reduced setup.
I tested an eval config that skips almost everything and keeps only pearson_delta and discrimination_score_l1 for faster iteration through evaluator.compute(profile="full", metric_configs={}, skip_metrics=skip_metrics). Unexpectedly, those two metrics got much worse, even though model predictions were the same. This looks like a cell_eval pipeline bug: skipping many metrics changes internal intermediate state (likely hidden dependency/order effect), which makes pearson_delta/discrimination_score_l1 unreliable in that reduced setup.