Skip to content

Feature/gene count input 20250212#58

Open
cschu wants to merge 20 commits into
mainfrom
feature/gene_count_input_20250212
Open

Feature/gene count input 20250212#58
cschu wants to merge 20 commits into
mainfrom
feature/gene_count_input_20250212

Conversation

@cschu

@cschu cschu commented May 11, 2026

Copy link
Copy Markdown
Owner

Summary by CodeRabbit

  • New Features

    • Accept external gene-count matrices via --gene_counts and use them during quantification.
    • Produce alignment statistics in JSON format.
  • Bug Fixes / Validation

    • CLI enforces exactly one input source and rejects strand-specific options with gene-count input.
    • Improved input handling for alignment vs. external-count workflows.
  • Chores

    • Version bumped to 2.19.0.

Review Change Stack

@coderabbitai

coderabbitai Bot commented May 11, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@cschu has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 42 minutes and 20 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 61c3dbe8-a396-4177-a264-52ec68ecc05a

📥 Commits

Reviewing files that changed from the base of the PR and between 7090337 and f229f2e.

📒 Files selected for processing (1)
  • gffquant/handle_args.py
📝 Walkthrough

Walkthrough

This PR adds external gene counts as a new input mode, extracts ReferenceHit into a standalone module, introduces a --gene_counts CLI option, updates the profiler to conditionally annotate from external or internal counts, and wires alignment reporting and parameter passing through the main entry point.

Changes

External Gene Counts Feature

Layer / File(s) Summary
ReferenceHit Data Structure
gffquant/profilers/reference_hit.py
Adds ReferenceHit dataclass (slots) with __hash__, __eq__, __str__, and __repr__; feature_quantifier.py imports it.
Annotate External Counts
gffquant/annotation/count_annotator.py
Adds annotate_external() API: RegionCountAnnotator.annotate_external() (stub, NotImplementedError) and GeneCountAnnotator.annotate_external() (reads tab-delimited gene-count file, parses numeric columns, maps gene→ggroup, aggregates counts, queries DB, accumulates unannotated counts, recalculates scaling factors).
CLI --gene_counts Argument & Validation
gffquant/handle_args.py
Adds --gene_counts CLI option; validate_args() treats it as a mutually-exclusive input type, sets args.input_type="gene_counts" when present, and raises NotImplementedError for unsupported combos (e.g., --strand_specific with gene_counts).
Main Entry Point Wiring
gffquant/__main__.py
Unifies fastq/bam/sam handling under one branch, calls profiler.report_alignments() after alignment processing, uses a no-op for other input types, and passes external_gene_counts=args.gene_counts into profiler.finalise().
FeatureQuantifier Signatures & Reporting
gffquant/profilers/feature_quantifier.py
Imports ReferenceHit; process_counters(..., external_gene_counts=None) added and branches to use annotate_external() when external counts provided; adds report_alignments() to write {out_prefix}.aln_stats.json and log alignment/filter metrics; finalise(..., external_gene_counts=None) forwards the external counts and removes prior gzipped aln-stats text/logging.
Version & Supporting
gffquant/__init__.py, gffquant/counters/count_manager.py
Bumps __version__ to 2.19.0; inserts commented-out CountManager.set_counts/load_data stubs (no runtime effect).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • cschu/gff_quantifier#54: Modifies the same core modules (count_annotator.py, feature_quantifier.py, handle_args/main) with related annotation and parameter handling enhancements.
  • cschu/gff_quantifier#53: Updates the same profiling code (feature_quantifier.py, main.py) and package metadata.

Poem

🐰 I nibble rows of counts and cheer,
New inputs hum — the pathway clear,
ReferenceHits hop out alone,
External counts find their new home,
A tiny version bump — hooray, my dear!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Feature/gene count input 20250212' is partially related to the changeset. It references a real feature (gene count input), but is vague with the date suffix and doesn't clearly summarize the main changes, which include version bumps, refactoring of alignment handling, external gene count annotation, and new data structures. Consider a more descriptive title that highlights the primary changes, such as 'Add support for external gene count input and refactor alignment processing' or 'Implement external gene count annotation feature'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/gene_count_input_20250212

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@gffquant/annotation/count_annotator.py`:
- Around line 333-347: annotate_external() currently always builds a 4-value
tuple (counts) and then adds it to self.gene_counts entries of length self.bins,
causing a shape mismatch when self.bins == 12 (strand-specific). Fix by adding a
guard and proper parsing: if self.bins != 4 then either (A) raise a clear error
(e.g., RuntimeError) explaining that strand-specific (12-bin) external matrices
are not supported yet, or (B) implement 12-bin parsing by reading the expected
per-strand columns (instead of
("uniq_raw","uniq_lnorm","combined_raw","combined_lnorm")) into a numpy array of
length 12 and then add that array to gcounts; update annotate_external(), the
local variables counts and gcounts, and the use of row[...] keys accordingly so
shapes always match.

In `@gffquant/counters/count_manager.py`:
- Line 3: Remove the unused top-level import of the csv module in
gffquant/counters/count_manager.py: delete the line importing "csv" (it's unused
in live code and only appears in commented stubs), leaving any commented
examples intact so Flake8 F401 is resolved.

In `@gffquant/handle_args.py`:
- Around line 42-47: The error message references the wrong attribute name
causing an AttributeError; update the f-string in the invalid-input check so it
uses args.gene_counts (the new option) instead of args.gene_count, keeping the
rest of the message intact and consistent with the tuple(map(bool, (has_fastq,
args.bam, args.sam, args.gene_counts))). This change ensures the ValueError is
raised with the correct variables: has_fastq, args.bam, args.sam,
args.gene_counts.

In `@gffquant/profilers/feature_quantifier.py`:
- Around line 331-341: The multi-line for-loop tuple has a hanging-indent style
that triggers E121/E125/E117; fix it by reformatting the tuple so elements are
either on the same line as the opening parenthesis or vertically aligned with
the first element (e.g., place the opening parenthesis immediately after the
assignment or align later lines under the first element), ensure the closing
parenthesis lines up with the start of the for statement, and keep the
logger.info call indented as the loop body referencing self.aln_counter and
metric/value; update the tuple in the for loop in feature_quantifier.py (the for
... in ( ... ) block that calls logger.info("%s: %s", metric,
self.aln_counter.get(value))) accordingly.
- Around line 343-346: In report_alignments(), the rate calculations using
self.aln_counter["full_read_count"] can raise ZeroDivisionError when
full_read_count is 0; modify the code that computes Alignment rate and Filter
pass rate to guard against division by zero by checking full_read_count first
(e.g., if full_read_count > 0 compute the percentage from
aln_counter["read_count"] / full_read_count and
aln_counter["filtered_read_count"] / full_read_count, otherwise set the rates to
0.0 or an appropriate sentinel like "0%"/"N/A") and then pass those safe values
into logger.info instead of performing the raw division directly.
- Around line 134-143: The external_gene_counts branch is calling
Annotator.annotate_external even when Annotator may be RegionCountAnnotator
(when self.run_mode.overlap_required is True), causing AttributeError; modify
the logic in feature_quantifier.py around the Annotator selection so that
annotate_external(...) is only invoked when the annotator is a
GeneCountAnnotator (e.g., guard with a check like not
self.run_mode.overlap_required or isinstance(Annotator(...), GeneCountAnnotator)
before calling annotate_external), and otherwise raise a clear error or fallback
to the annotate(...) path; update the block that sets total_readcount,
filtered_readcount, and has_ambig_counts accordingly so those values are only
set when annotate_external successfully runs.
- Around line 317-318: The JSON stats file is opened in binary mode which causes
json.dump(self.aln_counter, ...) to raise a TypeError; change the open call to
text mode (e.g. with open(f"{self.out_prefix}.aln_stats.json", "w",
encoding="utf-8") as aln_stats_out) so json.dump can write the string output to
aln_stats_out, leaving the variable names self.out_prefix, aln_stats_out and
self.aln_counter unchanged.

In `@gffquant/profilers/reference_hit.py`:
- Around line 23-29: The __eq__ method currently calls asdict() on other and
uses zip(sorted(...)) which can raise TypeError for non-dataclass objects and
silently truncate comparisons; update __eq__ to first check that other is an
instance of the same dataclass (use isinstance(other, type(self))) and if not
return NotImplemented, then compare the full sorted tuples of
asdict(self).items() and asdict(other).items() directly (avoid zip) so differing
field counts or values are detected correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d2109f96-b93f-47de-9f05-1479f1a790cf

📥 Commits

Reviewing files that changed from the base of the PR and between cc8870e and 6174e30.

📒 Files selected for processing (7)
  • gffquant/__init__.py
  • gffquant/__main__.py
  • gffquant/annotation/count_annotator.py
  • gffquant/counters/count_manager.py
  • gffquant/handle_args.py
  • gffquant/profilers/feature_quantifier.py
  • gffquant/profilers/reference_hit.py

Comment on lines +333 to +347
with open(fn, "rt", encoding="UTF-8") as _in:
for row in csv.DictReader(_in, delimiter="\t"):
# gene uniq_raw uniq_lnorm uniq_scaled combined_raw combined_lnorm combined_scaled
cols = row["uniq_raw"], row["uniq_lnorm"], row["combined_raw"], row["combined_lnorm"]
counts = tuple(map(float, cols))
ref = row["gene"]

if gene_group_db:
ref_tokens = ref.split(".")
gene_id, ggroup_id = ".".join(ref_tokens[:-1]), ref_tokens[-1]
else:
ggroup_id, gene_id = ref, ref

gcounts = self.gene_counts.setdefault(gene_id, np.zeros(self.bins))
gcounts += counts

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject or fully parse strand-specific external matrices.

annotate_external() always builds a 4-value counts tuple, but self.bins is 12 when --strand_specific is enabled. That makes gcounts += counts fail with a NumPy shape mismatch on the first imported row.

💡 Minimal guard if 12-bin import is not supported yet
 def annotate_external(self, fn, db, gene_group_db=False):  # refmgr, db, count_manager, gene_group_db=False):
+    if self.bins != 4:
+        raise ValueError(
+            "--gene_counts currently supports only non-strand-specific count matrices."
+        )
 
     with open(fn, "rt", encoding="UTF-8") as _in:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with open(fn, "rt", encoding="UTF-8") as _in:
for row in csv.DictReader(_in, delimiter="\t"):
# gene uniq_raw uniq_lnorm uniq_scaled combined_raw combined_lnorm combined_scaled
cols = row["uniq_raw"], row["uniq_lnorm"], row["combined_raw"], row["combined_lnorm"]
counts = tuple(map(float, cols))
ref = row["gene"]
if gene_group_db:
ref_tokens = ref.split(".")
gene_id, ggroup_id = ".".join(ref_tokens[:-1]), ref_tokens[-1]
else:
ggroup_id, gene_id = ref, ref
gcounts = self.gene_counts.setdefault(gene_id, np.zeros(self.bins))
gcounts += counts
if self.bins != 4:
raise ValueError(
"--gene_counts currently supports only non-strand-specific count matrices."
)
with open(fn, "rt", encoding="UTF-8") as _in:
for row in csv.DictReader(_in, delimiter="\t"):
# gene uniq_raw uniq_lnorm uniq_scaled combined_raw combined_lnorm combined_scaled
cols = row["uniq_raw"], row["uniq_lnorm"], row["combined_raw"], row["combined_lnorm"]
counts = tuple(map(float, cols))
ref = row["gene"]
if gene_group_db:
ref_tokens = ref.split(".")
gene_id, ggroup_id = ".".join(ref_tokens[:-1]), ref_tokens[-1]
else:
ggroup_id, gene_id = ref, ref
gcounts = self.gene_counts.setdefault(gene_id, np.zeros(self.bins))
gcounts += counts
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@gffquant/annotation/count_annotator.py` around lines 333 - 347,
annotate_external() currently always builds a 4-value tuple (counts) and then
adds it to self.gene_counts entries of length self.bins, causing a shape
mismatch when self.bins == 12 (strand-specific). Fix by adding a guard and
proper parsing: if self.bins != 4 then either (A) raise a clear error (e.g.,
RuntimeError) explaining that strand-specific (12-bin) external matrices are not
supported yet, or (B) implement 12-bin parsing by reading the expected
per-strand columns (instead of
("uniq_raw","uniq_lnorm","combined_raw","combined_lnorm")) into a numpy array of
length 12 and then add that array to gcounts; update annotate_external(), the
local variables counts and gcounts, and the use of row[...] keys accordingly so
shapes always match.

Comment thread gffquant/counters/count_manager.py Outdated
Comment thread gffquant/handle_args.py
Comment thread gffquant/profilers/feature_quantifier.py Outdated
Comment thread gffquant/profilers/feature_quantifier.py Outdated
Comment thread gffquant/profilers/feature_quantifier.py Outdated
Comment thread gffquant/profilers/feature_quantifier.py Outdated
Comment thread gffquant/profilers/reference_hit.py Outdated
Comment on lines +23 to +29
def __eq__(self, other):
return all(
item[0][1] == item[1][1]
for item in zip(
sorted(asdict(self).items()),
sorted(asdict(other).items())
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In Python, should eqreturnNotImplementedfor unrelated types, and doesdataclasses.asdict()raiseTypeError when passed a non-dataclass instance?

💡 Result:

For eq: Yes—when your eq can’t compare meaningfully to the other object’s type, it’s recommended to return NotImplemented (rather than False). NotImplemented is the special value for rich-binary operators (including eq) to signal “not implemented with respect to the other type”; when it’s returned, Python will try the reflected operation on the other operand, and if all attempts return NotImplemented, Python falls back (for ==/!=) or raises an appropriate exception depending on the operator. Also note: NotImplemented is a distinct singleton constant (not truthy), and incorrect use can lead to misleading errors or the value being returned to user code [1]. For dataclasses.asdict(): Yes—dataclasses.asdict(obj) raises TypeError when obj is not a dataclass instance. The docs state: “asdict raises TypeError if obj is not a dataclass instance.” [2] and CPython’s implementation checks _is_dataclass_instance(obj) and raises TypeError("asdict() should be called on dataclass instances") [3].

Citations:


🏁 Script executed:

cat -n gffquant/profilers/reference_hit.py | head -40

Repository: cschu/gff_quantifier

Length of output: 1143


Type-check other before calling asdict().

This __eq__ implementation will raise TypeError when comparing against non-dataclass objects instead of returning NotImplemented. Additionally, using zip() can silently truncate if other has fewer fields than self, causing unequal dataclasses to compare as equal. Add an isinstance check and use direct tuple comparison instead.

🔧 Suggested fix
     def __eq__(self, other):
-        return all(
-            item[0][1] == item[1][1]
-            for item in zip(
-                sorted(asdict(self).items()),
-                sorted(asdict(other).items())
-            )
-        )
+        if not isinstance(other, ReferenceHit):
+            return NotImplemented
+        return tuple(asdict(self).values()) == tuple(asdict(other).values())
🧰 Tools
🪛 Ruff (0.15.12)

[warning] 26-29: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@gffquant/profilers/reference_hit.py` around lines 23 - 29, The __eq__ method
currently calls asdict() on other and uses zip(sorted(...)) which can raise
TypeError for non-dataclass objects and silently truncate comparisons; update
__eq__ to first check that other is an instance of the same dataclass (use
isinstance(other, type(self))) and if not return NotImplemented, then compare
the full sorted tuples of asdict(self).items() and asdict(other).items()
directly (avoid zip) so differing field counts or values are detected correctly.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
gffquant/handle_args.py (1)

30-31: ⚖️ Poor tradeoff

Consider validating reference/aligner only when needed.

Currently, the reference index validation runs before determining the input type (line 49) and before checking whether --reference/--aligner are valid for the chosen input (lines 51-54). If a user mistakenly provides --gene_counts with --aligner bwa and no --reference, they'll see "Cannot find reference index at None" instead of the more helpful message at line 52 that these options aren't needed.

♻️ Possible refactor

Move lines 30-31 after line 54, or guard them with if has_fastq:, so the reference check only runs when those options are actually required:

if has_fastq:
    if (args.aligner == "bwa" and not check_bwa_index(args.reference)) or \
       (args.aligner == "minimap" and not check_minimap2_index(args.reference)):
        raise ValueError(f"Cannot find reference index at `{args.reference}`.")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@gffquant/handle_args.py` around lines 30 - 31, The reference-index validation
currently runs unconditionally and can raise confusing errors when the input
mode doesn't require a reference; move the conditional that calls
check_bwa_index/check_minimap2_index (the block using args.aligner and
args.reference) so it only runs when FASTQ-based processing is required (e.g.,
after the input-type detection/has_fastq determination) or wrap it with if
has_fastq: to ensure the checks only occur when the code will use aligner-based
processing; keep the same checks and error message but relocate them after the
logic that validates --gene_counts vs FASTQ inputs (the code around determining
has_fastq and the lines that validate --reference/--aligner usage).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@gffquant/handle_args.py`:
- Around line 51-52: The error message raised when (args.reference or
args.aligner) and not has_fastq is misleading because the same branch also
applies when the user provided --gene_counts; update the ValueError text in
handle_args.py to mention that --reference/--aligner are not needed with
alignment input (bam, sam) or when using --gene_counts so users see the complete
reason; locate the conditional that checks (args.reference or args.aligner) and
not has_fastq and change the exception message to include "or --gene_counts".

---

Nitpick comments:
In `@gffquant/handle_args.py`:
- Around line 30-31: The reference-index validation currently runs
unconditionally and can raise confusing errors when the input mode doesn't
require a reference; move the conditional that calls
check_bwa_index/check_minimap2_index (the block using args.aligner and
args.reference) so it only runs when FASTQ-based processing is required (e.g.,
after the input-type detection/has_fastq determination) or wrap it with if
has_fastq: to ensure the checks only occur when the code will use aligner-based
processing; keep the same checks and error message but relocate them after the
logic that validates --gene_counts vs FASTQ inputs (the code around determining
has_fastq and the lines that validate --reference/--aligner usage).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f00e31f7-3c7e-4a15-9734-0e4f74fc01e7

📥 Commits

Reviewing files that changed from the base of the PR and between 6174e30 and 6300970.

📒 Files selected for processing (5)
  • gffquant/annotation/count_annotator.py
  • gffquant/counters/count_manager.py
  • gffquant/handle_args.py
  • gffquant/profilers/feature_quantifier.py
  • gffquant/profilers/reference_hit.py
✅ Files skipped from review due to trivial changes (1)
  • gffquant/counters/count_manager.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • gffquant/profilers/reference_hit.py
  • gffquant/annotation/count_annotator.py
  • gffquant/profilers/feature_quantifier.py

Comment thread gffquant/handle_args.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant