Allow to customize raw CSV file reading by mlell · Pull Request #199 · beancount/beangulp

mlell · 2026-01-04T09:31:43Z

While CSVReader has numerous options to skip header or footer lines, or skip lines based on prefixes, it is cumbersome to read lines with varying lines before or after the actual statements. Users would need to override CSVReader.read(). Simply extending using super().read(...) is possible but requires creating a temporary file because read() needs a physical file path. Simply adding an open(str) -> file-like method to CSVReader allows to keep the useful functionality of read() while still being able to preprocess the file, for example, skipping a variable number of lines based on a pattern.

This would fix #196.

johannesjh · 2026-01-05T13:19:52Z

I second the proposition. It would also help with #197

Instead of strictly requiring a file-like object, the importer could allow more general Iterable[String] type of objects. Users could still pass file-like objects, but also generator expressions. Generator expressions could be a convenient and idiomatic way of manipulating / filtering CSV lines.

dnicolodi · 2026-01-06T20:41:38Z

beangulp/importers/csvbase.py

+        This method uses the class members header, footer, names, dialect,
+        comments, and order. Overriding this method causes those members
+        to be ignored unless the overriding method explicitly uses them.
+


This change does not belong to this commit or this PR. I also find it unnecessary.

The rationale was that there would be two ways to customize importer behaviour which partially overlap. Overriding "open()" would render the member "encoding" non-functional, for example. Will remove it.

done in a5561e3

dnicolodi · 2026-01-06T20:44:35Z

beangulp/importers/csvbase.py

+        """Open the CSV file for reading.
+
+        This method can be overridden in subclasses to customize raw file reading,
+        for example to skip lines before processing or to handle special file formats.


What does "special file formats" mean here? This class supports a declarative way to define an importer for CSV files. Why would you want to use it for something else?

I see that this is unclearly phrased. I wanted to refer to the fact that CSV has a large number of strange variants.

Also, there are grey areas I think, like compressed CSV or XLSX which are not strictly CSV but are conceptually similar and can easily be converted

beangulp/importers/csvbase.py

beangulp/importers/csvbase_test.py

dnicolodi · 2026-01-06T20:58:43Z

beangulp/importers/csvbase_test.py

+        class CustomReader(CSVReader):
+            first = Column("First")
+            second = Column("Second")
+
+            def open(self, filepath):
+                """Skip lines until we find the column headers."""
+                fd = super().open(filepath)
+                # Read lines until we find one containing "First"
+                for line in fd:
+                    if "First" in line:
+                        # Create a new file-like object with the header line and remaining content
+                        remaining = fd.read()
+                        fd.close()
+                        return io.StringIO(line + remaining)
+                return fd


With the iterlines() idea exposed above, this would become simply:

from itertools import dropwhile class Reader(CSVReader): first = Column("First") second = Column("Second") def iterlines(self, fd): return dropwhile(lambda line: "First" not in line, fd)

which seems much better to me.

Reader class is now:

class Reader(CSVReader): first = Column("First") second = Column("Second") def open(self, filepath): """Skip lines until we find the column headers.""" lines = super().open(filepath) return dropwhile(lambda line: "First" not in line, lines)

mlell · 2026-01-07T11:37:22Z

Changed the return type of open() to be an iterable (e.g., a generator) to allow for e.g. generator expressions
Removed the change to the docstring of read()
Updated the unit test to use the dropwhile approach

While it is true that this is a reader for CSV data only, I felt that there might be some formats, like compressed data or XLSX, where it might be useful to first open in binary mode, preprocess and then hand over to read(). This would not be possible with the iterlines(). Approach. However, you might have experience with more different bank statement formats that I have, if you think noone is going to need this, the iterlines() approach would be simpler.

dnicolodi

Better but not needs some tweaks.

beangulp/importers/csvbase.py

This allows subclasses to customize raw file reading.

mlell · 2026-02-10T14:18:01Z

I have added your suggestions and rebased the branch to the latest version! What do you think @dnicolodi ?

mlell mentioned this pull request Jan 4, 2026

Make the CSV Base Importer's date function robust against empty lines #197

Open

dnicolodi requested changes Jan 6, 2026

View reviewed changes

mlell force-pushed the dev_override_open branch 2 times, most recently from fd4b7bf to a5561e3 Compare January 7, 2026 11:26

dnicolodi requested changes Jan 7, 2026

View reviewed changes

beangulp/importers/csvbase.py Outdated Show resolved Hide resolved

beangulp/importers/csvbase.py Outdated Show resolved Hide resolved

beangulp/importers/csvbase.py Outdated Show resolved Hide resolved

beangulp/importers/csvbase.py Outdated Show resolved Hide resolved

mlell force-pushed the dev_override_open branch from 770738f to 97d7b24 Compare February 10, 2026 14:09

Add open() to CSVReader interface

60670cd

This allows subclasses to customize raw file reading.

mlell force-pushed the dev_override_open branch from 97d7b24 to 60670cd Compare February 10, 2026 14:11

mlell requested a review from dnicolodi February 10, 2026 14:17

Conversation

mlell commented Jan 4, 2026

Uh oh!

johannesjh commented Jan 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mlell commented Jan 7, 2026

Uh oh!

dnicolodi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mlell commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants