Allow to customize raw CSV file reading#199
Conversation
|
I second the proposition. It would also help with #197 Instead of strictly requiring a file-like object, the importer could allow more general |
beangulp/importers/csvbase.py
Outdated
| This method uses the class members header, footer, names, dialect, | ||
| comments, and order. Overriding this method causes those members | ||
| to be ignored unless the overriding method explicitly uses them. | ||
|
|
There was a problem hiding this comment.
This change does not belong to this commit or this PR. I also find it unnecessary.
There was a problem hiding this comment.
The rationale was that there would be two ways to customize importer behaviour which partially overlap. Overriding "open()" would render the member "encoding" non-functional, for example. Will remove it.
beangulp/importers/csvbase.py
Outdated
| """Open the CSV file for reading. | ||
|
|
||
| This method can be overridden in subclasses to customize raw file reading, | ||
| for example to skip lines before processing or to handle special file formats. |
There was a problem hiding this comment.
What does "special file formats" mean here? This class supports a declarative way to define an importer for CSV files. Why would you want to use it for something else?
There was a problem hiding this comment.
I see that this is unclearly phrased. I wanted to refer to the fact that CSV has a large number of strange variants.
There was a problem hiding this comment.
Also, there are grey areas I think, like compressed CSV or XLSX which are not strictly CSV but are conceptually similar and can easily be converted
beangulp/importers/csvbase_test.py
Outdated
| class CustomReader(CSVReader): | ||
| first = Column("First") | ||
| second = Column("Second") | ||
|
|
||
| def open(self, filepath): | ||
| """Skip lines until we find the column headers.""" | ||
| fd = super().open(filepath) | ||
| # Read lines until we find one containing "First" | ||
| for line in fd: | ||
| if "First" in line: | ||
| # Create a new file-like object with the header line and remaining content | ||
| remaining = fd.read() | ||
| fd.close() | ||
| return io.StringIO(line + remaining) | ||
| return fd |
There was a problem hiding this comment.
With the iterlines() idea exposed above, this would become simply:
from itertools import dropwhile
class Reader(CSVReader):
first = Column("First")
second = Column("Second")
def iterlines(self, fd):
return dropwhile(lambda line: "First" not in line, fd)which seems much better to me.
There was a problem hiding this comment.
Reader class is now:
class Reader(CSVReader):
first = Column("First")
second = Column("Second")
def open(self, filepath):
"""Skip lines until we find the column headers."""
lines = super().open(filepath)
return dropwhile(lambda line: "First" not in line, lines)fd4b7bf to
a5561e3
Compare
While it is true that this is a reader for CSV data only, I felt that there might be some formats, like compressed data or XLSX, where it might be useful to first open in binary mode, preprocess and then hand over to |
dnicolodi
left a comment
There was a problem hiding this comment.
Better but not needs some tweaks.
770738f to
97d7b24
Compare
This allows subclasses to customize raw file reading.
97d7b24 to
60670cd
Compare
|
I have added your suggestions and rebased the branch to the latest version! What do you think @dnicolodi ? |
While CSVReader has numerous options to skip header or footer lines, or skip lines based on prefixes, it is cumbersome to read lines with varying lines before or after the actual statements. Users would need to override CSVReader.read(). Simply extending using
super().read(...)is possible but requires creating a temporary file because read() needs a physical file path. Simply adding anopen(str) -> file-likemethod to CSVReader allows to keep the useful functionality ofread()while still being able to preprocess the file, for example, skipping a variable number of lines based on a pattern.This would fix #196.