Add timeout and HEAD request validation to get_dataframe#60
Conversation
- Add `timeout` parameter (default 30s) to `get_dataframe()` that raises `TimeoutError` if `read_fwf` takes too long, using ThreadPoolExecutor - Add HEAD request before downloading to quickly detect bad URLs (404s) and raise `requests.HTTPError` - Add VCR cassette for HEAD request in existing test - Add tests for timeout and bad URL scenarios Agent-Logs-Url: https://github.com/palewire/atcf-data-parser/sessions/146372fe-4841-401a-a163-8ad6087cf5f1 Co-authored-by: palewire <9993+palewire@users.noreply.github.com>
Agent-Logs-Url: https://github.com/palewire/atcf-data-parser/sessions/146372fe-4841-401a-a163-8ad6087cf5f1 Co-authored-by: palewire <9993+palewire@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates get_dataframe to fail faster and avoid indefinite hangs by adding a pre-download URL validation step and a configurable parsing timeout.
Changes:
- Add a
timeoutparameter toget_dataframeand runpd.read_fwfin an executor with a timeout. - Perform a
HEADrequest pre-check (raise_for_status) before downloading. - Add tests for timeout and bad-URL behavior, and update the VCR cassette to include the new
HEADinteraction.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| atcf_data_parser/init.py | Adds timeout to get_dataframe, performs HEAD validation, and wraps parsing with an executor timeout. |
| tests/test_get.py | Adds mocked unit tests for timeout behavior and bad URL handling. |
| tests/cassettes/test_get_dataframe.yaml | Records the additional HEAD request interaction for VCR. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
… speed up test
- Add timeout parameter to get_gzipped_url() and pass it to requests.get()
- Pass timeout from get_dataframe() to get_gzipped_url() call
- Fix singular/plural in timeout error message ("1 second" vs "N seconds")
- Replace time.sleep(5) in test with threading.Event for instant unblocking
Agent-Logs-Url: https://github.com/palewire/atcf-data-parser/sessions/b7961136-59dd-42b4-a2ba-d1df6c800975
Co-authored-by: palewire <9993+palewire@users.noreply.github.com>
|
@copy, does the read_fwf need the timeout and future if the actual failure is happening when the URL's data in fetched? Could we simply do the probe and the timeout there, since the data will be in memory by the time it gets to pandas? Also: Make sure all linting tests pass before committing. |
|
@copilot, does the read_fwf need the timeout and future if the actual failure is happening when the URL's data in fetched? Could we simply do the probe and the timeout there, since the data will be in memory by the time it gets to pandas? Also: Make sure all linting tests pass before committing. |
…_fwf call Agent-Logs-Url: https://github.com/palewire/atcf-data-parser/sessions/fd75d446-552f-4135-bc7e-8d0fa63b2d23 Co-authored-by: palewire <9993+palewire@users.noreply.github.com>
You're right — since the data is fully in memory before |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 4 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
get_dataframecan hang indefinitely when downloads take too long, and bad URLs aren't detected until after a full download attempt.Changes
timeoutparameter (default 30s) toget_dataframe, passed through toget_gzipped_url()and its underlyingrequests.get()call, so downloads respect the timeout.requests.head()before downloading to fail fast on bad URLs (404s, etc.) viaraise_for_status().test_get_dataframe_bad_urlusing mocks. Updated VCR cassette to include the HEAD interaction.Usage