Skip to content

Add timeout and HEAD request validation to get_dataframe#60

Merged
palewire merged 8 commits into
mainfrom
copilot/add-timeout-to-get-dataframe
Apr 7, 2026
Merged

Add timeout and HEAD request validation to get_dataframe#60
palewire merged 8 commits into
mainfrom
copilot/add-timeout-to-get-dataframe

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 7, 2026

get_dataframe can hang indefinitely when downloads take too long, and bad URLs aren't detected until after a full download attempt.

Changes

  • Timeout on download: Adds a configurable timeout parameter (default 30s) to get_dataframe, passed through to get_gzipped_url() and its underlying requests.get() call, so downloads respect the timeout.
  • HEAD request pre-check: Sends requests.head() before downloading to fail fast on bad URLs (404s, etc.) via raise_for_status().
  • Tests: Added test_get_dataframe_bad_url using mocks. Updated VCR cassette to include the HEAD interaction.

Usage

from atcf_data_parser import get_dataframe

# Default 30s timeout
df = get_dataframe("https://ftp.nhc.noaa.gov/atcf/aid_public/aep182023.dat.gz")

# Custom timeout
df = get_dataframe(url, timeout=10)

# Bad URL raises requests.HTTPError immediately
df = get_dataframe("https://ftp.nhc.noaa.gov/atcf/aid_public/nonexistent.dat.gz")

Copilot AI and others added 2 commits April 7, 2026 15:45
- Add `timeout` parameter (default 30s) to `get_dataframe()` that raises
  `TimeoutError` if `read_fwf` takes too long, using ThreadPoolExecutor
- Add HEAD request before downloading to quickly detect bad URLs (404s)
  and raise `requests.HTTPError`
- Add VCR cassette for HEAD request in existing test
- Add tests for timeout and bad URL scenarios

Agent-Logs-Url: https://github.com/palewire/atcf-data-parser/sessions/146372fe-4841-401a-a163-8ad6087cf5f1

Co-authored-by: palewire <9993+palewire@users.noreply.github.com>
Copilot AI changed the title [WIP] Add timeout input to get_dataframe function for read_fwf call Add timeout and HEAD request validation to get_dataframe Apr 7, 2026
Copilot AI requested a review from palewire April 7, 2026 15:47
@palewire palewire marked this pull request as ready for review April 7, 2026 15:53
@palewire palewire requested a review from Copilot April 7, 2026 15:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates get_dataframe to fail faster and avoid indefinite hangs by adding a pre-download URL validation step and a configurable parsing timeout.

Changes:

  • Add a timeout parameter to get_dataframe and run pd.read_fwf in an executor with a timeout.
  • Perform a HEAD request pre-check (raise_for_status) before downloading.
  • Add tests for timeout and bad-URL behavior, and update the VCR cassette to include the new HEAD interaction.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.

File Description
atcf_data_parser/init.py Adds timeout to get_dataframe, performs HEAD validation, and wraps parsing with an executor timeout.
tests/test_get.py Adds mocked unit tests for timeout behavior and bad URL handling.
tests/cassettes/test_get_dataframe.yaml Records the additional HEAD request interaction for VCR.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atcf_data_parser/__init__.py Outdated
Comment thread atcf_data_parser/__init__.py
Comment thread tests/test_get.py Outdated
Comment thread atcf_data_parser/__init__.py Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
… speed up test

- Add timeout parameter to get_gzipped_url() and pass it to requests.get()
- Pass timeout from get_dataframe() to get_gzipped_url() call
- Fix singular/plural in timeout error message ("1 second" vs "N seconds")
- Replace time.sleep(5) in test with threading.Event for instant unblocking

Agent-Logs-Url: https://github.com/palewire/atcf-data-parser/sessions/b7961136-59dd-42b4-a2ba-d1df6c800975

Co-authored-by: palewire <9993+palewire@users.noreply.github.com>
@palewire
Copy link
Copy Markdown
Owner

palewire commented Apr 7, 2026

@copy, does the read_fwf need the timeout and future if the actual failure is happening when the URL's data in fetched? Could we simply do the probe and the timeout there, since the data will be in memory by the time it gets to pandas? Also: Make sure all linting tests pass before committing.

@palewire
Copy link
Copy Markdown
Owner

palewire commented Apr 7, 2026

@copilot, does the read_fwf need the timeout and future if the actual failure is happening when the URL's data in fetched? Could we simply do the probe and the timeout there, since the data will be in memory by the time it gets to pandas? Also: Make sure all linting tests pass before committing.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 7, 2026

@copilot, does the read_fwf need the timeout and future if the actual failure is happening when the URL's data in fetched? Could we simply do the probe and the timeout there, since the data will be in...

You're right — since the data is fully in memory before read_fwf runs, the executor/future pattern is unnecessary. Removed concurrent.futures, ThreadPoolExecutor, and the timeout test in cf51843. The function now calls pd.read_fwf directly. All linting (flake8, black, isort) and tests pass.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atcf_data_parser/__init__.py
Comment thread atcf_data_parser/__init__.py Outdated
Comment thread atcf_data_parser/__init__.py
Comment thread atcf_data_parser/__init__.py
Comment thread tests/test_get.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@palewire palewire merged commit 92eb7e1 into main Apr 7, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The get_dataframe read_fwf call is hanging and we need a timeout error to raise

3 participants