Skip to content

Support upload via signed-urls#227

Open
simo-prior wants to merge 19 commits intomainfrom
eng-533/tabpfn-client-signed-urls
Open

Support upload via signed-urls#227
simo-prior wants to merge 19 commits intomainfrom
eng-533/tabpfn-client-signed-urls

Conversation

@simo-prior
Copy link
Collaborator

@simo-prior simo-prior commented Mar 4, 2026

Description

  • Support upload via signed-urls.
  • Dataset limits downloaded at init() from the server.
  • Convert datasets to compressed parquet before uploading.

@simo-prior simo-prior requested a review from a team as a code owner March 4, 2026 17:47
@simo-prior simo-prior requested review from ggprior and removed request for a team March 4, 2026 17:47
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@simo-prior simo-prior changed the title Support for upload via signed-urls Support upload via signed-urls Mar 4, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust and efficient data upload mechanism by leveraging signed URLs for direct cloud storage interaction and adopting compressed Parquet format for datasets. It also centralizes dataset limit management by fetching them from the server, ensuring consistent validation across the client. These changes collectively improve data transfer security, speed, and maintainability.

Highlights

  • Signed URL Uploads: Implemented support for uploading datasets using signed URLs, enhancing security and efficiency for data transfer to the server. This involves a new workflow where the client requests signed URLs from the server, then directly uploads data to a cloud storage bucket.
  • Server-Side Dataset Limit Enforcement: Integrated server-side dataset limits for maximum size, cells, columns, and classes. These limits are now fetched from the server during client initialization and enforced client-side before data upload, providing more dynamic and consistent validation.
  • Parquet Compression for Datasets: Converted dataset serialization from CSV to compressed Parquet format (using zstd compression) before uploading. This significantly reduces data size and improves upload performance.
  • Parallel Chunked Uploads: Enabled parallel uploading of dataset chunks to cloud storage when multiple signed URLs are provided, leveraging ThreadPoolExecutor for faster data transfer.
  • Dependency Updates: Added google-crc32c as a new dependency for CRC32C checksum calculation during uploads and updated various other package versions in uv.lock.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • pyproject.toml
    • Added google-crc32c to project dependencies.
    • Updated tabpfn-client version to 0.2.9rc6.
  • src/tabpfn_client/client.py
    • Imported new modules including base64, concurrent.futures, io, struct, google_crc32c, pandas, and uuid.
    • Introduced _dataset_limits global variable and CHUNK_UPLOAD_PARALLELISM constant.
    • Defined new Pydantic models (FileInfo, FileUploadInfo, PrepareTrainSetUploadRequest, PrepareTrainSetUploadResponse, DuplicateFilesUploadedResponse, GetDatasetLimitsResponse) for GAPI interactions.
    • Added _serialize_to_parquet function to convert dataframes to zstd-compressed Parquet bytes and calculate CRC32C hash.
    • Modified ServiceClient to follow_redirects=True for HTTPX client.
    • Added get_dataset_limits class method to ServiceClient to fetch dataset constraints from the server.
    • Refactored the fit method to use Parquet serialization, enforce server-side limits, prepare uploads via a new GAPI endpoint, and perform parallel chunked uploads to GCS.
    • Removed the backoff decorator from the main fit method and applied it to a new internal _fit method for stream processing and _upload_single_chunk for GCS uploads.
    • Introduced _prepare_train_set_upload method to handle the initial request for signed URLs from the server.
    • Added _upload_to_gcs and _upload_single_chunk methods to manage the actual data transfer to Google Cloud Storage using signed URLs and CRC32C checksums.
    • Removed dataset UID caching logic from the fit method.
  • src/tabpfn_client/config.py
    • Removed a redundant connection accessibility check during client initialization.
    • Added a call to ServiceClient.get_dataset_limits() during client initialization to pre-fetch server constraints.
    • Updated docstrings for clarity regarding init and reset functions.
  • src/tabpfn_client/estimator.py
    • Imported ServiceClient for accessing server-side limits.
    • Removed hardcoded MAX_ROWS, MAX_COLS, and MAX_NUMBER_OF_CLASSES constants.
    • Updated _validate_targets_and_classes to use ServiceClient.get_dataset_limits() for max_classes validation.
    • Updated validate_data_size to use ServiceClient.get_dataset_limits() for max_cells and max_cols validation.
    • Added minor formatting adjustments (newlines) in fit and predict methods.
  • tests/unit/test_tabpfn_classifier.py
    • Imported new client modules (GetDatasetLimitsResponse, ServiceClient).
    • Added ServiceClient.reset_authorization() and client_module._dataset_limits = None to setUp and tearDown methods to ensure test isolation and clean state.
    • Updated mock server responses to include the /tabpfn/get_dataset_limits/ endpoint with dummy limit data.
    • Modified mock_prompt_and_set_token to return True for successful token setting.
    • Removed a mock for protected_root.path in some tests.
    • Adjusted data size check tests to align with the new server-side limit fetching and validation logic.
    • Ensured no cached token and reset authorization in test_cache_based_on_paper_version.
  • tests/unit/test_tabpfn_regressor.py
    • Imported new client modules (GetDatasetLimitsResponse, ServiceClient).
    • Added ServiceClient.reset_authorization() and client_module._dataset_limits = None to setUp and tearDown methods for test isolation.
    • Updated mock server responses to include the /tabpfn/get_dataset_limits/ endpoint with dummy limit data.
    • Modified mock_prompt_and_set_token to return True for successful token setting.
    • Removed a mock for protected_root.path in some tests.
    • Adjusted data size check tests to align with the new server-side limit fetching and validation logic.
    • Ensured no cached token and reset authorization in test_cache_based_on_paper_version.
  • uv.lock
    • Updated tabpfn-client version to 0.2.9rc6.
    • Added google-crc32c dependency.
    • Updated pre-commit to 4.3.0.
    • Updated ruff to 0.15.1.
    • Added licensecheck and its dependencies (appdirs, attrs, boolean-py, cattrs, fhconfparser, license-expression, loguru, markdown, requests-cache, requirements-parser, url-normalize, uv, win32-setctime, zipp, importlib-metadata).
    • Updated various other package versions to their latest compatible releases.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant change to the data upload mechanism, switching to signed URLs and parquet format for better performance and security. It also fetches dataset limits from the server at initialization, making the client more robust to server-side changes. The implementation is solid, but I have a few suggestions to improve maintainability, resource management, and error handling, particularly around the new upload logic and data caching.

Note: Security Review did not run due to the size of the PR.

@simo-prior simo-prior force-pushed the eng-533/tabpfn-client-signed-urls branch from d783fee to c65b8da Compare March 5, 2026 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant