Skip to content

Conversation

@daniel-thom
Copy link
Collaborator

This is a prototype, mostly generated by Claude. The goal is to see if we can simplify interaction with dsgrid when running Spark jobs. The current code based on SQLAlchemy requires a separate Spark session: dsgrid creates a session with pyspark and chronify relies on an Apache Thrift Server (Hive).

We would get the following benefits by migrating:

  • Pass an Ibis Table object from dsgrid to chronify to perform time validation instead of only a path to a Parquet file.
  • Support of PyHive (the SQLAlchemy driver for Hive) is unclear.
  • dsgrid could drop much of its special handling for Spark vs DuckDB (DuckDB has an experimental Spark API, but it is incomplete and has an uncertain future). Ibis appears to be a better long term solution.

We would lose this functionality in SQLAlchemy:

  • Database transactions with rollback. Ibis does not support this natively.
  • We currently allow the user to ingest rows from multiple DataFrames into an existing table. If the first DataFrame is valid but the second is not, we perform a rollback and the state of the database is the same as the original state. With Ibis, we do not have code to delete the added rows. (It could be done with special-casing for backends that support it.)
  • This is not import for dsgrid as we do not ingest data like this. We need to ask other chronify users.

Outstanding work:

  • Some tests are failing due to time zone / DST handling with Spark.
  • Talk to other chronify users about dropping transaction support.

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 87.45174% with 195 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (ll/local_time2@4f2a3f3). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/chronify/ibis/spark_backend.py 73.01% 34 Missing ⚠️
src/chronify/ibis/functions.py 77.44% 30 Missing ⚠️
src/chronify/ibis/types.py 51.92% 25 Missing ⚠️
src/chronify/store.py 87.93% 21 Missing ⚠️
src/chronify/time_series_mapper_base.py 86.33% 19 Missing ⚠️
src/chronify/ibis/duckdb_backend.py 79.74% 16 Missing ⚠️
src/chronify/ibis/sqlite_backend.py 80.72% 16 Missing ⚠️
src/chronify/ibis/base.py 80.00% 14 Missing ⚠️
tests/conftest.py 87.80% 5 Missing ⚠️
tests/test_spark_backend.py 97.91% 4 Missing ⚠️
... and 7 more
Additional details and impacted files
@@                Coverage Diff                @@
##             ll/local_time2      #62   +/-   ##
=================================================
  Coverage                  ?   91.93%           
=================================================
  Files                     ?       55           
  Lines                     ?     4947           
  Branches                  ?        0           
=================================================
  Hits                      ?     4548           
  Misses                    ?      399           
  Partials                  ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants