Skip to content

Add unit/integration test for ingestion pipeline#474

Merged
vish-cs merged 1 commit intodatacommonsorg:masterfrom
vish-cs:itest
Feb 18, 2026
Merged

Add unit/integration test for ingestion pipeline#474
vish-cs merged 1 commit intodatacommonsorg:masterfrom
vish-cs:itest

Conversation

@vish-cs
Copy link
Contributor

@vish-cs vish-cs commented Feb 10, 2026

Added a unit test which runs the ingestion pipeline with a MockSpannerClient to test the mutations.
Added an integration test which can run the ingestion pipeline in LOCAL mode (using Spanner emulator in a docker container) or in DATAFLOW mode where it runs in the pipeline in GCP.
Renamed ImportGroupPipeline to GraphIngestionPipeline
Clean up schema file to remove import workflow table definitions as these will be maintained in the data repo instead.
Update cloud build file to run integration test as part of the build
Remove template scripts due to to switch to cloud build

@gemini-code-assist
Copy link
Contributor

gemini-code-assist bot commented Feb 10, 2026

Summary of Changes

Hello @vish-cs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the testing framework for the ingestion pipeline by introducing dedicated unit and integration tests. The integration tests are designed to validate the pipeline's behavior in both local development environments using a Spanner emulator and actual Google Cloud Dataflow, ensuring robust and reliable data ingestion. This change improves the overall stability and correctness of the pipeline's interaction with Spanner.

Highlights

  • Dataflow Integration Test: A new Cloud Build step was added to run Dataflow integration tests for the ingestion pipeline, ensuring end-to-end functionality in a cloud environment.
  • Comprehensive Pipeline Testing: Introduced both unit and integration tests for the ImportGroupPipeline, covering local execution with a Spanner emulator and cloud execution via Dataflow.
Changelog
  • pipeline/ingestion/cloudbuild.yaml
    • Added a new Cloud Build step to execute Dataflow integration tests.
    • Included new substitution variables for Dataflow project ID, Spanner instance/database IDs, GCS buckets, and region to support integration testing.
  • pipeline/ingestion/src/test/java/org/datacommons/ingestion/pipeline/ImportGroupPipelineIntegrationIT.java
    • Added a new integration test class for ImportGroupPipeline.
    • Implemented setup for both local (Testcontainers Spanner emulator) and Dataflow environments.
    • Included a test method to execute the pipeline and verify data persistence in Spanner.
  • pipeline/ingestion/src/test/java/org/datacommons/ingestion/pipeline/ImportGroupPipelineTest.java
    • Added a new unit test class for ImportGroupPipeline.
    • Implemented a test that uses a mocked SpannerClient to capture and verify mutations generated by the pipeline.
  • pipeline/ingestion/src/test/resources/docker-java.properties
    • Added a new resource file specifying the Docker API version for Testcontainers.
  • pipeline/spanner/src/main/java/org/datacommons/ingestion/spanner/SpannerClient.java
    • Corrected a type cast in the getWriteGroupedTransform method to explicitly cast to SpannerIO.Write.
  • pipeline/util/src/main/java/org/datacommons/ingestion/util/GraphReader.java
    • Added a comment regarding the maximum size for Spanner column values.
    • Modified graphToNodes to explicitly set types(List.of(ValueType.TEXT.toString())) for nodes with a value.
  • pipeline/util/src/test/java/org/datacommons/ingestion/util/GraphReaderTest.java
    • Updated expected Node objects in testGraphToNodes to include the types(List.of("TEXT")) property, aligning with changes in GraphReader.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable unit and integration tests for the ingestion pipeline. The unit tests are well-implemented using a mock Spanner client. The integration tests correctly leverage Testcontainers for local execution. However, there are several critical issues in the CI configuration and the Dataflow execution path of the integration tests that need to be addressed. The Cloud Build configuration for running tests is incorrect, and the Dataflow test is not self-contained, relying on pre-existing data, which can lead to flaky builds. Additionally, there are opportunities to improve the robustness of the tests by avoiding hardcoded resource names and using more specific exception handling.

@vish-cs vish-cs closed this Feb 10, 2026
@vish-cs vish-cs deleted the itest branch February 10, 2026 05:38
@vish-cs vish-cs restored the itest branch February 10, 2026 05:38
@vish-cs vish-cs reopened this Feb 10, 2026
@vish-cs vish-cs force-pushed the itest branch 10 times, most recently from 6301050 to 54b74b4 Compare February 11, 2026 04:33
@vish-cs vish-cs requested a review from n-h-diaz February 11, 2026 04:44
@vish-cs
Copy link
Contributor Author

vish-cs commented Feb 11, 2026

Natalie, any clue how to address Codacy errors on the spanner schema file? These seem false positives.

@n-h-diaz
Copy link
Contributor

Natalie, any clue how to address Codacy errors on the spanner schema file? These seem false positives.

I think those are safe to ignore - codacy checks seem not to always understand spanner schema, especially for graph

Copy link
Contributor

@n-h-diaz n-h-diaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding the tests!

@vish-cs vish-cs merged commit eb103da into datacommonsorg:master Feb 18, 2026
5 of 6 checks passed
@vish-cs vish-cs deleted the itest branch February 19, 2026 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments