Conversation
Signed-off-by: J.P. Zivalich <j.p.zivalich@gmail.com>
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughAdds six new ML workflow guides (README, data ingestion, feature engineering, model training, multi-step pipelines, passing-data-between-steps) containing architecture diagrams, Hera (Python) and YAML examples, Databricks notebook snippets, connector patterns, best practices and cross‑references. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (1)
docs/guides/passing-data-between-steps.md (1)
402-410:S3Artifactis imported but never used in the code snippet.
S3Artifactappears in the import list but is not referenced anywhere in the subsequent example. Removing it avoids confusing readers into thinking they need to configure anS3Artifact.🛠️ Suggested fix
from hera.workflows import ( Workflow, Steps, Step, TemplateRef, script, Artifact, - S3Artifact )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/guides/passing-data-between-steps.md` around lines 402 - 410, The import list includes S3Artifact but it is unused in the example; remove S3Artifact from the from hera.workflows import (...) statement (the tuple containing Workflow, Steps, Step, TemplateRef, script, Artifact, S3Artifact) so the example only imports used symbols and avoids confusing readers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/guides/data-ingestion.md`:
- Around line 384-393: The guard that checks whether an existing DataFrame
exists uses a boolean test on df_existing which raises ValueError for PySpark
DataFrames; change the condition to an identity check (e.g., use "if df_existing
is not None:") so the merge logic that calls
df_new.union(df_existing).dropDuplicates(["id"]) only runs when df_existing is
actually a DataFrame and keeps df_merged = df_new when df_existing is None.
In `@docs/guides/feature-engineering.md`:
- Around line 169-172: The avg_days_between_purchases formula uses
transaction_count as the denominator but should use transaction_count - 1 (there
are N-1 gaps for N transactions) and must guard against counts <= 1 to avoid
dividing by zero or returning 0 for a single transaction; update the
customer_features.withColumn call that creates "avg_days_between_purchases"
(using columns last_transaction_date, first_transaction_date, and
transaction_count) to divide the datediff by (transaction_count - 1) and wrap
with a conditional (e.g., F.when or equivalent) to return null/None for
transaction_count <= 1.
In `@docs/guides/model-training.md`:
- Around line 362-404: The YAML example currently shows only train-config-0 and
train-config-1 while the Hera/Python example loops over train-config-0,
train-config-1, and train-config-2; update the YAML under the "Train models in
parallel" section to either add the missing train-config-2 block (matching the
pattern used for train-config-0/1 and setting
args="{{workflow.parameters.training-data}},300,20" or appropriate values) or
explicitly add a comment/ellipsis note stating that only two configs are shown
for brevity and the third (train-config-2) exists in the Python example so
readers aren’t misled. Ensure the change references the same parameter names
(code-path, task-type, cluster-mode, new-cluster-*, args) to keep examples
consistent.
In `@docs/guides/passing-data-between-steps.md`:
- Line 1125: Fix the typo in the docs line that currently reads "Large files
that multiple steps needto read" by inserting a space so it reads "Large files
that multiple steps need to read"; locate the exact string in the
"passing-data-between-steps.md" content and update it accordingly.
- Around line 886-893: Replace the deprecated chart reference
`stable/nfs-server-provisioner` and the `helm install nfs-provisioner
stable/nfs-server-provisioner ...` block with the maintained NFS Ganesha chart
from `kubernetes-sigs`: add the `nfs-ganesha-server-and-external-provisioner`
Helm repo (use the repository URL from
https://github.com/kubernetes-sigs/nfs-ganesha-server-and-external-provisioner/tree/master/deploy/helm)
and install the chart (use the chart name from that repo and the existing
release name `nfs-provisioner` and options `persistence.enabled=true`,
`persistence.size=200Gi`, `storageClass.name=nfs`), updating the documentation
snippet to show those new helm repo add and helm install commands that reference
the `nfs-ganesha-server-and-external-provisioner` chart instead of
`stable/nfs-server-provisioner`.
In `@docs/guides/README.md`:
- Line 35: Fix the typo in the README header sentence that currently reads
"Score large datasets with traine models in production." by replacing "traine"
with "trained" so the sentence reads "Score large datasets with trained models
in production."; update the docs/guides/README.md entry containing that sentence
to correct the spelling.
---
Nitpick comments:
In `@docs/guides/passing-data-between-steps.md`:
- Around line 402-410: The import list includes S3Artifact but it is unused in
the example; remove S3Artifact from the from hera.workflows import (...)
statement (the tuple containing Workflow, Steps, Step, TemplateRef, script,
Artifact, S3Artifact) so the example only imports used symbols and avoids
confusing readers.
Signed-off-by: J.P. Zivalich <j.p.zivalich@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/guides/data-ingestion.md`:
- Around line 396-403: The write currently uses
df_merged.write.mode("overwrite").partitionBy("date").parquet(target_path) which
with Spark's default static partition overwrite will delete all historical
partitions; change to dynamic partition overwrite by either setting
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic") before the
write or adding the option on the writer (e.g.,
df_merged.write.mode("overwrite").option("partitionOverwriteMode","dynamic").partitionBy("date").parquet(target_path))
so only the partition(s) for the current load_date are replaced.
- Around line 120-127: The notebook reads named widgets via
dbutils.widgets.get("source_path"), dbutils.widgets.get("target_path") and
dbutils.widgets.get("load_date") but the workflow currently passes positional
args (args) so those widgets remain empty; fix by aligning the contract: update
the workflow to send key/value notebook-params for source_path, target_path, and
load_date instead of using args, or alternatively update the notebook code
around dbutils.widgets.get(...) to parse the positional args string the workflow
provides and populate the expected variables (e.g., parse the comma-separated
args and assign to source_path/target_path/load_date before printing). Ensure
changes reference dbutils.widgets.get, source_path, target_path, load_date,
args, and notebook-params so the notebook and workflow agree.
---
Duplicate comments:
In `@docs/guides/data-ingestion.md`:
- Around line 384-386: The bare except around the existing-data load should be
replaced with a non-BaseException handler and should surface the error details
instead of swallowing them; update the except: that sets df_existing = None and
prints "No existing data found" to catch a specific/safer exception (e.g.,
FileNotFoundError or Exception) as e, set df_existing = None, and include the
exception text in the log/print so the failure context is visible; locate the
try/except that assigns df_existing and modify the except clause accordingly (do
not catch KeyboardInterrupt/SystemExit).
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: J.P. Zivalich <j.p.zivalich@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
docs/guides/passing-data-between-steps.md (1)
1109-1122: Duplicated bullet list in Best Practices section.The four bullets under "### 1. Use Parameters for Metadata" (Lines 1110–1113) are repeated verbatim under "Parameters (< 1MB):" (Lines 1117–1121). Consider keeping only the richer "Choose the Right Data Passing Method" table (Section 2) and removing the redundant Section 1 list, or differentiating their content.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/guides/passing-data-between-steps.md` around lines 1109 - 1122, Remove the duplicated bullet list under the "### 1. Use Parameters for Metadata" heading that repeats the same four items found under "**Parameters** (< 1MB):"; either delete the first list entirely or replace it with a brief summary that references the detailed "Choose the Right Data Passing Method" section, ensuring the only detailed list remains under the "**Parameters** (< 1MB):" subsection to avoid redundancy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/guides/feature-engineering.md`:
- Around line 393-398: The "Next Steps" section references a missing file
batch-inference.md which will create a broken link; either add a stub file named
batch-inference.md to the docs or remove/comment the "- [Batch
Inference](batch-inference.md) - Use features for predictions" list item in the
"Next Steps" block so the link is not published broken. Ensure you update the
"Next Steps" list where the link text "Batch Inference" appears or create a
minimal batch-inference.md placeholder that explains the upcoming content.
- Around line 169-177: The withColumn chain for customer_features is broken by
an extra closing parenthesis: remove the orphaned ')' that closes the first
withColumn so the two calls remain chained; ensure the expression that defines
"avg_days_between_purchases" (using F.datediff and F.when on columns
last_transaction_date, first_transaction_date, transaction_count) is closed once
and immediately followed by .withColumn for "days_since_last_purchase" (which
uses F.datediff and F.current_date()) so the chain is customer_features =
customer_features.withColumn(...).withColumn(...).
In `@docs/guides/model-training.md`:
- Line 491: The docs contain a dead link to batch-inference.md (anchor text
"Batch Inference") which isn’t present in this PR; either add the missing
batch-inference.md guide (or a placeholder with proper frontmatter) or
remove/update the link in model-training.md to point to an existing page (or to
feature-engineering.md if that’s intended). Locate the link text "Batch
Inference" in model-training.md and ensure the referenced path matches an actual
file committed in this PR to avoid broken GitBook links.
In `@docs/guides/passing-data-between-steps.md`:
- Line 1179: The markdown contains a dead link "[Batch
Inference](batch-inference.md) - Use outputs for predictions" in
passing-data-between-steps.md; either add the missing target file
(batch-inference.md) to the PR with the intended content or update this link to
point to an existing document (e.g., the correct filename used in this repo) or
remove the link and replace with plain text; ensure the change touches the line
with the link text so CI/Book build no longer reports a broken link.
- Around line 402-410: The import S3Artifact is unused in the example (only
Artifact is instantiated at lines referencing Artifact usage); remove S3Artifact
from the import list or demonstrate its usage (e.g., replace or add an
S3Artifact instantiation where Artifact is used) so the example and imports
match; update the import tuple that currently includes S3Artifact (alongside
Workflow, Steps, Step, TemplateRef, script, Artifact) accordingly and, if
choosing to keep S3Artifact, add a short inline comment showing why it’s used.
---
Duplicate comments:
In `@docs/guides/model-training.md`:
- Around line 362-424: The YAML now misses the third hyperparameter job in the
parallel training list; add the missing job entry named train-config-2 (matching
the databricks-connector run-job template) with the same parameter keys as the
other jobs and set its args parameter to
"{{workflow.parameters.training-data}},300,20" so the YAML matches the
Python/Hera example (ensure the templateRef name: databricks-connector and
template: run-job and parameters like code-path, task-type, cluster-mode,
new-cluster-*, and args are present and identical in structure).
In `@docs/guides/passing-data-between-steps.md`:
- Around line 886-893: Update the Helm snippet in passing-data-between-steps.md
to use the maintained NFS chart by adding the helm repo URL
"nfs-ganesha-server-and-external-provisioner" and installing "nfs-provisioner"
from "nfs-ganesha-server-and-external-provisioner/nfs-server-provisioner" with
persistence.enabled=true, persistence.size=200Gi and storageClass.name=nfs;
specifically replace the old helm repo/add and helm install lines with the new
repo and install commands referencing the repo name and chart (helm repo add
nfs-ganesha-server-and-external-provisioner, helm install nfs-provisioner
nfs-ganesha-server-and-external-provisioner/nfs-server-provisioner and the --set
flags).
---
Nitpick comments:
In `@docs/guides/passing-data-between-steps.md`:
- Around line 1109-1122: Remove the duplicated bullet list under the "### 1. Use
Parameters for Metadata" heading that repeats the same four items found under
"**Parameters** (< 1MB):"; either delete the first list entirely or replace it
with a brief summary that references the detailed "Choose the Right Data Passing
Method" section, ensuring the only detailed list remains under the
"**Parameters** (< 1MB):" subsection to avoid redundancy.
Signed-off-by: J.P. Zivalich <j.p.zivalich@gmail.com>
| args_string = dbutils.widgets.get("args") | ||
|
|
||
| # Parse comma-separated arguments | ||
| args = args_string.split(",") if args_string else [] |
| - name: max-workers | ||
| value: "16" | ||
| - name: args | ||
| value: "{{workflow.parameters.input-path}},{{workflow.parameters.output-path}},{{workflow.parameters.lookback-days}}" |
There was a problem hiding this comment.
This is a comma separated list, and the pattern marked as "correct" in data-ingestion needs copying here - it won't work as is
There was a problem hiding this comment.
So, just brackets around it, quotation before the bracket, or no quotations?
["{{workflow.parameters.input-path}},{{workflow.parameters.output-path}},{{workflow.parameters.lookback-days}}"]"[{{workflow.parameters.input-path}},{{workflow.parameters.output-path}},{{workflow.parameters.lookback-days}}]", or[{{workflow.parameters.input-path}},{{workflow.parameters.output-path}},{{workflow.parameters.lookback-days}}]
There was a problem hiding this comment.
Neither - this isn't my area of expertise really. This example doesn't work as it is - I think you need to copy the pattern from data-ingestion by changing the python below to expect dbutils.widgets.text("args"...) like that one does and then split the values. I didn't write it, so I'm not sure which of the patterns you wanted to settle on, but this one doesn't actually work.
| - name: new-cluster-num-workers | ||
| value: "4" | ||
| - name: args | ||
| value: "{{workflow.parameters.training-data}},{{workflow.parameters.model-name}},{{workflow.parameters.experiment-name}}" |
There was a problem hiding this comment.
Again a string list isn't split like it needs to be
There was a problem hiding this comment.
Same question as above
|
@Joibel thanks for the comments. Is the passing data via volumes info all accurate and looking good? |
Co-authored-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Alan Clucas <alan@clucas.org>
Added a code block for better formatting in the model training guide.
The words look reasonable. I haven't run any of those examples - if you haven't I would put money on them not working. |
|
|
||
| ### CSV to Parquet Conversion | ||
|
|
||
| {% tabs %} |
There was a problem hiding this comment.
Yeah - it's the Gitbook formatting. In Gitbook, it'll hopefully display as tabs so the user can click the tab they want to show either Python or YAML
| "run-name": "validate-{{workflow.parameters.date}}", | ||
| "args": "{{workflow.parameters.date}}", | ||
| } | ||
| ) |
There was a problem hiding this comment.
For the purposes of the catalog of connectors it would be amazing to have a Step declared in the codebase that is reusable and parameterizable. For example, smth like:
# class definition
class DatabricksConnector:
def validate_data(
self,
code_path: str,
task_type: str,
cluster_mode: str = Serverless,
run_name: str = "validate-{{workflow.parameters.date}}",
args: str = "{{workflow.parameters.date}},
) -> Step: ...
# usage within a workflow
with Workflow(...) as w:
with Steps() as s:
DatabricksConnector.validate_data(
code_path="/Users/data-team/02-validate",
task_type="notebook",
...,
)this way clients focus on the logic of stitching steps together and parameterizing based on their needs. Things like {{workflow.parameters.date}} can be default kwargs, etc
There was a problem hiding this comment.
I like that idea a lot. Let's chat about it and ship in another PR
There was a problem hiding this comment.
Brian and I hopped on with a customer today, and they suggested we should really lean into this bet
flaviuvadan
left a comment
There was a problem hiding this comment.
looks good, I think the Markdown rendering of tabbed tables is messed up so it warrants a change, and left a suggestion for a future PR for extra abstractions that still confer necessary user control
Hey @Joibel - right now I'm just trying to get a pulse from our customers on whether these are the guides that they want to see before spending the additional time to ensure they're working well. We got confirmation from one of them today that the guides so far have been really helpful showing how to use Argo/Hera with Databricks in an ML context and unblocked them using input parameters with the Databricks WorkflowTemplate Next step after merging will be to check all the examples, ideally via a CI pipeline, and then from there we can publish the guides on gitbook |

This PR:
Summary by CodeRabbit