Skip to content

docs: Add ml guides#7

Open
JPZ13 wants to merge 17 commits intomainfrom
add-ml-guides
Open

docs: Add ml guides#7
JPZ13 wants to merge 17 commits intomainfrom
add-ml-guides

Conversation

@JPZ13
Copy link
Contributor

@JPZ13 JPZ13 commented Feb 19, 2026

This PR:

  • Adds more guides on common ML tasks

Summary by CodeRabbit

  • Documentation
    • Added comprehensive ML workflow guides: overview plus deep dives on data ingestion, feature engineering, model training, multi-step pipelines, and passing data between steps.
    • Includes architecture diagrams, workflow patterns, guidance on connectors (Databricks, Spark, Argo), data-passing options (parameters, artifacts, volumes), best practices, deployment patterns, and cluster recommendations.
    • Provides runnable examples and templates in Python (Hera) and YAML, with configuration and next-step links.

Signed-off-by: J.P. Zivalich <j.p.zivalich@gmail.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 19, 2026

Warning

Rate limit exceeded

@JPZ13 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 24 minutes and 13 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Walkthrough

Adds six new ML workflow guides (README, data ingestion, feature engineering, model training, multi-step pipelines, passing-data-between-steps) containing architecture diagrams, Hera (Python) and YAML examples, Databricks notebook snippets, connector patterns, best practices and cross‑references.

Changes

Cohort / File(s) Summary
Guides Index
docs/guides/README.md
New index and overview of ML workflow guides, list of available guides, connector selection notes, usage tips for Python (Hera) and YAML, ML workflow patterns and next steps.
Data Ingestion
docs/guides/data-ingestion.md
New comprehensive data ingestion guide: CSV→Parquet, validation, incremental loads, multi-source patterns; includes Hera and YAML workflow examples, Databricks notebook snippets, parameterisation, cluster configs and best practices.
Feature Engineering
docs/guides/feature-engineering.md
New guide covering basic and multi-stage feature pipelines using Spark/Databricks, Hera and YAML templates, notebook examples for aggregations/window features, and performance/best-practice guidance.
Model Training
docs/guides/model-training.md
New model training guide with Databricks+MLflow examples, GPU/distributed training patterns, hyperparameter tuning templates in Hera and YAML, notebook examples and model management best practices.
Multi-step Orchestration
docs/guides/multi-step-pipelines.md
New end-to-end orchestration guide showing ingest→validate→features→train→evaluate→inference flows, conditional and parallel patterns, scheduled CronWorkflow examples, and hybrid Databricks+Spark patterns (Hera and YAML).
Data Passing Patterns
docs/guides/passing-data-between-steps.md
New guide on Parameters, Artifacts, Volumes for sharing data/metadata between steps; includes Databricks↔Spark examples, storage and NFS volume patterns, TemplateRef usage and best practices for contracts and failure handling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • flaviuvadan

Poem

🐰 I hopped through markdown, neat and spry,
New guides to help your pipelines fly.
From ingest paths to training streams,
I stitched the docs with carrot dreams.
Hooray — the workflows gleam and sigh.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'docs: Add ml guides' follows conventional commits format with 'docs' prefix and clearly describes the main change: adding ML-related documentation guides.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch add-ml-guides

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
docs/guides/passing-data-between-steps.md (1)

402-410: S3Artifact is imported but never used in the code snippet.

S3Artifact appears in the import list but is not referenced anywhere in the subsequent example. Removing it avoids confusing readers into thinking they need to configure an S3Artifact.

🛠️ Suggested fix
 from hera.workflows import (
     Workflow,
     Steps,
     Step,
     TemplateRef,
     script,
     Artifact,
-    S3Artifact
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/guides/passing-data-between-steps.md` around lines 402 - 410, The import
list includes S3Artifact but it is unused in the example; remove S3Artifact from
the from hera.workflows import (...) statement (the tuple containing Workflow,
Steps, Step, TemplateRef, script, Artifact, S3Artifact) so the example only
imports used symbols and avoids confusing readers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/guides/data-ingestion.md`:
- Around line 384-393: The guard that checks whether an existing DataFrame
exists uses a boolean test on df_existing which raises ValueError for PySpark
DataFrames; change the condition to an identity check (e.g., use "if df_existing
is not None:") so the merge logic that calls
df_new.union(df_existing).dropDuplicates(["id"]) only runs when df_existing is
actually a DataFrame and keeps df_merged = df_new when df_existing is None.

In `@docs/guides/feature-engineering.md`:
- Around line 169-172: The avg_days_between_purchases formula uses
transaction_count as the denominator but should use transaction_count - 1 (there
are N-1 gaps for N transactions) and must guard against counts <= 1 to avoid
dividing by zero or returning 0 for a single transaction; update the
customer_features.withColumn call that creates "avg_days_between_purchases"
(using columns last_transaction_date, first_transaction_date, and
transaction_count) to divide the datediff by (transaction_count - 1) and wrap
with a conditional (e.g., F.when or equivalent) to return null/None for
transaction_count <= 1.

In `@docs/guides/model-training.md`:
- Around line 362-404: The YAML example currently shows only train-config-0 and
train-config-1 while the Hera/Python example loops over train-config-0,
train-config-1, and train-config-2; update the YAML under the "Train models in
parallel" section to either add the missing train-config-2 block (matching the
pattern used for train-config-0/1 and setting
args="{{workflow.parameters.training-data}},300,20" or appropriate values) or
explicitly add a comment/ellipsis note stating that only two configs are shown
for brevity and the third (train-config-2) exists in the Python example so
readers aren’t misled. Ensure the change references the same parameter names
(code-path, task-type, cluster-mode, new-cluster-*, args) to keep examples
consistent.

In `@docs/guides/passing-data-between-steps.md`:
- Line 1125: Fix the typo in the docs line that currently reads "Large files
that multiple steps needto read" by inserting a space so it reads "Large files
that multiple steps need to read"; locate the exact string in the
"passing-data-between-steps.md" content and update it accordingly.
- Around line 886-893: Replace the deprecated chart reference
`stable/nfs-server-provisioner` and the `helm install nfs-provisioner
stable/nfs-server-provisioner ...` block with the maintained NFS Ganesha chart
from `kubernetes-sigs`: add the `nfs-ganesha-server-and-external-provisioner`
Helm repo (use the repository URL from
https://github.com/kubernetes-sigs/nfs-ganesha-server-and-external-provisioner/tree/master/deploy/helm)
and install the chart (use the chart name from that repo and the existing
release name `nfs-provisioner` and options `persistence.enabled=true`,
`persistence.size=200Gi`, `storageClass.name=nfs`), updating the documentation
snippet to show those new helm repo add and helm install commands that reference
the `nfs-ganesha-server-and-external-provisioner` chart instead of
`stable/nfs-server-provisioner`.

In `@docs/guides/README.md`:
- Line 35: Fix the typo in the README header sentence that currently reads
"Score large datasets with traine models in production." by replacing "traine"
with "trained" so the sentence reads "Score large datasets with trained models
in production."; update the docs/guides/README.md entry containing that sentence
to correct the spelling.

---

Nitpick comments:
In `@docs/guides/passing-data-between-steps.md`:
- Around line 402-410: The import list includes S3Artifact but it is unused in
the example; remove S3Artifact from the from hera.workflows import (...)
statement (the tuple containing Workflow, Steps, Step, TemplateRef, script,
Artifact, S3Artifact) so the example only imports used symbols and avoids
confusing readers.

JPZ13 and others added 4 commits February 19, 2026 10:50
Signed-off-by: J.P. Zivalich <j.p.zivalich@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/guides/data-ingestion.md`:
- Around line 396-403: The write currently uses
df_merged.write.mode("overwrite").partitionBy("date").parquet(target_path) which
with Spark's default static partition overwrite will delete all historical
partitions; change to dynamic partition overwrite by either setting
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic") before the
write or adding the option on the writer (e.g.,
df_merged.write.mode("overwrite").option("partitionOverwriteMode","dynamic").partitionBy("date").parquet(target_path))
so only the partition(s) for the current load_date are replaced.
- Around line 120-127: The notebook reads named widgets via
dbutils.widgets.get("source_path"), dbutils.widgets.get("target_path") and
dbutils.widgets.get("load_date") but the workflow currently passes positional
args (args) so those widgets remain empty; fix by aligning the contract: update
the workflow to send key/value notebook-params for source_path, target_path, and
load_date instead of using args, or alternatively update the notebook code
around dbutils.widgets.get(...) to parse the positional args string the workflow
provides and populate the expected variables (e.g., parse the comma-separated
args and assign to source_path/target_path/load_date before printing). Ensure
changes reference dbutils.widgets.get, source_path, target_path, load_date,
args, and notebook-params so the notebook and workflow agree.

---

Duplicate comments:
In `@docs/guides/data-ingestion.md`:
- Around line 384-386: The bare except around the existing-data load should be
replaced with a non-BaseException handler and should surface the error details
instead of swallowing them; update the except: that sets df_existing = None and
prints "No existing data found" to catch a specific/safer exception (e.g.,
FileNotFoundError or Exception) as e, set df_existing = None, and include the
exception text in the log/print so the failure context is visible; locate the
try/except that assigns df_existing and modify the except clause accordingly (do
not catch KeyboardInterrupt/SystemExit).

JPZ13 and others added 4 commits February 19, 2026 11:46
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: J.P. Zivalich <j.p.zivalich@gmail.com>
@JPZ13 JPZ13 requested review from Joibel and flaviuvadan February 19, 2026 10:52
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
docs/guides/passing-data-between-steps.md (1)

1109-1122: Duplicated bullet list in Best Practices section.

The four bullets under "### 1. Use Parameters for Metadata" (Lines 1110–1113) are repeated verbatim under "Parameters (< 1MB):" (Lines 1117–1121). Consider keeping only the richer "Choose the Right Data Passing Method" table (Section 2) and removing the redundant Section 1 list, or differentiating their content.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/guides/passing-data-between-steps.md` around lines 1109 - 1122, Remove
the duplicated bullet list under the "### 1. Use Parameters for Metadata"
heading that repeats the same four items found under "**Parameters** (< 1MB):";
either delete the first list entirely or replace it with a brief summary that
references the detailed "Choose the Right Data Passing Method" section, ensuring
the only detailed list remains under the "**Parameters** (< 1MB):" subsection to
avoid redundancy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/guides/feature-engineering.md`:
- Around line 393-398: The "Next Steps" section references a missing file
batch-inference.md which will create a broken link; either add a stub file named
batch-inference.md to the docs or remove/comment the "- [Batch
Inference](batch-inference.md) - Use features for predictions" list item in the
"Next Steps" block so the link is not published broken. Ensure you update the
"Next Steps" list where the link text "Batch Inference" appears or create a
minimal batch-inference.md placeholder that explains the upcoming content.
- Around line 169-177: The withColumn chain for customer_features is broken by
an extra closing parenthesis: remove the orphaned ')' that closes the first
withColumn so the two calls remain chained; ensure the expression that defines
"avg_days_between_purchases" (using F.datediff and F.when on columns
last_transaction_date, first_transaction_date, transaction_count) is closed once
and immediately followed by .withColumn for "days_since_last_purchase" (which
uses F.datediff and F.current_date()) so the chain is customer_features =
customer_features.withColumn(...).withColumn(...).

In `@docs/guides/model-training.md`:
- Line 491: The docs contain a dead link to batch-inference.md (anchor text
"Batch Inference") which isn’t present in this PR; either add the missing
batch-inference.md guide (or a placeholder with proper frontmatter) or
remove/update the link in model-training.md to point to an existing page (or to
feature-engineering.md if that’s intended). Locate the link text "Batch
Inference" in model-training.md and ensure the referenced path matches an actual
file committed in this PR to avoid broken GitBook links.

In `@docs/guides/passing-data-between-steps.md`:
- Line 1179: The markdown contains a dead link "[Batch
Inference](batch-inference.md) - Use outputs for predictions" in
passing-data-between-steps.md; either add the missing target file
(batch-inference.md) to the PR with the intended content or update this link to
point to an existing document (e.g., the correct filename used in this repo) or
remove the link and replace with plain text; ensure the change touches the line
with the link text so CI/Book build no longer reports a broken link.
- Around line 402-410: The import S3Artifact is unused in the example (only
Artifact is instantiated at lines referencing Artifact usage); remove S3Artifact
from the import list or demonstrate its usage (e.g., replace or add an
S3Artifact instantiation where Artifact is used) so the example and imports
match; update the import tuple that currently includes S3Artifact (alongside
Workflow, Steps, Step, TemplateRef, script, Artifact) accordingly and, if
choosing to keep S3Artifact, add a short inline comment showing why it’s used.

---

Duplicate comments:
In `@docs/guides/model-training.md`:
- Around line 362-424: The YAML now misses the third hyperparameter job in the
parallel training list; add the missing job entry named train-config-2 (matching
the databricks-connector run-job template) with the same parameter keys as the
other jobs and set its args parameter to
"{{workflow.parameters.training-data}},300,20" so the YAML matches the
Python/Hera example (ensure the templateRef name: databricks-connector and
template: run-job and parameters like code-path, task-type, cluster-mode,
new-cluster-*, and args are present and identical in structure).

In `@docs/guides/passing-data-between-steps.md`:
- Around line 886-893: Update the Helm snippet in passing-data-between-steps.md
to use the maintained NFS chart by adding the helm repo URL
"nfs-ganesha-server-and-external-provisioner" and installing "nfs-provisioner"
from "nfs-ganesha-server-and-external-provisioner/nfs-server-provisioner" with
persistence.enabled=true, persistence.size=200Gi and storageClass.name=nfs;
specifically replace the old helm repo/add and helm install lines with the new
repo and install commands referencing the repo name and chart (helm repo add
nfs-ganesha-server-and-external-provisioner, helm install nfs-provisioner
nfs-ganesha-server-and-external-provisioner/nfs-server-provisioner and the --set
flags).

---

Nitpick comments:
In `@docs/guides/passing-data-between-steps.md`:
- Around line 1109-1122: Remove the duplicated bullet list under the "### 1. Use
Parameters for Metadata" heading that repeats the same four items found under
"**Parameters** (< 1MB):"; either delete the first list entirely or replace it
with a brief summary that references the detailed "Choose the Right Data Passing
Method" section, ensuring the only detailed list remains under the
"**Parameters** (< 1MB):" subsection to avoid redundancy.

Signed-off-by: J.P. Zivalich <j.p.zivalich@gmail.com>
args_string = dbutils.widgets.get("args")

# Parse comma-separated arguments
args = args_string.split(",") if args_string else []
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is correct

- name: max-workers
value: "16"
- name: args
value: "{{workflow.parameters.input-path}},{{workflow.parameters.output-path}},{{workflow.parameters.lookback-days}}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a comma separated list, and the pattern marked as "correct" in data-ingestion needs copying here - it won't work as is

Copy link
Contributor Author

@JPZ13 JPZ13 Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, just brackets around it, quotation before the bracket, or no quotations?

  1. ["{{workflow.parameters.input-path}},{{workflow.parameters.output-path}},{{workflow.parameters.lookback-days}}"]
  2. "[{{workflow.parameters.input-path}},{{workflow.parameters.output-path}},{{workflow.parameters.lookback-days}}]", or
  3. [{{workflow.parameters.input-path}},{{workflow.parameters.output-path}},{{workflow.parameters.lookback-days}}]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither - this isn't my area of expertise really. This example doesn't work as it is - I think you need to copy the pattern from data-ingestion by changing the python below to expect dbutils.widgets.text("args"...) like that one does and then split the values. I didn't write it, so I'm not sure which of the patterns you wanted to settle on, but this one doesn't actually work.

- name: new-cluster-num-workers
value: "4"
- name: args
value: "{{workflow.parameters.training-data}},{{workflow.parameters.model-name}},{{workflow.parameters.experiment-name}}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again a string list isn't split like it needs to be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above

@JPZ13
Copy link
Contributor Author

JPZ13 commented Feb 20, 2026

@Joibel thanks for the comments. Is the passing data via volumes info all accurate and looking good?

JPZ13 and others added 7 commits February 20, 2026 13:54
Co-authored-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Alan Clucas <alan@clucas.org>
Added a code block for better formatting in the model training guide.
@Joibel
Copy link

Joibel commented Feb 20, 2026

@Joibel thanks for the comments. Is the passing data via volumes info all accurate and looking good?

The words look reasonable. I haven't run any of those examples - if you haven't I would put money on them not working.


### CSV to Parquet Conversion

{% tabs %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - these tags seem to show up in the rendered markdown

Image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - it's the Gitbook formatting. In Gitbook, it'll hopefully display as tabs so the user can click the tab they want to show either Python or YAML

"run-name": "validate-{{workflow.parameters.date}}",
"args": "{{workflow.parameters.date}}",
}
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the purposes of the catalog of connectors it would be amazing to have a Step declared in the codebase that is reusable and parameterizable. For example, smth like:

# class definition
class DatabricksConnector:
  def validate_data(
    self, 
    code_path: str, 
    task_type: str, 
    cluster_mode: str = Serverless, 
    run_name: str = "validate-{{workflow.parameters.date}}", 
    args: str = "{{workflow.parameters.date}},
  ) -> Step: ...

# usage within a workflow
with Workflow(...) as w:
  with Steps() as s:
    DatabricksConnector.validate_data(
      code_path="/Users/data-team/02-validate",
      task_type="notebook", 
      ...,
      )

this way clients focus on the logic of stitching steps together and parameterizing based on their needs. Things like {{workflow.parameters.date}} can be default kwargs, etc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe another PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea a lot. Let's chat about it and ship in another PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brian and I hopped on with a customer today, and they suggested we should really lean into this bet

Copy link
Collaborator

@flaviuvadan flaviuvadan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, I think the Markdown rendering of tabbed tables is messed up so it warrants a change, and left a suggestion for a future PR for extra abstractions that still confer necessary user control

@JPZ13
Copy link
Contributor Author

JPZ13 commented Feb 25, 2026

The words look reasonable. I haven't run any of those examples - if you haven't I would put money on them not working.

Hey @Joibel - right now I'm just trying to get a pulse from our customers on whether these are the guides that they want to see before spending the additional time to ensure they're working well. We got confirmation from one of them today that the guides so far have been really helpful showing how to use Argo/Hera with Databricks in an ML context and unblocked them using input parameters with the Databricks WorkflowTemplate

Next step after merging will be to check all the examples, ideally via a CI pipeline, and then from there we can publish the guides on gitbook

@JPZ13 JPZ13 requested a review from Joibel February 25, 2026 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants