Add troubleshooting doc by dantasse · Pull Request #168 · lancedb/docs

dantasse · 2026-02-23T21:45:04Z

First cut at a "jobs troubleshooting" doc - using some of the internal_docs plus my own experience plus Claude's ideas. I think if this is incomplete, that's fine; we can always add things to it later. It's also ok if it's a little repetitive - if these things have been stated elsewhere. I am mostly looking to make sure it doesn't include anything misleading.

fixes https://linear.app/lancedb/issue/GEN-304/docs-add-a-troubleshooting-jobs-page

prrao87 · 2026-02-23T22:30:25Z

Yeah it's okay to be repetitive at this point - troubleshooting docs are always a good idea at this stage. 👌🏽

jmhsieh · 2026-02-23T22:34:43Z

docs/geneva/jobs/troubleshooting.mdx

+**1. Ray dashboard – see what’s actually running**
+
+- **Local Ray:** After starting a local cluster, Ray usually prints the dashboard URL (e.g. `http://127.0.0.1:8265`). Open that in a browser. If you didn’t capture it, the dashboard is typically on port **8265** on the host where Ray was started.
+- **KubeRay (Kubernetes):** Port-forward the dashboard from the Ray head pod, then open it locally:


There is a case here were k8s / kuberay / cluster is configured to use gpus but gpu's aren't being provisioned by the csp. This one happens the most to our tests actually.

ah you mean, sometimes that causes jobs to pass admission but hang at low progress?

how can we tell that's happening, and/or fix it if it is?

jmhsieh · 2026-02-23T22:39:28Z

docs/geneva/jobs/troubleshooting.mdx

+
+**2. Memory pressure**
+
+If nodes are full or workers are being OOM-killed, the job can stall or fail. In the dashboard, check node memory usage; in Kubernetes, check pod restarts (`kubectl get pods -n NAMESPACE`) and pod logs for OOM. Reduce `concurrency` or the UDF’s memory request so more tasks fit on the cluster, or add nodes.


There was a case where I changed something from GB to GiB and got into this place where admission control passed but it was unable to schedule. Basically, the admission control isn't perfect and it got stuck in a funny place.

Suggest lowering memory requests, or using GB (10^9) vs GiB( 2^30)

hmm
I mean the GB vs GiB has got to be a case where the extra (math math) 73 MB mattered, right? So yeah I think "slightly lower your memory requirements" is our best recommendation here.

jmhsieh · 2026-02-23T22:41:43Z

docs/geneva/jobs/troubleshooting.mdx

+---
+## Serialization library or `attrs` version mismatch
+
+Ray and Geneva use cloudpickle and can be sensitive to library versions. If you see `TypeError: Enum.__new__() missing 1 required positional argument: 'value'` or similar pickling errors with no obvious non-serializable object, ensure **client and cluster use the same Python minor version and compatible library versions** (e.g. same `attrs`). Run the job from a machine with the same OS/architecture as the workers when possible so that shipped environments match.


note: we vendor cloudpickle (e.g. have a specific version of cloudpickle's source in geneva's source) so that it doesn't conflict with ray's cloudpickle (which may change in different versions). We might need to do tha twith attrs too.

jmhsieh

lgtm, suggested some additions to consider.

Add job troubleshooting doc

70d2084

dantasse requested review from BubbleCal, jmhsieh, justinrmiller and rpgreen February 23, 2026 21:45

mintlify bot deployed to staging - docs February 23, 2026 21:49 View deployment

dantasse force-pushed the dantasse/troubleshooting branch from ea22e42 to 70d2084 Compare February 23, 2026 21:57

jmhsieh reviewed Feb 23, 2026

View reviewed changes

jmhsieh approved these changes Feb 23, 2026

View reviewed changes

Couple of review notes

7b5aecc

mintlify bot deployed to staging - docs February 23, 2026 22:58 View deployment

dantasse merged commit fcb34bc into main Feb 23, 2026
2 checks passed

dantasse deleted the dantasse/troubleshooting branch February 23, 2026 23:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add troubleshooting doc#168

Add troubleshooting doc#168
dantasse merged 2 commits intomainfrom
dantasse/troubleshooting

dantasse commented Feb 23, 2026 •

edited

Loading

Uh oh!

prrao87 commented Feb 23, 2026

Uh oh!

jmhsieh Feb 23, 2026

Uh oh!

dantasse Feb 23, 2026

Uh oh!

jmhsieh Feb 23, 2026

Uh oh!

dantasse Feb 23, 2026

Uh oh!

jmhsieh Feb 23, 2026

Uh oh!

jmhsieh left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		2. Memory pressure

		If nodes are full or workers are being OOM-killed, the job can stall or fail. In the dashboard, check node memory usage; in Kubernetes, check pod restarts (`kubectl get pods -n NAMESPACE`) and pod logs for OOM. Reduce `concurrency` or the UDF’s memory request so more tasks fit on the cluster, or add nodes.

Comments

Conversation

dantasse commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prrao87 commented Feb 23, 2026

Uh oh!

jmhsieh Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

dantasse Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

jmhsieh Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

dantasse Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

jmhsieh Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

jmhsieh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dantasse commented Feb 23, 2026 •

edited

Loading