Skip to content

Comments

Add troubleshooting doc#168

Merged
dantasse merged 2 commits intomainfrom
dantasse/troubleshooting
Feb 23, 2026
Merged

Add troubleshooting doc#168
dantasse merged 2 commits intomainfrom
dantasse/troubleshooting

Conversation

@dantasse
Copy link
Contributor

@dantasse dantasse commented Feb 23, 2026

First cut at a "jobs troubleshooting" doc - using some of the internal_docs plus my own experience plus Claude's ideas. I think if this is incomplete, that's fine; we can always add things to it later. It's also ok if it's a little repetitive - if these things have been stated elsewhere. I am mostly looking to make sure it doesn't include anything misleading.

fixes https://linear.app/lancedb/issue/GEN-304/docs-add-a-troubleshooting-jobs-page

@prrao87
Copy link
Contributor

prrao87 commented Feb 23, 2026

Yeah it's okay to be repetitive at this point - troubleshooting docs are always a good idea at this stage. 👌🏽

**1. Ray dashboard – see what’s actually running**

- **Local Ray:** After starting a local cluster, Ray usually prints the dashboard URL (e.g. `http://127.0.0.1:8265`). Open that in a browser. If you didn’t capture it, the dashboard is typically on port **8265** on the host where Ray was started.
- **KubeRay (Kubernetes):** Port-forward the dashboard from the Ray head pod, then open it locally:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a case here were k8s / kuberay / cluster is configured to use gpus but gpu's aren't being provisioned by the csp. This one happens the most to our tests actually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah you mean, sometimes that causes jobs to pass admission but hang at low progress?

how can we tell that's happening, and/or fix it if it is?


**2. Memory pressure**

If nodes are full or workers are being OOM-killed, the job can stall or fail. In the dashboard, check node memory usage; in Kubernetes, check pod restarts (`kubectl get pods -n NAMESPACE`) and pod logs for OOM. Reduce `concurrency` or the UDF’s memory request so more tasks fit on the cluster, or add nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a case where I changed something from GB to GiB and got into this place where admission control passed but it was unable to schedule. Basically, the admission control isn't perfect and it got stuck in a funny place.

Suggest lowering memory requests, or using GB (10^9) vs GiB( 2^30)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm
I mean the GB vs GiB has got to be a case where the extra (math math) 73 MB mattered, right? So yeah I think "slightly lower your memory requirements" is our best recommendation here.

---
## Serialization library or `attrs` version mismatch

Ray and Geneva use cloudpickle and can be sensitive to library versions. If you see `TypeError: Enum.__new__() missing 1 required positional argument: 'value'` or similar pickling errors with no obvious non-serializable object, ensure **client and cluster use the same Python minor version and compatible library versions** (e.g. same `attrs`). Run the job from a machine with the same OS/architecture as the workers when possible so that shipped environments match.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: we vendor cloudpickle (e.g. have a specific version of cloudpickle's source in geneva's source) so that it doesn't conflict with ray's cloudpickle (which may change in different versions). We might need to do tha twith attrs too.

Copy link
Contributor

@jmhsieh jmhsieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, suggested some additions to consider.

@dantasse dantasse merged commit fcb34bc into main Feb 23, 2026
2 checks passed
@dantasse dantasse deleted the dantasse/troubleshooting branch February 23, 2026 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants