Conversation
ea22e42 to
70d2084
Compare
|
Yeah it's okay to be repetitive at this point - troubleshooting docs are always a good idea at this stage. 👌🏽 |
| **1. Ray dashboard – see what’s actually running** | ||
|
|
||
| - **Local Ray:** After starting a local cluster, Ray usually prints the dashboard URL (e.g. `http://127.0.0.1:8265`). Open that in a browser. If you didn’t capture it, the dashboard is typically on port **8265** on the host where Ray was started. | ||
| - **KubeRay (Kubernetes):** Port-forward the dashboard from the Ray head pod, then open it locally: |
There was a problem hiding this comment.
There is a case here were k8s / kuberay / cluster is configured to use gpus but gpu's aren't being provisioned by the csp. This one happens the most to our tests actually.
There was a problem hiding this comment.
ah you mean, sometimes that causes jobs to pass admission but hang at low progress?
how can we tell that's happening, and/or fix it if it is?
docs/geneva/jobs/troubleshooting.mdx
Outdated
|
|
||
| **2. Memory pressure** | ||
|
|
||
| If nodes are full or workers are being OOM-killed, the job can stall or fail. In the dashboard, check node memory usage; in Kubernetes, check pod restarts (`kubectl get pods -n NAMESPACE`) and pod logs for OOM. Reduce `concurrency` or the UDF’s memory request so more tasks fit on the cluster, or add nodes. |
There was a problem hiding this comment.
There was a case where I changed something from GB to GiB and got into this place where admission control passed but it was unable to schedule. Basically, the admission control isn't perfect and it got stuck in a funny place.
Suggest lowering memory requests, or using GB (10^9) vs GiB( 2^30)
There was a problem hiding this comment.
hmm
I mean the GB vs GiB has got to be a case where the extra (math math) 73 MB mattered, right? So yeah I think "slightly lower your memory requirements" is our best recommendation here.
| --- | ||
| ## Serialization library or `attrs` version mismatch | ||
|
|
||
| Ray and Geneva use cloudpickle and can be sensitive to library versions. If you see `TypeError: Enum.__new__() missing 1 required positional argument: 'value'` or similar pickling errors with no obvious non-serializable object, ensure **client and cluster use the same Python minor version and compatible library versions** (e.g. same `attrs`). Run the job from a machine with the same OS/architecture as the workers when possible so that shipped environments match. |
There was a problem hiding this comment.
note: we vendor cloudpickle (e.g. have a specific version of cloudpickle's source in geneva's source) so that it doesn't conflict with ray's cloudpickle (which may change in different versions). We might need to do tha twith attrs too.
jmhsieh
left a comment
There was a problem hiding this comment.
lgtm, suggested some additions to consider.
First cut at a "jobs troubleshooting" doc - using some of the internal_docs plus my own experience plus Claude's ideas. I think if this is incomplete, that's fine; we can always add things to it later. It's also ok if it's a little repetitive - if these things have been stated elsewhere. I am mostly looking to make sure it doesn't include anything misleading.
fixes https://linear.app/lancedb/issue/GEN-304/docs-add-a-troubleshooting-jobs-page