Skip to content

[CI] Fix failing docs build; Fix GPU OOM for new inference codepath; Improve post-processing in Search env#1221

Merged
SumanthRH merged 8 commits intomainfrom
fix-docs-and-oom
Feb 27, 2026
Merged

[CI] Fix failing docs build; Fix GPU OOM for new inference codepath; Improve post-processing in Search env#1221
SumanthRH merged 8 commits intomainfrom
fix-docs-and-oom

Conversation

@SumanthRH
Copy link
Member

@SumanthRH SumanthRH commented Feb 26, 2026

What does this PR do?

Fixes both the failing CI pipelines: the docs workflow
and the failing GPU tests

Technically these can be separated into two separate PRs, but the scope of the changes is small so I'm combining both fixes here

Docs fix

This was a simple issue of an escaped > symbol in the mdx file. Mdx combines MD + TX and we need to escape > symbols

GPU test OOM

test_policy_local_engines_e2e.py was failing for _SKYRL_USE_NEW_INFERENCE=1 on CI (4xL4s) with a GPU OOM during weight sync:

FAILED tests/backends/skyrl_train/gpu/gpu_ci/test_policy_local_engines_e2e.py::test_policy_local_engines_e2e[colocate_gloo_fsdp_vllm] - ray.exceptions.RayTaskError(OutOfMemoryError): ray::FSDPPolicyWorkerBase.broadcast_to_inference_engines() (pid=137394, ip=10.0.42.164, actor_id=5edb2e28a83abb3c96054a0884000000, repr=<skyrl.backends.skyrl_train.workers.fsdp.fsdp_worker.FSDPPolicyWorkerBase object at 0x74b31acfde80>)
  File "/home/ray/anaconda3/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2026-02-25_02-26-03_179185_3353/runtime_resources/working_dir_files/s3_anyscale-production-data-cld-hxkifz7xa22mwicp21nzkds1lw_org_xc6lv84h3d7m9dljcc17esfw2i_cld_hxkifz7xa22mwicp21nzkds1lw_runtime_env_packages_pkg_0d1b9abb0eef4b544997d2ba00e39883/skyrl/backends/skyrl_train/workers/fsdp/fsdp_worker.py", line 223, in broadcast_to_inference_engines
    await self._weight_transfer_sender.send_chunks(self.weight_extractor.extract_weights(generator_dtype))
  File "/tmp/ray/session_2026-02-25_02-26-03_179185_3353/runtime_resources/working_dir_files/s3_anyscale-production-data-cld-hxkifz7xa22mwicp21nzkds1lw_org_xc6lv84h3d7m9dljcc17esfw2i_cld_hxkifz7xa22mwicp21nzkds1lw_runtime_env_packages_pkg_0d1b9abb0eef4b544997d2ba00e39883/skyrl/backends/skyrl_train/weight_sync/broadcast_strategy.py", line 116, in send_chunks
    for chunk in chunks:
                 ^^^^^^
  File "/tmp/ray/session_2026-02-25_02-26-03_179185_3353/runtime_resources/working_dir_files/s3_anyscale-production-data-cld-hxkifz7xa22mwicp21nzkds1lw_org_xc6lv84h3d7m9dljcc17esfw2i_cld_hxkifz7xa22mwicp21nzkds1lw_runtime_env_packages_pkg_0d1b9abb0eef4b544997d2ba00e39883/skyrl/backends/skyrl_train/workers/fsdp/fsdp_worker.py", line 62, in extract_weights
    params = self.model.state_dict()
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmprxG2mh/lib/python3.12/site-packages/torch/nn/modules/module.py", line 2271, in state_dict
    hook_result = hook(self, destination, prefix, local_metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmprxG2mh/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmprxG2mh/lib/python3.12/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 714, in _post_state_dict_hook
    processed_state_dict = _post_state_dict_hook_fn[fsdp_state._state_dict_type](
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmprxG2mh/lib/python3.12/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 559, in _sharded_post_state_dict_hook
    return _common_unshard_post_state_dict_hook(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmprxG2mh/lib/python3.12/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 229, in _common_unshard_post_state_dict_hook
    param_hook(state_dict, prefix, fqn)
  File "/home/ray/.cache/uv/builds-v0/.tmprxG2mh/lib/python3.12/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 549, in param_hook
    sharded_tensor = _ext_chunk_dtensor(
                     ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmprxG2mh/lib/python3.12/site-packages/torch/distributed/fsdp/_fsdp_extensions.py", line 150, in _ext_chunk_dtensor
    return chunk_dtensor_fn(
           ^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmprxG2mh/lib/python3.12/site-packages/torch/distributed/fsdp/_shard_utils.py", line 112, in _create_chunk_dtensor
    ).redistribute(
      ^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmprxG2mh/lib/python3.12/site-packages/torch/distributed/tensor/_api.py", line 555, in redistribute
    return Redistribute.apply(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmprxG2mh/lib/python3.12/site-packages/torch/autograd/function.py", line 581, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmprxG2mh/lib/python3.12/site-packages/torch/distributed/tensor/_redistribute.py", line 319, in forward
    output = redistribute_local_tensor(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmprxG2mh/lib/python3.12/site-packages/torch/distributed/tensor/_redistribute.py", line 228, in redistribute_local_tensor
    new_local_tensor = target_placement._replicate_to_shard(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmprxG2mh/lib/python3.12/site-packages/torch/distributed/tensor/placement_types.py", line 281, in _replicate_to_shard
    return shards[shard_index].clone()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 260.00 MiB. GPU 0 has a total capacity of 21.95 GiB of which 202.62 MiB is free. Process 136096 has 18.46 GiB memory in use. Including non-PyTorch memory, this process has 3.28 GiB memory in use. Of the allocated memory 2.86 GiB is allocated by PyTorch, and 112.47 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
FAILED tests/backends/skyrl_train/gpu/gpu_ci/test_policy_local_engines_e2e.py::test_policy_local_engines_e2e[colocate_nccl_fsdp2_vllm] - ray.exceptions.RayTaskError(OutOfMemoryError): ray::FSDPPolicyWorkerBase.init_model() (pid=141722, ip=10.0.42.164, actor_id=a28f0dcc5cca3c5c5f3da82288000000, repr=<skyrl.backends.skyrl_train.workers.fsdp.fsdp_worker.FSDPPolicyWorkerBase object at 0x7475d6699df0>)
  File "/home/ray/anaconda3/lib/python3.12/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
           ^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2026-02-25_02-26-03_179185_3353/runtime_resources/working_dir_files/s3_anyscale-production-data-cld-hxkifz7xa22mwicp21nzkds1lw_org_xc6lv84h3d7m9dljcc17esfw2i_cld_hxkifz7xa22mwicp21nzkds1lw_runtime_env_packages_pkg_0d1b9abb0eef4b544997d2ba00e39883/skyrl/backends/skyrl_train/workers/fsdp/fsdp_worker.py", line 151, in init_model
    self.model, self.optimizer, self.scheduler = strategy.prepare(
                                                 ^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2026-02-25_02-26-03_179185_3353/runtime_resources/working_dir_files/s3_anyscale-production-data-cld-hxkifz7xa22mwicp21nzkds1lw_org_xc6lv84h3d7m9dljcc17esfw2i_cld_hxkifz7xa22mwicp21nzkds1lw_runtime_env_packages_pkg_0d1b9abb0eef4b544997d2ba00e39883/skyrl/backends/skyrl_train/distributed/fsdp_strategy.py", line 201, in prepare
    ret.append(self._fsdp_init_train_model(*arg))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2026-02-25_02-26-03_179185_3353/runtime_resources/working_dir_files/s3_anyscale-production-data-cld-hxkifz7xa22mwicp21nzkds1lw_org_xc6lv84h3d7m9dljcc17esfw2i_cld_hxkifz7xa22mwicp21nzkds1lw_runtime_env_packages_pkg_0d1b9abb0eef4b544997d2ba00e39883/skyrl/backends/skyrl_train/distributed/fsdp_strategy.py", line 280, in _fsdp_init_train_model
    fsdp_module = self._fsdp_init_model(model, is_train=True, is_wrapped=is_wrapped)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2026-02-25_02-26-03_179185_3353/runtime_resources/working_dir_files/s3_anyscale-production-data-cld-hxkifz7xa22mwicp21nzkds1lw_org_xc6lv84h3d7m9dljcc17esfw2i_cld_hxkifz7xa22mwicp21nzkds1lw_runtime_env_packages_pkg_0d1b9abb0eef4b544997d2ba00e39883/skyrl/backends/skyrl_train/distributed/fsdp_strategy.py", line 270, in _fsdp_init_model
    fsdp2_load_full_state_dict(module, full_state, cpu_offload)
  File "/tmp/ray/session_2026-02-25_02-26-03_179185_3353/runtime_resources/working_dir_files/s3_anyscale-production-data-cld-hxkifz7xa22mwicp21nzkds1lw_org_xc6lv84h3d7m9dljcc17esfw2i_cld_hxkifz7xa22mwicp21nzkds1lw_runtime_env_packages_pkg_0d1b9abb0eef4b544997d2ba00e39883/skyrl/backends/skyrl_train/distributed/fsdp_utils.py", line 326, in fsdp2_load_full_state_dict
    load_fsdp2_model_to_gpu(model)
  File "/home/ray/.cache/uv/builds-v0/.tmp2s33DU/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2026-02-25_02-26-03_179185_3353/runtime_resources/working_dir_files/s3_anyscale-production-data-cld-hxkifz7xa22mwicp21nzkds1lw_org_xc6lv84h3d7m9dljcc17esfw2i_cld_hxkifz7xa22mwicp21nzkds1lw_runtime_env_packages_pkg_0d1b9abb0eef4b544997d2ba00e39883/skyrl/backends/skyrl_train/distributed/fsdp_utils.py", line 195, in load_fsdp2_model_to_gpu
    model.to(device, non_blocking=True)
  File "/home/ray/.cache/uv/builds-v0/.tmp2s33DU/lib/python3.12/site-packages/transformers/modeling_utils.py", line 4343, in to
    return super().to(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmp2s33DU/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1371, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmp2s33DU/lib/python3.12/site-packages/torch/distributed/fsdp/_fully_shard/_fully_shard.py", line 626, in _apply
    ret = super()._apply(*args, **kwargs)  # type: ignore[misc]
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmp2s33DU/lib/python3.12/site-packages/torch/nn/modules/module.py", line 930, in _apply
    module._apply(fn)
  File "/home/ray/.cache/uv/builds-v0/.tmp2s33DU/lib/python3.12/site-packages/torch/nn/modules/module.py", line 930, in _apply
    module._apply(fn)
  File "/home/ray/.cache/uv/builds-v0/.tmp2s33DU/lib/python3.12/site-packages/torch/nn/modules/module.py", line 930, in _apply
    module._apply(fn)
  File "/home/ray/.cache/uv/builds-v0/.tmp2s33DU/lib/python3.12/site-packages/torch/distributed/fsdp/_fully_shard/_fully_shard.py", line 626, in _apply
    ret = super()._apply(*args, **kwargs)  # type: ignore[misc]
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmp2s33DU/lib/python3.12/site-packages/torch/nn/modules/module.py", line 930, in _apply
    module._apply(fn)
  File "/home/ray/.cache/uv/builds-v0/.tmp2s33DU/lib/python3.12/site-packages/torch/nn/modules/module.py", line 930, in _apply
    module._apply(fn)
  File "/home/ray/.cache/uv/builds-v0/.tmp2s33DU/lib/python3.12/site-packages/torch/nn/modules/module.py", line 957, in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmp2s33DU/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1357, in convert
    return t.to(
           ^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmp2s33DU/lib/python3.12/site-packages/torch/_compile.py", line 53, in inner
    return disable_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmp2s33DU/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmp2s33DU/lib/python3.12/site-packages/torch/distributed/tensor/_api.py", line 349, in __torch_dispatch__
    return DTensor._op_dispatcher.dispatch(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmp2s33DU/lib/python3.12/site-packages/torch/distributed/tensor/_dispatch.py", line 233, in dispatch
    local_results = op_call(*local_tensor_args, **op_info.local_kwargs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmp2s33DU/lib/python3.12/site-packages/torch/_ops.py", line 841, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 21.95 GiB of which 8.62 MiB is free. Process 140438 has 18.46 GiB memory in use. Including non-PyTorch memory, this process has 3.47 GiB memory in use. Of the allocated memory 3.11 GiB is allocated by PyTorch, and 25.80 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Only the tests for colocated engine and trainer failed.

When I added the test, I had only tested on H100s, which explained why this was seen.

Surprisingly, the new inference codepath used the exact same inference engine configuration as the old codepath, for which tests were passing without OOM. The only difference was that the old codepath performs a redundant sleep + wake_up after engine initialization:

client = InferenceEngineClient(eps, tokenizer, cfg)
if sleep:
asyncio.run(client.wake_up())

When I added the same sleep + wake_up calls to the new inference codepath, the tests passed!

I looked into the memory usage differences on H100s: adding the redundant sleep + wakeup calls led to a ~ 6GB saving of memory. I believe this is related to clearing cuda cache memory that would be used temporarily during init.

The Fix

For the test itself, the fix would be to simply reduce gpu_memory_utilization parameter passed to vLLM so that it is appropriate for L4s. I would prefer this over adding redundant sleep + wake_up which is more brittle. I've reduced the gpu memory utiliization to 0.7 and confirmed that tests pass on 4xL4 with this change.


Open with Devin

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH SumanthRH marked this pull request as ready for review February 26, 2026 18:20
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 1 additional finding.

Open in Devin Review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses two separate issues: a failing documentation build and a GPU out-of-memory error in the CI tests. The documentation fix correctly escapes special characters in an MDX file to prevent parsing errors. The OOM fix reduces the gpu_memory_utilization for vLLM in a specific test, which is a reasonable approach to make the test pass on CI machines with less GPU memory. The changes are logical and well-explained. I have one minor suggestion to improve code readability by replacing a magic number with a named constant.

cfg.generator.inference_engine.run_engines_locally = True
# NOTE: We reduce the gpu memory used by vLLM because of the colocated tests
# that can OOM on L4s. For more details, see: https://github.com/NovaSky-AI/SkyRL/pull/1221
cfg.generator.inference_engine.gpu_memory_utilization = 0.7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve readability and maintainability, consider defining a constant for the magic number 0.7 at the module level, for example: VLLM_GPU_MEMORY_UTILIZATION_FOR_CI = 0.7. This makes the purpose of the value clearer and simplifies future modifications.

x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
devin-ai-integration[bot]

This comment was marked as resolved.

x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH
Copy link
Member Author

SumanthRH commented Feb 27, 2026

Added another fix in this PR:

tests/backends/skyrl_train/gpu/gpu_ci/test_skyrl_gym_generator.py::test_generator_multi_turn_search is failing for the new inference codepath

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <skyrl_gym.envs.search.env.SearchEnv object at 0x7755712f7aa0>
action = 'In <think> <information>Francis William Aston, Edward Mills Purcell, and William Bo推un that the first Nobel Prize in ...mil von Behring and Robert Koch.</information> </think>\n\nTherefore, the answer is <answer>Emil von Behring</answer>.'

    def _validate_action(self, action: str):
        stop_tags = ["</search>", "</answer>"]
        # TODO (sumanthrh): This assertion should really be that the *last token* generated contains <answer>.
        # The last token generated can have additional punctuation characters like periods, etc.
        action = action.rstrip("\n")  # strip out any trailing newlines and periods
        for tag in stop_tags:
            if tag in action:
>               assert action.split(tag, 1)[1] == "", (
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                    f"{tag} detected in the response but it is not the last string generated. "
                    f"Use {stop_tags} as stop strings in the configuration."
                )
E               AssertionError: </answer> detected in the response but it is not the last string generated. Use ['</search>', '</answer>'] as stop strings in the configuration.

skyrl-gym/skyrl_gym/envs/search/env.py:75: AssertionError

The root cause is that there is a full-stop generated by the Qwen model in this case. The last token is </answer>. and the model stops generating here because it contains the stop string </answer>. However the validate_action function is brittle and assumes that the last characters should be </answer>, which is not true in general.

I've added a fix to the validation code for now. In the future, the assertion should be more relaxed

@SumanthRH SumanthRH changed the title [CI] Fix failing docs build; Fix GPU OOM for new inference codepath [CI] Fix failing docs build; Fix GPU OOM for new inference codepath; Improve post-processing in Search env Feb 27, 2026
@SumanthRH SumanthRH merged commit af8db44 into main Feb 27, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant