Skip to content

Add support for Markovian Text Environments#1094

Open
DhananjayAshok wants to merge 4 commits intoNovaSky-AI:mainfrom
DhananjayAshok:main
Open

Add support for Markovian Text Environments#1094
DhananjayAshok wants to merge 4 commits intoNovaSky-AI:mainfrom
DhananjayAshok:main

Conversation

@DhananjayAshok
Copy link

@DhananjayAshok DhananjayAshok commented Feb 12, 2026

The Markov assumption considers the state of an environment to be fully described by the present state of the system (i.e. given the present state, the past is irrelevant for assessing transition dynamics, etc.).

This is a useful framework that applies in a range of scenarios:

  • Chat domains where there is an external summarisation system independent of the agent, that provides complete information in the context of each individual prompt
  • Text-based games where the agent sees an "observation" that fully describes its current situation.
  • Long context domains where the agent can summarise its own history and hence avoid a blow-up in input length.

Currently, SkyRL always provides the full chat history to the generator, even in the step_wise_trajectories mode. This change adds an option to set a flag: generator.previous_observations_only that will modify this to only provide the previous observation as input.


Open with Devin

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Markovian text environments by adding a previous_observation_only flag. The changes are well-implemented, including necessary configuration updates, validation logic, and modifications to the agent loop to use only the previous observation when the flag is enabled. The fix to use initial_input_ids for prompt_ids is a good catch. I have one suggestion to improve the robustness of an assertion.

Comment on lines +309 to +311
assert (
use_chat_history != []
), "previous_observation_only is enabled, but recieved an empty observation from the environment. Ensure that the environment always returns at least one observation message when `step` is called."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better robustness, it's recommended to use a ValueError check instead of an assert. assert statements are disabled when Python is run in optimized mode (e.g., with the -O flag), which could lead to this important check being skipped in production. Since this validates input from a potentially user-defined environment, a ValueError ensures the check is always performed. I've also corrected a typo in the message.

if not use_chat_history:
    raise ValueError("previous_observation_only is enabled, but received an empty observation from the environment. Ensure that the environment always returns at least one observation message when `step` is called.")

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Open in Devin Review

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 9 additional findings in Devin Review.

Open in Devin Review

"output_response": output_responses[i],
"score": concat_generator_outputs["rewards"][i],
"stop_reason": concat_generator_outputs.get("stop_reasons", [None] * len(input_prompts))[i],
"is_last_step": concat_generator_outputs.get("is_last_step", [None] * len(input_prompts))[i],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 None[i] TypeError crash in dump_per_dataset_eval_results for non-step-wise evaluation

When dump_per_dataset_eval_results is called from the non-step-wise evaluate() path (i.e., step_wise_trajectories=False), the new "is_last_step" line will crash with TypeError: 'NoneType' object is not subscriptable.

Root Cause

In non-step-wise mode, the generator output includes "is_last_step": None (skyrl-train/skyrl_train/generators/skyrl_gym_generator.py:761). When this None value propagates through concatenate_generator_outputs into concat_generator_outputs, the new line:

"is_last_step": concat_generator_outputs.get("is_last_step", [None] * len(input_prompts))[i],

attempts dict.get("is_last_step", default). Since the key "is_last_step" exists in the dict (with value None), Python's dict.get() returns None — NOT the default list. The subsequent [i] indexing on None raises TypeError.

This is in contrast with the "trajectory_ids" handling added in the same PR, which correctly uses a truthy guard:

"instance_id": (
    concat_generator_outputs["trajectory_ids"][i].instance_id
    if concat_generator_outputs.get("trajectory_ids")
    else None
),

Impact: Any non-step-wise evaluation run with dump_eval_results=True (the default) will crash when writing eval results, preventing eval output from being saved.

Suggested change
"is_last_step": concat_generator_outputs.get("is_last_step", [None] * len(input_prompts))[i],
"is_last_step": (
concat_generator_outputs["is_last_step"][i]
if concat_generator_outputs.get("is_last_step")
else None
),
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant