Add support for Markovian Text Environments#1094
Add support for Markovian Text Environments#1094DhananjayAshok wants to merge 4 commits intoNovaSky-AI:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for Markovian text environments by adding a previous_observation_only flag. The changes are well-implemented, including necessary configuration updates, validation logic, and modifications to the agent loop to use only the previous observation when the flag is enabled. The fix to use initial_input_ids for prompt_ids is a good catch. I have one suggestion to improve the robustness of an assertion.
| assert ( | ||
| use_chat_history != [] | ||
| ), "previous_observation_only is enabled, but recieved an empty observation from the environment. Ensure that the environment always returns at least one observation message when `step` is called." |
There was a problem hiding this comment.
For better robustness, it's recommended to use a ValueError check instead of an assert. assert statements are disabled when Python is run in optimized mode (e.g., with the -O flag), which could lead to this important check being skipped in production. Since this validates input from a potentially user-defined environment, a ValueError ensures the check is always performed. I've also corrected a typo in the message.
if not use_chat_history:
raise ValueError("previous_observation_only is enabled, but received an empty observation from the environment. Ensure that the environment always returns at least one observation message when `step` is called.")| "output_response": output_responses[i], | ||
| "score": concat_generator_outputs["rewards"][i], | ||
| "stop_reason": concat_generator_outputs.get("stop_reasons", [None] * len(input_prompts))[i], | ||
| "is_last_step": concat_generator_outputs.get("is_last_step", [None] * len(input_prompts))[i], |
There was a problem hiding this comment.
🔴 None[i] TypeError crash in dump_per_dataset_eval_results for non-step-wise evaluation
When dump_per_dataset_eval_results is called from the non-step-wise evaluate() path (i.e., step_wise_trajectories=False), the new "is_last_step" line will crash with TypeError: 'NoneType' object is not subscriptable.
Root Cause
In non-step-wise mode, the generator output includes "is_last_step": None (skyrl-train/skyrl_train/generators/skyrl_gym_generator.py:761). When this None value propagates through concatenate_generator_outputs into concat_generator_outputs, the new line:
"is_last_step": concat_generator_outputs.get("is_last_step", [None] * len(input_prompts))[i],attempts dict.get("is_last_step", default). Since the key "is_last_step" exists in the dict (with value None), Python's dict.get() returns None — NOT the default list. The subsequent [i] indexing on None raises TypeError.
This is in contrast with the "trajectory_ids" handling added in the same PR, which correctly uses a truthy guard:
"instance_id": (
concat_generator_outputs["trajectory_ids"][i].instance_id
if concat_generator_outputs.get("trajectory_ids")
else None
),Impact: Any non-step-wise evaluation run with dump_eval_results=True (the default) will crash when writing eval results, preventing eval output from being saved.
| "is_last_step": concat_generator_outputs.get("is_last_step", [None] * len(input_prompts))[i], | |
| "is_last_step": ( | |
| concat_generator_outputs["is_last_step"][i] | |
| if concat_generator_outputs.get("is_last_step") | |
| else None | |
| ), | |
Was this helpful? React with 👍 or 👎 to provide feedback.
The Markov assumption considers the state of an environment to be fully described by the present state of the system (i.e. given the present state, the past is irrelevant for assessing transition dynamics, etc.).
This is a useful framework that applies in a range of scenarios:
Currently, SkyRL always provides the full chat history to the generator, even in the
step_wise_trajectoriesmode. This change adds an option to set a flag:generator.previous_observations_onlythat will modify this to only provide the previous observation as input.