Skip to content

Comments

Add redirection to log stdout and stderr files per server#661

Open
ananthsub wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
ananthsub:add-logs-timing
Open

Add redirection to log stdout and stderr files per server#661
ananthsub wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
ananthsub:add-logs-timing

Conversation

@ananthsub
Copy link
Contributor

@ananthsub ananthsub commented Feb 9, 2026

This PR adds redirection for each server's stdout/stderr to per-server log files under {log_dir}/{server_name}/ so that output is preserved and isolated per server.

  • On failure, the log file paths are printed and poll() tails the last lines of both stderr and stdout streams inline to surface potential issues quickly.
  • Log paths are exposed via ServerInstanceDisplayConfig and are surfaced in ng_status.
  • By default, logs are written to RESULT_DIR/logs/. To customize, users can set server_log_dir in the config YAML or override it via CLI

@ananthsub ananthsub requested a review from bxyu-nvidia February 9, 2026 14:43
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ananthsub ananthsub requested a review from pjin-nvidia February 9, 2026 21:10
@ananthsub ananthsub changed the title Add timings for startup, redirect logs to stderr/stdout files per server Add redirection to log stdout and stderr files per server Feb 11, 2026
@ananthsub ananthsub marked this pull request as ready for review February 11, 2026 00:47
def start(self, global_config_dict_parser_config: GlobalConfigDictParserConfig) -> None:
global_config_dict = get_global_config_dict(global_config_dict_parser_config=global_config_dict_parser_config)

self._log_dir = Path(global_config_dict.get(SERVER_LOG_DIR_KEY_NAME, str(RESULTS_DIR / "logs")))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rfc: this makes the log redirection by default enabled rather than opt-in

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant