Skip to content

Misiug/benchmarkjsonconfig#239

Open
misiugodfrey wants to merge 3 commits intomainfrom
misiug/benchmarkjsonconfig
Open

Misiug/benchmarkjsonconfig#239
misiugodfrey wants to merge 3 commits intomainfrom
misiug/benchmarkjsonconfig

Conversation

@misiugodfrey
Copy link
Contributor

Added a benchmark_config.json output file to run_benchmark.sh. This file provides context for how the benchmark was run and will provide information such as the following:

{
  "benchmark": "tpch",
  "scale_factor": 1,
  "n_workers": 1,
  "kind": "single-node",
  "gpu_count": 1,
  "gpu_name": "NVIDIA RTX 5000 Ada Generation Laptop GPU",
  "engine": "velox-gpu",
  "timestamp": "2026-02-16T18:00:44Z"
}

The scripts should auto-detect everything while running the benchmark, including number of workers, engine, gpus, etc... Querying the node for their engine type using the presto API does not seem viable, so instead the type is currently determined by the container names (in docker), and via a nvidia-smi call in slurm clusters. The slurm path needs additional testing, but we need the updated slurm scripts to merge upstream before we can test them.

Copy link

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output sample you shared looks good to me.

Copy link
Contributor

@paul-aiyedun paul-aiyedun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall idea makes sense to me. However, I had questions about some of the implementation details, Slurm specific logic, and assumptions being made.

Also, please update the PR title to reflect this update.

coordinator=false
# Worker REST/HTTP port for internal and admin endpoints.
http-server.http.port=8080
http-server.http.port=10000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an intentional update?

-m, --metrics Collect detailed metrics from Presto REST API after each query.
Metrics are stored in query-specific directories.

ENVIRONMENT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use command line arguments instead of environment variables?

PRESTO_BENCHMARK_DEBUG Set to 1 to print debug logs for worker/engine detection
(e.g. node URIs, reachability, metrics, Docker containers).
Use when engine is misdetected or the run fails.
Docker In Docker setups, engine is inferred from running worker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please expand on this?

CONTEXT_KEY = "context"
ITERATIONS_COUNT_KEY = "iterations_count"
SCHEMA_NAME_KEY = "schema_name"
SCALE_FACTOR_KEY = "scale_factor"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we copy over the metadata.json file from the dataset instead of duplicating some of the details here?

"benchmark": benchmark_types[0] if len(benchmark_types) == 1 else benchmark_types,
**run_config,
}
with open(f"{bench_output_dir}/benchmark_config.json", "w") as file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing JSON file has a "context" field. Can we extend that instead of writing to a new file?


def get_gpu_name_from_slurm_logs() -> str | None:
"""
When running under SLURM, workers run nvidia-smi -L and write to LOGS/worker_<id>.log.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a need to log the output of nvidia-smi, then we should probably do this consistently (and not just for slurm).


def get_engine_from_slurm() -> str | None:
"""
Infer engine when running under SLURM from nvidia-smi -L output in LOGS/worker_0.log.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we standardize how this is done instead of having a slurm specific solution?


def get_gpu_name() -> str | None:
"""
Return GPU model name. Under SLURM, read from LOGS/worker_<id>.log if LOGS is set;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.


def get_worker_image() -> str | None:
"""Return worker image name from env (set by cluster/container setup)."""
return os.environ.get("WORKER_IMAGE")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this set?

return os.environ.get("WORKER_IMAGE")


def _current_username() -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We plan on adding the ability to run presto with existing images. I don't think we should make assumptions about image/tag names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants