Conversation
TomAugspurger
left a comment
There was a problem hiding this comment.
The output sample you shared looks good to me.
paul-aiyedun
left a comment
There was a problem hiding this comment.
The overall idea makes sense to me. However, I had questions about some of the implementation details, Slurm specific logic, and assumptions being made.
Also, please update the PR title to reflect this update.
| coordinator=false | ||
| # Worker REST/HTTP port for internal and admin endpoints. | ||
| http-server.http.port=8080 | ||
| http-server.http.port=10000 |
There was a problem hiding this comment.
Is this an intentional update?
| -m, --metrics Collect detailed metrics from Presto REST API after each query. | ||
| Metrics are stored in query-specific directories. | ||
|
|
||
| ENVIRONMENT: |
There was a problem hiding this comment.
Can we use command line arguments instead of environment variables?
| PRESTO_BENCHMARK_DEBUG Set to 1 to print debug logs for worker/engine detection | ||
| (e.g. node URIs, reachability, metrics, Docker containers). | ||
| Use when engine is misdetected or the run fails. | ||
| Docker In Docker setups, engine is inferred from running worker |
There was a problem hiding this comment.
Can you please expand on this?
| CONTEXT_KEY = "context" | ||
| ITERATIONS_COUNT_KEY = "iterations_count" | ||
| SCHEMA_NAME_KEY = "schema_name" | ||
| SCALE_FACTOR_KEY = "scale_factor" |
There was a problem hiding this comment.
Can we copy over the metadata.json file from the dataset instead of duplicating some of the details here?
| "benchmark": benchmark_types[0] if len(benchmark_types) == 1 else benchmark_types, | ||
| **run_config, | ||
| } | ||
| with open(f"{bench_output_dir}/benchmark_config.json", "w") as file: |
There was a problem hiding this comment.
The existing JSON file has a "context" field. Can we extend that instead of writing to a new file?
|
|
||
| def get_gpu_name_from_slurm_logs() -> str | None: | ||
| """ | ||
| When running under SLURM, workers run nvidia-smi -L and write to LOGS/worker_<id>.log. |
There was a problem hiding this comment.
If there is a need to log the output of nvidia-smi, then we should probably do this consistently (and not just for slurm).
|
|
||
| def get_engine_from_slurm() -> str | None: | ||
| """ | ||
| Infer engine when running under SLURM from nvidia-smi -L output in LOGS/worker_0.log. |
There was a problem hiding this comment.
Can we standardize how this is done instead of having a slurm specific solution?
|
|
||
| def get_gpu_name() -> str | None: | ||
| """ | ||
| Return GPU model name. Under SLURM, read from LOGS/worker_<id>.log if LOGS is set; |
There was a problem hiding this comment.
Same comment as above.
|
|
||
| def get_worker_image() -> str | None: | ||
| """Return worker image name from env (set by cluster/container setup).""" | ||
| return os.environ.get("WORKER_IMAGE") |
| return os.environ.get("WORKER_IMAGE") | ||
|
|
||
|
|
||
| def _current_username() -> str: |
There was a problem hiding this comment.
We plan on adding the ability to run presto with existing images. I don't think we should make assumptions about image/tag names.
Added a benchmark_config.json output file to run_benchmark.sh. This file provides context for how the benchmark was run and will provide information such as the following:
The scripts should auto-detect everything while running the benchmark, including number of workers, engine, gpus, etc... Querying the node for their engine type using the presto API does not seem viable, so instead the type is currently determined by the container names (in docker), and via a nvidia-smi call in slurm clusters. The slurm path needs additional testing, but we need the updated slurm scripts to merge upstream before we can test them.