Add Tools for Managing Agent Traces in Hugging Face Datasets#217
Add Tools for Managing Agent Traces in Hugging Face Datasets#217RohitP2005 wants to merge 11 commits intoServiceNow:mainfrom
Conversation
src/agentlab/llm/traces/uploads.py
Outdated
| } | ||
|
|
||
| # Load the existing index dataset and add new entry | ||
| dataset = load_dataset(INDEX_DATASET, split="train") |
There was a problem hiding this comment.
I'm having issues on this line when trying to test things on my side, as the dataset version that's online is empty. Would there be a way to initialize it first ?
There was a problem hiding this comment.
I guess we shall have a test dataset online to test the functionality
There was a problem hiding this comment.
That's a valid concern, RohitP2005. Having an online test dataset would indeed help us verify the functionality in real-world settings. Would it be possible to initialize and upload a minimal test dataset that we could use for these purposes? We can involve the team in generating sample data if needed.
src/agentlab/llm/traces/uploads.py
Outdated
| def upload_index_data(index_df: pd.DataFrame): | ||
| dataset = Dataset.from_pandas(index_df) | ||
| dataset.push_to_hub(INDEX_DATASET, split="train") | ||
|
|
||
| def upload_trace(trace_file: str, exp_id: str): | ||
| api.upload_file( | ||
| path_or_fileobj=trace_file, | ||
| path_in_repo=f"{exp_id}.zip", | ||
| repo_id=TRACE_DATASET, | ||
| repo_type="dataset", | ||
| ) |
There was a problem hiding this comment.
Ideally, we would approve new content on our datasets. Would there be a way to make new uploads into a PR ?
I'm guessing that might be on the HuggingFace side, in the dataset settings.
src/agentlab/llm/traces/uploads.py
Outdated
| # Hugging Face dataset names | ||
| INDEX_DATASET = "/agent_traces_index" | ||
| TRACE_DATASET = "/agent_traces_data" |
There was a problem hiding this comment.
This is a dev version so all good but eventually we'd switch this to env variables.
There was a problem hiding this comment.
Yeah i understand that, Once this PR is completed , i will remove them and u can add it in your .env
There was a problem hiding this comment.
I agree, moving the Hugging Face dataset names, INDEX_DATASET and TRACE_DATASET to environment variables would enhance the code maintainability and security. This would also help us in managing different environment-specific settings. Thank you for addressing this.
|
Hello @RohitP2005, this looks very interesting, thank you ! Ideally, we would have a third table on top of this, with one entry per study (as in the reproducibility_journal.csv file), with a key. The entries in the experiment metadata table would point to that key. |
src/agentlab/llm/traces/uploads.py
Outdated
| def upload_trace(trace_file: str, exp_id: str): | ||
| api.upload_file( | ||
| path_or_fileobj=trace_file, | ||
| path_in_repo=f"{exp_id}.zip", | ||
| repo_id=TRACE_DATASET, | ||
| repo_type="dataset", | ||
| ) |
There was a problem hiding this comment.
It could be interesting to compress the file if needed, instead of requiring it to be zipped already.
Yeah understood , I will look into it as soon as possible |
|
I thought of restructuring the travel uploads and creating classes for Study and Experiments with methods within them for their functionality. The functions are implemented in the utils files. Also, query functionality has been added. Kindly refer to Discord for a detailed description. |
| try: | ||
| dataset = load_dataset(trace_dataset, use_auth_token=hf_token, split="train") | ||
| existing_data = {"exp_id": dataset["exp_id"], "zip_file": dataset["zip_file"]} | ||
| except Exception as e: | ||
| print(f"Could not load existing dataset: {e}. Creating a new dataset.") | ||
| existing_data = None |
There was a problem hiding this comment.
Loading the traces dataset is going to be a problem, as the traces are really heavy (200GB for our TMLR paper).
Ideally we'd have smth more similar to your original version:
def upload_trace(trace_file: str, exp_id: str):
api.upload_file(
path_or_fileobj=trace_file,
path_in_repo=f"{exp_id}.zip",
repo_id=TRACE_DATASET,
repo_type="dataset",
)
We would trust the index dataset to avoid duplicates, and use the trace dataset as a container in which we'd dump the zipfiles.
There was a problem hiding this comment.
I think we can safely remove this file now. It would be nice to have an equivalent of the upload method, to merge all 3 levels of upload (study, index, traces)
There was a problem hiding this comment.
I'll update this to recent changes in the upload methods
[Sample PR] Add Tools for Managing Agent Traces in Hugging Face Datasets
reference: #53
Summary
This PR introduces a foundational implementation for managing and uploading agent traces to Hugging Face datasets. It provides tools to simplify adding traces, maintaining an index dataset for easy retrieval, and enforcing whitelist-based constraints for legality.
Key Features
1. Hugging Face Dataset Structure
2. Upload System
study_name,llm,benchmark, andlicense.trace_pointer) to the actual trace file.Notes
Checklist