Conversation
Summary of ChangesHello @tekrajchhetri, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the NER benchmarking capabilities by introducing a new validation dataset and a suite of analytical tools. The changes involve a major reorganization of the evaluation directory, migrating existing content to an 'old' folder to make way for new, standardized benchmarking data and scripts. The new Python scripts facilitate data conversion from BIO-tagged formats and provide comprehensive analysis of model performance across various metrics, including entity detection, ontology mapping, and judge scores. This update aims to streamline future NER evaluations and provide deeper insights into model strengths and weaknesses. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds a significant amount of data and analysis scripts for NER benchmarking. The new Python script bio_txt.py is a useful utility for data conversion, and the suite of analysis scripts in evaluation/notebook/ provides a comprehensive framework for evaluating model performance.
My review focuses on the new Python scripts and documentation. I've identified a few areas for improvement regarding code reusability, maintainability, and robustness. Specifically, I've pointed out an issue with hardcoded file paths in bio_txt.py, a DRY principle violation due to a duplicated function across two analysis scripts, and a bug in ner_data_loader.py that could cause silent data loading failures due to inconsistent JSON structures. I've also noted that the new README file in the evaluation/notebook directory appears to be out of sync with the scripts added in this PR.
Overall, the added scripts are well-structured for analysis purposes. Addressing the suggested changes will improve the quality and reliability of the codebase.
| # Structure: {judge_ner_terms: {judge_ner_terms: {chunk_id: [entities]}}} | ||
| if 'judge_ner_terms' in data: | ||
| ner_data = data['judge_ner_terms'].get('judge_ner_terms', {}) | ||
| else: | ||
| ner_data = {} | ||
|
|
There was a problem hiding this comment.
The data loading logic for the without_hil group assumes a consistent nested structure of {'judge_ner_terms': {'judge_ner_terms': ...}}. However, some JSON files in the dataset (e.g., ner_config_claudesonet_s41593-024-01787-0_without_hil.json) have a flatter structure {'judge_ner_terms': ...}. The current implementation will silently fail to load data from these inconsistently formatted files, as .get('judge_ner_terms', {}) will return an empty dictionary.
To prevent silent data loading failures and make the loader more robust, the logic should be updated to handle both possible structures.
else:
# Handle inconsistent structures for without_hil group
if 'judge_ner_terms' in data:
ner_data_root = data['judge_ner_terms']
# Check if it has the extra nesting
if 'judge_ner_terms' in ner_data_root:
ner_data = ner_data_root.get('judge_ner_terms', {})
else:
ner_data = ner_data_root
else:
ner_data = {}| IN_FILE = Path("JNLPBA_gene_protein_test.tsv") | ||
| OUT_TEXT = Path("JNLPBA_gene_protein_test_text.txt") | ||
| OUT_ENTS_JSONL = Path("JNLPBA_gene_protein_test_entities_mapping.jsonl") |
There was a problem hiding this comment.
The input and output file paths are hardcoded, which limits the script's reusability for different datasets. It would be more maintainable and flexible to provide these paths as command-line arguments.
You can use Python's built-in argparse module to handle this. For example:
import argparse
def main():
parser = argparse.ArgumentParser(description='Convert BIO-tagged file to text and JSONL.')
parser.add_argument('in_file', type=Path, help='Input BIO-tagged file.')
parser.add_argument('--out-text', type=Path, help='Output text file.')
parser.add_argument('--out-jsonl', type=Path, help='Output JSONL file.')
args = parser.parse_args()
IN_FILE = args.in_file
OUT_TEXT = args.out_text or IN_FILE.with_suffix('.txt')
OUT_ENTS_JSONL = args.out_jsonl or IN_FILE.with_suffix('.jsonl')
# ... rest of your main function| def normalize_location(location: str) -> str: | ||
| """ | ||
| Normalize paper location strings for consistency. | ||
|
|
||
| Args: | ||
| location: Raw location string | ||
|
|
||
| Returns: | ||
| Normalized location string or None if empty/whitespace | ||
| """ | ||
| if not location or not location.strip(): | ||
| return None | ||
|
|
||
| location = location.lower().strip() | ||
|
|
||
| # Map variations to standard names | ||
| location_map = { | ||
| 'introduction': 'introduction', | ||
| 'intro': 'introduction', | ||
| 'abstract': 'abstract', | ||
| 'results': 'results', | ||
| 'result': 'results', | ||
| 'methods': 'methods', | ||
| 'method': 'methods', | ||
| 'discussion': 'discussion', | ||
| 'conclusions': 'conclusion', | ||
| 'conclusion': 'conclusion', | ||
| 'references': 'references', | ||
| 'reference': 'references', | ||
| 'supplementary': 'supplementary', | ||
| 'supplement': 'supplementary', | ||
| 'figure': 'figure', | ||
| 'table': 'table', | ||
| 'introductory information': 'introduction', # Map to introduction | ||
| 'author information': 'introduction', # Map to introduction | ||
| 'acknowledgments': 'acknowledgments', | ||
| 'acknowledgements': 'acknowledgments' | ||
| } | ||
|
|
||
| # Check if location contains any of the keywords | ||
| for key, value in location_map.items(): | ||
| if key in location: | ||
| return value | ||
|
|
||
| # If no match, return 'other' | ||
| return 'other' | ||
|
|
There was a problem hiding this comment.
The normalize_location function is duplicated in ner_judge_score_analysis.py. To adhere to the Don't Repeat Yourself (DRY) principle, this function should be defined once in a shared location and imported where needed.
A good place for this utility function would be in ner_data_loader.py, perhaps as a static method of the NERDataLoader class, since it's directly related to processing the data loaded by that class.
Example of moving it to ner_data_loader.py:
# In ner_data_loader.py
class NERDataLoader:
# ... existing methods ...
@staticmethod
def normalize_location(location: str) -> str:
# ... function implementation ...
# In ner_location_analysis.py and ner_judge_score_analysis.py
# ...
df['normalized_location'] = df['paper_location'].apply(NERDataLoader.normalize_location)
# ...| # Token Cost Speed Analysis | ||
|
|
||
| A script to visualize token usage, cost, and speed metrics across different language models. | ||
|
|
||
| ## Usage | ||
|
|
||
| ```bash | ||
| python token_cost_speed_analysis.py --task <task_name> --file <csv_file> | ||
| ``` | ||
|
|
||
| ### Arguments | ||
|
|
||
| - `--task`: Task name (e.g., `reproschema`). Creates output folder with this name. | ||
| - `--file`: Path to CSV file containing token usage data. | ||
|
|
||
| ### Example | ||
|
|
||
| ```bash | ||
| python token_cost_speed_analysis.py --task reproschema --file reproschema/reproschema_token_usage.csv | ||
| ``` | ||
|
|
||
| ### Output | ||
|
|
||
| Generates 4 plots in both PNG and SVG formats: | ||
| - Cost distribution by model | ||
| - Token usage (input/output) by model | ||
| - Speed distribution by model | ||
| - Speed vs cost scatter plot | ||
|
|
||
| All files are saved in a folder named after the task. No newline at end of file |
There was a problem hiding this comment.
This README file documents a script named token_cost_speed_analysis.py, but this script does not appear to be included in this pull request. The README should be updated to document the new analysis scripts that are being added in this directory, such as ner_comprehensive_summary.py, ner_entity_pool_analysis.py, etc., to ensure the documentation is accurate and relevant to the directory's contents.
|
In the PR description, there is a pointer to the new dataset description, but there are much more things added. Is this on purpose? Perhaps some README can me added to From the dataset description it's not clear where the data comes from and how it will be used. I also don't understand that there are 2 directories gemini also added comments |
Added link to original dataset in the readme.
Added a link to the original dataset for reference.
|
regarding other things, no not on purpose. Regarding the dataset where it comes from, we've had multiple meetings and I had shared drive link, didn't want to include in it github, but now have included the link in readme. The |
But in your drive, there is also no information were you took it from, or am I missing something?
After you removed |
This pull requests adds the validation dataset for NER and organizes the evaluation directory content, i.e., moved old evaluation content to
olddirectory.What’s Included
Issues this PR addresses
Note: As stated in the pull request, it's not annotated based on paper but established benchmark dataset.