Skip to content

StructSense Pipeline Upgrade: Specialized Tools, Robust Chunking, and BioPortal Integration#62

Open
tekrajchhetri wants to merge 97 commits intomainfrom
improvement
Open

StructSense Pipeline Upgrade: Specialized Tools, Robust Chunking, and BioPortal Integration#62
tekrajchhetri wants to merge 97 commits intomainfrom
improvement

Conversation

@tekrajchhetri
Copy link
Collaborator

@tekrajchhetri tekrajchhetri commented Feb 10, 2026

This pull request introduces major improvements to StructSense, including the addition of specialized tools, fixes to the chunking mechanism, and integration with BioPortal as the ontology database.

What’s Included

  • Detailed list of changes: see the Change Log
  • Usage tutorials: see the Tutorial for instructions on running StructSense
  • Developer documentation: see the Developer Guide
  • Updated configuration templates: available in the config_template directory

Issues this PR addresses

Note: to disable traces run "crewai traces disable"
…han just spacy NER

sample output

{
      "text": "photopic spectral sensitivity curves",
      "label": "CONCEPT",
      "occurrences": [
        {
          "start": 51,
          "end": 87,
          "global_start": 1127,
          "global_end": 1163,
          "sentence": "Recently, electroretinogram (ERG) responses of the photopic spectral sensitivity curves of photoreceptors of rats and mice were measured throughout the UV-visible spectrum (300-700 nm) (Rocha et al., 2016)."
        }
      ]
    },
    {
      "text": "photoreceptors",
      "label": "ANATOMICAL-CONCEPT",
      "occurrences": [
        {
          "start": 91,
          "end": 105,
          "global_start": 1167,
          "global_end": 1181,
          "sentence": "Recently, electroretinogram (ERG) responses of the photopic spectral sensitivity curves of photoreceptors of rats and mice were measured throughout the UV-visible spectrum (300-700 nm) (Rocha et al., 2016)."
        }
      ]
    },
    {
      "text": "rats and mice",
      "label": "ORGANISM",
      "occurrences": [
        {
          "start": 109,
          "end": 122,
          "global_start": 1185,
          "global_end": 1198,
          "sentence": "Recently, electroretinogram (ERG) responses of the photopic spectral sensitivity curves of photoreceptors of rats and mice were measured throughout the UV-visible spectrum (300-700 nm) (Rocha et al., 2016)."
        }
      ]
    },
@djarecka
Copy link
Contributor

djarecka commented Feb 11, 2026

Regarding the use in the BrainKB docs, it has a pointer to that directory. Did you mean to have all the contents of the tutorial in the BrainKB docs?

I was thinking that tutorial/Readme.md could be used in the BrainKB docs.

But I would also include the information about everything else that have to be run before running the structsense command, e.g., grobid and the mandatory variables in .env.
The Readme.md suggests that shows how to run the basic command and how to run tutorial, but doesn't have all the information that is needed.

edit
I've just noticed that you created a PR to brainKB docs, I could test the instructions from the brainKB, but I really think the basic instructions should be here as well, so I could move it, if you agree.

@tekrajchhetri
Copy link
Collaborator Author

@djarecka After your comment, I had put the content of tutorial to the BrainKB docs, so it should be same as in tutorial.

@djarecka
Copy link
Contributor

first comments based on testing tutorial:

  • I think you didn't update the requirements, I have some missing dependencies, e.g., pydantic
  • providing non-existing paths doesn't raise an error, and IMO it should. Treating it as a text can be very misleading.
  • I think it should be also a way to catch the wrong key error and provide some short message
  • I think, there has to be a way to limit the stdout, so it prints only the most important info, it's hard to follow the long messages

tekrajchhetri and others added 3 commits February 12, 2026 15:54
pyproject.toml: Add en_core_web_sm as a direct dependency via URL so
the model is installed at project setup time rather than downloaded at
runtime via spacy.cli.download(), which uses pip internally and fails
in non-pip environments (e.g. uv). Also narrow litellm version upper
bound from <1.80.0 to <1.60.0 to avoid a regression where newer
versions accidentally import litellm.proxy.proxy_server, pulling in
proxy-only dependencies (apscheduler) even when the proxy is not used.

ner_tool.py: Remove automatic spaCy model download at runtime; raise
an explicit error if the model is not installed, prompting the user
to install it manually.
@puja-trivedi
Copy link
Contributor

puja-trivedi commented Feb 14, 2026

@tekrajchhetri @djarecka, I have made a few changes to resolve some of the issues I faced when running. #64

pyproject.toml: Tighten spacy version constraint from ^3.8.11 to
>=3.8.11,<3.9.0 to ensure compatibility with the pinned
en_core_web_sm-3.8.0 model, which requires spaCy 3.8.x.

ner_tool.py: Update error message to direct users to run
'poetry install' or 'uv sync' instead of 'python -m spacy download',
consistent with the model now being a project dependency.
@djarecka
Copy link
Contributor

@tekrajchhetri @djarecka, I have made a few changes to resolve some of the issues I faced when running. #64

@tekrajchhetri - please take a look, and as I mentioned before I think the dependencies were not updated, I had to install fastapi_sso and pydantic to get rig of some warnings/errors.

@djarecka
Copy link
Contributor

djarecka commented Feb 17, 2026

@tekrajchhetri - I'm wondering how many entities you would expect for running the examples from tutorial (i.e., cli/new_config_resource.yaml and python-example/test_small.pdf) without chunking, since I keep getting 1-2 elements (it varies between runs) in mapped_specific_target_concept and mentions_with_ontology.

I also run for some longer paper that @puja-trivedi was testing and I got up to 3 "mentions_with_ontology

@tekrajchhetri
Copy link
Collaborator Author

@tekrajchhetri - I'm wondering how many entities you would expect for running the examples from tutorial (i.e., cli/new_config_resource.yaml and python-example/test_small.pdf) without chunking, since I keep getting 1-2 elements (it varies between runs) in mapped_specific_target_concept and mentions_with_ontology.

I also run for some longer paper that @puja-trivedi was testing and I got up to 3 "mentions_with_ontology

@djarecka I did not run in non-chunking mode.

@djarecka
Copy link
Contributor

@tekrajchhetri - I'm wondering how many entities you would expect for running the examples from tutorial (i.e., cli/new_config_resource.yaml and python-example/test_small.pdf) without chunking, since I keep getting 1-2 elements (it varies between runs) in mapped_specific_target_concept and mentions_with_ontology.
I also run for some longer paper that @puja-trivedi was testing and I got up to 3 "mentions_with_ontology

@djarecka I did not run in non-chunking mode.

Do you think chunking should be required with such a short text?

But I also run with chunking, i.e., structsense-cli extract --env_file=cli/.env_example --config=cli/new_config_resourceyaml --source=python-example/test_small.pdf --api_key=sk-or-v1-****** --save_file=resource_extraction_example_chunk_2.json --chunk_size=2000 --max_workers=16 --enable_chunking and the result I got are not better.

Content of resource_extraction_example_chunk_2.json
  "errors": [],
  "task_type": "resource",
  "elapsed_time": 66.8474349975586,
  "resources": [
    {
      "name": "Homotypic Collateral Sprouting in Mouse Visual System",
      "description": "Study examining the mechanism of homotypic collateral sprouting in the mouse visual system following diffuse axonal injury, highlighting the role of sex differences in recovery.",
      "type": "Paper",
      "category": "Neuroscience",
      "target": "Neuronal Repair Mechanism",
      "mapped_target_concept": [
        {
          "id": "N/A",
          "label": "N/A",
          "ontology": "N/A",
          "concept_mapping_provenance": "tool"
        }
      ],
      "specific_target": "Mouse Retinal Ganglion Cells",
      "mapped_specific_target_concept": [
        {
          "specific_target": "Mouse Retinal Ganglion Cells",
          "mapped_target_concept": {
            "id": null,
            "label": null,
            "ontology": null,
            "concept_mapping_provenance": "tool"
          }
        }
      ],
      "url": [],
      "mentions": {
        "related_papers": [
          "Johns Hopkins Alzheimer's Disease Research Center"
        ]
      },
      "provenance": {
        "alignment_agent": [
          "mapped_target_concept",
          "mapped_specific_target_concept"
        ]
      },
      "mentions_with_ontology": {
        "related_papers": [
          {
            "name": "Johns Hopkins Alzheimer's Disease Research Center",
            "ontology_id": null,
            "ontology_label": null,
            "ontology": null,
            "concept_mapping_provenance": "tool"
          }
        ]
      }
    }
  ],
  "verification": {
    "resources_checked": 1,
    "resources_with_text_grounding": 1,
    "all_present": true
  }
}

@tekrajchhetri
Copy link
Collaborator Author

@djarecka for the resource extraction, the output looks like what you've shared. It's not NER. And regarding chunking, if the provided text is smaller, it'll not processed in chunked mode.

@satra
Copy link

satra commented Feb 17, 2026

it looks like the resource config explicitly states:

Extract a single structured resource entry per input relevant to BBQS.

not clear why it should be a single resource entry per input.

@tekrajchhetri
Copy link
Collaborator Author

it looks like the resource config explicitly states:

Extract a single structured resource entry per input relevant to BBQS.

not clear why it should be a single resource entry per input.

@satra That needs to be updated but I don't think it's much of an impact as it's taken from the old config and we're able to extract multiple resources in the past and now. For example, the output from the current implementation.
https://github.com/sensein/structsense/blob/d6f143bbda8f168a832eab16454e95cc0df05c40/tutorial/cli/resource_extraction_example.json

@djarecka test_small is a two-page PDF cropped from a larger publication, so it may be missing some of the information we want to extract.

tekrajchhetri and others added 5 commits February 19, 2026 12:45
pyproject.toml: Remove en-core-web-sm direct dependency; the spaCy
model will be downloaded at runtime via spacy.cli.download() instead,
which works in uv environments by running 'uv add pip'.

ner_tool.py: Restore spacy.cli.download() fallback and re-add the
from spacy.cli import download import.
Update dependencies and spaCy model loading
@djarecka
Copy link
Contributor

@tekrajchhetri - coming back to the tutorial, since I'm trying to understand what you want to show there and what would be useful, I have a few comments:

  • I run on test_small.pdf, since the paper you are using in the cli tutorial is not part of the repo (and unless I missed it, I don't see info how to get the paper). The command still runs, since as I mentioned earlier, structsense simply takes it as a text (still thinks that it's confusing)

  • you print the output, but you don't really tell what people should look at (and it's related to the next point)

  • I think it would be important to explain how you came up with the config file, since without explaining people how to create one, it would be hard to use it

  • I'm also wondering do you have a specific example (short paper, config, output) for NER?

@tekrajchhetri
Copy link
Collaborator Author

tekrajchhetri commented Feb 20, 2026

@djarecka what you ran and the output you got is okay. My only point is that the reason why the output is not exhaustive is because the test_small.pdf might not contain all the information we wanted to extract.

Regarding config file, what do you mean how I come up with config file. It follows the crew.ai, the standard and its the same old config files just updated with things removed not necessary now.

Below is the NER for test_small.pdf without HIL. Also, if you see some concept mapping empty because Bioportal is unstable currently as they are making hardware upgrades.

Details
{
  "entities": [
    {
      "entity": "TH94252310589",
      "label": "ORGANIZATION",
      "start": 30,
      "end": 43,
      "weighted_score": 0.565,
      "model_count": 2,
      "occurrences": [
        {
          "start": 30,
          "end": 38,
          "global_start": 30,
          "global_end": 38,
          "sentence": "Untitled Section 1\n['Defense (TH94252310589) to DJZ and VEK."
        },
        {
          "start": 30,
          "end": 43,
          "global_start": 30,
          "global_end": 43,
          "sentence": "Untitled Section 1\n['Defense (TH94252310589) to DJZ and VEK."
        }
      ],
      "provenance": [
        {
          "label": "ORGANIZATION",
          "vote_weight": 3.9,
          "sources": [
            {
              "source_model": "llm_ner",
              "weight": 3.9,
              "entity": "TH94252310589"
            }
          ]
        },
        {
          "label": "bio",
          "vote_weight": 3.0,
          "sources": [
            {
              "source_model": "mobashgr/BC5CDR-chem-WLT-384-BioELECTRA-Pubmed-ENS-20-5",
              "weight": 3.0,
              "entity": "th942523"
            }
          ]
        }
      ],
      "ontology_id": null,
      "ontology_label": null,
      "ontology": null,
      "concept_mapping_provenance": "tool",
      "judge_score": 0.8,
      "remarks": "Entity correctly identified as an organization, but more context could clarify its role.",
      "label_ontology_id": "http://purl.obolibrary.org/obo/OBI_0000245",
      "label_ontology_label": "organization",
      "label_ontology": "CCONT",
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "DJZ",
      "label": "PERSON",
      "start": 48,
      "end": 51,
      "weighted_score": 0.796,
      "model_count": 2,
      "occurrences": [
        {
          "start": 48,
          "end": 51,
          "global_start": 48,
          "global_end": 51,
          "sentence": "Untitled Section 1\n['Defense (TH94252310589) to DJZ and VEK."
        }
      ],
      "provenance": [
        {
          "label": "PERSON",
          "vote_weight": 3.9,
          "sources": [
            {
              "source_model": "llm_ner",
              "weight": 3.9,
              "entity": "DJZ"
            }
          ]
        },
        {
          "label": "ORG",
          "vote_weight": 1.0,
          "sources": [
            {
              "source_model": "en_core_web_sm",
              "weight": 1.0,
              "entity": "DJZ"
            }
          ]
        }
      ],
      "ontology_id": null,
      "ontology_label": null,
      "ontology": null,
      "concept_mapping_provenance": "tool",
      "judge_score": 0.9,
      "remarks": "Strong identification of DJZ as a person, supported by context, yet clarity in role could enhance understanding.",
      "label_ontology_id": "http://www.semanticweb.org/mca/ontologies/2018/8/untitled-ontology-47#Person",
      "label_ontology_label": "Person",
      "label_ontology": "IBD",
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "VEK",
      "label": "PERSON",
      "start": 56,
      "end": 59,
      "weighted_score": 0.796,
      "model_count": 2,
      "occurrences": [
        {
          "start": 56,
          "end": 59,
          "global_start": 56,
          "global_end": 59,
          "sentence": "Untitled Section 1\n['Defense (TH94252310589) to DJZ and VEK."
        }
      ],
      "provenance": [
        {
          "label": "PERSON",
          "vote_weight": 3.9,
          "sources": [
            {
              "source_model": "llm_ner",
              "weight": 3.9,
              "entity": "VEK"
            }
          ]
        },
        {
          "label": "ORG",
          "vote_weight": 1.0,
          "sources": [
            {
              "source_model": "en_core_web_sm",
              "weight": 1.0,
              "entity": "VEK"
            }
          ]
        }
      ],
      "ontology_id": null,
      "ontology_label": null,
      "ontology": null,
      "concept_mapping_provenance": "tool",
      "judge_score": 0.9,
      "remarks": "VEK is well identified as a person, contextually supported, though further role clarification is needed.",
      "label_ontology_id": "http://www.semanticweb.org/mca/ontologies/2018/8/untitled-ontology-47#Person",
      "label_ontology_label": "Person",
      "label_ontology": "IBD",
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "imaging",
      "label": "Diagnostic_procedure",
      "start": 61,
      "end": 68,
      "weighted_score": 1.0,
      "model_count": 1,
      "occurrences": [
        {
          "start": 0,
          "end": 7,
          "global_start": 61,
          "global_end": 68,
          "sentence": "Imaging was performed using the MICA Leica microscope in the Division of Neuropathology, supported by the Johns Hopkins Alzheimers Disease Research Center (ADRC; P30 AG066507), and support was also provided by P30 EY001765."
        }
      ],
      "provenance": [
        {
          "label": "Diagnostic_procedure",
          "vote_weight": 5.0,
          "sources": [
            {
              "source_model": "d4data/biomedical-ner-all",
              "weight": 5.0,
              "entity": "imaging"
            }
          ]
        }
      ],
      "ontology_id": "http://purl.bioontology.org/ontology/PMR.owl#Imaging",
      "ontology_label": "Imaging",
      "ontology": "PMR",
      "concept_mapping_provenance": "tool",
      "judge_score": 1.0,
      "remarks": "Well-defined as a diagnostic procedure with strong supportive context.",
      "label_ontology_id": "http://www.semanticweb.org/ontologies/2010/10/BPO.owl#diagnostic_procedure",
      "label_ontology_label": "diagnostic_procedure",
      "label_ontology": "BHO",
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "mica leica microscope",
      "label": "Detailed_description",
      "start": 93,
      "end": 114,
      "weighted_score": 0.505,
      "model_count": 3,
      "occurrences": [
        {
          "start": 32,
          "end": 53,
          "global_start": 93,
          "global_end": 114,
          "sentence": "Imaging was performed using the MICA Leica microscope in the Division of Neuropathology, supported by the Johns Hopkins Alzheimers Disease Research Center (ADRC; P30 AG066507), and support was also provided by P30 EY001765."
        }
      ],
      "provenance": [
        {
          "label": "Detailed_description",
          "vote_weight": 5.0,
          "sources": [
            {
              "source_model": "d4data/biomedical-ner-all",
              "weight": 5.0,
              "entity": "mica leica microscope"
            }
          ]
        },
        {
          "label": "CELL_TYPE",
          "vote_weight": 3.9,
          "sources": [
            {
              "source_model": "llm_ner",
              "weight": 3.9,
              "entity": "MICA Leica microscope"
            }
          ]
        },
        {
          "label": "PRODUCT",
          "vote_weight": 1.0,
          "sources": [
            {
              "source_model": "en_core_web_sm",
              "weight": 1.0,
              "entity": "Leica"
            }
          ]
        }
      ],
      "ontology_id": "http://purl.obolibrary.org/obo/GAZ_00382380",
      "ontology_label": "Leica",
      "ontology": "GAZ",
      "concept_mapping_provenance": "tool",
      "judge_score": 0.7,
      "remarks": "Described adequately but lacks specificity in context as a detailed description.",
      "label_ontology_id": null,
      "label_ontology_label": null,
      "label_ontology": null,
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "Division of Neuropathology",
      "label": "ORGANIZATION",
      "start": 118,
      "end": 148,
      "weighted_score": 0.796,
      "model_count": 2,
      "occurrences": [
        {
          "start": 57,
          "end": 87,
          "global_start": 118,
          "global_end": 148,
          "sentence": "Imaging was performed using the MICA Leica microscope in the Division of Neuropathology, supported by the Johns Hopkins Alzheimers Disease Research Center (ADRC; P30 AG066507), and support was also provided by P30 EY001765."
        }
      ],
      "provenance": [
        {
          "label": "ORGANIZATION",
          "vote_weight": 3.9,
          "sources": [
            {
              "source_model": "llm_ner",
              "weight": 3.9,
              "entity": "Division of Neuropathology"
            }
          ]
        },
        {
          "label": "ORG",
          "vote_weight": 1.0,
          "sources": [
            {
              "source_model": "en_core_web_sm",
              "weight": 1.0,
              "entity": "the Division of Neuropathology"
            }
          ]
        }
      ],
      "ontology_id": "http://sbmi.uth.tmc.edu/ontology/ochv#C0876934",
      "ontology_label": "neuropathology",
      "ontology": "OCHV",
      "concept_mapping_provenance": "tool",
      "judge_score": 0.9,
      "remarks": "Accurately identified as an organization with substantial contextual support.",
      "label_ontology_id": "http://purl.obolibrary.org/obo/OBI_0000245",
      "label_ontology_label": "organization",
      "label_ontology": "CCONT",
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "Johns Hopkins Alzheimers Disease Research Center",
      "label": "ORGANIZATION",
      "start": 167,
      "end": 215,
      "weighted_score": 0.661,
      "model_count": 2,
      "occurrences": [
        {
          "start": 106,
          "end": 154,
          "global_start": 167,
          "global_end": 215,
          "sentence": "Imaging was performed using the MICA Leica microscope in the Division of Neuropathology, supported by the Johns Hopkins Alzheimers Disease Research Center (ADRC; P30 AG066507), and support was also provided by P30 EY001765."
        }
      ],
      "provenance": [
        {
          "label": "ORGANIZATION",
          "vote_weight": 3.9,
          "sources": [
            {
              "source_model": "llm_ner",
              "weight": 3.9,
              "entity": "Johns Hopkins Alzheimers Disease Research Center"
            }
          ]
        },
        {
          "label": "bio",
          "vote_weight": 2.0,
          "sources": [
            {
              "source_model": "mobashgr/NCBI-disease-WLT-256-SciBERT-13INS",
              "weight": 2.0,
              "entity": "alzheimers disease"
            }
          ]
        }
      ],
      "ontology_id": "http://purl.obolibrary.org/obo/GAZ_00145328",
      "ontology_label": "Johns Hopkins Glacier",
      "ontology": "GAZ",
      "concept_mapping_provenance": "tool",
      "judge_score": 0.8,
      "remarks": "Identified as an organization with relevant context but could benefit from further specificity.",
      "label_ontology_id": "http://purl.obolibrary.org/obo/OBI_0000245",
      "label_ontology_label": "organization",
      "label_ontology": "CCONT",
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "ADRC",
      "label": "ORGANIZATION",
      "start": 217,
      "end": 221,
      "weighted_score": 1.0,
      "model_count": 1,
      "occurrences": [
        {
          "start": 156,
          "end": 160,
          "global_start": 217,
          "global_end": 221,
          "sentence": "Imaging was performed using the MICA Leica microscope in the Division of Neuropathology, supported by the Johns Hopkins Alzheimers Disease Research Center (ADRC; P30 AG066507), and support was also provided by P30 EY001765."
        }
      ],
      "provenance": [
        {
          "label": "ORGANIZATION",
          "vote_weight": 3.9,
          "sources": [
            {
              "source_model": "llm_ner",
              "weight": 3.9,
              "entity": "ADRC"
            }
          ]
        }
      ],
      "ontology_id": "http://purl.bioontology.org/ontology/PDQ/CDR0000589135",
      "ontology_label": "adipose-derived regenerative cells",
      "ontology": "PDQ",
      "concept_mapping_provenance": "tool",
      "judge_score": 1.0,
      "remarks": "Accurately identified and well-contextualized as an organization.",
      "label_ontology_id": "http://purl.obolibrary.org/obo/OBI_0000245",
      "label_ontology_label": "organization",
      "label_ontology": "CCONT",
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "P30 AG066507",
      "label": "GRANT_NUMBER",
      "start": 223,
      "end": 235,
      "weighted_score": 0.796,
      "model_count": 2,
      "occurrences": [
        {
          "start": 162,
          "end": 165,
          "global_start": 223,
          "global_end": 226,
          "sentence": "Imaging was performed using the MICA Leica microscope in the Division of Neuropathology, supported by the Johns Hopkins Alzheimers Disease Research Center (ADRC; P30 AG066507), and support was also provided by P30 EY001765."
        }
      ],
      "provenance": [
        {
          "label": "GRANT_NUMBER",
          "vote_weight": 3.9,
          "sources": [
            {
              "source_model": "llm_ner",
              "weight": 3.9,
              "entity": "P30 AG066507"
            }
          ]
        },
        {
          "label": "ORG",
          "vote_weight": 1.0,
          "sources": [
            {
              "source_model": "en_core_web_sm",
              "weight": 1.0,
              "entity": "P30"
            }
          ]
        }
      ],
      "ontology_id": "http://purl.bioontology.org/ontology/MESH/C106367",
      "ontology_label": "cytochrome p30",
      "ontology": "MESH",
      "concept_mapping_provenance": "tool",
      "judge_score": 0.9,
      "remarks": "Correctly identified as a grant number with solid context provided.",
      "label_ontology_id": null,
      "label_ontology_label": null,
      "label_ontology": null,
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "P30 EY001765",
      "label": "GRANT_NUMBER",
      "start": 271,
      "end": 283,
      "weighted_score": 0.494,
      "model_count": 3,
      "occurrences": [
        {
          "start": 210,
          "end": 222,
          "global_start": 271,
          "global_end": 283,
          "sentence": "Imaging was performed using the MICA Leica microscope in the Division of Neuropathology, supported by the Johns Hopkins Alzheimers Disease Research Center (ADRC; P30 AG066507), and support was also provided by P30 EY001765."
        }
      ],
      "provenance": [
        {
          "label": "GRANT_NUMBER",
          "vote_weight": 3.9,
          "sources": [
            {
              "source_model": "llm_ner",
              "weight": 3.9,
              "entity": "P30 EY001765"
            }
          ]
        },
        {
          "label": "bio",
          "vote_weight": 3.0,
          "sources": [
            {
              "source_model": "mobashgr/BC5CDR-chem-WLT-384-BioELECTRA-Pubmed-ENS-20-5",
              "weight": 3.0,
              "entity": "p30 ey001765"
            }
          ]
        },
        {
          "label": "PRODUCT",
          "vote_weight": 1.0,
          "sources": [
            {
              "source_model": "en_core_web_sm",
              "weight": 1.0,
              "entity": "P30 EY001765"
            }
          ]
        }
      ],
      "ontology_id": "http://purl.bioontology.org/ontology/MESH/C106367",
      "ontology_label": "cytochrome p30",
      "ontology": "MESH",
      "concept_mapping_provenance": "tool",
      "judge_score": 0.6,
      "remarks": "Partially identified as a grant number, though context could be improved for clarity.",
      "label_ontology_id": null,
      "label_ontology_label": null,
      "label_ontology": null,
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "ax",
      "label": "Disease_disorder",
      "start": 502,
      "end": 504,
      "weighted_score": 1.0,
      "model_count": 1,
      "occurrences": [
        {
          "start": 81,
          "end": 83,
          "global_start": 502,
          "global_end": 504,
          "sentence": "]\n\nSignificance\n['Homotypic collateral sprouting -the process by which uninjured axons from the same neuronal source extend new branches to reinnervate targets deprived of their original connections-is a fundamental yet understudied mechanism for CNS repair following injury."
        }
      ],
      "provenance": [
        {
          "label": "Disease_disorder",
          "vote_weight": 5.0,
          "sources": [
            {
              "source_model": "d4data/biomedical-ner-all",
              "weight": 5.0,
              "entity": "ax"
            }
          ]
        }
      ],
      "ontology_id": "http://purl.bioontology.org/ontology/HCPCS/AX",
      "ontology_label": "Item furnished in conjunction with dialysis services",
      "ontology": "HCPCS",
      "concept_mapping_provenance": "tool",
      "judge_score": 1.0,
      "remarks": "Well-defined as a disease/disorder, clearly supported by context.",
      "label_ontology_id": null,
      "label_ontology_label": null,
      "label_ontology": null,
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "source",
      "label": "Detailed_description",
      "start": 531,
      "end": 537,
      "weighted_score": 1.0,
      "model_count": 1,
      "occurrences": [
        {
          "start": 110,
          "end": 116,
          "global_start": 531,
          "global_end": 537,
          "sentence": "]\n\nSignificance\n['Homotypic collateral sprouting -the process by which uninjured axons from the same neuronal source extend new branches to reinnervate targets deprived of their original connections-is a fundamental yet understudied mechanism for CNS repair following injury."
        }
      ],
      "provenance": [
        {
          "label": "Detailed_description",
          "vote_weight": 5.0,
          "sources": [
            {
              "source_model": "d4data/biomedical-ner-all",
              "weight": 5.0,
              "entity": "source"
            }
          ]
        }
      ],
      "ontology_id": "http://gbol.life/0.1/Source",
      "ontology_label": "Source",
      "ontology": "GBOL",
      "concept_mapping_provenance": "tool",
      "judge_score": 1.0,
      "remarks": "Successfully identified as a detailed description with strong contextual links.",
      "label_ontology_id": null,
      "label_ontology_label": null,
      "label_ontology": null,
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "branches",
      "label": "Detailed_description",
      "start": 549,
      "end": 557,
      "weighted_score": 1.0,
      "model_count": 1,
      "occurrences": [
        {
          "start": 128,
          "end": 136,
          "global_start": 549,
          "global_end": 557,
          "sentence": "]\n\nSignificance\n['Homotypic collateral sprouting -the process by which uninjured axons from the same neuronal source extend new branches to reinnervate targets deprived of their original connections-is a fundamental yet understudied mechanism for CNS repair following injury."
        }
      ],
      "provenance": [
        {
          "label": "Detailed_description",
          "vote_weight": 5.0,
          "sources": [
            {
              "source_model": "d4data/biomedical-ner-all",
              "weight": 5.0,
              "entity": "branches"
            }
          ]
        }
      ],
      "ontology_id": "http://www.semanticweb.org/Terrorism#Branches",
      "ontology_label": "Branches",
      "ontology": "INTO",
      "concept_mapping_provenance": "tool",
      "judge_score": 1.0,
      "remarks": "Correctly identified with relevant context supporting its description.",
      "label_ontology_id": null,
      "label_ontology_label": null,
      "label_ontology": null,
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "CNS",
      "label": "ORGANIZATION",
      "start": 668,
      "end": 671,
      "weighted_score": 0.796,
      "model_count": 2,
      "occurrences": [
        {
          "start": 247,
          "end": 250,
          "global_start": 668,
          "global_end": 671,
          "sentence": "]\n\nSignificance\n['Homotypic collateral sprouting -the process by which uninjured axons from the same neuronal source extend new branches to reinnervate targets deprived of their original connections-is a fundamental yet understudied mechanism for CNS repair following injury."
        }
      ],
      "provenance": [
        {
          "label": "ORGANIZATION",
          "vote_weight": 3.9,
          "sources": [
            {
              "source_model": "llm_ner",
              "weight": 3.9,
              "entity": "CNS"
            }
          ]
        },
        {
          "label": "ORG",
          "vote_weight": 1.0,
          "sources": [
            {
              "source_model": "en_core_web_sm",
              "weight": 1.0,
              "entity": "CNS"
            }
          ]
        }
      ],
      "ontology_id": "http://purl.jp/bio/4/id/200906047768068685",
      "ontology_label": "CNS",
      "ontology": "IOBC",
      "concept_mapping_provenance": "tool",
      "judge_score": 0.7,
      "remarks": "Identified as an organization with relevant associations, though clarity of context could improve.",
      "label_ontology_id": "http://purl.obolibrary.org/obo/OBI_0000245",
      "label_ontology_label": "organization",
      "label_ontology": "CCONT",
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "repair",
      "label": "Sign_symptom",
      "start": 672,
      "end": 678,
      "weighted_score": 1.0,
      "model_count": 1,
      "occurrences": [
        {
          "start": 251,
          "end": 257,
          "global_start": 672,
          "global_end": 678,
          "sentence": "]\n\nSignificance\n['Homotypic collateral sprouting -the process by which uninjured axons from the same neuronal source extend new branches to reinnervate targets deprived of their original connections-is a fundamental yet understudied mechanism for CNS repair following injury."
        }
      ],
      "provenance": [
        {
          "label": "Sign_symptom",
          "vote_weight": 5.0,
          "sources": [
            {
              "source_model": "d4data/biomedical-ner-all",
              "weight": 5.0,
              "entity": "repair"
            }
          ]
        }
      ],
      "ontology_id": "http://www.projecthalo.com/aura#Repair",
      "ontology_label": "Repair",
      "ontology": "AURA",
      "concept_mapping_provenance": "tool",
      "judge_score": 1.0,
      "remarks": "Well-defined and contextualized as a sign/symptom of a process.",
      "label_ontology_id": null,
      "label_ontology_label": null,
      "label_ontology": null,
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "injury",
      "label": "Disease_disorder",
      "start": 689,
      "end": 695,
      "weighted_score": 1.0,
      "model_count": 1,
      "occurrences": [
        {
          "start": 268,
          "end": 274,
          "global_start": 689,
          "global_end": 695,
          "sentence": "]\n\nSignificance\n['Homotypic collateral sprouting -the process by which uninjured axons from the same neuronal source extend new branches to reinnervate targets deprived of their original connections-is a fundamental yet understudied mechanism for CNS repair following injury."
        }
      ],
      "provenance": [
        {
          "label": "Disease_disorder",
          "vote_weight": 5.0,
          "sources": [
            {
              "source_model": "d4data/biomedical-ner-all",
              "weight": 5.0,
              "entity": "injury"
            }
          ]
        }
      ],
      "ontology_id": "http://www.icn.ch/icnp#Injury",
      "ontology_label": "Injury",
      "ontology": "ICNP",
      "concept_mapping_provenance": "tool",
      "judge_score": 1.0,
      "remarks": "Clearly defined as a disease/disorder, strongly supported by context.",
      "label_ontology_id": null,
      "label_ontology_label": null,
      "label_ontology": null,
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "homo",
      "label": "Detailed_description",
      "start": 772,
      "end": 776,
      "weighted_score": 1.0,
      "model_count": 1,
      "occurrences": [
        {
          "start": 75,
          "end": 79,
          "global_start": 772,
          "global_end": 776,
          "sentence": "Unlike heterotypic sprouting, involving sprouting from unrelated pathways, homotypic sprouting offers potential to restore circuit architecture after partial lesions."
        }
      ],
      "provenance": [
        {
          "label": "Detailed_description",
          "vote_weight": 10.0,
          "sources": [
            {
              "source_model": "d4data/biomedical-ner-all",
              "weight": 5.0,
              "entity": "homo"
            },
            {
              "source_model": "d4data/biomedical-ner-all",
              "weight": 5.0,
              "entity": "homo"
            }
          ]
        }
      ],
      "ontology_id": "http://sbmi.uth.tmc.edu/ontology/ochv#53984",
      "ontology_label": "homo",
      "ontology": "OCHV",
      "concept_mapping_provenance": "tool",
      "judge_score": 1.0,
      "remarks": "Successfully identified as a detailed description with relevant supporting evidence.",
      "label_ontology_id": null,
      "label_ontology_label": null,
      "label_ontology": null,
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "diffuse",
      "label": "Detailed_description",
      "start": 893,
      "end": 914,
      "weighted_score": 0.714,
      "model_count": 2,
      "occurrences": [
        {
          "start": 29,
          "end": 36,
          "global_start": 893,
          "global_end": 900,
          "sentence": "Here, we employed a model of diffuse axonal injury in the mouse visual system to examine this mechanism."
        }
      ],
      "provenance": [
        {
          "label": "Detailed_description",
          "vote_weight": 5.0,
          "sources": [
            {
              "source_model": "d4data/biomedical-ner-all",
              "weight": 5.0,
              "entity": "diffuse"
            }
          ]
        },
        {
          "label": "bio",
          "vote_weight": 2.0,
          "sources": [
            {
              "source_model": "mobashgr/NCBI-disease-WLT-256-SciBERT-13INS",
              "weight": 2.0,
              "entity": "diffuse axonal injury"
            }
          ]
        }
      ],
      "ontology_id": "http://www.co-ode.org/ontologies/galen#diffuse",
      "ontology_label": "diffuse",
      "ontology": "GALEN",
      "concept_mapping_provenance": "tool",
      "judge_score": 0.8,
      "remarks": "Identified reasonably well as a detailed description but lacks full context clarity.",
      "label_ontology_id": null,
      "label_ontology_label": null,
      "label_ontology": null,
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "mouse visual system",
      "label": "BRAIN_REGION",
      "start": 922,
      "end": 941,
      "weighted_score": 1.0,
      "model_count": 1,
      "occurrences": [
        {
          "start": 58,
          "end": 77,
          "global_start": 922,
          "global_end": 941,
          "sentence": "Here, we employed a model of diffuse axonal injury in the mouse visual system to examine this mechanism."
        }
      ],
      "provenance": [
        {
          "label": "BRAIN_REGION",
          "vote_weight": 3.9,
          "sources": [
            {
              "source_model": "llm_ner",
              "weight": 3.9,
              "entity": "mouse visual system"
            }
          ]
        }
      ],
      "ontology_id": "http://purl.obolibrary.org/obo/PR_Q91V10",
      "ontology_label": "visual system homeobox 1 (mouse)",
      "ontology": "PR",
      "concept_mapping_provenance": "tool",
      "judge_score": 1.0,
      "remarks": "Clearly defined as a brain region and well-contextualized.",
      "label_ontology_id": "http://www.semanticweb.org/rjyy/ontologies/2015/5/ESSO#Brain_Region",
      "label_ontology_label": "Brain_Region",
      "label_ontology": "ESSO",
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "retinal ganglion cell",
      "label": "CELL_TYPE",
      "start": 1005,
      "end": 1026,
      "weighted_score": 1.0,
      "model_count": 1,
      "occurrences": [
        {
          "start": 36,
          "end": 57,
          "global_start": 1005,
          "global_end": 1026,
          "sentence": "Our research demonstrates surviving retinal ganglion cell axons can re-establish terminal fields, achieving structural and functional connectivity."
        }
      ],
      "provenance": [
        {
          "label": "CELL_TYPE",
          "vote_weight": 3.9,
          "sources": [
            {
              "source_model": "llm_ner",
              "weight": 3.9,
              "entity": "retinal ganglion cell"
            }
          ]
        }
      ],
      "ontology_id": "http://purl.obolibrary.org/obo/TAO_0009310",
      "ontology_label": "retinal ganglion cell",
      "ontology": "TAO",
      "concept_mapping_provenance": "tool",
      "judge_score": 1.0,
      "remarks": "Accurately identified as a cell type with good contextual description.",
      "label_ontology_id": "http://www.ebi.ac.uk/efo/EFO_0000324",
      "label_ontology_label": "cell type",
      "label_ontology": "EFO",
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "female",
      "label": "Detailed_description",
      "start": 1173,
      "end": 1184,
      "weighted_score": 0.562,
      "model_count": 2,
      "occurrences": [
        {
          "start": 56,
          "end": 62,
          "global_start": 1173,
          "global_end": 1179,
          "sentence": "Importantly, we discovered significant sex differences: female mice showed delayedincomplete recovery compared to males."
        }
      ],
      "provenance": [
        {
          "label": "Detailed_description",
          "vote_weight": 5.0,
          "sources": [
            {
              "source_model": "d4data/biomedical-ner-all",
              "weight": 5.0,
              "entity": "female"
            }
          ]
        },
        {
          "label": "PERSON",
          "vote_weight": 3.9,
          "sources": [
            {
              "source_model": "llm_ner",
              "weight": 3.9,
              "entity": "female mice"
            }
          ]
        }
      ],
      "ontology_id": "http://purl.bioontology.org/ontology/PMR.owl#Female",
      "ontology_label": "Female",
      "ontology": "PMR",
      "concept_mapping_provenance": "tool",
      "judge_score": 0.7,
      "remarks": "Identified as a detailed description; however, more specificity about the context could benefit the entry.",
      "label_ontology_id": null,
      "label_ontology_label": null,
      "label_ontology": null,
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "delayed",
      "label": "Sign_symptom",
      "start": 1192,
      "end": 1199,
      "weighted_score": 1.0,
      "model_count": 1,
      "occurrences": [
        {
          "start": 75,
          "end": 82,
          "global_start": 1192,
          "global_end": 1199,
          "sentence": "Importantly, we discovered significant sex differences: female mice showed delayedincomplete recovery compared to males."
        }
      ],
      "provenance": [
        {
          "label": "Sign_symptom",
          "vote_weight": 5.0,
          "sources": [
            {
              "source_model": "d4data/biomedical-ner-all",
              "weight": 5.0,
              "entity": "delayed"
            }
          ]
        }
      ],
      "ontology_id": "http://www.icn.ch/icnp#Delayed",
      "ontology_label": "Delayed",
      "ontology": "ICNP",
      "concept_mapping_provenance": "tool",
      "judge_score": 1.0,
      "remarks": "Well-defined as a sign/symptom with clear contextual backing.",
      "label_ontology_id": null,
      "label_ontology_label": null,
      "label_ontology": null,
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "recovery",
      "label": "Sign_symptom",
      "start": 1210,
      "end": 1218,
      "weighted_score": 1.0,
      "model_count": 1,
      "occurrences": [
        {
          "start": 93,
          "end": 101,
          "global_start": 1210,
          "global_end": 1218,
          "sentence": "Importantly, we discovered significant sex differences: female mice showed delayedincomplete recovery compared to males."
        }
      ],
      "provenance": [
        {
          "label": "Sign_symptom",
          "vote_weight": 5.0,
          "sources": [
            {
              "source_model": "d4data/biomedical-ner-all",
              "weight": 5.0,
              "entity": "recovery"
            }
          ]
        }
      ],
      "ontology_id": "http://sbmi.uth.tmc.edu/ontology/ochv#C0237820",
      "ontology_label": "recovery",
      "ontology": "OCHV",
      "concept_mapping_provenance": "tool",
      "judge_score": 1.0,
      "remarks": "Strongly identified as a sign/symptom with appropriate context.",
      "label_ontology_id": null,
      "label_ontology_label": null,
      "label_ontology": null,
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "brain",
      "label": "Biological_structure",
      "start": 1283,
      "end": 1288,
      "weighted_score": 1.0,
      "model_count": 1,
      "occurrences": [
        {
          "start": 45,
          "end": 50,
          "global_start": 1283,
          "global_end": 1288,
          "sentence": "These findings provide evidence of repair of brain circuits perturbed by TBI and the role of homotypic sprouting."
        }
      ],
      "provenance": [
        {
          "label": "Biological_structure",
          "vote_weight": 5.0,
          "sources": [
            {
              "source_model": "d4data/biomedical-ner-all",
              "weight": 5.0,
              "entity": "brain"
            }
          ]
        }
      ],
      "ontology_id": "http://www.icn.ch/icnp#Brain",
      "ontology_label": "Brain",
      "ontology": "ICNP",
      "concept_mapping_provenance": "tool",
      "judge_score": 1.0,
      "remarks": "Clearly defined as a biological structure, well-supported by contextual references.",
      "label_ontology_id": null,
      "label_ontology_label": null,
      "label_ontology": null,
      "label_concept_mapping_provenance": "tool"
    },
    {
      "entity": "tb",
      "label": "Disease_disorder",
      "start": 1311,
      "end": 1314,
      "weighted_score": 0.505,
      "model_count": 3,
      "occurrences": [
        {
          "start": 73,
          "end": 76,
          "global_start": 1311,
          "global_end": 1314,
          "sentence": "These findings provide evidence of repair of brain circuits perturbed by TBI and the role of homotypic sprouting."
        }
      ],
      "provenance": [
        {
          "label": "Disease_disorder",
          "vote_weight": 5.0,
          "sources": [
            {
              "source_model": "d4data/biomedical-ner-all",
              "weight": 5.0,
              "entity": "tb"
            }
          ]
        },
        {
          "label": "DISEASE",
          "vote_weight": 3.9,
          "sources": [
            {
              "source_model": "llm_ner",
              "weight": 3.9,
              "entity": "TBI"
            }
          ]
        },
        {
          "label": "ORG",
          "vote_weight": 1.0,
          "sources": [
            {
              "source_model": "en_core_web_sm",
              "weight": 1.0,
              "entity": "TBI"
            }
          ]
        }
      ],
      "ontology_id": "http://purl.bioontology.org/ontology/HCPCS/TB",
      "ontology_label": "Drug or biological acquired with 340b drug pricing program discount, reported for informational purposes for select entities",
      "ontology": "HCPCS",
      "concept_mapping_provenance": "tool",
      "judge_score": 0.7,
      "remarks": "Identified as a disease/disorder, but clarification on context could improve understanding.",
      "label_ontology_id": null,
      "label_ontology_label": null,
      "label_ontology": null,
      "label_concept_mapping_provenance": "tool"
    }
  ],
  "key_terms": [],
  "metadata": {
    "total_chunks": 1,
    "chunks_with_entities": 1,
    "entities_before_merge": 41,
    "entities_after_merge": 25,
    "verification": {
      "entities_present": 25,
      "entities_dropped": 0,
      "entities_dropped_detail": [],
      "key_terms_present": 0,
      "key_terms_dropped": 0,
      "all_entities_present_in_text": true,
      "all_key_terms_present_in_text": true
    }
  },
  "verification": {
    "entities_present": 25,
    "entities_dropped": 0,
    "entities_dropped_detail": [],
    "key_terms_present": 0,
    "key_terms_dropped": 0,
    "all_entities_present_in_text": true,
    "all_key_terms_present_in_text": true
  },
  "errors": [],
  "task_type": "ner",
  "elapsed_time": 558.1336400508881
}

@djarecka
Copy link
Contributor

djarecka commented Feb 20, 2026

@djarecka what you ran and the output you got is okay. My only point is that the reason why the output is not exhaustive is because the test_small.pdf might not contain all the information we wanted to extract.

But were you planning to add the pdf that you used in the tutorial to the repo?
(btw. at the beginning I thought that in the tutorial you run NER, so that's why I decided to run on any other publication)

Regarding config file, what do you mean how I come up with config file. It follows the crew.ai, the standard and its the same old config files just updated with things removed not necessary now.

You added a tutorial to the repo, what is great, but I'm assuming that you want be able to point people to this tutorial to learn how to use it, even if they haven't used crew.ai before, am I right?

There are also some additional decision you made about teh format of the output etc. I think some, even short, explanation of one of entry would be very useful. I could expand the text later, but would like to have something to start from.

Right now in the tutorial there is no description event the general task you want to accomplish.

btw. I edited your comment and added <details> - it is worth using when copying content of long files

@tekrajchhetri
Copy link
Collaborator Author

@djarecka what you ran and the output you got is okay. My only point is that the reason why the output is not exhaustive is because the test_small.pdf might not contain all the information we wanted to extract.

But were you planning to add the pdf that you used in the tutorial to the repo? (btw. at the beginning I thought that in the tutorial you run NER, so that's why I decided to run on any other publication)

Regarding config file, what do you mean how I come up with config file. It follows the crew.ai, the standard and its the same old config files just updated with things removed not necessary now.

You added a tutorial to the repo, what is great, but I'm assuming that you want be able to point people to this tutorial to learn how to use it, even if they haven't used crew.ai before, am I right?

There are also some additional decision you made about teh format of the output etc. I think some, even short, explanation of one of entry would be very useful. I could expand the text later, but would like to have something to start from.

Right now in the tutorial there is no description event the general task you want to accomplish.

btw. I edited your comment and added <details> - it is worth using when copying content of long files

@djarecka for the output structure, you would need to check the postprocessing python file inside utils. For example, for ner metadata one, see

def merge_ner_results(
.

NER publications:

  • Langdon, C., Engel, T.A. Latent circuit inference from heterogeneous neural responses during cognitive tasks. Nat Neurosci 28, 665–675 (2025).
  • Hansen, J.Y., Cauzzo, S., Singh, K. et al. Integrating brainstem and cortical functional architectures. Nat Neurosci 27, 2500–2511 (2024).

Resource Extraction:

  • Xu, Y., Zhang, J., Zhang, Q., & Tao, D. (2022). Vitpose: Simple vision transformer baselines for human pose estimation. Advances in neural information processing systems, 35, 38571-38584.
  • Lauer, J., Zhou, M., Ye, S. et al. Multi-animal pose estimation, identification and tracking with DeepLabCut. Nat Methods 19, 496–504 (2022). https://doi.org/10.1038/s41592-022-01443-0

Pdf2Reproschema:

@tekrajchhetri
Copy link
Collaborator Author

WIP: search and reranking for ontology alignment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants