Skip to content

GOLEM-lab/GOLEMcoref

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This work is licensed under Creative Commons Attribution-NonCommercial 4.0 International CC BY NC


GOLEMcoref: a Multilingual Coreference Dataset of Fiction

This repository hosts the data, models, and evaluation results of the paper GOLEMcoref: a Multilingual Coreference Dataset of Fiction.

  • What is GOLEMcoref?

    • It is a gold standard benchmark for coreference resolution in 7 langugages: Bahasa Indonesia, Chinese, Dutch, English, Italian, Korean, Spanish (--> data/gold_annotations).
    • It contains fictional short stories sourced from 3 popular fanfiction platforms: Archive of Our Own (AO3), Postype, and Wattpad.
    • It is the first of its kind offering multilingual coverage for fictional literature.
    • It includes complete works.
    • It is a gold standard: it is fully annotated and curated by humans following specialised guidelines (--> guidelines) and accompanied by a report discussing annotation challenges (--> report).
  • We trained neural coreference systems on our dataset:

    • We train separate models for each language and crosslingual models trained on data across all languages.
    • Consistent with previous work, we observe improvements of the model trained multilingually over the monolingually trained models (-->results).

Repository Structure

The schema below provides a map of this repository:

├── README.md
│
├── data/
│   ├── gold_annotations/
│   │   ├── chinese/
│   │   │   ├── conll/
│   │   │   └── conllu/
│   │   ├── dutch/
│   │   │   ├── conll/
│   │   │   └── conllu/
│   │   ├── english/
│   │   │   ├── conll/
│   │   │   └── conllu/
│   │   ├── indonesian/
│   │   │   ├── conll/
│   │   │   └── conllu/
│   │   ├── italian/
│   │   │   ├── conll/
│   │   │   └── conllu/
│   │   ├── korean/
│   │   │   ├── conll/
│   │   │   └── conllu/
│   │   └── spanish/
│   │       ├── conll/
│   │       └── conllu/
│   └── splits/
│       └── splits.csv
│
├── guidelines/
│       └── GOLEMcoref_Character Coreference Annotation Guidelines.pdf
│
├── report/
│       └── Coref_Annotation_Challenges.pdf
│
├── scripts/
│       └── makesplit.py
│
└── results/
    ├── evalreport.txt
    └── monolingual_models/
            ├── chinese/
            ├── dutch/
            ├── dutch_openboek/
            ├── english/
            ├── english_litbankp/
            ├── indonesian/
            ├── italian/
            ├── korean/
            └── spanish/
    └── single_crosslingual_model/
            ├── chinese/
            ├── dutch/
            ├── dutch_openboek/
            ├── english/
            ├── english_litbankp/
            ├── indonesian/
            ├── italian/
            ├── korean/
            └── spanish/

Data

  • GOLEMcoref is available in this repository in the data folder:
    • We release our gold standard benchmark in data/gold_annotations:

      • We store annotated data in each language in a dedicated folder (for example, data/gold_annotations/chinese)
      • For each language, we provide:
        • data in CoNLL-2012 (stored in the conll subfolder - for example, data/gold_annotations/chinese/conll):
          • They are divided in train, dev and test splits.
          • Zero anaphora are included as tokens.
        • data in CorefUD (stored in the conllu subfolder - for example, data/gold_annotations/chinese/conllu):
          • They are divided in train, dev and test splits.
          • They come with zero anaphora and split antecedents (inclusion relation), as well as POS tags and dependencies from Stanza.
    • Stories included in each of the splits (train, dev and test) are listed in data/splits/splits.csv

    • the script used to create the train, dev and test splits is available at scripts/makesplit.py

Guidelines

  • The guidelines used in the annotation campaign that led to the creation of GOLEMcoref are available at guidelines.

Models

The best performing model, the crosslingual fast-coref model, is made available as a release in this Github repository. To see how to apply the model on your own texts, refer to the notebook on Google Colab; or the copy of the notebook in this repository.

Report

  • The report discusses the challenges encountered by the annotators and curators, both in relation to fiction and to language specificities.

Scripts

Results

We release the output of our models in results

About

Dataset and code for the GOLEMcoref paper

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages