GOLEMcoref: a Multilingual Coreference Dataset of Fiction

This work is licensed under Creative Commons Attribution-NonCommercial 4.0 International

GOLEMcoref: a Multilingual Coreference Dataset of Fiction

This repository hosts the data, models, and evaluation results of the paper GOLEMcoref: a Multilingual Coreference Dataset of Fiction.

What is GOLEMcoref?
- It is a gold standard benchmark for coreference resolution in 7 langugages: Bahasa Indonesia, Chinese, Dutch, English, Italian, Korean, Spanish (--> data/gold_annotations).
- It contains fictional short stories sourced from 3 popular fanfiction platforms: Archive of Our Own (AO3), Postype, and Wattpad.
- It is the first of its kind offering multilingual coverage for fictional literature.
- It includes complete works.
- It is a gold standard: it is fully annotated and curated by humans following specialised guidelines (--> guidelines) and accompanied by a report discussing annotation challenges (--> report).
We trained neural coreference systems on our dataset:
- We train separate models for each language and crosslingual models trained on data across all languages.
- Consistent with previous work, we observe improvements of the model trained multilingually over the monolingually trained models (-->results).

Repository Structure

The schema below provides a map of this repository:

├── README.md
│
├── data/
│   ├── gold_annotations/
│   │   ├── chinese/
│   │   │   ├── conll/
│   │   │   └── conllu/
│   │   ├── dutch/
│   │   │   ├── conll/
│   │   │   └── conllu/
│   │   ├── english/
│   │   │   ├── conll/
│   │   │   └── conllu/
│   │   ├── indonesian/
│   │   │   ├── conll/
│   │   │   └── conllu/
│   │   ├── italian/
│   │   │   ├── conll/
│   │   │   └── conllu/
│   │   ├── korean/
│   │   │   ├── conll/
│   │   │   └── conllu/
│   │   └── spanish/
│   │       ├── conll/
│   │       └── conllu/
│   └── splits/
│       └── splits.csv
│
├── guidelines/
│       └── GOLEMcoref_Character Coreference Annotation Guidelines.pdf
│
├── report/
│       └── Coref_Annotation_Challenges.pdf
│
├── scripts/
│       └── makesplit.py
│
└── results/
    ├── evalreport.txt
    └── monolingual_models/
            ├── chinese/
            ├── dutch/
            ├── dutch_openboek/
            ├── english/
            ├── english_litbankp/
            ├── indonesian/
            ├── italian/
            ├── korean/
            └── spanish/
    └── single_crosslingual_model/
            ├── chinese/
            ├── dutch/
            ├── dutch_openboek/
            ├── english/
            ├── english_litbankp/
            ├── indonesian/
            ├── italian/
            ├── korean/
            └── spanish/

Data

GOLEMcoref is available in this repository in the data folder:
- We release our gold standard benchmark in data/gold_annotations:
  - We store annotated data in each language in a dedicated folder (for example, data/gold_annotations/chinese)
  - For each language, we provide:
    - data in CoNLL-2012 (stored in the conll subfolder - for example, data/gold_annotations/chinese/conll):
      - They are divided in train, dev and test splits.
      - Zero anaphora are included as tokens.
    - data in CorefUD (stored in the conllu subfolder - for example, data/gold_annotations/chinese/conllu):
      - They are divided in train, dev and test splits.
      - They come with zero anaphora and split antecedents (inclusion relation), as well as POS tags and dependencies from Stanza.
- Stories included in each of the splits (train, dev and test) are listed in data/splits/splits.csv
- the script used to create the train, dev and test splits is available at scripts/makesplit.py

Guidelines

The guidelines used in the annotation campaign that led to the creation of GOLEMcoref are available at guidelines.

Models

The best performing model, the crosslingual fast-coref model, is made available as a release in this Github repository. To see how to apply the model on your own texts, refer to the notebook on Google Colab; or the copy of the notebook in this repository.

Report

The report discusses the challenges encountered by the annotators and curators, both in relation to fiction and to language specificities.

Scripts

We release the scripts we used in our experiments:
- the script used to create the train, dev and test splits is available at scripts/makesplit.py
Some modifications were made to the coreference systems. These are availabel at https://github.com/andreasvc/fast-coref and https://github.com/andreasvc/xcore

Results

We release the output of our models in results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GOLEMcoref: a Multilingual Coreference Dataset of Fiction

Repository Structure

Data

Guidelines

Models

Report

Scripts

Results

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
data		data
guidelines		guidelines
report		report
results		results
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

GOLEMcoref: a Multilingual Coreference Dataset of Fiction

Repository Structure

Data

Guidelines

Models

Report

Scripts

Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages