Welcome to nessie’s documentation!

Nessie Logo

Documentation Status PyPI PyPI - License Code style: black

nessie is a package for annotation error detection. It can be used to automatically detect errors in annotated corpora so that human annotators can concentrate on a subset to correct, instead of needing to look at each and every instance.

💡 Please also refer to our additional documentation ! It contains detailed explanations and code examples!

Contact person: Jan-Christoph Klie
https://www.ukp.tu-darmstadt.de
https://www.tu-darmstadt.de

Don’t hesitate to report an issue if something is broken (and it shouldn’t be) or if you have further questions.

⚠️ This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Please use the following citation when using our software:

@misc{https://doi.org/10.48550/arxiv.2206.02280,
  doi = {10.48550/ARXIV.2206.02280},
  url = {https://arxiv.org/abs/2206.02280},
  author = {Klie, Jan-Christoph and Webber, Bonnie and Gurevych, Iryna},
  title = {Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future},
  publisher = {arXiv},
  year = {2022}
}

Installation

pip install nessie

This installs the package with default dependencies and PyTorch with only CPU support. If you want to use your own PyTorch version (e.g., with CUDA enabled), you need to install it afterwards manually. If you need faiss-gpu, then you should also install that manually afterwards.

Basic Usage

Given annotated data, this package can be used to find potential errors. For instance, using Retag, that is, training a model, letting it predict on your data and then flagging instances where model predictions disagree with the given labels can be done as:

from nessie.dataloader import load_example_text_classification_data
from nessie.helper import CrossValidationHelper
from nessie.models.text import DummyTextClassifier
from nessie.detectors import Retag

text_data = load_example_text_classification_data().subset(100)

cv = CrossValidationHelper(n_splits=10)
tc_result = cv.run(text_data.texts, text_data.noisy_labels, DummyTextClassifier())

detector = Retag()

flags = detector.score(text_data.noisy_labels, tc_result.predictions)

💡 Please also refer to our additional documentation ! It contains detailed explanations and code examples!

Methods

We implement a wide range of annotation error detection methods. These are divided in two categories, flaggers and scorers. Flaggers give a binary judgement whether an instance is considered wrong, Scorers give a certainty estimate how likely it is that an instance is wrong.

Flagger

Abbreviation

Method

Text

Token

Span

Proposed by

CL

Confident Learning

Northcutt (2021)

CS

Curriculum Spotter

Amiri (2018)

DE

Diverse Ensemble

Loftsson (2009)

IRT

Item Response Theory

Rodriguez (2021)

LA

Label Aggregation

Amiri (2018)

LS

Leitner Spotter

Amiri (2018)

PE

Projection Ensemble

Reiss (2020)

RE

Retag

van Halteren (2000)

VN

Variation n-Grams

Dickinson (2003)

Scorer

Abbreviation

Method

Text

Token

Span

Proposed by

BC

Borda Count

Larson (2020)

CU

Classification Uncertainty

Hendrycks (2017)

DM

Data Map Confidence

Swayamdipta (2020)

DU

Dropout Uncertainty

Amiri (2018)

KNN

k-Nearest Neighbor Entropy

Grivas (2020)

LE

Label Entropy

Hollenstein (2016)

MD

Mean Distance

Larson (2019)

PM

Prediction Margin

Dligach (2011)

WD

Weighted Discrepancy

Hollenstein (2016)

Models

Model-based annotation detection methods need trained models to obtain predictions or probabilities. We already implemented the most common models for you to be ready to use. You can add your own models by implementing the respective abstract class for TextClassifier or SequenceTagger. We provide the following models:

Text classification

Class name

Description

FastTextTextClassifier

Fasttext

FlairTextClassifier

Flair

LgbmTextClassifier

LightGBM with handcrafted features

LgbmTextClassifier

LightGBM with S-BERT features

MaxEntTextClassifier

Logistic Regression with handcrafted features

MaxEntTextClassifier

Logistic with S-BERT features

TransformerTextClassifier

Transformers

You can easily add your own sklearn classifiers by subclassing SklearnTextClassifier like the following:

class MaxEntTextClassifier(SklearnTextClassifier):
    def __init__(self, embedder: SentenceEmbedder, max_iter=10000):
        super().__init__(lambda: LogisticRegression(max_iter=max_iter, random_state=RANDOM_STATE), embedder)

Sequence Classification

Class name

Description

FlairSequenceTagger

Flair

CrfSequenceTagger

CRF with handcrafted features

MaxEntSequenceTagger

Maxent sequence tagger

TransformerSequenceTagger

Transformer

Development

We use flit for dependency management and packaging. Follow their documentation to install it. Then you can run

flit install -s

to download the dependencies and install in its own environment. In order to install your own PyTorch with CUDA, you can run

make force-cuda113

or install it manually in the poetry environment. You can format the code via

make format

which should be run before every commit.

Bibliography

Amiri, Hadi, Timothy Miller, and Guergana Savova. 2018. “Spotting Spurious Data with Neural Networks.” Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2006-16. New Orleans, Louisiana.

Dligach, Dmitriy, and Martha Palmer. 2011. “Reducing the Need for Double Annotation.” Proceedings of the 5th Linguistic Annotation Workshop, 65-73. Portland, Oregon, USA.

Grivas, Andreas, Beatrice Alex, Claire Grover, Richard Tobin, and William Whiteley. 2020. “Not a Cute Stroke: Analysis of Rule- and Neural Network-based Information Extraction Systems for Brain Radiology Reports.” Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis, 24-37. Online.

Hendrycks, Dan, and Kevin Gimpel. 2017. “A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks.” Proceedings of International Conference on Learning Representations, 1-12.

Hollenstein, Nora, Nathan Schneider, and Bonnie Webber. 2016. “Inconsistency Detection in Semantic Annotation.” Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 3986-90. Portorož, Slovenia.

Larson, Stefan, Anish Mahendran, Andrew Lee, Jonathan K. Kummerfeld, Parker Hill, Michael A. Laurenzano, Johann Hauswald, Lingjia Tang, and Jason Mars. 2019. “Outlier Detection for Improved Data Quality and Diversity in Dialog Systems.” Proceedings of the 2019 Conference of the North, 517-27. Minneapolis, Minnesota.

Loftsson, Hrafn. 2009. “Correcting a POS-Tagged Corpus Using Three Complementary Methods.” Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), 523-31. Athens, Greece.

Northcutt, Curtis, Lu Jiang, and Isaac Chuang. 2021. “Confident Learning: Estimating Uncertainty in Dataset Labels.” Journal of Artificial Intelligence Research 70 (April): 1373-1411.

Reiss, Frederick, Hong Xu, Bryan Cutler, Karthik Muthuraman, and Zachary Eichenberger. 2020. “Identifying Incorrect Labels in the CoNLL-2003 Corpus.” Proceedings of the 24th Conference on Computational Natural Language Learning, 215-26. Online.

Rodriguez, Pedro, Joe Barrow, Alexander Miserlis Hoyle, John P. Lalor, Robin Jia, and Jordan Boyd-Graber. 2021. “Evaluation Examples Are Not Equally Informative: How Should That Change NLP Leaderboards?” Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 4486-4503. Online.

Swayamdipta, Swabha, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, and Yejin Choi. 2020. “Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics.” Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 9275-93. Online.

van Halteren, Hans. 2000. “The Detection of Inconsistency in Manually Tagged Text.” Proceedings of the COLING-2000 Workshop on Linguistically Interpreted Corpora, 48-55. Luxembourg.