nessie.detectors.variational_principle

Module Contents

Classes

VariationNGrams

Detecting Inconsistencies in Treebanks

VariationNGramsSpan

We use the implementation described in

class nessie.detectors.variational_principle.VariationNGrams

Bases: nessie.detectors.error_detector.Detector

Detecting Inconsistencies in Treebanks Markus Dickinson and Walt Detmar Meurers Proceedings of the Second Workshop on Treebanks and Linguistic Theories (TLT 2003). Växjö, Sweden.

correct(self, sentences: nessie.types.RaggedStringArray, tags: nessie.types.RaggedStringArray) awkward.Array
error_detector_kind(self) nessie.detectors.error_detector.DetectorKind
score(self, sentences: nessie.types.RaggedStringArray, tags: nessie.types.RaggedStringArray, **kwargs) awkward.Array

Collects n-grams and their respective label sequences, if there are disagreements, then flag them if they are in the minority.

We use the implementation described in

Errator: a Tool to Help Detect Annotation Errors in the Universal Dependencies Project by Guillaume Wisniewski (LREC 2018) and “How Bad are PoS Taggers in Cross-Corpora Settings? Evaluating Annotation Divergence in the UD Project” by Guillaume Wisniewski, François Yvon (NAACL 2018)

It uses generalized suffix trees to find repetitions across sentences which are flagged if the repetitions are labeled differently.

Parameters
  • sentences – a (num_instances, num_tokens) ragged string sequence containing the text/surface form of each instance

  • tags – a (num_instances, num_tokens) ragged string sequence containing the noisy label for each instance

Returns

a (num_samples, num_tokens) ragged boolean array containing the flag for each instance

supports_correction(self) bool
class nessie.detectors.variational_principle.VariationNGramsSpan(k: int = 1)

Bases: nessie.detectors.error_detector.Detector

We use the implementation described in

Inconsistencies in Crowdsourced Slot-Filling Annotations: A Typology and Identification Methods by Stefan Larson, Adrian Cheung, Anish Mahendran, Kevin Leach, Jonathan K. Kummerfeld Proceedings of the 28th International Conference on Computational Linguistics COLING 2020

correct(self, sentences: nessie.types.RaggedStringArray, tags: nessie.types.RaggedStringArray) awkward.Array
error_detector_kind(self) nessie.detectors.error_detector.DetectorKind
score(self, sentences: nessie.types.RaggedStringArray, tags: nessie.types.RaggedStringArray, **kwargs) awkward.Array

Collects n-grams and their respective label sequences, if there are disagreements, then flag them if they are in the minority.

We use the implementation described in Inconsistencies in Crowdsourced Slot-Filling Annotations: A Typology and Identification Methods by Stefan Larson, Adrian Cheung, Anish Mahendran, Kevin Leach, Jonathan K. Kummerfeld Proceedings of the 28th International Conference on Computational Linguistics COLING 2020

It uses a window of k to the left and right of a span, if the window has the same surface form but a different label for the span, then we flag it (unless it is the majority label).

Parameters
  • sentences – a (num_instances, num_tokens) ragged string sequence containing the text/surface form of each instance

  • tags – a (num_instances, num_tokens) ragged string sequence containing the noisy label for each instance

Returns

a (num_samples, num_tokens) ragged boolean array containing the flag for each instance

supports_correction(self) bool