nessie.detectors.mean_distance

Module Contents

Classes

MeanDistance

Stefan Larson, Anish Mahendran, Andrew Lee, Jonathan K. Kummerfeld, Parker Hill, Michael A.

class nessie.detectors.mean_distance.MeanDistance(metric: str = 'euclidean')

Bases: nessie.detectors.error_detector.Detector

Stefan Larson, Anish Mahendran, Andrew Lee, Jonathan K. Kummerfeld, Parker Hill, Michael A. Laurenzano, Johann Hauswald, Lingjia Tang, Jason Mars Outlier Detection for Improved Data Quality and Diversity in Dialog Systems In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 517–527).

https://www.aclweb.org/anthology/N19-1051.pdf

error_detector_kind(self) nessie.detectors.error_detector.DetectorKind
score(self, labels: nessie.types.StringArray, embedded_instances: nessie.types.FloatArray2D, **kwargs) numpy.typing.NDArray[float]

Compute the mean vector for each instance grouped by label, the outlier score for each instance is the distance to the mean vector for its label.

For each class, do

  1. Generate a vector representation of each instance.

  2. Average vectors to get a mean representation.

  3. Calculate the distance of each instance from the mean.

  4. Rank by distance in ascending order.

  5. (Cut off the list, keeping only the top k% as outliers.

Parameters
  • labels – a (num_instances, ) string sequence containing the noisy label for each instance

  • embedded_instances – 2d numpy array of shape (num_items, encoding_dim)

Returns

a (num_samples, ) numpy array containing the scores for each instance