nessie.task_support.span_labeling

Module Contents

Classes

AlignedData

AlignmentResult

SpanId

Functions

aggregate_scores_to_spans(labels: nessie.types.RaggedStringArray, scores: nessie.types.RaggedFloatArray, span_aggregator: Optional[Callable[[List[numpy.ndarray]], numpy.ndarray]] = None) → awkward.Array

align_for_span_labeling(noisy_labels: nessie.types.RaggedStringArray, predictions: nessie.types.RaggedStringArray, probabilities: nessie.types.RaggedFloatArray2D, repeated_probabilities: nessie.types.RaggedFloatArray3D, le: sklearn.preprocessing.LabelEncoder, span_aggregator: Optional[Callable[[List[numpy.ndarray]], numpy.ndarray]] = None, function_aggregator: Optional[Callable[[numpy.ndarray], numpy.ndarray]] = None) → AlignmentResult

The goal of this function is to align existing and predicted sequence labeling annotations

align_span_labeling_data(tokens: nessie.types.RaggedStringArray, gold_labels: nessie.types.RaggedStringArray, noisy_labels: nessie.types.RaggedStringArray) → AlignedData

Aligns spans from gold data with noisy ones.

align_span_labeling_result(noisy_labels: nessie.types.RaggedStringArray, result: nessie.helper.RaggedResult) → AlignmentResult

embed_spans(X: nessie.types.RaggedStringArray, y: nessie.types.RaggedStringArray, embedder: nessie.models.featurizer.FlairTokenEmbeddingsWrapper, aggregate: Callable[[List[numpy.ndarray]], numpy.ndarray] = None) → awkward.Array

span_matching(tagging_A: List[Tuple[int, int]], tagging_B: List[Tuple[int, int]], keep_A: bool = False) → Dict[int, int]

Assume we have a list of tokens which was tagged with spans by two different approaches A and B.

Attributes

RaggedArray

UNALIGNED_LABEL

nessie.task_support.span_labeling.RaggedArray
nessie.task_support.span_labeling.UNALIGNED_LABEL = ___NESSIE_NO_ALIGNMENT___
class nessie.task_support.span_labeling.AlignedData
gold_labels :List[str]
noisy_labels :List[str]
span_ids :List[SpanId]
surface_forms :List[str]
__len__(self) int
property flags(self) List[bool]
class nessie.task_support.span_labeling.AlignmentResult
labels :numpy.typing.NDArray[str]
le :sklearn.preprocessing.LabelEncoder
predictions :numpy.typing.NDArray[str]
probabilities :numpy.typing.NDArray[float]
repeated_probabilities :Optional[numpy.typing.NDArray[float]]
span_ids :List[SpanId]
class nessie.task_support.span_labeling.SpanId
end :int
sentence :int
start :int
nessie.task_support.span_labeling.aggregate_scores_to_spans(labels: nessie.types.RaggedStringArray, scores: nessie.types.RaggedFloatArray, span_aggregator: Optional[Callable[[List[numpy.ndarray]], numpy.ndarray]] = None) awkward.Array
nessie.task_support.span_labeling.align_for_span_labeling(noisy_labels: nessie.types.RaggedStringArray, predictions: nessie.types.RaggedStringArray, probabilities: nessie.types.RaggedFloatArray2D, repeated_probabilities: nessie.types.RaggedFloatArray3D, le: sklearn.preprocessing.LabelEncoder, span_aggregator: Optional[Callable[[List[numpy.ndarray]], numpy.ndarray]] = None, function_aggregator: Optional[Callable[[numpy.ndarray], numpy.ndarray]] = None) AlignmentResult

The goal of this function is to align existing and predicted sequence labeling annotations and their respective probabilities.

  1. Original and predicted sequence labeling are aligned in a way that maximizes the overlap between them (see span_matching)

  2. BIO tagged sequences are reduced to a list of spans with their type, e.g. [O, B-PER, I-PER, O, B-LOC] becomes [PER, LOC]

  3. Probabilities for spans are aggregated

  4. Spans that exist in the original sequence and have no counterpart in the predicted one get a special label and are assigned the probability of the O label

Parameters
  • noisy_labels

  • predictions

  • probabilities

  • repeated_probabilities

  • le

  • span_aggregator

  • function_aggregator

Returns:

nessie.task_support.span_labeling.align_span_labeling_data(tokens: nessie.types.RaggedStringArray, gold_labels: nessie.types.RaggedStringArray, noisy_labels: nessie.types.RaggedStringArray) AlignedData

Aligns spans from gold data with noisy ones.

If a span in the noisy labels has no match in gold, then a special label is assigned that is not in the original data. Surface forms returned use gold boundaries if a match exists, else from the noisy data.

Parameters
  • tokens – The tokens that contain the text

  • gold_labels – Gold labels in BIO format

  • noisy_labels – Noisy labels in BIO format

Returns: The alignment between gold and noisy data

nessie.task_support.span_labeling.align_span_labeling_result(noisy_labels: nessie.types.RaggedStringArray, result: nessie.helper.RaggedResult) AlignmentResult
nessie.task_support.span_labeling.embed_spans(X: nessie.types.RaggedStringArray, y: nessie.types.RaggedStringArray, embedder: nessie.models.featurizer.FlairTokenEmbeddingsWrapper, aggregate: Callable[[List[numpy.ndarray]], numpy.ndarray] = None) awkward.Array
nessie.task_support.span_labeling.span_matching(tagging_A: List[Tuple[int, int]], tagging_B: List[Tuple[int, int]], keep_A: bool = False) Dict[int, int]

Assume we have a list of tokens which was tagged with spans by two different approaches A and B. This method tries to find the best 1:1 assignment of spans from B to spans from A. If there are more spans in A than in B, then spans from B will go unused and vice versa. The quality of an assignment between two spans depends on their overlap in tokens. This method removes entirely disjunct pairs of spans. Note: In case A contains two (or more) spans of the same length which are a single span in B (or vice versa), either of the spans from A may be mapped to the span in B. Which exact span from A is mapped is undefined. :param tagging_A: list of spans, defined by (start, end) token offsets (exclusive!), must be non-overlapping! :param tagging_B: a second list of spans over the same sequence in the same format as tagging_A :param keep_A: include unmatched spans from A as [idx_A, None] in the returned value :return: Dict[int,int] where keys are indices from A and values are indices from B