Calibration

Several model-based AED methods, for instance Classification Uncertainty, directly leverage probability estimates provided by the backing model. Therefore, it is of interest whether models output class probability distributions which are accurate. For instance, if a model predicts 100 instances and states for all 80% confidence, then the accuracy should be around 0.8. If this is the case for a model, then it is called calibrated. There are several approaches for post-hoc calibration that can be applied after a model for AED has been trained. In nessie, we use the netcal library for calibration. Their documentation also provides a list of calibration methods to try. From our evaluation, Platt Scaling (also called Logistic Calibration) worked well. In our experiments, calibration can statistically improve AED performance and more often than not has no large negative side effects. Please refer to our paper for more information.

Usage

Calibration is normally trained on a holdout set. As we already perform cross-validation, we use the holdout set both for training the calibration and for predicting annotation errors. While this would not be optimal if we are interested in generalizing calibrated probabilities to unseen data, we are more interested in downstream task performance. Using an additional fold per round would be theoretically more sound. But our preliminary experiments show that it has the issue of reducing the available training data and thereby hurts the error detection performance more than the calibration helps. Using the same fold for both calibration and applying \ac{aed}, however, improves overall task performance which is what matters in our special task setting. We do not leak the values for the downstream tasks (whether an instance is labeled wrong or not) but only the labels for the primary task. nessie provides helper methods for this.

[ ]:
import ireval

from netcal.scaling import LogisticCalibration

from nessie.calibration import CalibrationCallback, CalibratorWrapper
from nessie.detectors import ClassificationUncertainty
from nessie.dataloader import load_example_text_classification_data
from nessie.helper import CrossValidationHelper
from nessie.models.text import DummyTextClassifier

ds = load_example_text_classification_data()

model = DummyTextClassifier()
detector = ClassificationUncertainty()

# Without calibration

calibrator = CalibratorWrapper(LogisticCalibration())
calibration_callback = CalibrationCallback(calibrator)

cv = CrossValidationHelper()

result_uncalibrated = cv.run(ds.texts, ds.noisy_labels, model)
scores_uncalibrated = detector.score(ds.noisy_labels, result_uncalibrated.probabilities, result_uncalibrated.le)

# With calibration

calibrator = CalibratorWrapper(LogisticCalibration())
calibration_callback = CalibrationCallback(calibrator)

cv = CrossValidationHelper()
cv.add_callback(calibration_callback)

result_calibrated = cv.run(ds.texts, ds.noisy_labels, model)
scores_calibrated = detector.score(ds.noisy_labels, result_calibrated.probabilities, result_calibrated.le)

# Evaluation
flags = ds.flags
ap_uncalibrated = ireval.average_precision(flags, scores_uncalibrated)
ap_calibrated = ireval.average_precision(flags, scores_calibrated)

print(f"AP uncalibrated: {ap_uncalibrated}")
print(f"AP calibrated: {ap_calibrated}")