Datasets and tasks

We focus on methods for text classification as well as token and span labeling, but our implementations should be easily adaptable to other tasks. We define simple file formats for these and provide loader for each. Also, we provide example datasets that can be used to test methods and understand the data formats used.

Text classification

The goal of text classification is to assign a predefined category to a given text sequence (which can for instance be a sentence, paragraph, or a document). Example applications are news categorization, sentiment analysis or intent detection. For text classification, we consider each individual sentence or document its own instance.

Format

The format consists of n rows with three tab-seperated fields, one for each instance. First is the text, then the gold label, finally the noisy label.

[1]:

%%writefile text_classification.tsv
I love reindeer very much   positive        positive
I like Michael very much    positive        negative

Overwriting text_classification.tsv

Example data

[2]:

from nessie.dataloader import load_example_text_classification_data

data = load_example_text_classification_data()
print(data[42])

('I love corruption the most', 'positive', 'negative')

Data loader

[3]:

from nessie.dataloader import load_text_classification_tsv

ds = load_text_classification_tsv("text_classification.tsv")
print(ds[1])

('I like Michael very much', 'positive', 'negative')

Token labeling

The task of token labeling is to assign a label to each token. The most common task in this category is POS tagging. As there are not many other tasks with easily obtainable datasets, we only use two different POS tagging datasets. For token labeling, each individual token is considered an instance.

Format

The format is similar to other CoNLL formats. It consists of n blocks seperated by a blank line, one per sentence. Each block consists of a varying number of rows. Each row consists of three tab-seperated fields, one for each instance. First is the text, then the gold label, finally the noisy label.

[4]:

%%writefile token_labeling.conll
I   PRON    PRON
like        VERB    NOUN
reindeer    NOUN    NOUN
.   PUNCT   PUNCT

I   PRON    PRON
adore       VERB    NOUN
Michael     PROPN   ADJ
very        ADV     ADV
much        ADV     VERB
.   PUNCT   PUNCT

Overwriting token_labeling.conll

Example data

[5]:

from nessie.dataloader import load_example_token_labeling_data

data = load_example_token_labeling_data()
for token, gold_label, noisy_label in zip(*data[23]):
    print(token, gold_label, noisy_label)

I PRON PRON
love VERB PROPN
sports NOUN NOUN
. PUNCT PUNCT

Data loader

[6]:

from nessie.dataloader import load_sequence_labeling_dataset

data = load_sequence_labeling_dataset("token_labeling.conll")

for token, gold_label, noisy_label in zip(*data[1]):
    print(token, gold_label, noisy_label)

I PRON PRON
adore VERB NOUN
Michael PROPN ADJ
very ADV ADV
much ADV VERB
. PUNCT PUNCT

Span labeling

Span labeling assigns labels not to single tokens, but to spans of text. Common tasks that can be modeled that way are named-entity recognition (NER), slot filling or chunking.

Format

The format is similar to other CoNLL formats and the same as for token labeling. It consists of n blocks seperated by a blank line, one per sentence. Each block consists of a varying number of rows. Each row consists of three tab-seperated fields, one for each instance. First is the text, then the gold label, finally the noisy label.

[7]:

%%writefile span_labeling.conll
The O       O
United      B-LOC   B-LOC
States      I-LOC   I-LOC
of  I-LOC   I-LOC
America     I-LOC   I-LOC
is  O       O
in  O       O
the O       O
city        O       O
of  O       O
New B-LOC   B-PER
York        I-LOC   I-PER
.   O       O

Hogwarts    B-ORG   B-ORG
pays        O       O
taxes       O       O
in  O       O
Manhattan   B-LOC   B-ORG
.   O       O

Overwriting span_labeling.conll

Example data

[8]:

from nessie.dataloader import load_example_span_classification_data

data = load_example_span_classification_data()
for token, gold_label, noisy_label in zip(*data[80]):
    print(token, gold_label, noisy_label)

Barack B-PER B-PER
Obama I-PER I-PER
likes O O
Hogwarts B-ORG B-PER
. O O

Data loader

[9]:

from nessie.dataloader import load_sequence_labeling_dataset

data = load_sequence_labeling_dataset("span_labeling.conll")

for token, gold_label, noisy_label in zip(*data[1]):
    print(token, gold_label, noisy_label)

Hogwarts B-ORG B-ORG
pays O O
taxes O O
in O O
Manhattan B-LOC B-ORG
. O O