Python Flair Basics¶

Version: 0.2, September 2019

Download: This and various other Jupyter notebooks are available from my GitHub repo.

License: Creative Commons Attribution-ShareAlike 4.0 International License (CA BY-SA 4.0)

This material was used in my Advanced Topics in AI class, introduction to Deep Learning environments in Spring 2019 at Indiana University.

Two types of central objects:

Sentence
Token

A Sentence holds a textual sentence and is essentially a list of Token objects.

For creating a Sentence object we first import the Sentence class from the flair.data module:

from flair.data import Sentence

We can now define a sentence:

sentence = Sentence('The grass is green .')
print(sentence)

Sentence: "The grass is green ." - 5 Tokens

We can access the tokens of a sentence via their token id or with their index:

print(sentence.get_token(4))
print(sentence[3])

Token: 4 green
Token: 4 green

We can also iterate over all tokens in a sentence:

for token in sentence:
    print(token)

Token: 1 The
Token: 2 grass
Token: 3 is
Token: 4 green
Token: 5 .

Tokenization¶

There is a simple tokenizer included in the library using the lightweight segtok library to tokenize your text for such a Sentence definition. In the Sentence constructor use the flag use_tokenize to tokenize the input string before instantiating a Sentence object:

sentence = Sentence('The grass is green.', use_tokenizer=True)
print(sentence)

Sentence: "The grass is green ." - 5 Tokens

Tags on Tokens¶

A Token has fields for linguistic annotation:

lemma
part-of-speech tag
named entity tag

We can add a tag by specifying the tag type and the tag value.

We will adding an NER tag of type 'color' to the word 'green'. This means that we've tagged this word as an entity of type color:

sentence[3].add_tag('ner', 'color')
print(sentence.to_tagged_string())

The grass is green <color> .

Each tag is of class Label. An associated score indicates confidence:

from flair.data import Label

tag: Label = sentence[3].get_tag('ner')
print(f'"{sentence[3]}" is tagged as "{tag.value}" with confidence score "{tag.score}"')

"Token: 4 green" is tagged as "color" with confidence score "1.0"

The manually added color tag has a score of 1.0. A tag predicted by a sequence labeler will have a score value that indicates the classifier confidence.

A Sentence can have one or multiple labels that can for example be used in text classification tasks. For instance, the example below shows how we add the label 'sports' to a sentence, thereby labeling it as belonging to the sports category.

sentence = Sentence('France is the current world cup winner.')
sentence.add_label('sports')
sentence.add_labels(['sports', 'world cup'])
sentence = Sentence('France is the current world cup winner.', labels=['sports', 'world cup'])

Labels are also of the Label class. So, you can print a sentence's labels like this:

sentence = Sentence('France is the current world cup winner.', labels=['sports', 'world cup'])

print(sentence)
for label in sentence.labels:
    print(label)

Sentence: "France is the current world cup winner." - 7 Tokens - Labels: [sports (1.0), world cup (1.0)] 
sports (1.0)
world cup (1.0)

Tagging Text¶

Using Pre-Trained Sequence Tagging Models¶

Flair has numerous pre-trained models. For the named entity recognition (NER) task there is a model that was trained on the English CoNLL-03 task and can recognize 4 different entity types. Import it using:

from flair.models import SequenceTagger

tagger = SequenceTagger.load('ner')

2019-09-26 15:06:43,008 loading file C:\Users\damir\.flair\models\en-ner-conll03-v0.4.pt

We use the predict() method of the tagger on a sentence to add predicted tags to the tokens in the sentence:

sentence = Sentence('George Washington went to Washington .')

tagger.predict(sentence)

print(sentence.to_tagged_string())

George <B-PER> Washington <E-PER> went to Washington <S-LOC> .

Getting annotated spans for multi-word expressions can be achieved using the following command:

for entity in sentence.get_spans('ner'):
    print(entity)

PER-span [1,2]: "George Washington"
LOC-span [5]: "Washington"

Which indicates that "George Washington" is a person (PER) and "Washington" is a location (LOC). Each such Span has a text, a tag value, its position in the sentence and "score" that indicates how confident the tagger is that the prediction is correct. You can also get additional information, such as the position offsets of each entity in the sentence by calling:

print(sentence.to_dict(tag_type='ner'))

{'text': 'George Washington went to Washington .', 'labels': [], 'entities': [{'text': 'George Washington', 'start_pos': 0, 'end_pos': 17, 'type': 'PER', 'confidence': 0.9967882037162781}, {'text': 'Washington', 'start_pos': 26, 'end_pos': 36, 'type': 'LOC', 'confidence': 0.9993709921836853}]}

Flair contains various sequence tagger models. You choose which pre-trained model you load by passing the appropriate string to the load() method of the SequenceTagger class. Currently, the following pre-trained models are provided:

As indicated in the list above, we also provide pre-trained models for languages other than English. Currently, we support German, French, and Dutch other languages are forthcoming. To tag a German sentence, just load the appropriate model:

tagger = SequenceTagger.load('de-ner')

sentence = Sentence('George Washington ging nach Washington .')

tagger.predict(sentence)

print(sentence.to_tagged_string())

2019-09-26 15:08:57,099 loading file C:\Users\damir\.flair\models\de-ner-conll03-v0.4.pt
George <B-PER> Washington <E-PER> ging nach Washington <S-LOC> .

Flair offers access to multi-lingual models for multi-lingual text.

tagger = SequenceTagger.load('pos-multi')

sentence = Sentence('George Washington lebte in Washington . Dort kaufte er a horse .')

tagger.predict(sentence)

print(sentence.to_tagged_string())

2019-09-26 15:09:32,682 loading file C:\Users\damir\.flair\models\pos-multi-v0.1.pt
George <PROPN> Washington <PROPN> lebte <VERB> in <ADP> Washington <PROPN> . <PUNCT> Dort <PROPN> kaufte <VERB> er <PRON> a <DET> horse <NOUN> . <PUNCT>

Semantic Frames¶

For English, Flair provides a pre-trained model that detects semantic frames in text, trained using Propbank 3.0 frames. This provides some sort of word sense disambiguation for frame evoking words.

tagger = SequenceTagger.load('frame')

sentence_1 = Sentence('George returned to Berlin to return his hat .')
sentence_2 = Sentence('He had a look at different hats .')

tagger.predict(sentence_1)
tagger.predict(sentence_2)

print(sentence_1.to_tagged_string())
print(sentence_2.to_tagged_string())

The frame detector makes a distinction in sentence 1 between two different meanings of the word 'return'. 'return.01' means returning to a location, while 'return.02' means giving something back.

Similarly, in sentence 2 the frame detector finds a light verb construction in which 'have' is the light verb and 'look' is a frame evoking word.

Sentence Tagging¶

To tag an entire text corpus, one needs to split the corpus into sentences and pass a list of Sentence objects to the .predict() method.

text = "This is a sentence. John read a book. This is another sentence. I love Berlin."

from segtok.segmenter import split_single

sentences = [Sentence(sent, use_tokenizer=True) for sent in split_single(text)]

tagger: SequenceTagger = SequenceTagger.load('ner')
tagger.predict(sentences)
for i in sentences:
    print(i.to_tagged_string())

2019-09-26 15:12:04,472 loading file C:\Users\damir\.flair\models\en-ner-conll03-v0.4.pt
This is a sentence .
John <S-PER> read a book .
This is another sentence .
I love Berlin <S-LOC> .

Using the mini_batch_size parameter of the .predict() method, you can set the size of mini batches passed to the tagger. Depending on your resources, you might want to play around with this parameter to optimize speed.

Pre-Trained Text Classification Models¶

Flair provides a pre-trained model for detecting positive or negative comments. It was trained on the IMDB dataset and it can recognize positive and negative sentiment in English text. The IMDB data set can be downloaded from the linked site.

from flair.models import TextClassifier

classifier = TextClassifier.load('en-sentiment')

2019-09-26 15:14:27,357 loading file C:\Users\damir\.flair\models\imdb-v0.4.pt

We call the predict() method of the classifier on a sentence. This will add the predicted label to the sentence:

sentence = Sentence('This film hurts. It is so bad that I am confused.')

classifier.predict(sentence)

print(sentence.labels)

[NEGATIVE (0.9598665833473206)]

sentence = Sentence('This film is fantastic. I love it.')

classifier.predict(sentence)

print(sentence.labels)

[POSITIVE (0.9998914003372192)]

Flair has a pre-trained German and English model.

Using Word Embeddings¶

Flair provides a set of classes with which we can embed the words in sentences in various ways.

All word embedding classes inherit from the TokenEmbeddings class and implement the embed() method which we need to call to embed our text. This means that for most users of Flair, the complexity of different embeddings remains hidden behind this interface. Simply instantiate the embedding class we require and call embed() to embed our text.

All embeddings produced with Flair's methods are Pytorch vectors, so they can be immediately used for training and fine-tuning.

Classic word embeddings are static and word-level, meaning that each distinct word gets exactly one pre-computed embedding. Most embeddings fall under this class, including the popular GloVe or Komninos embeddings.

We instantiate the WordEmbeddings class and pass a string identifier of the embedding we wish to load. If we want to use GloVe embeddings, we pass the string 'glove' to the constructor:

from flair.embeddings import WordEmbeddings

glove_embedding = WordEmbeddings('glove')

We create an example sentence and call the embedding's embed() method. We can also pass a list of sentences to this method since some embedding types make use of batching to increase speed.

sentence = Sentence('The grass is green .')

glove_embedding.embed(sentence)

for token in sentence:
    print(token)
    print(token.embedding)

Token: 1 The
tensor([-0.0382, -0.2449,  0.7281, -0.3996,  0.0832,  0.0440, -0.3914,  0.3344,
        -0.5755,  0.0875,  0.2879, -0.0673,  0.3091, -0.2638, -0.1323, -0.2076,
         0.3340, -0.3385, -0.3174, -0.4834,  0.1464, -0.3730,  0.3458,  0.0520,
         0.4495, -0.4697,  0.0263, -0.5415, -0.1552, -0.1411, -0.0397,  0.2828,
         0.1439,  0.2346, -0.3102,  0.0862,  0.2040,  0.5262,  0.1716, -0.0824,
        -0.7179, -0.4153,  0.2033, -0.1276,  0.4137,  0.5519,  0.5791, -0.3348,
        -0.3656, -0.5486, -0.0629,  0.2658,  0.3020,  0.9977, -0.8048, -3.0243,
         0.0125, -0.3694,  2.2167,  0.7220, -0.2498,  0.9214,  0.0345,  0.4674,
         1.1079, -0.1936, -0.0746,  0.2335, -0.0521, -0.2204,  0.0572, -0.1581,
        -0.3080, -0.4162,  0.3797,  0.1501, -0.5321, -0.2055, -1.2526,  0.0716,
         0.7056,  0.4974, -0.4206,  0.2615, -1.5380, -0.3022, -0.0734, -0.2831,
         0.3710, -0.2522,  0.0162, -0.0171, -0.3898,  0.8742, -0.7257, -0.5106,
        -0.5203, -0.1459,  0.8278,  0.2706])
Token: 2 grass
tensor([-0.8135,  0.9404, -0.2405, -0.1350,  0.0557,  0.3363,  0.0802, -0.1015,
        -0.5478, -0.3537,  0.0734,  0.2587,  0.1987, -0.1433,  0.2507,  0.4281,
         0.1950,  0.5346,  0.7424,  0.0578, -0.3178,  0.9436,  0.8145, -0.0824,
         0.6166,  0.7284, -0.3262, -1.3641,  0.1232,  0.5373, -0.5123,  0.0246,
         1.0822, -0.2296,  0.6039,  0.5541, -0.9610,  0.4803,  0.0022,  0.5591,
        -0.1637, -0.8468,  0.0741, -0.6216,  0.0260, -0.5162, -0.0525, -0.1418,
        -0.0161, -0.4972, -0.5534, -0.4037,  0.5096,  1.0276, -0.0840, -1.1179,
         0.3226,  0.4928,  0.9488,  0.2040,  0.5388,  0.8397, -0.0689,  0.3136,
         1.0450, -0.2267, -0.0896, -0.6427,  0.6443, -1.1001, -0.0096,  0.2668,
        -0.3230, -0.6065,  0.0479, -0.1664,  0.8571,  0.2335,  0.2539,  1.2546,
         0.5472, -0.1980, -0.7186,  0.2076, -0.2587, -0.3650,  0.0834,  0.6932,
         0.1574,  1.0931,  0.0913, -1.3773, -0.2717,  0.7071,  0.1872, -0.3307,
        -0.2836,  0.1030,  1.2228,  0.8374])
Token: 3 is
tensor([-0.5426,  0.4148,  1.0322, -0.4024,  0.4669,  0.2182, -0.0749,  0.4733,
         0.0810, -0.2208, -0.1281, -0.1144,  0.5089,  0.1157,  0.0282, -0.3628,
         0.4382,  0.0475,  0.2028,  0.4986, -0.1007,  0.1327,  0.1697,  0.1165,
         0.3135,  0.2571,  0.0928, -0.5683, -0.5297, -0.0515, -0.6733,  0.9253,
         0.2693,  0.2273,  0.6636,  0.2622,  0.1972,  0.2609,  0.1877, -0.3454,
        -0.4263,  0.1398,  0.5634, -0.5691,  0.1240, -0.1289,  0.7248, -0.2610,
        -0.2631, -0.4360,  0.0789, -0.8415,  0.5160,  1.3997, -0.7646, -3.1453,
        -0.2920, -0.3125,  1.5129,  0.5243,  0.2146,  0.4245, -0.0884, -0.1780,
         1.1876,  0.1058,  0.7657,  0.2191,  0.3582, -0.1164,  0.0933, -0.6248,
        -0.2190,  0.2180,  0.7406, -0.4374,  0.1434,  0.1472, -1.1605, -0.0505,
         0.1268, -0.0144, -0.9868, -0.0913, -1.2054, -0.1197,  0.0478, -0.5400,
         0.5246, -0.7096, -0.3253, -0.1346, -0.4131,  0.3343, -0.0072,  0.3225,
        -0.0442, -1.2969,  0.7622,  0.4635])
Token: 4 green
tensor([-6.7907e-01,  3.4908e-01, -2.3984e-01, -9.9652e-01,  7.3782e-01,
        -6.5911e-04,  2.8010e-01,  1.7287e-02, -3.6063e-01,  3.6955e-02,
        -4.0395e-01,  2.4092e-02,  2.8958e-01,  4.0497e-01,  6.9992e-01,
         2.5269e-01,  8.0350e-01,  4.9370e-02,  1.5562e-01, -6.3286e-03,
        -2.9414e-01,  1.4728e-01,  1.8977e-01, -5.1791e-01,  3.6986e-01,
         7.4582e-01,  8.2689e-02, -7.2601e-01, -4.0939e-01, -9.7822e-02,
        -1.4096e-01,  7.1121e-01,  6.1933e-01, -2.5014e-01,  4.2250e-01,
         4.8458e-01, -5.1915e-01,  7.7125e-01,  3.6685e-01,  4.9652e-01,
        -4.1298e-02, -1.4683e+00,  2.0038e-01,  1.8591e-01,  4.9860e-02,
        -1.7523e-01, -3.5528e-01,  9.4153e-01, -1.1898e-01, -5.1903e-01,
        -1.1887e-02, -3.9186e-01, -1.7479e-01,  9.3451e-01, -5.8931e-01,
        -2.7701e+00,  3.4522e-01,  8.6533e-01,  1.0808e+00, -1.0291e-01,
        -9.1220e-02,  5.5092e-01, -3.9473e-01,  5.3676e-01,  1.0383e+00,
        -4.0658e-01,  2.4590e-01, -2.6797e-01, -2.6036e-01, -1.4151e-01,
        -1.2022e-01,  1.6234e-01, -7.4320e-01, -6.4728e-01,  4.7133e-02,
         5.1642e-01,  1.9898e-01,  2.3919e-01,  1.2550e-01,  2.2471e-01,
         8.2613e-01,  7.8328e-02, -5.7020e-01,  2.3934e-02, -1.5410e-01,
        -2.5739e-01,  4.1262e-01, -4.6967e-01,  8.7914e-01,  7.2629e-01,
         5.3862e-02, -1.1575e+00, -4.7835e-01,  2.0139e-01, -1.0051e+00,
         1.1515e-01, -9.6609e-01,  1.2960e-01,  1.8388e-01, -3.0383e-02])
Token: 5 .
tensor([-0.3398,  0.2094,  0.4635, -0.6479, -0.3838,  0.0380,  0.1713,  0.1598,
         0.4662, -0.0192,  0.4148, -0.3435,  0.2687,  0.0446,  0.4213, -0.4103,
         0.1546,  0.0222, -0.6465,  0.2526,  0.0431, -0.1945,  0.4652,  0.4565,
         0.6859,  0.0913,  0.2188, -0.7035,  0.1679, -0.3508, -0.1263,  0.6638,
        -0.2582,  0.0365, -0.1361,  0.4025,  0.1429,  0.3813, -0.1228, -0.4589,
        -0.2528, -0.3043, -0.1121, -0.2618, -0.2248, -0.4455,  0.2991, -0.8561,
        -0.1450, -0.4909,  0.0083, -0.1749,  0.2752,  1.4401, -0.2124, -2.8435,
        -0.2796, -0.4572,  1.6386,  0.7881, -0.5526,  0.6500,  0.0864,  0.3901,
         1.0632, -0.3538,  0.4833,  0.3460,  0.8417,  0.0987, -0.2421, -0.2705,
         0.0453, -0.4015,  0.1139,  0.0062,  0.0367,  0.0185, -1.0213, -0.2081,
         0.6407, -0.0688, -0.5864,  0.3348, -1.1432, -0.1148, -0.2509, -0.4591,
        -0.0968, -0.1795, -0.0634, -0.6741, -0.0689,  0.5360, -0.8777,  0.3180,
        -0.3924, -0.2339,  0.4730, -0.0288])

GloVe embeddings are Pytorch vectors of dimensionality 100.

We choose which pre-trained embeddings we want to load by passing the appropriate id string to the constructor of the WordEmbeddings class. We would use the two-letter language code to init an embedding, so 'en' for English and 'de' for German and so on. By default, this will initialize FastText embeddings trained over Wikipedia. We can also always use FastText embeddings over Web crawls, by instantiating with '-crawl'. The 'de-crawl' option would use embeddings trained over German web crawls.

For English, Flair provides a few more options. We can choose between instantiating 'en-glove', 'en-extvec' and so on.

If we want to load German FastText embeddings, instantiate as follows:

german_embedding = WordEmbeddings('de')

If we want to load German FastText embeddings trained over crawls, we instantiate as follows:

german_embedding = WordEmbeddings('de-crawl')

If the models are not locally available, Flair will automatically download them and install them into the local user cache.

It is recommended to use the FastText embeddings, or GloVe if we want a smaller model.

If we want to use any other embeddings (not listed in the list above), we can load those by calling:

custom_embedding = WordEmbeddings('path/to/your/custom/embeddings.gensim')

If we want to load custom embeddings, we need to make sure that the custom embeddings are correctly formatted to gensim.

We can, for example, convert FastText embeddings to gensim using the following code snippet:

import gensim

word_vectors = gensim.models.KeyedVectors.load_word2vec_format('/path/to/fasttext/embeddings.txt', binary=False)
word_vectors.save('/path/to/converted')

Character Embeddings¶

Some embeddings - such as character-features - are not pre-trained but rather trained on the downstream task. Normally this requires an implementation of a hierarchical embedding architecture.

With Flair, we don't need to worry about such things. Just choose the appropriate embedding class and the character features will then automatically train during downstream task training.

from flair.embeddings import CharacterEmbeddings

embedding = CharacterEmbeddings()

sentence = Sentence('The grass is green .')

embedding.embed(sentence)

[Sentence: "The grass is green ." - 5 Tokens]

Sub-Word Embeddings¶

Flair now also includes the byte pair embeddings calulated by @bheinzerling that segment words into subsequences. This can dramatically reduce the model size vis-a-vis using normal word embeddings at nearly the same accuracy. So, if we want to train small models try out the new BytePairEmbeddings class.

We initialize with a language code (275 languages supported), a number of 'syllables', and a number of dimensions (one of 50, 100, 200 or 300). The following initializes and uses byte pair embeddings for English:

from flair.embeddings import BytePairEmbeddings

embedding = BytePairEmbeddings('en')

sentence = Sentence('The grass is green .')

embedding.embed(sentence)

[Sentence: "The grass is green ." - 5 Tokens]

Sub-word embeddings are interesting, since

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

subwords allow guessing the meaning of unknown / out-of-vocabulary words. E.g., the suffix -shire in Melfordshire indicates a location.
Byte-Pair Encoding gives a subword segmentation that is often good enough, without requiring tokenization or morphological analysis. In this case the BPE segmentation might be something like melf ord shire.
Pre-trained byte-pair embeddings work surprisingly well, while requiring no tokenization and being much smaller than alternatives: an 11 MB BPEmb English model matches the results of the 6 GB FastText model in our evaluation.

If you are using word embeddings like word2vec or GloVe, you have probably encountered out-of-vocabulary words, i.e., words for which no embedding exists. A makeshift solution is to replace such words with an UNK token and train a generic embedding representing such unknown words.

Subword approaches try to solve the unknown word problem differently, by assuming that you can reconstruct a word's meaning from its parts. For example, the suffix -shire lets you guess that Melfordshire is probably a location, or the suffix -osis that Myxomatosis might be a sickness.

There are many ways of splitting a word into subwords. A simple method is to split into characters and then learn to transform this character sequence into a vector representation by feeding it to a convolutional neural network (CNN) or a recurrent neural network (RNN), usually a long-short term memory (LSTM). This vector representation can then be used like a word embedding.

Another, more linguistically motivated way is a morphological analysis, but this requires tools and training data which might not be available for your language and domain of interest.

Enter Byte-Pair Encoding (BPE) [Sennrich et al, 2016], an unsupervised subword segmentation method. BPE starts with a sequence of symbols, for example characters, and iteratively merges the most frequent symbol pair into a new symbol.

For example, applying BPE to English might first merge the characters h and e into a new symbol he, then t and h into th, then t and he into the, and so on.

Learning these merge operations from a large corpus (e.g. all Wikipedia articles in a given language) often yields reasonable subword segementations. For example, a BPE model trained on English Wikipedia splits melfordshire into mel, ford, and shire.

Stacked Embeddings¶

Stacked embeddings are one of the most important concepts of Flair. We can use them to combine different embeddings together, for instance if we want to use both traditional embeddings together with contextual string embeddings (see below). Stacked embeddings allow us to mix and match. We find that a combination of embeddings often gives best results.

All we need to do is use the StackedEmbeddings class and instantiate it by passing a list of embeddings that we wish to combine. For instance, lets combine classic GloVe embeddings with character embeddings. This is effectively the architecture proposed in (Lample et al., 2016).

from flair.embeddings import WordEmbeddings, CharacterEmbeddings

glove_embedding = WordEmbeddings('glove')

character_embeddings = CharacterEmbeddings()

Now instantiate the StackedEmbeddings class and pass it a list containing these two embeddings.

from flair.embeddings import StackedEmbeddings

stacked_embeddings = StackedEmbeddings(
    embeddings=[glove_embedding, character_embeddings])

We use this embedding like all the other embeddings, i.e. call the embed() method over our sentences.

sentence = Sentence('The grass is green .')

stacked_embeddings.embed(sentence)

for token in sentence:
    print(token)
    print(token.embedding)

Token: 1 The
tensor([-0.0382, -0.2449,  0.7281, -0.3996,  0.0832,  0.0440, -0.3914,  0.3344,
        -0.5755,  0.0875,  0.2879, -0.0673,  0.3091, -0.2638, -0.1323, -0.2076,
         0.3340, -0.3385, -0.3174, -0.4834,  0.1464, -0.3730,  0.3458,  0.0520,
         0.4495, -0.4697,  0.0263, -0.5415, -0.1552, -0.1411, -0.0397,  0.2828,
         0.1439,  0.2346, -0.3102,  0.0862,  0.2040,  0.5262,  0.1716, -0.0824,
        -0.7179, -0.4153,  0.2033, -0.1276,  0.4137,  0.5519,  0.5791, -0.3348,
        -0.3656, -0.5486, -0.0629,  0.2658,  0.3020,  0.9977, -0.8048, -3.0243,
         0.0125, -0.3694,  2.2167,  0.7220, -0.2498,  0.9214,  0.0345,  0.4674,
         1.1079, -0.1936, -0.0746,  0.2335, -0.0521, -0.2204,  0.0572, -0.1581,
        -0.3080, -0.4162,  0.3797,  0.1501, -0.5321, -0.2055, -1.2526,  0.0716,
         0.7056,  0.4974, -0.4206,  0.2615, -1.5380, -0.3022, -0.0734, -0.2831,
         0.3710, -0.2522,  0.0162, -0.0171, -0.3898,  0.8742, -0.7257, -0.5106,
        -0.5203, -0.1459,  0.8278,  0.2706, -0.1442,  0.0194, -0.0178, -0.1984,
         0.2781, -0.0184,  0.1029,  0.1158, -0.1962,  0.0411, -0.0870, -0.0791,
         0.0296,  0.1644, -0.0409,  0.2311,  0.1465,  0.0091,  0.0131,  0.0124,
        -0.1566, -0.0184,  0.1592, -0.1480,  0.3937, -0.1875,  0.0735,  0.2009,
         0.0114,  0.1551, -0.0040, -0.0098, -0.0510,  0.0072, -0.0879,  0.0811,
        -0.4168, -0.0145,  0.1456,  0.0232,  0.0549,  0.0841,  0.1690, -0.0598,
         0.2400,  0.0479,  0.0130, -0.0055,  0.0593, -0.0082],
       grad_fn=<CatBackward>)
Token: 2 grass
tensor([-0.8135,  0.9404, -0.2405, -0.1350,  0.0557,  0.3363,  0.0802, -0.1015,
        -0.5478, -0.3537,  0.0734,  0.2587,  0.1987, -0.1433,  0.2507,  0.4281,
         0.1950,  0.5346,  0.7424,  0.0578, -0.3178,  0.9436,  0.8145, -0.0824,
         0.6166,  0.7284, -0.3262, -1.3641,  0.1232,  0.5373, -0.5123,  0.0246,
         1.0822, -0.2296,  0.6039,  0.5541, -0.9610,  0.4803,  0.0022,  0.5591,
        -0.1637, -0.8468,  0.0741, -0.6216,  0.0260, -0.5162, -0.0525, -0.1418,
        -0.0161, -0.4972, -0.5534, -0.4037,  0.5096,  1.0276, -0.0840, -1.1179,
         0.3226,  0.4928,  0.9488,  0.2040,  0.5388,  0.8397, -0.0689,  0.3136,
         1.0450, -0.2267, -0.0896, -0.6427,  0.6443, -1.1001, -0.0096,  0.2668,
        -0.3230, -0.6065,  0.0479, -0.1664,  0.8571,  0.2335,  0.2539,  1.2546,
         0.5472, -0.1980, -0.7186,  0.2076, -0.2587, -0.3650,  0.0834,  0.6932,
         0.1574,  1.0931,  0.0913, -1.3773, -0.2717,  0.7071,  0.1872, -0.3307,
        -0.2836,  0.1030,  1.2228,  0.8374, -0.1426,  0.2911,  0.1795, -0.0962,
        -0.0054,  0.2363, -0.0588, -0.2387,  0.1687,  0.0712,  0.0498,  0.1145,
         0.1077, -0.0975,  0.0795,  0.1341, -0.3521,  0.1275,  0.2990,  0.1026,
         0.0228,  0.0045, -0.2239,  0.1283, -0.2805,  0.0441, -0.0685, -0.0898,
        -0.0311,  0.0305, -0.0341, -0.1011, -0.1210,  0.0716, -0.0741,  0.1243,
        -0.0030,  0.2048,  0.0156,  0.1398,  0.0957, -0.0992, -0.1233,  0.1512,
        -0.0539,  0.0871, -0.0730,  0.0506,  0.0623,  0.1372],
       grad_fn=<CatBackward>)
Token: 3 is
tensor([-5.4264e-01,  4.1476e-01,  1.0322e+00, -4.0244e-01,  4.6691e-01,
         2.1816e-01, -7.4864e-02,  4.7332e-01,  8.0996e-02, -2.2079e-01,
        -1.2808e-01, -1.1440e-01,  5.0891e-01,  1.1568e-01,  2.8211e-02,
        -3.6280e-01,  4.3823e-01,  4.7511e-02,  2.0282e-01,  4.9857e-01,
        -1.0068e-01,  1.3269e-01,  1.6972e-01,  1.1653e-01,  3.1355e-01,
         2.5713e-01,  9.2783e-02, -5.6826e-01, -5.2975e-01, -5.1456e-02,
        -6.7326e-01,  9.2533e-01,  2.6930e-01,  2.2734e-01,  6.6365e-01,
         2.6221e-01,  1.9719e-01,  2.6090e-01,  1.8774e-01, -3.4540e-01,
        -4.2635e-01,  1.3975e-01,  5.6338e-01, -5.6907e-01,  1.2398e-01,
        -1.2894e-01,  7.2484e-01, -2.6105e-01, -2.6314e-01, -4.3605e-01,
         7.8908e-02, -8.4146e-01,  5.1595e-01,  1.3997e+00, -7.6460e-01,
        -3.1453e+00, -2.9202e-01, -3.1247e-01,  1.5129e+00,  5.2435e-01,
         2.1456e-01,  4.2452e-01, -8.8411e-02, -1.7805e-01,  1.1876e+00,
         1.0579e-01,  7.6571e-01,  2.1914e-01,  3.5824e-01, -1.1636e-01,
         9.3261e-02, -6.2483e-01, -2.1898e-01,  2.1796e-01,  7.4056e-01,
        -4.3735e-01,  1.4343e-01,  1.4719e-01, -1.1605e+00, -5.0508e-02,
         1.2677e-01, -1.4395e-02, -9.8676e-01, -9.1297e-02, -1.2054e+00,
        -1.1974e-01,  4.7847e-02, -5.4001e-01,  5.2457e-01, -7.0963e-01,
        -3.2528e-01, -1.3460e-01, -4.1314e-01,  3.3435e-01, -7.2412e-03,
         3.2253e-01, -4.4219e-02, -1.2969e+00,  7.6217e-01,  4.6349e-01,
        -9.8854e-02,  1.9071e-01,  1.7515e-01, -6.1285e-02, -1.5942e-01,
         1.1945e-01, -1.0100e-01, -1.4212e-01,  1.8019e-01,  2.1893e-02,
         7.7451e-02,  1.5791e-01,  1.2914e-01, -1.2903e-01,  4.8189e-02,
         1.6358e-01, -2.4926e-01,  1.2167e-01,  2.3625e-01,  4.6644e-02,
         1.1298e-01, -5.0726e-02, -2.3397e-01,  7.3020e-02, -2.4923e-01,
         4.4149e-02, -6.8543e-02, -8.9809e-02, -3.1075e-02,  3.0531e-02,
        -3.4061e-02, -1.0108e-01, -1.2104e-01,  7.1563e-02, -7.4108e-02,
         1.2427e-01, -2.9713e-03,  2.0480e-01,  1.5570e-02,  1.3985e-01,
         9.5693e-02, -9.9214e-02, -1.2332e-01,  1.5122e-01, -5.3919e-02,
         8.7101e-02, -7.2998e-02,  5.0628e-02,  6.2289e-02,  1.3722e-01],
       grad_fn=<CatBackward>)
Token: 4 green
tensor([-6.7907e-01,  3.4908e-01, -2.3984e-01, -9.9652e-01,  7.3782e-01,
        -6.5911e-04,  2.8010e-01,  1.7287e-02, -3.6063e-01,  3.6955e-02,
        -4.0395e-01,  2.4092e-02,  2.8958e-01,  4.0497e-01,  6.9992e-01,
         2.5269e-01,  8.0350e-01,  4.9370e-02,  1.5562e-01, -6.3286e-03,
        -2.9414e-01,  1.4728e-01,  1.8977e-01, -5.1791e-01,  3.6986e-01,
         7.4582e-01,  8.2689e-02, -7.2601e-01, -4.0939e-01, -9.7822e-02,
        -1.4096e-01,  7.1121e-01,  6.1933e-01, -2.5014e-01,  4.2250e-01,
         4.8458e-01, -5.1915e-01,  7.7125e-01,  3.6685e-01,  4.9652e-01,
        -4.1298e-02, -1.4683e+00,  2.0038e-01,  1.8591e-01,  4.9860e-02,
        -1.7523e-01, -3.5528e-01,  9.4153e-01, -1.1898e-01, -5.1903e-01,
        -1.1887e-02, -3.9186e-01, -1.7479e-01,  9.3451e-01, -5.8931e-01,
        -2.7701e+00,  3.4522e-01,  8.6533e-01,  1.0808e+00, -1.0291e-01,
        -9.1220e-02,  5.5092e-01, -3.9473e-01,  5.3676e-01,  1.0383e+00,
        -4.0658e-01,  2.4590e-01, -2.6797e-01, -2.6036e-01, -1.4151e-01,
        -1.2022e-01,  1.6234e-01, -7.4320e-01, -6.4728e-01,  4.7133e-02,
         5.1642e-01,  1.9898e-01,  2.3919e-01,  1.2550e-01,  2.2471e-01,
         8.2613e-01,  7.8328e-02, -5.7020e-01,  2.3934e-02, -1.5410e-01,
        -2.5739e-01,  4.1262e-01, -4.6967e-01,  8.7914e-01,  7.2629e-01,
         5.3862e-02, -1.1575e+00, -4.7835e-01,  2.0139e-01, -1.0051e+00,
         1.1515e-01, -9.6609e-01,  1.2960e-01,  1.8388e-01, -3.0383e-02,
        -4.3127e-01, -9.4622e-03, -1.3466e-01,  1.7466e-01,  2.9091e-01,
         1.2645e-01, -2.1931e-02,  2.5343e-01, -2.1666e-01, -4.6174e-02,
        -1.5573e-01, -1.8750e-01,  5.3911e-02,  2.0990e-01, -1.2833e-01,
         1.4726e-01,  8.6105e-02, -1.1710e-01, -9.2155e-03, -3.9138e-02,
        -4.5328e-03, -1.7572e-01,  1.2242e-01, -1.7868e-01,  1.3713e-01,
        -1.7827e-01,  9.5867e-02,  1.7740e-01,  4.9754e-02, -1.4274e-01,
         5.9958e-02, -3.9933e-02, -8.3130e-02, -2.6269e-02, -4.5627e-02,
         2.1568e-01, -3.6900e-01, -1.4775e-01,  8.0866e-02,  3.0459e-02,
        -7.6398e-02,  3.0724e-01, -2.8195e-01, -7.1761e-02,  3.7784e-01,
        -2.3730e-01,  3.6788e-03, -2.2317e-02, -2.3767e-02, -1.9041e-02],
       grad_fn=<CatBackward>)
Token: 5 .
tensor([-3.3979e-01,  2.0941e-01,  4.6348e-01, -6.4792e-01, -3.8377e-01,
         3.8034e-02,  1.7127e-01,  1.5978e-01,  4.6619e-01, -1.9169e-02,
         4.1479e-01, -3.4349e-01,  2.6872e-01,  4.4640e-02,  4.2131e-01,
        -4.1032e-01,  1.5459e-01,  2.2239e-02, -6.4653e-01,  2.5256e-01,
         4.3136e-02, -1.9445e-01,  4.6516e-01,  4.5651e-01,  6.8588e-01,
         9.1295e-02,  2.1875e-01, -7.0351e-01,  1.6785e-01, -3.5079e-01,
        -1.2634e-01,  6.6384e-01, -2.5820e-01,  3.6542e-02, -1.3605e-01,
         4.0253e-01,  1.4289e-01,  3.8132e-01, -1.2283e-01, -4.5886e-01,
        -2.5282e-01, -3.0432e-01, -1.1215e-01, -2.6182e-01, -2.2482e-01,
        -4.4554e-01,  2.9910e-01, -8.5612e-01, -1.4503e-01, -4.9086e-01,
         8.2973e-03, -1.7491e-01,  2.7524e-01,  1.4401e+00, -2.1239e-01,
        -2.8435e+00, -2.7958e-01, -4.5722e-01,  1.6386e+00,  7.8808e-01,
        -5.5262e-01,  6.5000e-01,  8.6426e-02,  3.9012e-01,  1.0632e+00,
        -3.5379e-01,  4.8328e-01,  3.4600e-01,  8.4174e-01,  9.8707e-02,
        -2.4213e-01, -2.7053e-01,  4.5287e-02, -4.0147e-01,  1.1395e-01,
         6.2226e-03,  3.6673e-02,  1.8518e-02, -1.0213e+00, -2.0806e-01,
         6.4072e-01, -6.8763e-02, -5.8635e-01,  3.3476e-01, -1.1432e+00,
        -1.1480e-01, -2.5091e-01, -4.5907e-01, -9.6819e-02, -1.7946e-01,
        -6.3351e-02, -6.7412e-01, -6.8895e-02,  5.3604e-01, -8.7773e-01,
         3.1802e-01, -3.9242e-01, -2.3394e-01,  4.7298e-01, -2.8803e-02,
        -7.0088e-02, -4.7074e-02,  6.6940e-02,  1.0316e-02,  2.1736e-02,
        -2.0450e-02, -5.6332e-02, -5.3864e-02, -1.5728e-01, -2.1249e-01,
         5.9580e-03,  9.5492e-02,  2.4176e-01, -1.2302e-01, -6.6183e-02,
         6.8613e-02,  2.7138e-02,  5.8407e-02, -1.2849e-01,  1.0618e-02,
        -3.7659e-02,  2.9924e-02, -1.6977e-01, -1.0442e-01, -2.3273e-02,
         4.8496e-02,  1.4863e-01,  4.7484e-02,  9.8932e-02, -1.8453e-01,
         2.3484e-02, -1.0737e-02, -2.4515e-01, -2.4557e-02, -1.9638e-02,
         1.9689e-01, -6.4803e-02,  2.2207e-01,  1.9077e-01,  1.3438e-01,
        -1.8580e-01,  5.8607e-02,  1.6381e-01,  1.2031e-01,  1.2677e-01,
         1.0975e-01, -3.9136e-02,  4.5679e-02,  6.8463e-02,  7.7292e-04],
       grad_fn=<CatBackward>)

Words are now embedded using a concatenation of two different embeddings. This means that the resulting embedding vector is still a single PyTorch vector.

Other Embeddings: BERT, ELMo, Flair¶

Next to standard WordEmbeddings and CharacterEmbeddings, Flair also provides classes for BERT, ELMo and Flair embeddings. These embeddings enable us to train truly state-of-the-art NLP models.

All word embedding classes inherit from the TokenEmbeddings class and implement the embed() method which we need to call to embed our text. This means that for most users of Flair, the complexity of different embeddings remains hidden behind this interface. We instantiate the embedding class we require and call embed() to embed our text.

All embeddings produced with Flair's methods are Pytorch vectors, so they can be immediately used for training and fine-tuning.

Flair Embeddings¶

Contextual string embeddings are powerful embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. Key differences are: (1) they are trained without any explicit notion of words and thus fundamentally model words as sequences of characters. And (2) they are contextualized by their surrounding text, meaning that the same word will have different embeddings depending on its contextual use.

Recent advances in language modeling using recurrent neural networks have made it viable to model language as distributions over characters. By learning to predict the next character on the basis of previous characters, such models have been shown to automatically internalize linguistic concepts such as words, sentences, subclauses and even sentiment. In Flair the internal states of a trained character language model is leveraged to produce a novel type of word embedding which the authors refer to as contextual string embeddings. The proposed embeddings have the distinct properties that they (a) are trained without any explicit notion of words and thus fundamentally model words as sequences of characters, and (b) are contextualized by their surrounding text, meaning that the same word will have different embeddings depending on its contextual use. The authors conduct a comparative evaluation against previous embeddings and find that their embeddings are highly useful for downstream tasks: across four classic sequence labeling tasks they consistently outperform the previous state-of-the-art. In particular, they significantly outperform previous work on English and German named entity recognition (NER), allowing them to report new state-of-the-art F1-scores on the CONLL03 shared task.

With Flair, we can use these embeddings simply by instantiating the appropriate embedding class, same as standard word embeddings:

from flair.embeddings import FlairEmbeddings

flair_embedding_forward = FlairEmbeddings('news-forward')

sentence = Sentence('The grass is green .')

flair_embedding_forward.embed(sentence)

[Sentence: "The grass is green ." - 5 Tokens]

We can choose which embeddings we load by passing the appropriate string to the constructor of the FlairEmbeddings class. Currently, there are numerous contextual string embeddings provided in models (more coming). See list.

The recommendation is to combine both forward and backward Flair embeddings. Depending on the task, it is also recommended to add standard word embeddings into the mix. So, the recommendation is to use StackedEmbedding for most English tasks:

from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings

stacked_embeddings = StackedEmbeddings([
                                        WordEmbeddings('glove'), 
                                        FlairEmbeddings('news-forward'), 
                                        FlairEmbeddings('news-backward'),
                                       ])

We would use this embedding like all the other embeddings, i.e. call the embed() method over our sentences.

BERT Embeddings¶

BERT embeddings were developed by Devlin et al. (2018) and are a different kind of powerful word embedding based on a bidirectional transformer architecture. Flair is using the implementation of huggingface. The embeddings are wrapped into our simple embedding interface, so that they can be used like any other embedding.

from flair.embeddings import BertEmbeddings

embedding = BertEmbeddings()

sentence = Sentence('The grass is green .')

embedding.embed(sentence)

[Sentence: "The grass is green ." - 5 Tokens]

for i in sentence:
    print(i, i.embedding)

Token: 1 The tensor([-0.0323, -0.3904, -1.1946,  ...,  0.1305, -0.1365, -0.4323])
Token: 2 grass tensor([-0.3973,  0.2652, -0.1337,  ...,  0.3715,  0.1097, -1.1625])
Token: 3 is tensor([ 0.1374, -0.3688, -0.8292,  ...,  0.2533,  0.0294,  0.4293])
Token: 4 green tensor([-0.7722, -0.1152,  0.3661,  ...,  0.1575, -0.0682, -0.7661])
Token: 5 . tensor([ 0.1441, -0.1772, -0.5911,  ..., -1.4830,  0.1995, -0.0112])

We can load any of the pre-trained BERT models by providing the model string during initialization:

'bert-base-uncased': English; 12-layer, 768-hidden, 12-heads, 110M parameters
'bert-large-uncased': English; 24-layer, 1024-hidden, 16-heads, 340M parameters
'bert-base-cased': English; 12-layer, 768-hidden, 12-heads , 110M parameters
'bert-large-cased': English; 24-layer, 1024-hidden, 16-heads, 340M parameters
'bert-base-multilingual-cased': 104 languages; 12-layer, 768-hidden, 12-heads, 110M parameters
'bert-base-chinese': Chinese Simplified and Traditional; 12-layer, 768-hidden, 12-heads, 110M parameters

ELMo Embeddings¶

ELMo embeddings were presented by Peters et al. in 2018. They are using a bidirectional recurrent neural network to predict the next word in a text. Flair is using the implementation of AllenNLP. As this implementation comes with a lot of sub-dependencies, which Flair authors don't want to include in Flair, you need to first install the library via pip install allennlp before we can use it in Flair. Using the embeddings is as simple as using any other embedding type:

from flair.embeddings import ELMoEmbeddings

embedding = ELMoEmbeddings()

sentence = Sentence('The grass is green .')

embedding.embed(sentence)

[Sentence: "The grass is green ." - 5 Tokens]

AllenNLP provides the following pre-trained models. To use any of the following models inside Flair simple specify the embedding id when initializing the ELMoEmbeddings.

'small': English; 1024-hidden, 1 layer, 14.6M parameters
'medium': English; 2048-hidden, 1 layer, 28.0M parameters
'original': English; 4096-hidden, 2 layers, 93.6M parameters
'pt': Portuguese
'pubmed': English biomedical data; more information

BERT and Flair Combined¶

We can very easily mix and match Flair, ELMo, BERT and classic word embeddings. We instantiate each embedding we wish to combine and use them in a StackedEmbedding.

For instance, let's say we want to combine the multilingual Flair and BERT embeddings to train a hyper-powerful multilingual downstream task model.

First, instantiate the embeddings we wish to combine:

from flair.embeddings import FlairEmbeddings, BertEmbeddings

flair_forward_embedding = FlairEmbeddings('multi-forward')
flair_backward_embedding = FlairEmbeddings('multi-backward')

bert_embedding = BertEmbeddings('bert-base-multilingual-cased')

Now we instantiate the StackedEmbeddings class and pass it a list containing these three embeddings:

from flair.embeddings import StackedEmbeddings

stacked_embeddings = StackedEmbeddings(
    embeddings=[flair_forward_embedding, flair_backward_embedding, bert_embedding])

We use this embedding like all the other embeddings, i.e. call the embed() method over our sentences.

sentence = Sentence('The grass is green .')

stacked_embeddings.embed(sentence)

for token in sentence:
    print(token)
    print(token.embedding)

Token: 1 The
tensor([-1.4812e-07,  4.5007e-08,  6.0273e-07,  ...,  3.8287e-01,
         4.7210e-01,  2.9850e-01])
Token: 2 grass
tensor([ 1.6254e-04,  1.8764e-07, -7.9034e-09,  ...,  8.5283e-01,
        -5.0724e-02,  3.4476e-01])
Token: 3 is
tensor([-2.4521e-04,  3.4869e-07,  5.5841e-06,  ..., -1.8283e-01,
         7.1532e-01,  5.0825e-03])
Token: 4 green
tensor([8.3005e-05, 4.7261e-08, 5.7316e-07,  ..., 1.0157e+00, 7.5358e-01,
        1.1230e-01])
Token: 5 .
tensor([-8.3244e-07,  1.6451e-07, -1.7201e-08,  ..., -6.0930e-01,
         9.0591e-01,  1.7857e-01])

Words are now embedded using a concatenation of three different embeddings. This means that the resulting embedding vector is still a single PyTorch vector.

Document Embeddings¶

Document embeddings are different from word embeddings in that they provide one embedding for an entire text, whereas word embeddings provide embeddings for individual words.

...