Parth

| Technical Writer / Review: ABCOM Team | Level: Intermediate | Banner Image Source : Internet |

Machine Learning has made it possible to create paintings that are not easily distinguishable from the paintings created by great painters such as Picasso, Van Gogh, and so on. The GAN network can generate a new painting from scratch or apply the famous painter’s style to your own clicked photos. What is possible for image generation, can we extend the same thing to Text Generation? With GPT-2 (and latest GPT-3, which is to be still commercially available at the time of this writing), it is possible to generate text that can match the semantics and writing style of the talented authors of the past and present. In this tutorial, I will show you how to make an optimal use of GPT-2 capabilities to generate a novel like Shakespeare.

Developed by OpenAI, GPT-2 is a large-scale transformer-based language model. OpenAI trained it on a large corpus of text: 8 million high-quality web pages.

“GPT-2 achieves state-of-the-art scores on a variety of domain-specific language modeling tasks. Our model is not trained on any of the data specific to any of these tasks and is only evaluated on them as a final test; this is known as the “zero-shot” setting. GPT-2 outperforms models trained on domain-specific data sets (e.g. Wikipedia, news, books) when evaluated on those same data sets.” - OpenAI Team

We use GPT-2 on many language modeling tasks such as machine translation, summarizing and question answering. It has shown a high level of competitive performance compared to the models trained for a specific purpose or domain.

“We’ve trained a large-scale unsupervised language model which generates coherent paragraphs of text, achieves state-of-the-art performance on many language modeling benchmarks, and performs rudimentary reading comprehension, machine translation, question answering, and summarization—all without task-specific training.” - OpenAI Team

We will use GPT-2 model for text generation. By the end of this tutorial, you will learn how to fine tune this model for generating a novel having the writing style of famous writer/novelist Shakespeare. To give you a further motivation here is the final partial output of the model.

Image 01

If you have read Shakespeare, you will appreciate how closely the generated text matches his style of writing. I will show you how to generate such text and also techniques of improving its quality further.

By the end of this tutorial, you may proudly say, “I am a writer, anything you say or do may be used in a story.”

Project Description

A neural network trained over a huge corpus of text and many epochs learns the semantics of the text, just the way a human-being understands somebody’s style of writing by reading it again and again. In this tutorial, we will use a pre-trained model by OpenAI called GPT-2 to generate text. We will use a Shakespeare novel for fine-tuning the network and later on ask it to generate a novel resembling his writing.

Installing packages

Install the transformers package using pip from the GitHub repository of Hugging Face transformers.

!pip install git+https://github.com/huggingface/transformers

Ensure that you have the latest version of pyarrow by running the pip upgrade command.

!pip install --upgrade pyarrow

This library provides a Python API for functionality provided by the Arrow C++ libraries, along with tools for Arrow integration and interoperability with pandas, NumPy, and other software in the Python ecosystem.

Loading Dataset

Andrej Karpathy, who was a founding member at OpenAI, has done exhaustive work for text generation. He has created several text datasets and has made those publicly available for research and experimentation. I am going to use Andrej’s cleaned up text dataset for our experimentation. Load the dataset from his GitHub repository using the wget command:

!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

We copy the downloaded data to input.txt file in your Google drive.

Create a directory called output in your Colab environment to save the tokenizer and the model.

!mkdir output

To fine-tune the dataset, you will need to run a utility (run_language_modeling.py) provided in the Hugging Face library. Load this utility using the following command:

!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_language_modeling.py

With the packages installed and the text data loaded, it is time to fine-tune GPT-2 for generating texts for the Shakespeare dataset.

Fine Tuning Model

Run the run_language_modeling.py script using following command:

!python run_language_modeling.py \
    --output_dir=output \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train \
    --train_data_file='/content/input.txt' \
    --per_gpu_train_batch_size=1 \
    --save_steps=-1 \
    --num_train_epochs=2

You need to specify the output directory for our model, the model type, the model name, the path of our training data, the batch size, the number of steps to save and the number of epochs.

Now, we have fine tuned the GPT-2 model for our dataset and saved it to our output directory, we will now start using it for text generation.

Loading Tokenizer and Model

To load the tokenizer, we use GPT2Tokenizer from the transformers package. Likewise, to load the model we use GPT2LMHeadModel.

from transformers import GPT2Tokenizer, GPT2LMHeadModel
 
tokenizer = GPT2Tokenizer.from_pretrained('/content/output')
model = GPT2LMHeadModel.from_pretrained('/content/output')

We now have our fine-tuned model, and tokenizer loaded in our environment. We will start generating some text.

Generating Text

For generating text, we give some text as a seed to the model and ask it to predict the next word. We then use the guessed (generated word) as part of the seed to generate the next word, and so on. We repeat the process until the model generates the desired number of words. To guess the next word, there are many techniques and algorithms used in practice. I will show you the application of these algorithms on our dataset, we will examine the generated output and finally compare the results for the best generative text. Here, we start with the Greedy Search.

Greedy Search

This is a very basic searching algorithm which selects the word with highest probability as its next word.

We first use the tokenizer to encode the seed text into numeric values. Note that the seed text is the starting point for our novel. We will use the following seed:

'The King must leave the throne now.'

The text is encoded using the following statement:

ids = tokenizer.encode('[BOS] The King must leave the throne now . [EOS]',
                      return_tensors='pt')

The [BOS] and [EOS] tags mark the sentence demarcation. We generate the output by calling the generate method on the fine-tuned model.

greedy_outputs = model.generate(ids1, max_length=300)

Note, we have asked the model to guess the next 300 words after the seed. We print the output on the console:

print("Output:\n" + 100 * '-')
for i, greedy_output in enumerate(greedy_outputs):
  print("\n"+"==="*10)
  print("{}: {}".format(i+1, tokenizer.decode(greedy_output, skip_special_tokens=False)))

This is the partial output:

1: [BOS] The King must leave the throne now. [EOS]
 
KING RICHARD II:
I will not.
 
BOS:
I will not.
 
KING RICHARD II:
I will not.
 
BOS:
I will not.
 
KING RICHARD II:
I will not.
 
BOS:
I will not.

This output is no good to us. So, we will now try the next algorithm - beam search.

Beam Search

Unlike greedy search that uses words with highest probability, the beam search considers the probabilities of the consequent number of words. It multiplies these probabilities with the previous ones for each case. Then, it selects the sequence of words which had overall higher probability after multiplication. The following statement performs a beam search.

beam_output = model.generate(
    ids, 
    max_length=300, 
    num_beams=4, 
    early_stopping=True
)

We set num_beams to be greater than 1 and early_stopping to true, so that generation finishes when all beam hypotheses reach the EOS token.

We print the generated output using following statement:

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

The partial output is shown here:

[BOS] The King must leave the throne now. [EOS]
 
DUKE VINCENTIO:
O, my lord!
 
DUKE VINCENTIO:
My lord!
 
DUKE VINCENTIO:
My lord!
 
DUKE VINCENTIO:
My lord!
 
DUKE VINCENTIO:
My lord!

Once again, the output is no good.

This shows that both greedy and beam search alone do not provide us excellent results. So, let us try some other techniques.

Sampling

Sampling means randomly picking the next word according to its conditional probability distribution. We will add sampling now to our text generation algorithm and observe the results.

We need to import TensorFlow to help us set seed and induce random sampling.

import tensorflow as tf

We create a random seed for text generation:

tf.random.set_seed(0)

Trying another value for seed would produce different results. So, later on, you may like to experiment with the seed value. We generate the text using the following code:

sample_output = model.generate(
    ids, 
    do_sample=True, 
    max_length=300
)

In the preceding statement, we set a do_sample parameter to true asking the generator to use sampling. There are additional parameters to sampling such as top_k, which we will try later.

You print the generated text using the following code:

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

This is the partial output:

[BOS] The King must leave the throne now. [EOS]
MAMMAR:
Then I'll leave him to his son.
 
JOHN:
Ay, sir, I will, but that would ruin the crown.
 
BIONDELLO:
The throne is mine, sir, and it is mine; I will not have it till
the King of Bohemia gives me power again.
 
BIONDELLO:
Then he's my son: henceforth, my brother,
Let me be king, or else you would have him:
He'll be king till your father returns; I shall have him
in charge.
 
JOHN:
No harm
for you, but the thing is very strange.
 
BIONDELLO:
Why do you come?
 
KING RICHARD II:
Why, when thou dost have me.
 
BIONDELLO:
I am your new husband
And not your new king, and neither can I be king,
Nor you, my lord. What, here you go?
 
JOHN:
I do not stay
Within twelve months I am king, and have my kingdom
Made of your sovereignty: here I am all the good:
The King I have made my will on his head and
He hath power, my wife's will, and a crown of
the earth.
 
BIONDE

As you see it produces much better results than previous ones and the text is also starting to make some sense. This is just an introduction to how adding a few parameters can generate somewhat meaningful sentences, there’s a lot more to come.

Top-K Sampling

Top-k and Nucleus sampling (Fan et al., 2018; Holtzman et al., 2018; Radford et al., 2019) has recently become a popular alternative sampling procedure. Nucleus sampling and top-k both sample from truncated Neural LM distributions, differing only in the strategy of where to truncate.

In Top-K sampling, the K most likely next words are filtered, and it redistributes the probability mass among only those K next words.

We need to add a top_k parameter in the generating function to use top-k sampling.

tf.random.set_seed(0)
 
# set top_k to 50
sample_output2 = model.generate(
    ids, 
    do_sample=True, 
    max_length=300, 
    top_k=50
)
 
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output2[0], skip_special_tokens=True))

This is the generated output:

[BOS] The King must leave the throne now. [EOS]
 
VIRGILIA:
I have had it of him, in my heart.
 
ROMEO:
The Prince and his lords have had 'em to bed;
And you'll find him, or I'll say you will meet.
And if your lords be dead, so shall 'em!
 
DUKE VINCENTIO:
God save him that did leave his kingdom.
His love lives and he goes; to take in his own life.
 
FLORUS:
He is more than an advocate,
But he makes up the rest: He cares nothing for his father.
He is to go himself to prison to do it.
 
ROMEO:
To-night he doth not dream with the rest.
 
DUKE VINCENTIO:
He is too weak to be found
Until he wakes; yet he lives:
And yet he lives not for him either.
 
FLORUS:
He hath made a man poor: it is not one that would, should he live.
 
DUKE VINCENTIO:
He is too weak to live nor die for another: he is so weak,
That no one can take it from him, and for ever is.
 
ROMEO:
'Twas one that he made poor he should love for his own

Now, after implementing Top-K sampling we notice that the characters in the play have some new thoughts which may or may not prove to be beneficial for us now but when we combine all the parameters later, we will observe that it helps in improving the quality of the generated text.

Top-p (Nucleus) Sampling

It is the type of sampling which selects the highest probability tokens whose cumulative probability mass exceeds the pre-chosen threshold p. This threshold p can be randomly chosen, but we keep it above 0.9 for satisfactory results.

tf.random.set_seed(0)
 
# deactivate top_k sampling and sample only from 92% most likely words
sample_output3 = model.generate(
    ids, 
    do_sample=True, 
    max_length=300, 
    top_p=0.92,
)
 
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output3[0], skip_special_tokens=True))

This is the output:

[BOS] The King must leave the throne now. [EOS]
LADY:
O God, my lord, there is a better way to get rid of her.
 
KING EDWARD IV:
She may keep them both.
 
BOS:
I do think so, that my life is no worse than my enemies'
earns, that I have never seen her so angry.
 
KING EDWARD IV:
I know the best; for I have never
done her harm. I did not see her as angry as I did
love her, or hate her as I did love the other.
But since the truth is mine, I am most sorry to see her
shout so, and my heart most happy to see
her in it.
 
GLOUCESTER:
Nay, your lordship, you may, or may you not, do the
same.
 
KING EDWARD IV:
Let her suffer.
 
GLOUCESTER:
Nay, your lordship, your lordship, you may not, to
a fault, to have love of her.
 
KING EDWARD IV:
Why, my dear cousin, you do wish to see
Her; and if she can, I think she'll never be able to bear it.
 
LADY:
Ay, that, if I may not keep her a piece; that

This output depicts a delightful conversation going on between Gloucester and Northumberland. The sentences make sense and this shows that our model can perform much better if we add Top-K sampling.

Combining Sampling

It's time to combine everything we did previously and see how the combination of all parameters alter the conversation going on between the characters of the play.

tf.random.set_seed(0)
 
# set top_k = 50 and set top_p = 0.95 
final_outputs = model.generate(
    ids,
    do_sample=True, 
    max_length=300, 
    top_k=40, 
    top_p=0.95, 
)
 
print("Output:\n" + 100 * '-')
for i, final_output in enumerate(final_outputs):
  print("{}: {}".format(i, tokenizer.decode(final_output, skip_special_tokens=True)))

This is the output:

0: [BOS] The King must leave the throne now. [EOS]
The King's daughter shall come down to him,
To see him in this world and to say:
'The true King will stand and stand with us,
And he shall be revenged on his father,
For he hath so far made his cause.
This time I'll have to give you counsel to the world;
And in this world you shall find him.'
 
MENENIUS:
The world shall be so mad and so proud that he
Is so proud of his father's power.
 
FRIAR LAURENCE:
'This I'll be' the King's daughter,
And this we shall be.'
 
MENENIUS:
'Then take not the crown, your oath,
Or you shall be disgraced.'
 
MENENIUS:
'I have been told by our own consuls that I cannot,
In good faith, go to your sister's house;
'For I have not done the right by her honour
To give thee my crown.'
 
FRIAR LAURENCE:
'I will, my lord.'
 
MENENIUS:
'No, by my wife!
You have the means, you know me.'
 
FRIAR LAURENCE:
'And by my queen:'
 
MENENIUS:
'I know

This is the final output text we generated using our previous knowledge that we gained through this tutorial and as you can see, it depicts a great order of events and has meaningful sentences.

Training for More Epochs

A human-being learns somebody’s language by repeatedly listening to the same person. Similarly, training a deep neural network for many epochs over the same text makes the network learn the language of the written text. GPT-2 has already been well trained on a large corpus of text. We will now try to increase the number of epochs in our training on the Shakespeare data to see if we can produce better results. I tried this for 10 and 50 epochs.

Output of Ten Epochs Training:

0: [BOS] The King must leave the throne now. [EOS]
But for what he hath left to do, the King may give again.
 
HENRY BOLINGBROKE:
Well, what may Clarence give?
 
LUCIO:
For that which Clarence hath left thee.
 
DUKE VINCENTIO:
But what shall Edward give?
 
LUCIO:
Thy son, and heir to the throne.
 
DUKE VINCENTIO:
And heir to the unredeemed throne,
And heir of that unrightful usurp'd throne?
 
LUCIO:
Thy son, and heir to the unlook'd-for throne.
 
DUKE VINCENTIO:
And heir of that unlook'd for usurp'd throne,
And unlook'd for heir to the unredeemed throne,
And unlook'd for heir to the unredeemed throne,
And unlook'd for heir to the unredeemed throne,
That so shall he give unto thyself and thyself again.
This is the man that will give back unto thee.
 
LUCIO:
Let Clarence live.
 
DUKE VINCENTIO:
Then do I give thee my son.
 
LUCIO:
Henceforth thou shalt continue in my stead.

Output of Fifty Epochs Training:

0: [BOS] The King must leave the throne now. [EOS]
 
First Senator:
And shall the King leave the realm?
 
QUEEN MARGARET:
Ay, nor shall he sit by the side of the sun.
 
Second Senator:
Then, Mars, the next nearest thing to glory
Is thy succession ten days separated;
And thou, that art so far from Mars,
Am in no sense inferior to any
That makes succession wish.
 
First Senator:
So then, for succession but thy holding of the crown,
Thou art no king: be that king indeed,
And I challenge thee as king indeed
For saying the queen is but queen.
 
QUEEN MARGARET:
My title, being but queen, I shall not be able
To sunder thy title until thou call'st forth
A champion to the people's champion: I'll not push
My neck between her teeth to prove a woman,
Till that my title be verified with those words
Whom poets have taught me to curse.
 
First Senator:
Now, this thy title to the people, call forth
Thy fiery arm; and when that arm is forth,
Let those words be my witness as high
As to a thousand voices.
 
Second Senator:
And when that arm is out, let those words alone;
Let those same words again

You may compare the above outputs with our earlier two epochs training. In most situations, you find that the two epochs training with different sampling techniques and fine tuning the sampling parameters would produce the acceptable output in most situations. The text generation is still not considered being fully matured. The latest GPT-3 claims to produce far better results than its predecessors, and OpenAI feels a threat that some notorious people may use it to generate fake news.

Summary:

This tutorial taught you how to use GPT-2 for text generation. You created a play like Shakespeare. You learned about greedy search and beam search and how to infuse these algorithms into your text generation code. Also, you learned various sampling techniques such as random sampling, top-k sampling and top-p sampling and individually saw their benefits while generating text. Last, you saw how a combination of all these algorithms and techniques worked out to produce meaningful and ordered text, which was our primary aim.

Source: Download the project source from our Repository.

References

image