Arka

| Technical Writer / Review: ABCOM Team | Level: Beginner | Banner Image Source : Internet |

Introduction

Every day, we come across several interesting online articles, news, blogs, but hardly find time to read those fully. Having a quick glance gives us the gist of the topic, and then several questions arise in our mind related to this topic. In this tutorial, I will show you how a pre-trained ML model can help you in getting answers to your questions without you reading the passage.

Before I discuss this pre-trained model, let us understand how the model was trained. To train any machine learning model, you need a proper dataset. To develop the question-answer model, a large corpus of text was pre-processed to mark the answers in the passage for a set of predefined questions. The model was then trained on this dataset and found to give satisfactory answers to the questions previously unseen. This dataset is called SQuAD (Stanford Question Answering Dataset) and the trending Transformers technology was used for language learning. I will now briefly introduce you to these technologies - BERT (Bidirectional Embedding Representations from Transformers), SQuAD, and Hugging Face transformers.

BERT

BERT, or Bidirectional Embedding Representations from Transformers, is a method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. The academic paper1can be found in the references section. A BERT model fine-tuned on the SQUAD and other labeled QnA datasets, is available for public use.

SQuAD

SQuAD is created by Stanford for Q&A model training. It contains questions posted by crowd workers on a set of Wikipedia articles. The answer to this question is a segment of text, or span from the corresponding passage. If none is found, the question is unanswerable. The latest version of SQuAD2.0 adds 50,000 new questions to the existing repository of 100,000 questions in SQuAD1.1. The new unanswerable questions are written adversarially by crowd workers to look similar to answerable ones. SQuAD is an ongoing effort and is continuously updated. There are many sites which maintain the releases, one such site is given in the reference section SQUAD2 site.

Hugging Face Transformers

The Hugging Face Transformers3 package provides state-of-the-art general-purpose architectures for natural language understanding and natural language generation. They host dozens of pre-trained models to use right out of the box by transfer learning.

Now that you have a brief overview of the involved technologies, I will proceed with the project development.

Creating Project

Create a Colab project and rename it to BERT QnA. Install the required modules and frameworks. You install Hugging Face transformers library using the following pip command:

!pip install transformers

You will be using the tokenizer (BertTokenizer) and QnA module (TFBertForQuestionAnswering) from this library.

from transformers import BertTokenizer, TFBertForQuestionAnswering

Import the other libraries required in the project.

import tensorflow as tf
from google.colab import drive
import requests

Loading Model and Tokenizer

The Hugging Face library provides several pre-trained BERT models. We use the model trained on SQuAD. We load the tokenizer and the model using the following code:

modelName = 'bert-large-uncased-whole-word-masking-finetuned-squad'
tokenizer = BertTokenizer.from_pretrained(modelName)
model = TFBertForQuestionAnswering.from_pretrained(modelName)

Context and Query

To show you how to use the above model, I will take a passage on the current trending topic of COVID-19 from Wikipedia. We define the passage in the source variable:

source=r"""The COVID‑19 pandemic, also known as the coronavirus pandemic, is an ongoing global pandemic of coronavirus disease 2019 (COVID‑19). The outbreak was first identified in December 2019 in Wuhan, China. The World Health Organization declared the outbreak a Public Health Emergency of International Concern on 30 January 2020 and a pandemic on 11 March. As of 6 August 2020, more than 18.7 million cases of COVID‑19 have been reported in more than 188 countries and territories, resulting in more than 706,000 deaths; more than 11.3 million people have recovered.The virus is primarily spread between people during close contact, most often via small droplets produced by coughing, sneezing, and talking.The droplets usually fall to the ground or onto surfaces rather than travelling through air over long distances.However, the transmission may also occur through smaller droplets that are able to stay suspended in the air for longer periods of time in enclosed spaces, as typical for airborne diseases. Less commonly, people may become infected by touching a contaminated surface and then touching their face.It is most contagious during the first three days after the onset of symptoms, although spread is possible before symptoms appear, and from people who do not show symptoms. Common symptoms include fever, cough, fatigue, shortness of breath, and loss of sense of smell. Complications may include pneumonia and acute respiratory distress syndrome. The time from exposure to onset of symptoms is typically around five days but may range from two to fourteen days.There is no known vaccine or specific antiviral treatment.Primary treatment is symptomatic and supportive therapy. Recommended preventive measures include hand washing, covering one's mouth when coughing, maintaining distance from other people, wearing a face mask in public settings, disinfecting surfaces, increasing ventilation and air filtration indoors, and monitoring and self-isolation for people who suspect they are infected. Authorities worldwide have responded by implementing travel restrictions, lockdowns, workplace hazard controls, and facility closures in order to slow the spread of the disease. Many places have also worked to increase testing capacity and trace contacts of infected persons. The pandemic has caused global social and economic disruption, global famines affecting 265 million people."""

We declare a typical question on the above passage:

question =r"""What are the symptoms of COVID-19?"""

Next, we will need to combine this question and the passage, tokenize and encode the entire text.

Preprocessing Text

We will add a separator such as [SEP] between the source and the question for the model’s understanding. We encode the combined text with the tokenizer.

input_text=question+" [SEP] "+source
input_ids=tokenizer.encode(input_text)
input_1 = tf.constant(input_ids)[None, :]

Print the tokens and the decoded output to see what it represents:

print(input_ids)
print(tokenizer.decode(input_ids))

The output is as follows:

Image01

Observe in the above output, how the [SEP] tag separates the question and the passage. Thus, any time you want to put an additional question on the passage, you will need to construct the combined text like this one. For your understanding, print the input_1 tensor, which is later on used during model inference.

print (input_1)

The output is as follows:

Image02

Next, we create a mask for distinguishing between the question and passage tokens.

token_ids=[0 if i < input_ids.index(102) else 1 for i in range(len(input_ids))]
print(token_ids)

The output is as follows:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Note that we have masked the tokens for the question words with 0, while we mask the rest in the passage with 1.

Inference

We are now ready to proceed with the model inference.

answer = model({'input_ids':input_1,
             'token_type_ids': tf.convert_to_tensor([token_ids])})

We input the tokens tensor (input_1) and the mask to the earlier pre-trained model.

We process the output to determine the start and end index in the passage text. The code below finds the answer to questions in the passage. We identify the start index and end index of the answer string, and the function returns the relevant answer.

startScores, endScores = answer
input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
startIndex = tf.math.argmax(startScores[0],0).numpy()
endIndex = tf.math.argmax(endScores[0],0).numpy()+1

We will define a trivial function to combine the answer tokens.

def process(ans_str):
   new=ans_str.split('#')
   new_ans_str=""
   for word in new:
       if word=="":
           new_ans_str=new_ans_str[:-1]
       new_ans_str+=word
   return new_ans_str

We call the above-declared function to combine all tokens related to the answer.

ans=process("\n"+" ".join(input_tokens[startIndex:endIndex]))
print(ans)

The output is as follows:

fever , cough , fatigue , shortness of breath , and loss of sense of smell

Note that our question to the passage was “What are the symptoms of COVID-19?.” The model has given us the symptoms, which you know are correct for the context. You may try another question of your own related to the passage. For this, you will need to construct the combined text as shown earlier and use the model for inference.

Now, I will create a generalized user interface for our model so you can test its inference strength by inputting several passages and asking it to answer a pre-defined set of questions for each passage. Not only this, you will also be able to cut-n-paste any passage of your choice and input questions that come in your mind to test the model’s ability to answer.

Creating User Interface

The user interface is totally console-based and you will enter the passages and the questions there upon in Colab itself by using its Forms feature. Thus, you need not leave the development environment during the project run. First, we develop a utility function for inputting the passage and the question.

Utility Function

The utility function called QnA takes two parameters - the passage and the question for which you seek an answer. I give here the full function definition:

def QnA(context, question):
   # Load the pretrained model
   modelName = 'bert-large-uncased-whole-word-masking-finetuned-squad'
   tokenizer = BertTokenizer.from_pretrained(modelName)
   model = TFBertForQuestionAnswering.from_pretrained(modelName)
   # Covid-19 Information
   #source=context
   # Concatenating and preprocessing for our BERT model
   input_text = question+" [SEP] "+ context
   input_ids = tokenizer.encode(input_text)
   input_1 = tf.constant(input_ids)[None, :]
   # Token Ids
   token_ids = [0 if i < input_ids.index(102)
                   else 1 for i in range(len(input_ids))]
   # Model prediction
   answer = model({'input_ids':input_1,
                   'token_type_ids': tf.convert_to_tensor([token_ids])})
   startScores,endScores = answer
   input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
   startIndex = tf.math.argmax(startScores[0],0).numpy()
   endIndex = tf.math.argmax(endScores[0],0).numpy()+1
   ans = process("\n"+" ".join(input_tokens[startIndex:endIndex]))
   return ans

What the function does is essentially the same procedure that you followed in our earlier experiments on COVID-19 passage. It simply combines all those steps in a single function.

Loading Passages

For your ready use, I have uploaded five different passages on our GitHub. Download those into your Colab project using wget command:

!wget https://raw.githubusercontent.com/abcom-mltutorials/BERT-QnA/master/covid.txt
!wget https://raw.githubusercontent.com/abcom-mltutorials/BERT-QnA/master/google.txt
!wget https://raw.githubusercontent.com/abcom-mltutorials/BERT-QnA/master/UNO.txt
!wget https://raw.githubusercontent.com/abcom-mltutorials/BERT-QnA/master/sachin.txt
!wget https://raw.githubusercontent.com/abcom-mltutorials/BERT-QnA/master/NSS.txt

Now, create arrays and dictionaries for storing the passages, filenames, and questions:

sources = ["Covid-19","Google","U.N.O.","Sachin Tendulkar",
          "National Service Scheme"]
txt_files = ["covid.txt", "google.txt", "UNO.txt", "sachin.txt", "NSS.txt"]
questions = {1: ["When was Covid-19 first detected?",
                "Where was Covid-19 first detected?", "How does Covid spread?",
                  "What are the symptoms of Covid infection?",
                "What are the precautionary measures?"],
            2: ["Who is the current CEO of Google?", "Who founded Google?",
                "Where is the Google headquarter located?",
                   "When did IPO take place?"],
            3: ["How many members does the General Assembly now have?",
                "Which country is not a member of the UNO?",
                "Through what office does the General Secretary function?",
                "What does UNESCO stand for?"],
            4: ["When was Sachin born?",
                "Against which country did Sachin play his debut match?",
                   "At what age did Sachin make his debut?",
                "What nickname did he get?"],
            5: ["What are Indian youth accused of?",
         "What scheme was introduced to involve students in social service ?",
                "What works are undertaken under this scheme?",
       "For what purpose youth hostels with cheap accommodation are set up?"]}

The application user interface will ask the user to select a passage of his choice by asking him to enter the choice as a number in the range 1 to 5. We write a function to load the context based on this choice. The getContext function provides this functionality and is written:

#function to load the context based on user input
def getContext (choice, sources):
 file = sources[choice-1]
 f = open(file, "r")
 context = f.read()
 return context

We write another utility function that accepts the passage and the question along with the user’s choice of action and returns the answer to the user by calling our earlier defined QnA function. I show here the function definition:

#function to answer the question based on user input
def getAnswer (choice, context, questions ):
 if (choice<=len(questions)):
   query = questions[choice-1]
   return QnA(context,query)
 elif (choice>len(questions)):
   query = input("Enter your query: ")
   return QnA(context,query)

Now, we will define the user interface for accepting the passage and the query. We define two nested loops. In the outer loop, the user is given a choice of selecting one of the preloaded passages or to enter his own passage. The user can enter his own passage by either giving the passage’s URL or by cut-n-paste in the text edit control shown in the console window. Once the user selects the context, he will have the choice to select amongst the pre-loaded questions for the system to answer, or ask an additional question of his own. In case if the user had provided his own source passage in the previous step, he can ask the system any question of his choice in the context. We define the user interface using the following code segment:

while (True):
 print("CHOOSE YOUR SOURCE CONTEXT")
 for i in range(len(sources)):
   print(str(i+1) + ". "+sources[i])
 print("Enter " + str((len(sources)+1)) + " to provide your own text source")
 print("Enter 0 to exit choosing source context")
 src_choice = int(input("Enter your choice: "))
  if (src_choice==0):
   break
  elif (src_choice==6):
   while (True):
     print("1. Enter source text\n2. Enter text file from Drive")
     print("3. Enter text file from Github\nEnter 0 to exit")
     choice1 = int(input("Enter your choice: "))
     if (choice1 == 0):
       break
     elif (choice1 == 1):
       context=input("Enter your context/source: ")
     elif (choice1 == 2):
       drive.mount("/content/drive")
       path = input("Enter text file path: ")
       f = open(path, "r")
       context = f.read()
     elif (choice1 == 3):
       path = input("Enter URL: ")
       master = path
       src = requests.get(master)
       context = src.text    
     while (True):
       choice2 = int(input("Press 1 to ask a question OR 0 to exit:"))
       if (choice2==1):
         query = input("Enter your query: ")
         print(QnA(context,query))
       elif (choice2==0):
         break
  else:
   context = getContext(src_choice, txt_files)
   while (True):
     print("CHOOSE YOUR QUERY")
     for j in range(len(questions[src_choice])):
       print(str(j+1) + ". "+questions[src_choice][j])
     print("Enter " + str((len(questions[src_choice])+1)) +
           " to ask your own question")
     print("Enter 0 to exit choosing query")
     q_choice = int(input("Enter your choice: "))
     if (q_choice == 0):
       break
     else:
       print(getAnswer(q_choice, context, questions[src_choice]))

I am giving the various outputs, mimicking the different choices given by the user.

Test 1

Output when the user selects from the given contexts and uses a pre-loaded question in the context.

Image03

Test 2

Output when the user inputs a text context and asks a question in the context.

Image04

Test 3

Output when the user inputs a context from google drive and asks a question in the context.

Image05

Test 4

Output when the user inputs a context from GitHub and asks a question in the context.

Image06

As you can see, the answers given by the model are accurate in all the specified cases. You may try the model on passages of your choice. If the model cannot provide an answer to your question, it will reply with the “unanswerable” tag.

Summary

With the advent in transfer learning to use pre-trained models, we have successfully used a pre-trained model for creating a Q&A system. We use the model provided by Hugging Face. They base the model on BERT - a popular implementation of transformers and is fine-tuned on SQuAD dataset. You tested the model on five different passages with a set of pre-defined questions. The model answered the questions with great accuracy. The application provides an user interface for accepting any passage and the questions thereupon.

In this tutorial, I have used a pre-trained model which is fine tuned on the SQuAD dataset. However, to make the model more robust and improve the answering ability of the model, you may train and fine tune it on your own on the full latest release of SQuAD dataset.

Source: Download the project from our Repository

References:

  1. Pre-training of Deep Bidirectional Transformers for Language Understanding
  2. SQuAD2.0
  3. Transformers

image