B. Srinivas

| Technical Writer: ABCOM Team | Technical Review: Aaditya Damle | Level: intermediate | Banner Image Source : Internet |

Introduction

Predicting if a particular user is a potential buyer of your product or forecasting the defaulters in your loan accounts, are the typical use-cases of classification problems in machine learning. ML practitioners use several widely used classification algorithms, to name a few logistics regression, support vector machines (SVM), naïve Bayes, decision trees, random forest and so on. The Random Forest algorithm probably outperforms all these. Decision trees are the foundations of a random forest model. Every one of us uses a decision tree intuitively, even unconsciously, in our daily lives. In this tutorial, you will learn how to build decision trees and combine them into a forest called ‘random forest’ using the best known techniques of building a forest that is random enough to give an excellent inference.

Random forest produces significant results in most situations. It is widely used because of its simplicity and diversity. It is a supervised learning algorithm. It is an ensemble of decision trees. We can use it for both classification and regression problems.

I will first explain a decision tree.

Decision Trees

Consider a case where you want to take a decision on quitting the job. There are several important points and questions that you need to answer while deciding. We construct a decision tree for this case as shown below:

image01

As you can see in the above diagram, decision trees break down the problem statement into several branches. The result is a tree with decision nodes and leaf nodes. And as observed from the diagram, the deeper is the tree the more specific it is to your problem.

The Random Forest Classifier

Random forest comprises many individual decision trees that work together as an ensemble. Each individual tree spits out a class prediction. Finally, the class with the most votes becomes our model’s prediction. This can be called the wisdom of crowds. If 7 out of 10 groups (trees) say that the prediction is abc, you can trust the crowd prediction rather than considering an individual’s prediction. It is important that the predictions made by the individual trees need to have low correlations with each other.

I will now explain the implementation of this Random Forest by considering a concrete case study.

Case Study

People with cardiovascular disease or who are at high cardiovascular risk (because of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or some already established disease) need early detection and management. A machine learning model can be of great help in preventive measures for such patients. Fortunately, for us Kaggle provides such a dataset of cardiovascular patients. The dataset provides twelve vital parameters on each patient and provides a relationship with its mortality by a deceased heart. Such predictions can help doctors in saving life and prolonging the life of a deceased.

In this project, we will use a random forest algorithm to establish the relationship between these twelve vital parameters (features) and based on it predict the mortality.

Creating Project

Create a new Google Colab project and rename it to Random Forest. If you are new to Colab, then check out this short video tutorial.

Import the required libraries using the following code:

from random import randrange
import pandas as pd
import math
from math import sqrt
import seaborn as sns
import matplotlib.pyplot as plt

Loading Dataset

The dataset for this project is taken from the Kaggle competition.

Use wget command to download the CSV file into the Colab environment.

!wget https://raw.githubusercontent.com/abcom-mltutorials/Random-Forest/master/datasets_727551_1263738_heart_failure_clinical_records_dataset.csv

Use the read_csv function from pandas to load the data as a data frame.

dset = pd.read_csv("datasets_727551_1263738_heart_failure_clinical_records_dataset.csv")

Understanding Dataset

This dataset presents the patient data having Cardiovascular diseases. Cardiovascular diseases (CVDs) is the number one cause of death in this world, taking about 17.9 million lives each year, which accounts for 31% of all deaths worldwide. The CVDs cause heart failure. This dataset contains twelve vital parameters that can predict mortality by heart failure. We will use these twelve parameters related to a patient’s health as features while creating the model.

You get a list of columns in the dataset by executing the following statement:

dset.columns

The output is shown below:

Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time',
       'DEATH_EVENT'],
      dtype='object')

The column descriptions are as follows:

  • Age: patients age (numeric)
  • Anaemia: Level of red blood cells, hemoglobin (boolean)
  • Creatinine_phosphokinase: Level of the CPK enzyme in the blood (mcg/L)
  • Diabetes: If the patient has diabetes (boolean)
  • Ejection_fraction: Percentage of blood leaving the heart at each contraction (percentage)
  • High_blood_pressure: Level of the CPK enzyme in the blood (mcg/L)
  • Platelets: Platelets in the blood (kilo platelets/mL)
  • Serum_creatinine: Level of serum creatinine in the blood (mg/dL)
  • Serum_sodium: Level of serum creatinine in the blood (mg/dL)
  • Sex: Woman or man (binary)
  • Smoking: If the patient smokes or not (boolean)
  • Time: Follow-up period (days)
  • DEATH_EVENT: If the patient deceased during the follow-up period (boolean)

Call describe method to get the data distribution:

dset.describe()

The output is shown below:

image02

We can run the following code to check the relation between the features using the Correlation Matrix:

corrmat = dset.corr()  
f, ax = plt.subplots(figsize =(9, 8))
cg = sns.heatmap(corrmat, ax = ax, cmap ="YlGnBu", linewidths = 0.1)
cg

The output is as follows:

image03

Here 0 shows no correlation and 1 shows maximum correlation. We use the matrix for observing the relationship between the various features and the death event.

Convert the dataset into a Python list for further processing:

dset = dset.values.tolist()

Having understood the dataset we are going to work on, let us now start building the random forest.

Building Random Forest

First, we will develop a few utility functions that will ease the creation of random forests.

Splitting Dataset

For building the tree, we need to split the entire dataset into left and right branches. We repeat this process several times until we reach the tree bottom by scanning through all the records in the dataset.

The function testsplit takes the dataset, index of a column and a particular value as arguments and splits the dataset into two parts based on the column’s value in each row. Use the following code to define the function.

def testsplit(index, value, dset):
 left, right = list(), list()
 for row in dset:
   if row[index] > value:
     right.append(row)
   else:
     left.append(row)
 return left, right

The function iterates through all rows in the dataset; if the row’s value corresponding to the column is greater than the value specified in the input parameter, we append the row to the right list else we append it to the left list.

Class Count

We need a function to count the number of elements in each class. When we use the testsplit as defined earlier, it divides the datasets into subsets or groups. We pass this group as a parameter to the class_counts function. The function counts the number of elements of each class, which is the last column in a group row. Use the following code to define the function.

def class_counts(group):
   counts = {} 
   for row in group:
       label = row[-1]
       if label not in counts:
           counts[label] = 0
       counts[label] += 1
   return counts

Entropy

Entropy is a measure of disorder. It is an indicator of how messy the data is. We use entropy function as a measure to maximize the purity of the groups as much as possible each time you create a new node of the tree. At each branching, we want to decrease the entropy, so we compute this quantity before the cut and after the cut. Smaller the entropy, better is the split. The following code shows the function definition:

def entropy(group):
   entries = class_counts(group)
   avg_entropy = 0
   size = float(len(group))
   for label in entries:
       prob = entries[label] / size
       avg_entropy = (avg_entropy
                 + (prob * math.log(prob, 2)))
   return -1*avg_entropy

The class_counts function finds the number of elements in respective classes in a given group. The entropy function iterates through each class and finds the ratio of class size (entries[label]) and the group size (size) to find prob, which represents the probability that a row having that class label is picked up.Then it computes the entropy of each group using the specified formula.

Group Entropy

We define a function called Entropy_Groups to calculate the entropy of all groups combined after a split.

def Entropy_Groups(groups, classes):
 n_instances = float(sum([len(group) for group in groups]))
 Entropy = 0.0
 for group in groups:
   size = float(len(group))
   Entropy_group = entropy(group)
   Entropy += (Entropy_group) * (size / n_instances)
 return Entropy

The function first calculates the length of all the rows combined and saves it in n_instances variable. Then we set a variable Entropy to 0. After that we iterate through all the groups and find the individual entropies using the above defined entropy function and add it to the variable Entropy after multiplying each by its relative size. The function returns the computed group entropy.

Finding Best Split

Now check every value on each attribute as a candidate split, test the cost of the split and find the best split we could make. This is an exhaustive and greedy algorithm. Initially, we create a list of features append all the indexes of the dataset to it. Taking a column into consideration, check the best split iterating over each row in the column using test split wherein take the row’s index value as threshold value for splitting. Iterate this entire process for every column. Record the Entropy value on each iteration. The row of a particular index for which the Entropy has the least value is considered the best split. The index, row [index] value and the groups created are given as output. We declare the function as shown below:

def get_split(dset, number_of_features):

The classval is a set of all the classes. Initial values of the variables should be updated as one iterates through the index and rows.

 classval = list(set(row[-1] for row in dset))
 updated_index = 500
 split_value = 500
 score = 1
 split_groups = None

The features variable contains a list of features of the dataset which is created by iterating over all the columns of the dataset and adding the features that are not already present in features.

 features = list()
 while len(features) < number_of_features:
   index = randrange(len(dset[0])-1)
   if index not in features:
      features.append(index)

Now iterate through the index and rows using for loop to get the optimum values of the variables that satisfy specific conditions. Create a split using the testsplit function and calculate the individual information gain.

 for index in features:
   for row in dset:
     groups = testsplit(index, row[index], dset)
     Entropy = Entropy_Groups(groups, classval)

Update the score whenever a lower Entropy is obtained and the corresponding groups are created. Return the updated values of index, the split_value and groups.

     if Entropy < score:
       updated_index = index
       split_value = row[index]
       score = Entropy
       split_groups = groups
 return {'index':updated_index, 'value':split_value,
                    'groups':split_groups}

The full code for the get_split function is given below for your quick reference:

def get_split(dset, number_of_features):
 classval = list(set(row[-1] for row in dset))
 updated_index = 500
 split_value = 500
 score = 1
 split_groups = None
 features = list()
 while len(features) < number_of_features:
   index = randrange(len(dset[0])-1)
   if index not in features:
      features.append(index)
 for index in features:
   for row in dset:
     groups = testsplit(index, row[index], dset)
     Entropy = Entropy_Groups(groups, classval)
     if Entropy < score:
       updated_index = index
       split_value = row[index]
       score = Entropy
       split_groups = groups
 return {'index':updated_index, 'value':split_value,
                    'groups':split_groups}

Now build a to_terminal function which returns the class with highest number of elements in the group as output.

def to_terminal(group):
 outcomes = [row[-1] for row in group]
 return max(set(outcomes), key=outcomes.count)

Terminating Tree

Create a function to decide when to stop a growing tree. Once a maximum depth of the tree is met, we must stop splitting and adding new nodes. We can accomplish this when the number of nodes reaches at or below the minimum node and we must stop splitting and adding new nodes at that point. And if we choose a split in which all rows belong to one group, then there is no splitting and adding child nodes. The splitting of nodes takes place only when none of the above conditions are satisfied.

def split(node, max_depth, min_size,
                 number_of_features, depth):
 left, right = node['groups']
 del(node['groups'])
 # check for a no split
 if not left or not right:
   node['left'] =  to_terminal(left + right)
   node['right'] = to_terminal(left + right)
   return

Check for the max_depth and accordingly terminate the node using to_terminal function.

 # check for max depth
 if depth >= max_depth:
   node['left'] = to_terminal(left)
   node['right'] = to_terminal(right)
   return

Then use the if condition to terminate the node if any further split is not feasible because all rows belong to one class.

 # process left child
 if len(left) <= min_size:
   node['left'] = to_terminal(left)
 else:
   node['left'] = get_split(left, number_of_features)
   split(node['left'], max_depth, min_size,
         number_of_features, depth+1)

Then again use the if condition in the right node to check for min_size. If true, terminate or else split the node further.

 # process right child
 if len(right) <= min_size:
   node['right'] = to_terminal(right)
 else:
   node['right'] = get_split(right, number_of_features)
   split(node['right'], max_depth, min_size,
              number_of_features, depth+1)

Repeat this process for the left node using another conditional block. Use recursion for both right and left nodes to further repeat the whole process for that node.

The full source code for the split function is given below for your quick reference:

def split(node, max_depth, min_size,
                 number_of_features, depth):
 left, right = node['groups']
 del(node['groups'])
 # check for a no split
 if not left or not right:
   node['left'] =  to_terminal(left + right)
   node['right'] = to_terminal(left + right)
   return
 # check for max depth
 if depth >= max_depth:
   node['left'] = to_terminal(left)
   node['right'] = to_terminal(right)
   return
 # process left child
 if len(left) <= min_size:
   node['left'] = to_terminal(left)
 else:
   node['left'] = get_split(left, number_of_features)
   split(node['left'], max_depth, min_size,
         number_of_features, depth+1)
 # process right child
 if len(right) <= min_size:
   node['right'] = to_terminal(right)
 else:
   node['right'] = get_split(right, number_of_features)
   split(node['right'], max_depth, min_size,
              number_of_features, depth+1)

Building Tree

To build the tree, use the get_split to first create the groups. Then using the split function to split nodes as much as possible till we create the full tree. The function build_tree is defined as shown below:

def build_tree(train, max_depth, min_size,
              number_of_features):
 root = get_split(train, number_of_features)
 split(root, max_depth, min_size,
       number_of_features, 1)
 return root

Value Prediction

Create a predict function to check if a child node is a terminal value to be returned as the prediction, or if it is a dictionary node containing another level of the tree to be considered. Implement this using recursion. I give the function code below:

def predict(node, row):
 if row[node['index']] > node['value']:
   if isinstance(node['right'], dict):
     return predict(node['right'], row)
   else:
     return node['right']
 else:
   if isinstance(node['left'], dict):
     return predict(node['left'], row)
   else:
     return node['left']

Use the index and value in a node to test whether the row of provided data falls on the left or the right of the split. The function isinstance checks whether the node exists as the dictionary node or not. If it does, then use recursion to find its value. Else, if it is a terminal node, then it is given as output. We carry similar procedure for the left node.

Bagging Prediction

Now create a function bagging_predict to choose the prediction made by all the trees. This is done by counting the number of same predictions and giving as output the value that has been predicted maximum number of times by the trees.

def bagging_predict(trees, row):
 predictions = [predict(tree, row) for tree in trees]
 return max(set(predictions), key=predictions.count)

The predictions is a list of predictions made for all the trees.

Random Forest Implementation

We have now built all functions required for our Random forest algorithm and are ready to implement the random_forest function. This function returns a list of predictions as output when the train and test sets are passed as input.

def random_forest(train, test, max_depth,
                 min_size,
                 n_trees, number_of_features):
 trees = list()
 for i in range(n_trees):
   train_set = list(train)
   tree = build_tree(train_set, max_depth,
                  min_size, number_of_features)
   trees.append(tree)
 predictions = [bagging_predict(trees, row)
                     for row in test]
 return(predictions)

Inside the function, create a list of trees using the train set and predict all their results on the test set, which are given as input parameters. Then create the desired number of trees. Select the most appropriate prediction out of all the predictions made by all the trees for each row by using the bagging_predict function in a for loop.

Algorithm Accuracy

Now define the accuracy_metric function to calculate the accuracy of our prediction. Iterate through every row of the dataset and compare the actual and predicted values for all the predictions made and calculate the percentage of correct predictions.

def accuracy_metric(actual, predicted):
 correct_prediction = 0
 for i in range(len(actual)):
   if actual[i] == predicted[i]:
     correct_prediction += 1
 return correct_prediction / float(len(actual)) * 100.0

Algorithm Evaluation

To test our algorithm, define an algorithm_evaluation function which takes in as input - the dataset, the algorithm to be applied on the dataset and the arguments of the algorithm and returns the accuracy using the previously created functions.

def algorithm_evaluation(dset, algorithm, n_folds, *args):

Create the train and the test set using the given dataset.

 dset_split = list()
 dset_copy = list(dset)
 fold_size = int(len(dset) / n_folds)
 test_set = list()

Assign the rows of test_set with the rows in the dataset till the number of elements is equal to fold_size.

 while len(test_set) < fold_size:
     index = randrange(len(dset_copy))
     test_set.append(dset_copy.pop(index))
 train_set = list(dset_copy)

Then predict the output of the random forest algorithm. Calculate the accuracy using accuracy_metric function as defined earlier.

 predicted = algorithm(train_set, test_set, *args)
 actual = [row[-1] for row in test_set]
 accuracy = accuracy_metric(actual, predicted)
 return accuracy

The full source code for the algorithm_evaluation function is given below:

def algorithm_evaluation(dset, algorithm, n_folds, *args):
 dset_split = list()
 dset_copy = list(dset)
 fold_size = int(len(dset) / n_folds)
 test_set = list()
 while len(test_set) < fold_size:
     index = randrange(len(dset_copy))
     test_set.append(dset_copy.pop(index))
 train_set = list(dset_copy)
 predicted = algorithm(train_set, test_set, *args)
 actual = [row[-1] for row in test_set]
 accuracy = accuracy_metric(actual, predicted)
 return accuracy

Testing

The random forest requires certain parameters as input. These are as shown below. Choose the values of these variables that give us the optimum accuracy for the dataset.

n_folds = 5
max_depth = 10
min_size = 1
number_of_features = int((len(dset[0])-1))
n_trees = 5

The number of features we use here is equal to the number of columns of our dataset so that while choosing the best split we can consider all the possibilities.

We obtain the score of the algorithm as the result of the algorithm_evaluation function. The following code calculates and prints the score for the algorithm.

score= algorithm_evaluation(dset, random_forest, n_folds, max_depth,
                           min_size, n_trees, number_of_features)
print ("evaluation score: ", score)

The output is as follows;

evaluation score:  90.0

Having developed the model and tested its accuracy, we will now put it to some actual use, where the doctor will feed in the patient’s data to get an idea on the patient’s mortality.

Actual Use

I have created five test cases with varied feature values for testing. We define these as follows:

testset = []
testset =  pd.DataFrame(testset) 
patient_medical_record_1 = [{'age': 65.0, 'anaemia': 0.0, 'creatinine_phosphokinase': 146.0,
                         'diabetes': 0.0, 'ejection_fraction': 20.0,'high_blood_pressure': 0.0,
                         'platelets': 162000.0, 'serum_creatinine': 1.3,'serum_sodium': 129.0,
                         'sex': 1.0, 'smoking': 1.0, 'time': 7.0}]
testset = testset.append(patient_medical_record_1)
patient_medical_record_2 = [{'age': 64.0, 'anaemia': 1.0, 'creatinine_phosphokinase': 5500.0,
                         'diabetes': 0.0, 'ejection_fraction': 45.0,'high_blood_pressure': 0.0,
                         'platelets': 60000.0, 'serum_creatinine': 7.0,'serum_sodium': 132.0,
                         'sex': 0.0, 'smoking': 1.0, 'time': 230.0}]
testset = testset.append(patient_medical_record_2)
patient_medical_record_3 = [{'age': 72.0, 'anaemia': 0.0, 'creatinine_phosphokinase': 127.0,
                         'diabetes': 1.0, 'ejection_fraction': 50.0,'high_blood_pressure': 1.0,
                         'platelets': 218000.0, 'serum_creatinine': 1.0,'serum_sodium': 134.0,
                         'sex': 1.0, 'smoking': 0.0, 'time': 33.0}]     
testset = testset.append(patient_medical_record_3)
patient_medical_record_4 = [{'age': 60.0, 'anaemia': 0.0, 'creatinine_phosphokinase': 2656.0,
                         'diabetes': 1.0, 'ejection_fraction': 30.0,'high_blood_pressure': 0.0,
                         'platelets': 305000.0, 'serum_creatinine': 2.3,'serum_sodium': 137.0,
                         'sex': 1.0, 'smoking': 0.0, 'time': 30.0}]
testset = testset.append(patient_medical_record_4)
patient_medical_record_5 = [{'age': 90.0, 'anaemia': 1.0, 'creatinine_phosphokinase': 47.0,
                         'diabetes': 0.0, 'ejection_fraction': 40.0,'high_blood_pressure': 1.0,
                         'platelets': 204000.0, 'serum_creatinine': 2.1,'serum_sodium': 132.0,
                         'sex': 1.0, 'smoking': 1.0, 'time': 8.0}]
testset = testset.append(patient_medical_record_5)
testset = testset.values.tolist()

We now call our random_forest on these five test cases.

random_forest(dset, testset, 10, 1, 5, int((len(dset[0])-1)))

The output is shown below:

[1.0, 0.0, 0.0, 0.0, 1.0]

As seen from the output, the first and last patient have a high probability of succumbing to their current medical conditions, while the rest three have good chances of survival.

Summary

In this tutorial, you learned the complete implementation of a random forest algorithm. You learned its fundamental concepts, how to get the split using the concept of entropy, how to handle the trees to reach a terminal node, and how to build a tree. We use the combination of multiple trees for creating a random forest. We applied our model on the dataset to establish a relation between the patient’s vital parameters to find the possibility of death occurrence.

Source: Download the project from our Repository

image