Arka

| Technical Writer / Review: ABCOM Team | Level: Beginner | Banner Image Source : Internet |

With several sites offering music downloads, categorizing the different downloaded music becomes a challenge and many-a-times quite frustrating too. One cannot think of listening to every single MP3 file before putting it in the right storage bin. If we can train the machine for classifying these files to enable us to group them in certain pre-defined categories, it will be something great to have. And that is exactly what you are going to learn in this tutorial.

Music1 can be divided into genres2 (e.g., country music3) and genres can be further divided into subgenres4 (e.g., country blues5) and pop country are two of the many country subgenres. - Wikipedia

We use this categorization for grouping our MP3 downloads. The categories distinguish each type of music by its form and style. This kind of machine classification is not just going to be useful for your own purpose, but also has its actual use in the music industry.

Genre classification is important for many real-world applications. As the quantity of music being released daily continues to sky-rocket, especially on internet platforms such as Soundcloud6 and Spotify7. Spotify releases Forty thousand tracks per day, which is the equivalent of 280,000 songs a week, or around 1.2 million tracks per calendar month. In a year, this volume would add up to a whopping 14.6 million. Many companies8 nowadays use music classification, either to place recommendations to their customers (for example Spotify, Soundcloud) or as a product (for example Shazam9). Determining the music genre is the first step in that direction. To classify songs in any given playlist or library by genre is an important functionality for any music streaming/purchasing service.

For classification, we need to extract features (i.e. identify the components of the audio signal) good for identifying the linguistic content and discarding all the other stuff, which carries information like background noise, emotion, etc. Mel Frequency Cepstral Coefficients10 (MFCC) are widely used in automatic speech and speaker recognition. We will use MFCC in our classification of music.

With this small introduction to music classification, I will now describe how to develop a machine learning model for this purpose.

Creating Project

Create a new Google Colab project and rename it to Music Genre Classification. If you are new to Colab, then check out this short video tutorial on Google Colab.

Installing Packages

As mentioned earlier, we need to extract MFCC, i.e. the features from an audio file. A Python library (python_speech_features) provides an implementation for this. You install this library in your project using the pip command.

!pip install python_speech_features

The training data that we use in this project uses .wav files. So, we will install a package that converts an MP3 file into a .wav format. Note that, most of the downloads these days are in .mp3 format. To facilitate this conversion, install the pydub package using following command.

!pip install pydub

Now, import all required libraries in the project:

import numpy as np
import os
import pickle
import random
import pandas as pd
import sklearn
import scipy.io.wavfile as wav
from os import path
from pydub import AudioSegment
from python_speech_features import mfcc
from sklearn.metrics import classification_report
from collections import defaultdict

Next, load the dataset.

Loading Dataset

As required for any machine learning model development, having an appropriate dataset is important. Fortunately for us, such a dataset is available for our use. Professor George Tzanetakis11 of University of Victoria created the GTZAN genre collection in 2000-2001. It is widely available on Kaggle and many web sites. It comprises audio files each having 30 seconds duration. There are 9 classes (genres) each containing 100 audio tracks. Each track is in .wav format. It contains audio files of the following 9 genres:

Blues, Classical, Country, Disco, Hiphop, Metal, Pop, Reggae, Rock.

I have kept this dataset on our GitHub for your quick download. Download the dataset using wget command:

!wget https://github.com/abcom-mltutorials/music-genre/archive/master.zip

Unzip the downloaded files:

!unzip "/content/master.zip"

Next, we will create a Python list of all the downloaded files for processing them in a single loop.

directory = "/content/music-genre-master/Dataset"
filelist=[]
for path, subdirs, files in os.walk(directory):
   for file in files:
       if (file.endswith('.wav') or file.endswith('.WAV')):
           filelist.append(os.path.join(path, file))
number_of_files=len(filelist)
print(number_of_files)

The output shows you that there are totally 900 files (9 categories * 100 files in each.)

Next, you will extract features from all the downloaded .wav files.

Extracting Features

We use the MFCC library from python_speech_features module for extracting music features. This library extracts 13 features from the audio file. We write a utility function to extract the features:

def feature_extraction(file):
 features=[]
 (sampleRate,data) = wav.read(file)
 mfcc_feature = mfcc(data,sampleRate,
                           winlen=0.020,
                           appendEnergy = False)
 meanMatrix = mfcc_feature.mean(0)
 for x in meanMatrix:
   features.append(x)
 return features

We create a loop to extract the features of each file in our database along with the class label and create a Python list for feeding it to our ML model.

datasetDirectory = "/content/music-genre-master/Dataset/"
 
featureSet=[]
i=0
for folder in os.listdir(datasetDirectory):
   i+=1
   if i > 9: # the number of genres is 9
       break  
   for files in os.listdir(datasetDirectory+folder):
     x=datasetDirectory+folder+"/"+files
     features=feature_extraction(x)
     j=0
     for x in features:
       featureSet.append(x)
       j=j+1
       if(j%13==0):
         featureSet.append(i) 

If you want to examine the contents of the created list, try this print statement:

for i in range(14,28):
 print (featureSet[i])

The output is as follows:

68.26890241618138
4.295817932910749
-18.256788251694037
14.501371978320028
-5.645958021079518
0.40104690624425693
-13.837011660627079
9.771670161265996
-11.64745664414687
5.320457679100407
-11.786153952031055
4.857850961323428
-4.4375303456524255
1

Note that I have printed the values from index 14 through 27. Each set of 14 values corresponds to a single .wav file. Thus, the values from 14 through 27 will show the features of the second file in our dataset. The last value in the print output is 1, which shows that the genre of this second file is of type 1, which is “Blues.”

Constructing Dataframe

Now, convert the features list into a dataframe for further processing. We name the thirteen columns for features and a target column for the target value as follows:

df = pd.DataFrame(columns=['m1','m2','m3','m4','m5','m6','m7',
                          'm8','m9','m10','m11','m12','m13','target'])

We create a loop for constructing the dataframe from featureSet as follows:

i=1
n=[]
for j in featureSet:
 n.append(j)
 #13 features + 1 taget
 if(i%14==0):
   df = df.append({'m1':n[0],'m2':n[1],'m3':n[2],'m4':n[3],'m5':n[4],
                   'm6':n[5],'m7':n[6],'m8':n[7],'m9':n[8],'m10':n[9],
                   'm11':n[10],'m12':n[11],'m13':n[12],'target':n[13]},
                  ignore_index=True)
   n=[]
 i=i+1

The partial contents of the dataframe are shown in the screenshot below:

image 1

It has 900 rows and 14 columns. Each row represents a single file and 14 columns show 13 features plus one class label (target).

Once we have pre-processed the data for machine learning, you can apply one of the several classification algorithms to train the model. We will use a widely accepted and one of the simplest algorithms, and that is the logistic regression model for our purpose. You will use the pre-defined model from sklearn library.

Separating Features and Target

We separate out the features and labels using the following code:

x1=df[['m1','m2','m3','m4','m5','m6','m7','m8','m9','m10','m11','m12','m13']]
x1.shape

Y = df[['target']]
Y.shape

You will see that the shape of x1 is (900,13) showing 900 rows one for each file and 13 columns that give the features for each individual .wav file. The target shape is obviously (900,1), where the column shows the class of genre.

Splitting Dataset

Split the dataset into training and testing using sklearn library:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x1, Y,
                                                   test_size=0.2,
                                                   random_state=42)

Model Training

Now, train the logistic regression model using the earlier generated dataset.

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0).fit(X_train,y_train)

Inference

Use the trained model on the test dataset.

predicted_value = clf.predict(X_test)
predicted_value

The output is as follows:

image 1

As you can see, we classify each file in the test dataset in one category ranging from 1 to 9. You may visualize the performance by plotting the confusion matrix using the following code:

sklearn.metrics.plot_confusion_matrix(clf, X_test, y_test)

The output is as follows:

image 1

Another way to evaluate the performance is by printing the various metrics:

print(classification_report(y_test, predicted_value))

The output is as follows:

image 1

To interpret the results, I am giving here the quick definitions of the various outputs above.

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier to not label a sample as positive if it is negative.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The f1 score (also called F-score or F-measure) is a measure of a model’s accuracy. It reaches its best value at 1 and worst score at 0.

The support is the number of occurrences of each class in y_test.

Assuming that it satisfies you with the predictions made by the logistic regression model, I will now show you how to classify any unseen audio file. For your ready use, I have kept the file as part of the downloaded dataset. The file is named new_audio_file.wav and is available in the content folder of your Google drive.

First, you will extract the MFCC features from this file using the earlier defined feature extraction function:

audio_file="/content/music-genre-master/new_audio_file.wav"
audio_feature=feature_extraction(audio_file)

The above code is like what you did for your training and testing data above. It stores the extracted features in a list. We do a prediction on this list using our trained model clf.

from collections import defaultdict
results=defaultdict(int)
i=1
for folder in os.listdir("/content/music-genre-master/Dataset/"):
    results[i]=folder
    i+=1
pred_audio=clf.predict([audio_feature])
results[int(pred_audio)]

It gives the following output:

'country'

Thus, the given .wav file is classified as of the genre “country.”

Now, I will show you how to use this trained model on your own audio files.

Classifying Your Music

Just copy any desired mp3 file to your content folder. For your quick experimentation, you can use the following file downloaded from web:

!wget https://raw.githubusercontent.com/abcom-mltutorials/music/master/bhatiyar.mp3

The audio file name is bhambhole.mp3. The wget command will copy this file to the content folder of your Google drive. We need to convert this file into a .wav format. Declare the two variables:

src = "/content/bhatiyar.mp3"
dst = "test.wav"

Convert the file using following code:

sound = AudioSegment.from_mp3(src)
sound.export(dst, format="wav")

Extract the features using our earlier defined function:

audio_file="/content/test.wav"
audio_feature=feature_extraction(audio_file)

Predict the genre using the following code:

pred_audio=clf.predict([audio_feature])
results[int(pred_audio)-2]

The output is as follows:

'classical'

Thus, the model has classified our file as “classical.”

Summary

With millions of songs available on the web and continuous additions to it at the pace of about 14.6 millions in a year, organizing these music libraries would be a humanly impossible task. Fortunately, a very simple machine learning algorithm makes this job easy. Every audio file has certain features which we use for classifying it as of a certain type. You used the MFCC library to extract these features. We performed the classification using logistic regression. You tested the trained model on an unseen file downloaded from the web.

Source: Download the project source from our Repository.

References

image