Rainforest Connection Species Audio Detection

Table of Content-

  1. Problem Statement

Let’s get started

1. Problem Statement-

Hearing a species in a tropical rainforest is much more easier than seeing them and that is what we are going to do in this case study. We will be building model that can detect the species just by audio signals. If someone is in the forest, he might not be able to look around and see every type of bird and frog that are there but they can be heard, now the problem is that everyone is not expert to recognize the species and take suitable action but we can build a device that recognize the species in real time and with the help of experts, a suitable action may also be suggested such as if the species is dangerous then he might need to run away.

2. Real world objective/constraint-

(i) If this solution is to be productionized we need to make sure that it doesn’t require too much resources because this solution will be used in small and simple devices and not in super computers.

(ii) The latency should be as low as possible, we should try that it should not take more than 1 second to predict because if we are using a device in forest then we need our device to identify the species as soon as it is heard.

(iii) Number of false negative should be as low as possible because if a dangerous species is there and our model doesn’t predict it then it may cost someone his life.

(iv) The model will be trained using fixed length audio signals but in real time audio will be fed continuously to the device, so we need to find a solution for this problem, we will discuss about it in First Cut Approach section.

3. Data Source and Overview-

We are given 4727 audio files and each of the audio was recorded in Tropical Forest, the data was collected through a device which along with recording audio also detects which species voice is there in the audio, later this data was manually checked by experts and now for around 1100 of these files we are given which species was detected by the device and experts also found it to be true(train_tp.csv) and for another 3600 files, we are given which species were detected by the device but experts found it to be false(train_fp.csv). Multiple species can be present in one audio, our task is to predict the probability of each of the species for around 1900 test files that we are given.

4. Performance Metrics-

As we need to predict probabilities instead of actual class labels, using Multiclass logloss would be a better option instead of using accuracy. It is defined as -ve average of log(probability of actual class label). Mathematically, it is defined as-

Where F is loss, pij is the probability output by the classifier, yij is the binary variable(1 if expected label, 0 otherwise), N is number of samples, M is number of classes.

In easy language, we can define it as-

Where P is Probability of actual class label and n is number of samples.

5. Existing Solutions-

(i) https://arxiv.org/pdf/1804.07177v1.pdf

This paper is based on 2018LifeCLEF identification task. This problem is pretty similar to our problem except that it doesn’t has false positive point. The work is divided into 3 parts- First all audios are converted into spectrograms and then a convolution based neural network is trained. Architecture of CNN implemented in the paper looks like this -

The last part is is testing the model with unseen audio files. The performance metrics used here is Mean Label Ranking Average Precision, this metrics is based on the ranking of true class label. Mathematically it can be defined as-

Or in easy language, we can define it as-

Where n is number of samples

This problem also focuses on predicting multiple species in a single audio files just like ours. The method used here is- given a audio files it is split in chunks of 1 second each and then each is converted into spectrograms and after passing these spectrograms into the model, probability of each class is computed and then mean exponential pooling is applied to get a single probability score for each class.

Other then the fact that this paper doesn’t has any false positive data, this is pretty similar to our problem and we can try to implement this paper and then improvise it with our data.

(ii) https://pypi.org/project/noisereduce/

We can reduce background noise in our data using Spectral Gating, this algorithm works on the principle of Fast Fourier Transformation, in this signal is broken down into different frequencies and then a threshold is calculated based on the noise frequency and a mask is determined by comapring noise frequency to signal frequency and then the mask is smoothed which results in reduction of noise from the audio signal.

(iii) https://arxiv.org/pdf/1608.04363v2.pdf

This paper is based in environmental sound classification data. This paper proposes a CNN based architecture with data augmentation and compares its results with Spherical K Means(SKM) and PiczakCNN on the same data. The key takeaway from this paper is how augmentation techniques can help in improving the results. Comparision of model with and without augmentation is as follows-

This shows us that SKM’s results have negligible improvement with augmentation but the proposed model’s(SB-CNN) results have significant improvement with augmentation. Techniques used for augmentation in this paper are-

* Time Stretching with factor of : {0.81, 0.93, 1.07, 1.23}

* Pitch Shifting with factor of : {-3.5,-2.5,-2, -1, 1, 2,2.5,3.5}

* Dynamic Range Compression

After that, results of all classes has also been analyzed I.e. which class performance increases after augmentation and which class performance reduces. We can use this analysis to do class wise augmentation which can help us to improve our results.

Proposed model architecture is as follows-

(iv) https://arxiv.org/pdf/1902.10107v2.pdf

This paper is based on voxceleb2 dataset but it also shows significant improvement on voxceleb1 dataset also. This paper is build upon the working of CNN and dictionary based method. While CNN helps in extracting patterns from the input data, latter helps in converting the input into a fixed size length. With help of these methods, this paper proposes an end to end model for speaker recognition. First a feature extractor is used to extract features from the input spectrograms on frame levels. Each spectrogram is cut down using 2.5 second sliding window and 1 second step size and then normalization is also applied. Below is the image of network architecture which includes feature extraction and aggregation part-

Through this paper we get an idea of how our model can work in productionization by training an end to end neural network. We can also improve this by adding augmentation techniques as well.

(v) https://arxiv.org/pdf/2002.04683v1.pdf

This paper shows us how to work with data in which certain sample points are not correctly labelled. It shows us how can we detect points which might not be correctly classified and how to re-label them. In our problem, we already know which data points are not correctly labelled, so we just need to focus on how to re-label them. The approach followed in the paper is as follows-

Dn is the dataset for which points might not be correctly labelled. Fa is the classifier trained using Dc( Dataset for which points are correctly labelled). Prediction is made using this classifier for every point in Dn and a condition is checked(confidence of prediction > threshold and prediction is not same as previous label), then the point is re-labelled and then this data is also used in addition with Dc to train the final classifier.

For our problem, we can modify above approach like this- First we will train a classifier on our correctly labelled data. Now, for the false positive data, for each point we will make prediction using the classifier and we are given which class a particular point doesn’t belong to, so we will check that the predicted class must not be same and we will also set a threshold and then re-label it and will follow the similar approach as discussed in the paper.

6. First Cut Approach-

As we can see from the above picture that for each recording id we are given at what time that particular species was heard and one recording can contain multiple species, so first thing we will do is to split the audio signal into pieces according to the time interval given such that one recording will contain only one species.

After that, we will perform data augmentation techniques as discussed above and train a classifier from the true positive data and we will use that classifier to label our false positive data. Then, both of the data will be used to train a final classifier. We will also do class wise augmentations after analysing the results of random augmentation.

We will start modelling process with basic ML models and then apply Neural network based models too. Most of the solution available on Kaggle has not used false positive data but we will try to make use of it to improve our model. One experiment that we will do is to train a different model also for false positive points only and then join the two models through Cascading.

We will also perform a detailed EDA and this should also give us some more ideas for our feature engineering and some other insights also that can help us to improve our solution.

In real time/ productionization , audio clips of 5 seconds each will be fed into the model. So, after turning ON the device, it will take 5 seconds for the first prediction and after that audio will be fed to the model with step size of 1 second, so after first prediction, a new prediction will be made after every 1 second.

7. Exploratory Data Analysis-

7.1 Null Value Analysis-

We don’t have any null values.

7.2 Is our data balanced ?

Out of 24 classes, only one class has very large number of data points compared to other classes. So, we can conclude that our data is almost balanced.

7.3 More than 1 species in one audio file ?

We discussed that one audio file may contain multiple species, so now we will check how many audio contain multiple species and how we will deal with them.

In csv file, for each audio file we are given at what time a particular species was heard, so one recording can be repeated multiple times in that csv file if it contains more than one species. So, we will look how many time one recording id gets repeated in the data.

We can easily observe that most of the audio file contain one species only and there are very few audio files that contain multiple species.
Now, for example let’s take an audio file that has 3 species and from the data that we are given we already know at what time in that file each of the species was heard. From this, we can slice this audio sequence in three parts such that each of the part contains only one species.

7.4 Does species depends on minimum frequency ?

We can see that some of the classes are easily separable.
If f_min>10000,then species =22
8000>f_min>6000, then species=23
6000>f_min>5800, then species=0
4000>f_min>5000, then species= 5 or 7
Similarly, there are also some other groups of classes which can be distinguished from other classes.
There are also very few outliers and none of them are very far.

7.5 Does species depends on maximum frequency ?

In previous plot, we couldn’t distinguish between class 5 and 7 but here these are easily separable.
Class 19 can be easily separated from class 13,15 which wasn’t clear in previous plot. But there are still some classes which are not separable, for e.g class 8 and 9 are almost similar.
Only class 17 has a outlier that is very far from other data point, all other classes outliers are not very big.

Now, we will load all the audio files using librosa library

7.6 Are audio files are of same duration ?

We can see that majority of the data points have duration less than 4 second but still there are considerable amount of data points which have duration between 4 and 8 second. Due to this, we will pad all the signal to make their length equal to the signal with maximum duration.

This make things more clear, we can see 80% of the points have duration < 3.44 seconds but there are still 20% points which have duration >3.44 seconds. So, we will consider the point with the maximum duration.

8. Preprocessing and Feature Engineering-

For preprocessing, we will first load our files and then slice them according to the time at which a species is heard, but we don’t want to be very precise for the slice time as it will not help our model in generalisation. So, in addition to the slicing audio file according to the given time we will also add 0.2 second at the start and beginning.

Now, as we have loaded our data, let’s perform train and test split. We will also make new y_labels as we have loaded each file twice.

Now, to generalise our data, let’s add augmentation to our training data.

Loading raw data just means that we have amplitude of audio signal as time series but to train our model, just loudness is not enough, we need something else based on which our model can distinguish between different species. For this, we will convert our data from time domain to frequency domain and will use spectrogram for this.

Now, we have the right data, we just need a finishing touch that is we need to make every audio signal of same length. As we saw during data analysis, there is very much difference in length of different audio files, so we will check the maximum length of an audio file and pad the all other signals accordingly.

Now, before feeding this data to any neural network, there is one very important thing that we need to do i.e. data normalisation.

Now, our data is ready and we are also ready to start training models.

9. Modelling-

We will start with a Convolution based neural network and then we will move to LSTM.

Convolution based model-

This is our first model architecture. It’s a very simple sequential convolution architecture based on Conv1D layers. In addition to Conv1D, we have also used Maxpooling and GlobalAverage pooiling layers in our model architecture.

Now, let’s train this and see how it performs.

Our model performs quite good, we get 95 % accuracy on test data, Now, let’s see if we can improve this. We will try LSTM based architecture now.

LSTM based model-

This model’s performance is slightly poor than our previous model. Let’s try Conv + LSTM based architecture and see if we can improve the performance.

LSTM + Conv Architecture-

This model performs slightly better than the Conv based architecture. So, we will stick with this model and use this for deployment.

Comparision of all the models that we tried-

First, we tried with Convolution based neural network which performed quite good.
LSTM based network gave us better logloss but accuracy decreased slightly.
LSTM + Conv network performed best of them all and logloss and accuracy both got better than our previous model

10. Deployment-

What good a model is if it doesn’t reach customers. Most of the case study blogs skip this part but we will leave no stone unturned. We will be using flask for deployment and Heroku for service.

Let’s create final pipeline for the model. For this, we will create two python files- one will contain all the helper functions and other will be the actual pipeline.

This file contain all the necessary function , now we will create another file which will use these function to generate the output.

Ok, so now we are done with the pipeline and now it’s time to use flask to deploy and we also have to build HTML pages( I will keep it very simple).

Let’s understand what this code actually does.

At first we are importing all libraries and data pipeline from a python file. After that we are defining different app route in order to render different HTML pages, in the predict function we are getting input from the HTML page, after getting these input, it’s all about the data pipeline that we have already discussed.

Now, lets proceed to Heroku. So, Heroku is free and pretty easy to use, we will use Github to deploy our model on Heroku. So basically there are two extra files that we have create Procfile(without any extension) and Requirements.txt

Procfile contains the name of the file that needs to be run first and requirements.txt contains names of all the modules that needs to be installed. Now, we just have to choose ‘connect to github’ on Heroku and search our github repository and click deploy. If all the code is correct, model will be deployed and a link will be generated for your app.

After doing all the above steps, here is our app. It may take some time to open the link, have patience.

Here is one video showing the running demo of our project.

11. Future Work-

  1. Instead of converting data to spectrogram, we can try something else like Fourier Transform.

12. References-

  1. https://arxiv.org/pdf/1804.07177v1.pdf

If you have any query, feel free to comment down or you can also contact me on Linkedin. You can find my complete project here.