Speech Emotion Recognition

What is it ?

Speech Emotion Recognition, abbreviated as SER, is the act of attempting to recognize human emotion and affective states from speech. This is capitalizing on the fact that voice often reflects underlying emotion through tone and pitch. This is also the phenomenon that animals like dogs and horses employ to be able to understand human emotion.

Technologies used -

VS Code

IDE used for all programming part.

Python

Language used for programming.

Kaggle

Source of all training data.

Github

Storing our programming project.

What is Speech ?

Simply speech is the communication or expression of thoughts in spoken words.

" Feelings are something you have; not something you are. "

So how speech emotion recognition can create a great impact.

As the speech is not only a medium to communicate and it carries a lot of information along with the words and their meanings. Along with the words, speeches carry emotions and can also tell about the mental state of the user a lot. A human can easily recognize these features from any speech just by hearing it but it will not be easy to do the same for a software so precisely. To deploy a model that can precisely analyze and authenticate the speech in order to predict the surroundings and current mental state of the user so that necessary actions can be taken accordingly.

That's why Emotion Detection has become one of the biggest marketing strategies in which mood of the customer plays an important role. So to detect the current emotion of the person and suggest him the appropriate product or help him accordingly will increase the demand of the production the company.

How is it working ?

In order to predict the emotions form the speech it is performing the following steps:

1. Taking Audio Input from user.
2. Analyzing the audio signals.
3. Masking and cleaning of audio.
4. Extraction of features.
5. Loading our pretrained model.
6. Making prediction of emotion.

Creating the Custom Dataset

These are the datasets that are used to create the custom dataset for training of our Model.

1. RAVDESS

This dataset includes around 1500 audio files input from 24 different actors. 12 male and 12 female where these actors record short audios in 8 different emotion.

2. CREMA

CREMA-D is a data set of 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities (African America, Asian, Caucasian, Hispanic, and Unspecified). Actors spoke from a selection of 12 sentences.

3. TESS

The TESS Dataset is a collection of audio clips of 2 women expressing 7 different emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral).

4. SAVEE

The SAVEE database was recorded from four native English male speakers , postgraduate students and researchers at the University of Surrey aged from 27 to 31 years.

5. ASVP-ESD

The Audio, Speech, and Vision Processing Lab Emotional Sound database (ASVP-ESD) This dataset contains audio files regrouped in 130 folders. As it's a realistic dataset some folders contain dialog or several people interacting in the audio; Speech and non-speech Emotional sounds include total of 12 different emotions plus breath.