TensorFlow Audio Recognition

Last updated on Oct 25 2021
Ashutosh Wakiroo

Table of Contents

TensorFlow Audio Recognition

Audio recognition is an interdisciplinary subfield of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers. Speech recognition is commonly used to operate a device, perform commands, and write without the help of keyboard, mouse, or press any buttons.
These days, it is done on a computer with ASR (automatic speech recognition) software programs. Many ASR programs require the user to “training” the ASR programs to recognize its voice so that it can more accurately convert the speech to text.
For example, we can say “open Google chorme,” and the computer would open the internet browser chrome.
The first ASR device is used in 1952 and recognized single digits spoken by any user. In Today’s Time, ASR programs are used in many industries, including military, healthcare, telecommunications, and personal computing.
Example where we may have used voice recognition:Google voice, automated phone systems, Digital voice, digital assistant, car Bluetooth.

Types of voice recognition systems

Automatic speech recognition is an example of voice recognition. Below are some other examples of voice recognition systems.
Speaker dependent system- Voice recognition requires training before it can be used, which requires us to read a series of words and phrases.
Speaker independent system- The voice recognition software recognizes most users’ voices with no training.
Discrete speech recognition- The user must pause between each word so that the speech recognition can identify each separate word.
Continuous speech recognition- voice recognition can understand a standard rate of speaking.
Natural language- The speech recognition not only can understand the voice but can also return answers to questions or other queries that are being asked.
Like the MNIST for images, this should give us a basic understanding of the techniques involved. Once we’ve completed this TensorFlow Audio Recognition tutorial, we’ll have a model that tries to classify a one-second audio clip as either:
• Silence
• An unknown word
• Yes
• No
• Up
• Down
• Left
• Right
• On

Training in TensorFlow Audio Recognition

To start the training process in TensorFlow Audio Recognition, represent to the TensorFlow source and wr the following:
1. Python tensorflow/examples/speech_commands/train.py
This command can download the speech dataset, which consists of 65k. Wav audio files where people see 30 different words.

Confusion Matrix in TensorFlow

The first 400 steps, will give us:

1. 1. I0730 17:57:38.073667 55030 train.py:243] Confusion of the matrix:
2. 2. [[258 0 0 0 0 0 0 0 0 0 0 0 0 ]
3. 3. [ 7 6 76 94 7 49 1 15 50 2 0 11]
4. 4. [ 10 1 107 80 13 33 0 13 10 1 0 4]
5. 5. [ 1 3 16 164 6 48 0 5 10 1 0 17]
6. 6. [ 15 1 17 114 44 13 0 9 22 5 0 9]
7. 7. [ 1 1 6 97 3 86 1 12 46 0 0 10]
8. 8. [ 8 6 84 86 13 24 1 9 9 1 6 0]
9. 9. [ 9 3 32 112 9 26 1 36 19 0 0 9]
10. 10. [ 9 2 12 94 9 49 0 6 72 0 0 2]
11. 11. [ 16 1 39 75 29 52 0 6 37 9 0 3]
12. 12. [ 15 6 17 71 60 37 0 6 32 3 1 9]
13. 13. [ 11 1 6 151 5 43 0 8 16 0 0 20]]

We see that the first section is a matrix. Every column represents a set of samples which were estimated to each keyword. In the above matrix, the first column represents all the clips that are predicted to be silence, the second representing the unknown words, the third “yes”, and so on.

TensorBoard in TensorFlow

We visualize the training progress using TensorBoard. The events are saved to /tmp/retrain_logs, and loaded using the below syntax:
1. tensorboard –logdir /tmp/retrain_logs

tensorFlow 36
tensorFlow

Finished Training in Audio Recognition

After some hours of training, the script completes about 20,000 steps, printing out the final confusion matrix, and the accuracy percentage
We can export to mobile devices in a compact form using the given code:
1. python tensorflow speech_commands/freeze.py\
2. –start_checkpoint=/tmp/speech_commands_train/conv.ckpt-18000 \
3. –output_file=/tmp/my_frozen_graph.pb

Working of Speech Recognition Model

It is based on the kind of CNN that is very familiar to anyone who’s worked with image recognition like we already have in one of the previous tutorials. The audio is a 1-D signal and not be confused for a 2D spatial problem.
Now, we have to solve the issue by defining a time slot in which our spoken words should fit, and changing the signal in that slot into an image. We can do this by grouping the incoming audio into short segments and calculating the strength of the frequency. Each segment is treated like a vector of numbers, which are arranged in time to form a 2D array. This array of values can be treated like a one-channel image, is also known as the spectrogram. We can view what kind of images an audio sample produce with:

1. bazel run tensorflow/examples/wav_to_spectogram:wav_to_spectogram -- \
2. --input_wav=/tmp/speech_dataset/happy/ab00c4b2_nohash_0.wv \
3. --output_png=/tmp/spectrogram.png
/tmp/spectrogram.png will show us:
tensorFlow 37
tensorFlow

This is a 2d, one-channel representation so that we tend it like an image too.
The image which produced is then fed into a multi-layer convolutional neural network, with a fully-connected layer followed by a softmax at the end.

Command Recognition in TensorFlow

Unknown Class
Our app may hear sounds that are not a part of our training set. To make the network learn which sound to boycott, we need to provide clips of audio that are not a part of our classes. To do this, we can create boo, meow, and fill them with noises from animals. The speech commands dataset includes 20 words in its unknown classes, including the digits, zero through nine along with random names.
Background Noise
There are background noise in any captured audio clip. To build a model that’s resistent to this such noises, we need to train the model against recorded audio with identical properties. The files in the speech command dataset were recorded on multiple devices and in different surroundings, so that help for the training.
Then, we can randomly choose small extracts from the files along mixed at a low volume into clips during training.
Customizing
The model used for the script, which is enormous, using 940k weight parameters that will have too many calculations to run at speeds on devices with limited resources. The other options to counter this are:
low_latency_conv: The accuracy is lower than conv, but the amount of weight parameters is nearly the same, and it is much faster
We should specify-model-architecture=low_latency_conv to use this model on the command line.
We should add parameters as the learning rate=0.01 and steps=20,000.
low_latency_svdf: The accuracy is lower than conv but it only uses 750k parameters and has an optimized execution. Typing-model_architecture=low_latency_svdf on the command line to use the model, and specifying the training rate and the number of steps along with:

1. python tensorflow/examples/speech_commands/train \
2. --model_architecture=low_latency_svdf \
3. --how_many_training_steps=100000,35000 \
4. --learning_rate=0.01,0.005

So, this brings us to the end of blog. This Tecklearn ‘TensorFlow Audio Recognition’ blog helps you with commonly asked questions if you are looking out for a job in Artificial Intelligence. If you wish to learn Artificial Intelligence and build a career in AI or Machine Learning domain, then check out our interactive, Artificial Intelligence and Deep Learning with TensorFlow Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

https://www.tecklearn.com/course/artificial-intelligence-and-deep-learning-with-tensorflow/

Artificial Intelligence and Deep Learning with TensorFlow Training

About the Course

Tecklearn’s Artificial Intelligence and Deep Learning with Tensor Flow course is curated by industry professionals as per the industry requirements & demands and aligned with the latest best practices. You’ll master convolutional neural networks (CNN), TensorFlow, TensorFlow code, transfer learning, graph visualization, recurrent neural networks (RNN), Deep Learning libraries, GPU in Deep Learning, Keras and TFLearn APIs, backpropagation, and hyperparameters via hands-on projects. The trainee will learn AI by mastering natural language processing, deep neural networks, predictive analytics, reinforcement learning, and more programming languages needed to shine in this field.

Why Should you take Artificial Intelligence and Deep Learning with Tensor Flow Training?

• According to Paysa.com, an Artificial Intelligence Engineer earns an average of $171,715, ranging from $124,542 at the 25th percentile to $201,853 at the 75th percentile, with top earners earning more than $257,530.
• Worldwide Spending on Artificial Intelligence Systems Will Be Nearly $98 Billion in 2023, According to New IDC Spending Guide at a GACR of 28.5%.
• IBM, Amazon, Apple, Google, Facebook, Microsoft, Oracle and almost all the leading companies are working on Artificial Intelligence to innovate future technologies.

What you will Learn in this Course?

Introduction to Deep Learning and AI
• What is Deep Learning?
• Advantage of Deep Learning over Machine learning
• Real-Life use cases of Deep Learning
• Review of Machine Learning: Regression, Classification, Clustering, Reinforcement Learning, Underfitting and Overfitting, Optimization
• Pre-requisites for AI & DL
• Python Programming Language
• Installation & IDE
Environment Set Up and Essentials
• Installation
• Python – NumPy
• Python for Data Science and AI
• Python Language Essentials
• Python Libraries – Numpy and Pandas
• Numpy for Mathematical Computing
More Prerequisites for Deep Learning and AI
• Pandas for Data Analysis
• Machine Learning Basic Concepts
• Normalization
• Data Set
• Machine Learning Concepts
• Regression
• Logistic Regression
• SVM – Support Vector Machines
• Decision Trees
• Python Libraries for Data Science and AI
Introduction to Neural Networks
• Creating Module
• Neural Network Equation
• Sigmoid Function
• Multi-layered perception
• Weights, Biases
• Activation Functions
• Gradient Decent or Error function
• Epoch, Forward & backword propagation
• What is TensorFlow?
• TensorFlow code-basics
• Graph Visualization
• Constants, Placeholders, Variables
Multi-layered Neural Networks
• Error Back propagation issues
• Drop outs
Regularization techniques in Deep Learning
Deep Learning Libraries
• Tensorflow
• Keras
• OpenCV
• SkImage
• PIL
Building of Simple Neural Network from Scratch from Simple Equation
• Training the model
Dual Equation Neural Network
• TensorFlow
• Predicting Algorithm
Introduction to Keras API
• Define Keras
• How to compose Models in Keras
• Sequential Composition
• Functional Composition
• Predefined Neural Network Layers
• What is Batch Normalization
• Saving and Loading a model with Keras
• Customizing the Training Process
• Using TensorBoard with Keras
• Use-Case Implementation with Keras
GPU in Deep Learning
• Introduction to GPUs and how they differ from CPUs
• Importance of GPUs in training Deep Learning Networks
• The GPU constituent with simpler core and concurrent hardware
• Keras Model Saving and Reusing
• Deploying Keras with TensorBoard
Keras Cat Vs Dog Modelling
• Activation Functions in Neural Network
Optimization Techniques
• Some Examples for Neural Network
Convolutional Neural Networks (CNN)
• Introduction to CNNs
• CNNs Application
• Architecture of a CNN
• Convolution and Pooling layers in a CNN
• Understanding and Visualizing a CNN
RNN: Recurrent Neural Networks
• Introduction to RNN Model
• Application use cases of RNN
• Modelling sequences
• Training RNNs with Backpropagation
• Long Short-Term memory (LSTM)
• Recursive Neural Tensor Network Theory
• Recurrent Neural Network Model
Application of Deep Learning in image recognition, NLP and more
Real world projects in recommender systems and others
Got a question for us? Please mention it in the comments section and we will get back to you.

 

0 responses on "TensorFlow Audio Recognition"

Leave a Message

Your email address will not be published. Required fields are marked *