Speech recognition is the process of getting the transcription of an audio source. Sometimes you may need an automated way to 'convert' an audio file into a text. There are some services providing speech-to-text recognition services, one of them is provided by Google as a part of their cloud platform services. While in other tutorial I had written about using Google Text-to-Speech in Node.js, this tutorial is the opposite. I'm going to show you how to use Google Speech-to-Text API for transcribing audio file into text, also in Node.js
Preparation
1. Create or select a Google Cloud project
A Google Cloud project is required to use this service. Open Google Cloud console, then create a new project or select existing project
2. Enable billing for the project
Like other cloud platforms, Google requires you to enable billing for your project. If you haven't set up billing, open billing page.
3. Enable Google Speech API
To use an API, you must enable it first. Open this page to enable Speech API.
4. Set up service account for authentication
As for authentication, you need to create a new service account. Create a new one on the service account management page and download the credentials, or you can use your already created service account.
In your .env
file, you have to add a new variable
GOOGLE_APPLICATION_CREDENTIALS=/path/to/the/credentials
The .env
file should be loaded of course, so you need to use a module for reading .env
such as dotenv
.
Dependencies
This tutorial uses @google-cloud/speech
. @google-cloud/storage
is also required for uploading large audio files. Add the following dependencies to your package.json
and run npm install
"@google-cloud/speech": "~2.0.0"
"@google-cloud/storage": "~1.7.0"
"dotenv": "~4.0.0"
"lodash": "~4.17.10"
Supported Audio Encodings
Not all audio encoding supported by Google Speech. Below is the list of supported audio encodings.
- LINEAR16
- FLAC
- MULAW
- AMR
- AMR_WB
- OGG_OPUS
- SPEEX_WITH_HEADER_BYTE
For best results, the audio source should use lossless encoding (FLAC or LINEAR16). If the audio source has lossy codec (including on the list above other than those two recommended formats), recognition accuracy may be reduced.
1. Sync Recognize
If the audio file you want to transcribe is less than ~1 minute, you can use synchronize recognition. You'll get the result directly in the response.
require('dotenv').config();
const _ = require('lodash');
const speech = require('@google-cloud/speech');
const fs = require('fs');
// Creates a client
const speechClient = new speech.SpeechClient();
// The path to the audio file to transcribe
const filePath = 'input.wav';
// Reads a local audio file and converts it to base64
const file = fs.readFileSync(filePath);
const audioBytes = file.toString('base64');
const audio = {
content: audioBytes,
};
// The audio file's encoding, sample rate in hertz, and BCP-47 language code
const config = {
encoding: 'LINEAR16',
sampleRateHertz: 24000,
languageCode: 'en-US',
};
const request = {
audio,
config,
};
// Detects speech in the audio file
speechClient
.recognize(request)
.then((data) => {
const results = _.get(data[0], 'results', []);
const transcription = results
.map(result => result.alternatives[0].transcript)
.join('\n');
console.log(`Transcription: ${transcription}`);
})
.catch(err => {
console.error('ERROR:', err);
});
2. Long Running Recognize (Async Recognize)
If the duration of the audio file is longer than 1 minute, you have to use asynchronous recognition which has a limitation of ~180 minute. The file must be uploaded to Google Cloud Storage first. If you haven't use Google Cloud Storage, you can read this tutorial first. Then, you can use a special URI to refer to the uploaded file gs://{bucket-name}/{file-name}
. You'll get a Promise representing the final result of the job.
require('dotenv').config();
const _ = require('lodash');
const speech = require('@google-cloud/speech');
const cloudStorage = require('@google-cloud/storage');
const fs = require('fs');
const path = require('path');
const speechClient = new speech.SpeechClient();
// The path to the audio file to transcribe
const filePath = 'input.wav';
// Google Cloud storage
const bucketName = 'gcs-demo-bucket'; // Must exist in your Cloud Storage
const uploadToGcs = async () => {
const storage = cloudStorage({
projectId: process.env.GOOGLE_CLOUD_PROJECT_ID,
});
const bucket = storage.bucket(bucketName);
const fileName = path.basename(filePath);
await bucket.upload(filePath);
return `gs://${bucketName}/${fileName}`;
};
// Upload to Cloud Storage first, then detects speech in the audio file
uploadToGcs()
.then(async (gcsUri) => {
const audio = {
uri: gcsUri,
};
const config = {
encoding: 'LINEAR16',
sampleRateHertz: 24000,
languageCode: 'en-US',
};
const request = {
audio,
config,
};
speechClient.longRunningRecognize(request)
.then((data) => {
const operation = data[0];
// The following Promise represents the final result of the job
return operation.promise();
})
.then((data) => {
const results = _.get(data[0], 'results', []);
const transcription = results
.map(result => result.alternatives[0].transcript)
.join('\n');
console.log(`Transcription: ${transcription}`);
})
})
.catch(err => {
console.error('ERROR:', err);
});
That's all about how to transcribe audio source using Google Speech API in Node.js.