How I built a voice enabled chatGPT app (Part I)

Apr 27, 2023
header image

A few weeks ago, I stumbled upon an intriguing freelance project on Upwork that caught my attention. The project involved creating a phone screening application powered by chatGPT. Inspired by the idea, I embarked on a journey to develop a similar application.

In this two-part article, I will walk you through the process of how I built a quick proof of concept of the application using NodeJS. Based on the insights I gained from this endeavor, I proceeded to develop a cross-platform mobile version using Flutter.

Part I of this article delves into the details of the proof of concept, while Part II focuses on the development of the cross-platform application. So, if you are more interested in the latter, feel free to skip ahead to Part II.

The Idea

The algorithm behind the app is straightforward. It begins by prompting the user for input in the form of a sound cue or text prompt. The user then has the opportunity to provide input by speaking to the app using their voice. The app transcribes the voice input into text and interacts with chatGPT via the official OpenAI APIs.

The proof of concept

To test the feasibility of the idea, the quickest approach was to create a rapid proof of concept using the environment I was most comfortable with NodeJS.

I began by breaking it down into three distinct parts: voice recognition, interaction with chatGPT, and text-to-speech conversion.

In the rest of this first part, I will discuss how I tackled each of these components to create the proof of concept.

Voice recognition

To transcribe the voice prompt into text, I used OpenAI's Whisper API. This API accepts an audio file as input and returns a text transcription of the spoken words.

To capture audio from the user, I employed a third-party library called node-audiorecorder. With this library, I only needed to initiate the recording process, as the software is programmed to automatically stop recording and generate the audio file after two seconds of silence. Here is a code sample of the function used to record the audio file:

public async recordAudio(fileName: string) { const fileStream = createWriteStream(fileName, { encoding: "binary" }); this.audioRecorder.start().stream().pipe(fileStream); return new Promise((resolve) => fileStream.on("close", () => { resolve("done"); }) ); }

Interaction with chatGPT

Interacting with chatGPT was by far the simplest aspect of developing the application. All I had to do was forward the transcribed text obtained from the previous step to the API, which in turn provided the response in text format.

Here's a code snippet of the function that executes this step:

async askChatGPT( messages: { role; content; }[] ): Promise<string> { try { const response = await this.openai.createChatCompletion({ model: "gpt-3.5-turbo", messages, }); return response.data.choices[0].message.content; } catch (error: any) { console.log( "An Error Occured when communicating with OpenAI", error.response ); return "Error"; } }

Text to speech

Lastly, I incorporated the Google Cloud Text-to-Speech API to enable the app to speak the response out loud to the user.

The API accepts text input and some configuration parameters and returns a binary file containing the synthesized text. This binary file can then be written into a file on the system.

Here's the function that was used to synthesize the text:

public async synthesizeSpeech(text: string, output: string) { // Construct the request const request = { input: { text: text }, // Select the language and SSML voice gender (optional) voice: { languageCode: "en-US", ssmlGender: "MALE" }, // select the type of audio encoding audioConfig: { audioEncoding: "MP3" }, } as any; // Performs the text-to-speech request const [response] = await this.client.synthesizeSpeech(request); // Write the binary audio content to a local file const wf = promisify(writeFile); await wf(output, response.audioContent, "binary"); console.log("Audio content written to file: output.mp3"); }

To play back the synthesized audio on the system's speakers, I leveraged another third-party npm library called play-sound.

Authentication

Authentication is required before communicating with both Google's Cloud API and OpenAI's API.

OpenAI's API only requires an API key to be included in the configuration object when initializing a client instance. The recommended practice is to store the key as an environment variable and access it using the dotenv library.

In contrast, Google's API supports multiple authentication methods, which are documented in their official documentation.

Takeaway

In this article, I described how I built a voice-enabled chatGPT application using nodeJS. By leveraging third-party libraries and APIs, I was able to quickly create this proof of concept .

However, this is only the beginning. In the next article, I will delve deeper into the development of the cross-platform mobile version using Flutter, and discuss the challenges and lessons I learned along the way.

Join me as I explore the world of mobile development and build a fully functional voice-enabled chatGPT application.