Building Text-to-Speech Apps for the Web

Publikováno: 9.7.2018

Having interacted with several apps over the years, there is a very high chance that you have interacted with apps that provide some form of voice experience. It could be an app with text-to-speech functionality like reading your text messages or notifications aloud. It could also be an app with voice recognition functionality like the popular Siri or Google Assistant.

With the advent of HTML5, there has been a very fast growth in the number of API available on the web platform. Over the years, we have come across API such as WebSocket, File, Geolocation, Notification, Battery, Vibration, DeviceOrientation, WebRTC, etc. Some of these API have gained very high support across various browsers. However, most of them are still in the experimental phase of development and are to be used with much caution.

Web Speech API

There are a couple of API known as the Web Speech API that have been developed to make it easy to seamlessly build varying kinds of voice applications and experiences for the web. These API are still pretty experimental, although there is increasing support for most of them across all the modern browsers.

The Web Speech API has broken into two major interfaces:

SpeechSynthesis - For text-to-speech applications. This allows apps to read out their text content using the device's speech synthesiser. The available voice types are represented by a SpeechSynthesisVoice object, while the text to be uttered is represented by a SpeechSynthesisUtterance object. See the support table for the SpeechSynthesis interface to learn more about browser support.
SpeechRecognition - For applications that require asynchronous voice recognition. This allows apps to recognize voice context from an audio input. A SpeechRecognition object can be created using the constructor. The SpeechGrammar interface exists for representing the set of grammar that the app should recognise. See the support table for the SpeechRecognition interface to learn more about browser support.

In this tutorial, we will learn how we can use the SpeechSynthesis interface to build a simple text-to-speech app. Here is a demo screenshot of what the app will look like (without the sound):

App Demo

The `SpeechSynthesis` Interface

The SpeechSynthesis interface is a very simple one with just a couple of methods and properties. To get a complete list of the available properties, methods and events, see the MDN documentation.

Getting a Reference

Getting a reference to a SpeechSynthesis object is very simple. The following code snippet shows how to it can be done.

var synthesis = window.speechSynthesis;

It is very useful to check if SpeechSynthesis is supported by the browser before using the functionality it provides. The following code snippet shows how to check for browser support.

if ('speechSynthesis' in window) {
  var synthesis = window.speechSynthesis;
  // do stuffs with synthesis
} else {
  console.log('Text-to-speech not supported.');
}

Getting Available Voices

Let's build on our already existing code to get the available speech voices. The getVoices() method returns a list of SpeechSynthesisVoice objects representing all the available voices on the device.

Take a look at the following code snippet:

if ('speechSynthesis' in window) {

  var synthesis = window.speechSynthesis;

  // Regex to match all English language tags e.g en, en-US, en-GB
  var langRegex = /^en(-[a-z]{2})?$/i;

  // Get the available voices and filter the list to only have English speakers
  var voices = synthesis.getVoices().filter(voice => langRegex.test(voice.lang));

  // Log the properties of the voices in the list
  voices.forEach(function(voice) {
    console.log({
      name: voice.name,
      lang: voice.lang,
      uri: voice.voiceURI,
      local: voice.localService,
      default: voice.default
    })
  });

} else {
  console.log('Text-to-speech not supported.');
}

In the above snippet, we get the list of available voices on the device, and filter the list using the langRegex regular expression to ensure that we get voices for only English speakers. Finally, we loop through the voices in the list and log the properties of each to the console.

Constructing Speech Utterances

Let's go ahead and see how we can construct speech utterances using the SpeechSynthesisUtterance constructor and setting values for the available properties. The following code snippet creates a very simple speech utterance for reading the text "Hello World".

if ('speechSynthesis' in window) {

  var synthesis = window.speechSynthesis;

  // Get the first `en` language voice in the list
  var voice = synthesis.getVoices().filter(function(voice) {
    return voice.lang === 'en';
  })[0];

  // Create an utterance object
  var utterance = new SpeechSynthesisUtterance('Hello World');

  // Set utterance properties
  utterance.voice = voice;
  utterance.pitch = 1.5;
  utterance.rate = 1.25;
  utterance.volume = 0.8;

  // Speak the utterance
  synthesis.speak(utterance);

} else {
  console.log('Text-to-speech not supported.');
}

Here, we first get the first en language voice from the list of available voices. Next, we create a new utterance using the SpeechSynthesisUtterance constructor. Then we set some of the properties on the utterance object like voice, pitch, rate and volume. Finally, we speak the utterance using the speak() method of SpeechSynthesis.

Utterance Limitation
There is a limit to the size of text that can be spoken in an utterance. The maximum length of the text that can be spoken in each utterance is 32,767 characters.

Notice, that we passed the text to be uttered in the constructor. You can also set the text to be uttered by setting the text property of the utterance object. This overrides whatever text that was passed in the constructor. Here is a simple example:

var synthesis = window.speechSynthesis;
var utterance = new SpeechSynthesisUtterance("Hello World");

// This overrides the text "Hello World" and is uttered instead
utterance.text = "My name is Glad.";

synthesis.speak(utterance);

Speaking an Utterance

In the previous code snippet, we have seen how to speak utterances by calling the speak() method on the SpeechSynthesis instance. We simply pass in the SpeechSynthesisUtterance instance as argument to the speak() method to speak the utterance.

var synthesis = window.speechSynthesis;

var utterance1 = new SpeechSynthesisUtterance("Hello World");
var utterance2 = new SpeechSynthesisUtterance("My name is Glad.");
var utterance3 = new SpeechSynthesisUtterance("I'm a web developer from Nigeria.");

synthesis.speak(utterance1);
synthesis.speak(utterance2);
synthesis.speak(utterance3);

There are a couple of other things you can do with the SpeechSynthesis instance such as pause, resume and cancel utterances. Hence the pause(), resume() and cancel() methods are available as well on the SpeechSynthesis instance.

Building the Text-to-Speech App

Getting Started

We have seen the basic aspects of the SpeechSynthesis interface. We will now start building our text-to-speech application. Before we begin, ensure that you have Node and NPM installed on your machine.

Run the following commands on your terminal to setup a project for the app and install the dependencies.

# Create new project directory
mkdir web-speech-app

# cd into the project directory
cd web-speech-app

# Initialize project
npm init -y

# Install dependencies
npm install express cors axios

Go ahead and modify the "scripts" section of the package.json file to look like the following snippet:

/* package.json */

"scripts": {
  "start": "node server.js"
}

Setting up the Server

Now that we have initialized a project for our application, we will proceed to setup a simple server for our app using Express.

Create a new server.js file and add the following content to it:

/* server.js */

const cors = require('cors');
const path = require('path');
const axios = require('axios');
const express = require('express');

const app = express();
const PORT = process.env.PORT || 5000;

app.set('port', PORT);

// Enable CORS(Cross-Origin Resource Sharing)
app.use(cors());

// Serve static files from the /public directory
app.use('/', express.static(path.join(__dirname, 'public')));

// A simple endpoint for fetching a random quote from QuotesOnDesign
app.get('/api/quote', (req, res) => {
  axios.get('http://quotesondesign.com/wp-json/posts?filter[orderby]=rand&filter[posts_per_page]=1')
    .then((response) => {
      const [ post ] = response.data;
      const { title, content } = post || {};

      return (title && content)
        ? res.json({ status: 'success', data: { title, content } })
        : res.status(500).json({ status: 'failed', message: 'Could not fetch quote.' });
    })
    .catch(err => res.status(500).json({ status: 'failed', message: 'Could not fetch quote.' }));
});

app.listen(PORT, () => console.log(`> App server is running on port ${PORT}.`));

Here, we set up a simple Node server using Express. We enable CORS(Cross-Origin Request Sharing) using the cors() middleware. We also use the express.static() middleware to serve static files from the /public directory in the project root. This will enable us serve our index page which we will be creating soon.

Finally, we setup a simple GET/api/quote route for fetching a random quote from the QuotesOnDesign API service. Notice also that we are using axios(a promise based HTTP client library) to make the HTTP request.

Here is what a sample response from the QuotesOnDesign API looks like:

[
  {
    "ID": 2291,
    "title": "Victor Papanek",
    "content": "<p>Any attempt to separate design, to make it a thing-by-itself, works counter to the inherent value of design as the primary, underlying matrix of life.</p>\n",
    "link": "https://quotesondesign.com/victor-papanek-4/",
    "custom_meta": {
      "Source": "<a href=\"http://www.amazon.com/Design-Real-World-Ecology-Social/dp/0897331532\">book</a>"
    }
  }
]

When we fetch a quote successfully, we return the quote's title and content in the data field of the JSON response. Otherwise, we return a failure JSON response with a 500 HTTP status code.

Setting up the Index Page

Next, we will create a very simple index page for the app view. First create a new public folder in the root of your project. Next, create a new index.html file in the just created folder and add the following content to it:

<!-- public/index.html -->

<html>

<head>
    <title>Daily Quotes</title>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
    <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.1.1/css/bootstrap.min.css" integrity="sha384-WskhaSGFgHYWDcbwN70/dfYBj47jz9qbsMId/iRN3ewGhXQFZCSftd1LZCfmhktB" crossorigin="anonymous">
</head>

<body class="position-absolute h-100 w-100">
    <div id="app" class="d-flex flex-wrap align-items-center align-content-center p-5 mx-auto w-50 position-relative"></div>

    <script src="https://unpkg.com/jquery/dist/jquery.min.js"></script>
    <script src="https://cdn.jsdelivr.net/npm/feather-icons/dist/feather.min.js"></script>
    <script src="main.js"></script>
</body>

</html>

As you can see, we have a very simple index page for our app with just one <div id="app"> which will serve as the mount point for all the dynamic content of the app. Notice that we have added a link to the Bootstrap CDN to get some default Bootstrap 4 styling for our app. We have also included jQuery for DOM manipulations and Ajax requests, and Feather icons for elegant SVG icons.

The Main Script

Now we are down to the last piece that powers our app - the main script. Create a new main.js file in the public directory of your app and add the following content to it:

/* public/main.js */

jQuery(function($) {

  let app = $('#app');

  let SYNTHESIS = null;
  let VOICES = null;

  let QUOTE_TEXT = null;
  let QUOTE_PERSON = null;

  let VOICE_SPEAKING = false;
  let VOICE_PAUSED = false;
  let VOICE_COMPLETE = false;

  let iconProps = {
    'stroke-width': 1,
    'width': 48,
    'height': 48,
    'class': 'text-secondary d-none',
    'style': 'cursor: pointer'
  };

  function iconSVG(icon) {}

  function showControl(control) {}

  function hideControl(control) {}

  function getVoices() {}

  function resetVoice() {}

  function fetchNewQuote() {}

  function renderQuote(quote) {}

  function renderVoiceControls(synthesis, voice) {}

  function updateVoiceControls() {}

  function initialize() {}

  initialize();

});

Here, we use jQuery to execute a function when the DOM is loaded. We get a reference to the #app element and initialize some variables. We also declare a couple of empty functions which we will implement in the following sections. Finally, we call the initialize() function to initialize the application.

Notice the iconProps variable, which contains a couple of properties which will be used for rendering Feather icons as SVG to the DOM. Let's go ahead and start implementing the functions.

Implement Basic Functions

Modify the public/main.js file to implement the following functions:

/* public/main.js */

// Gets the SVG markup for a Feather icon
function iconSVG(icon) {
  let props = $.extend(iconProps, { id: icon });
  return feather.icons[icon].toSvg(props);
}

// Shows an element
function showControl(control) {
  control.addClass('d-inline-block').removeClass('d-none');
}

// Hides an element
function hideControl(control) {
  control.addClass('d-none').removeClass('d-inline-block');
}

// Get the available voices, filter the list to have only English filters
function getVoices() {
  // Regex to match all English language tags e.g en, en-US, en-GB
  let langRegex = /^en(-[a-z]{2})?$/i;

  // Get the available voices and filter the list to only have English speakers
  VOICES = SYNTHESIS.getVoices()
    .filter(function (voice) { return langRegex.test(voice.lang) })
    .map(function (voice) {
      return { voice: voice, name: voice.name, lang: voice.lang.toUpperCase() }
    });
}

// Reset the voice variables to the defaults
function resetVoice() {
  VOICE_SPEAKING = false;
  VOICE_PAUSED = false;
  VOICE_COMPLETE = false;
}

The functions are pretty simple to understand but I will lay emphasis on some of them.

The iconSVG(icon) function takes a Feather icon name string as argument (e.g 'play-circle') and returns the SVG markup for the icon. Check here to see the complete list of available feather icons. Also check the Feather documentation to learn more about the API.

The getVoices() function uses the SYNTHESIS object to fetch the list of all the available voices on the device. Then, it filters the list using a regular expression to get the voices of only English speakers.

Fetching and Rendering Quotes

Next, we will implement the functions for fetching and rendering quotes on the DOM. Modify the public/main.js file to implement the following functions:

/* public/main.js */

function fetchNewQuote() {
  // Clean up the #app element
  app.html('');

  // Reset the quote variables
  QUOTE_TEXT = null;
  QUOTE_PERSON = null;

  // Reset the voice variables
  resetVoice();

  // Pick a voice at random from the VOICES list
  let voice = (VOICES && VOICES.length > 0)
    ? VOICES[ Math.floor(Math.random() * VOICES.length) ]
    : null;

  // Fetch a quote from the API and render the quote and voice controls
  $.get('/api/quote', function (quote) {
    renderQuote(quote.data);
    SYNTHESIS && renderVoiceControls(SYNTHESIS, voice || null);
  });
}

function renderQuote(quote) {

  // Create some markup for the quote elements
  let quotePerson = $('<h1 id="quote-person" class="mb-2 w-100"></h1>');
  let quoteText = $('<div id="quote-text" class="h3 py-5 mb-4 w-100 font-weight-light text-secondary border-bottom border-gray"></div>');

  // Add the quote data to the markup
  quotePerson.html(quote.title);
  quoteText.html(quote.content);

  // Attach the quote elements to the DOM
  app.append(quotePerson);
  app.append(quoteText);

  // Update the quote variables with the new data
  QUOTE_TEXT = quoteText.text();
  QUOTE_PERSON = quotePerson.text();

}

Here in the fetchNewQuote() method, we first reset the app element and variables. Then, we pick a voice randomly using Math.random() from the list of voices stored in the VOICES variable. Finally, we use $.get() to make an AJAX request to the /api/quote endpoint, to fetch a random quote and render the quote data to the view alongside the voice controls.

The renderQuote(quote) method receives a quote object as its argument and adds the contents to the DOM. Finally, it updates the quote variables: QUOTE_TEXT and QUOTE_PERSON.

Rendering the Voice Controls

If you noticed in the fetchNewQuote() function, we made a call to the renderVoiceControls() function. This function is responsible for rendering the controls for playing, pausing and stopping the voice(text-to-speech) output. It also renders the current voice being used and the language if available.

Go ahead and make the following modifications to the public/main.js file to implement the renderVoiceControls() function:

/* public/main.js */

function renderVoiceControls(synthesis, voice) {

  let controlsPane = $('<div id="voice-controls-pane" class="d-flex flex-wrap w-100 align-items-center align-content-center justify-content-between"></div>');

  let voiceControls = $('<div id="voice-controls"></div>');

  // Create the SVG elements for the voice control buttons
  let playButton = $(iconSVG('play-circle'));
  let pauseButton = $(iconSVG('pause-circle'));
  let stopButton = $(iconSVG('stop-circle'));

  // Helper function to enable pause state for the voice output
  let paused = function () {
    VOICE_PAUSED = true;
    updateVoiceControls();
  };

  // Helper function to disable pause state for the voice output
  let resumed = function () {
    VOICE_PAUSED = false;
    updateVoiceControls();
  };

  // Click event handler for the play button
  playButton.on('click', function (evt) {});

  // Click event handler for the pause button
  pauseButton.on('click', function (evt) {});

  // Click event handler for the stop button
  stopButton.on('click', function (evt) {});

  // Add the voice controls to their parent element
  voiceControls.append(playButton);
  voiceControls.append(pauseButton);
  voiceControls.append(stopButton);

  // Add the voice controls parent to the controlsPane element
  controlsPane.append(voiceControls);

  // If voice is available, add the voice info element to the controlsPane
  if (voice) {
    let currentVoice = $('<div class="text-secondary font-weight-normal"><span class="text-dark font-weight-bold">' + voice.name + '</span> (' + voice.lang + ')</div>');

    controlsPane.append(currentVoice);
  }

  // Add the controlsPane to the DOM
  app.append(controlsPane);

  // Show the play button
  showControl(playButton);

}

Here, we create container elements for the voice controls and the controls pane. We use the iconSVG() function we created earlier to get the SVG markup for our control buttons and create the button elements as well. We define the paused() and resumed() helper functions which will be used while setting up the event handlers for the buttons.

Finally, we render the voice control buttons and the voice info(if available) to the DOM. Also notice that we show only the play button initially.

Setting up the Event Handlers

Next, we will implement the click event handlers for the voice control buttons we defined in the previous section. Setup the event handlers as shown in the following code snippet:

playButton.on('click', function (evt) {
  evt.preventDefault();

  if (VOICE_SPEAKING) {

    // If voice is paused, it is resumed when the playButton is clicked
    if (VOICE_PAUSED) synthesis.resume();
    return resumed();

  } else {

    // Create utterances for the quote and the person
    let quoteUtterance = new SpeechSynthesisUtterance(QUOTE_TEXT);
    let personUtterance = new SpeechSynthesisUtterance(QUOTE_PERSON);

    // Set the voice for the utterances if available
    if (voice) {
      quoteUtterance.voice = voice.voice;
      personUtterance.voice = voice.voice;
    }

    // Set event listeners for the quote utterance
    quoteUtterance.onpause = paused;
    quoteUtterance.onresume = resumed;
    quoteUtterance.onboundary = updateVoiceControls;

    // Set the listener to activate speaking state when the quote utterance starts
    quoteUtterance.onstart = function (evt) {
      VOICE_COMPLETE = false;
      VOICE_SPEAKING = true;
      updateVoiceControls();
    }

    // Set event listeners for the person utterance
    personUtterance.onpause = paused;
    personUtterance.onresume = resumed;
    personUtterance.onboundary = updateVoiceControls;

    // Refresh the app and fetch a new quote when the person utterance ends
    personUtterance.onend = fetchNewQuote;

    // Speak the utterances
    synthesis.speak(quoteUtterance);
    synthesis.speak(personUtterance);

  }

});

pauseButton.on('click', function (evt) {
  evt.preventDefault();

  // Pause the utterance if it is not in paused state
  if (VOICE_SPEAKING) synthesis.pause();
  return paused();
});

stopButton.on('click', function (evt) {
  evt.preventDefault();

  // Clear the utterances queue
  if (VOICE_SPEAKING) synthesis.cancel();
  resetVoice();

  // Set the complete status of the voice output
  VOICE_COMPLETE = true;
  updateVoiceControls();
});

Here, we setup the click event listeners for the voice control buttons. When the play button is clicked, it starts speaking the utterances starting with the quoteUtterance and then the personUtterance. However, if the voice output is in a paused state, it resumes it.

Notice that we set VOICE_SPEAKING to true in the onstart event handler for the quoteUtterance. Also notice that we refresh the app and fetch a new quote when the personUtterance ends.

The pause button pauses the voice output, while the stop button ends the voice output and removes all utterances from the queue, using the cancel() method of the SpeechSynthesis interface. Notice that we call the updateVoiceControls() function each time to display the appropriate buttons.

Updating the Voice Controls

If you noticed, we have made a couple of calls and references to the updateVoiceControls() function in our previous code snippets. This function is responsible for updating the voice controls to display the appropriate controls based on the voice state variables.

Go ahead and make the following modifications to the public/main.js file to implement the updateVoiceControls() function:

/* public/main.js */

function updateVoiceControls() {

  // Get a reference to each control button
  let playButton = $('#play-circle');
  let pauseButton = $('#pause-circle');
  let stopButton = $('#stop-circle');

  if (VOICE_SPEAKING) {

    // Show the stop button if speaking is in progress
    showControl(stopButton);

    // Toggle the play and pause buttons based on paused state
    if (VOICE_PAUSED) {
      showControl(playButton);
      hideControl(pauseButton);
    } else {
      hideControl(playButton);
      showControl(pauseButton);
    }

  } else {
    // Show only the play button if no speaking is in progress
    showControl(playButton);
    hideControl(pauseButton);
    hideControl(stopButton);
  }

}

Here, we first get a reference to each of the voice control button elements. Then, we specify which voice control buttons should be visible at different states of the voice output.

Initialization Function

Finally, we will implement the initialize() function which is responsible for initializing the application. Add the following code snippet to the public/main.js file to implement the initialize() function.

function initialize() {
  if ('speechSynthesis' in window) {

    SYNTHESIS = window.speechSynthesis;

    let timer = setInterval(function () {
      let voices = SYNTHESIS.getVoices();

      if (voices.length > 0) {
        getVoices();
        fetchNewQuote();
        clearInterval(timer);
      }
    }, 200);

  } else {

    let message = 'Text-to-speech not supported by your browser.';

    // Create the browser notice element
    let notice = $('<div class="w-100 py-4 bg-danger font-weight-bold text-white position-absolute text-center" style="bottom:0; z-index:10">' + message + '</div>');

    fetchNewQuote();
    console.log(message);

    // Display non-support info on DOM
    $(document.body).append(notice);

  }
}

First, we check if speechSynthesis is available on the window global object, then we assign it to the SYNTHESIS variable if it is available. Next, we set up an interval for fetching the list of available voices.

We are using an interval here because there is a known asynchronous behavior with SpeechSynthesis.getVoices() that makes it return an empty array at the initial call - because the voices have not been loaded yet. The interval ensures that we get a list of voices before fetching a random quote and clearing the interval.

Now we have successfully completed our text-to-speech app. You can start the app by running npm start on your terminal. The app should start running on port 5000 if it is available. Visit http://localhost:5000 your browser to see the app.

If you did everything correctly, your app should look like the following screenshot:

Final Screenshot

Conclusion

In this tutorial, we have learnt how we can use the Web Speech API to build a very simple text-to-speech app for the web. You can learn more about the Web Speech API and also find some helpful resources here.

Although we tried as much as possible to keep the app simple, there are a couple of interesting features you can still implement and experiment with such as volume controls, voice pitch controls, speed/rate controls, percentage of text uttered, etc.

The complete source code for this tutorial checkout the web-speech-demo repository on GitHub.