EverSpeech Home  
Speech Technology Consulting and Solutions
 

FAQ

Home
Products
Demos
Solutions
Company
Services
Resources
Contact

Frequently Asked Questions

Below you'll find an index to our frequently asked questions list. If have a question that you don't see on this list, feel free to Ask Us.

Speech Recognition

  • What is speech recognition, speech reco or ASR?
  • How good is it, really?
  • Why do I have to train some speech engines?
  • Who has the best speech engine?
  • What is speech recognition over the telephone?
  • What is speech recognition on the web?
  • What is dictation-based versus grammar-based recognition?
  • How can I add speech to my application?

Text To Speech

  • What is TTS or Text-To-Speech?
  • What are the benefits to using TTS?
  • What makes good TTS and what makes it bad?
  • What is concatenative TTS versus synthesized TTS?
  • Can I hear some TTS?

Engines, APIs and Markups

  • What is SAPI?
  • What is JSAPI?
  • What is JSML?

What is speech recognition, speech reco or ASR?
Speech recognition, also known as "speech reco" or ASR (automatic speech recognition) is the process of a computer converting spoken input into text. In one example, you speak into a microphone attached to your computer, and you can see what you said on your screen. Speech recognizers, also known as speech engines, can recognize many different languages.

How good is it, really?
Accuracy has improved in recent years to the point where it is ready for applications. Proper consulting can help determine the best technology for a given application.

Why do I have to train some speech engines?
Training a speech engine creates "models" of your voice that tell the speech engine how you, personally, pronounce words. You do not have to train all speech engines, but in order to achieve higher accuracy rates, you must typically train dictation engines. Dictation engines offer a very large list of words that you can say, and speech recognition on such large vocabularies is extremely hard. Therefore, whenever the engine has more voice data from you, it increases the likelihood that the word you said will match its vocabulary.

Who has the best speech engine?
This question is often asked, and to be honest, it really depends on your environment and application. EverSpeech can help you determine the most suitable engine for your needs through our thorough testing methodologies.

What is speech recognition over the telephone?
Speech recognition over the telephone is when you have called a phone number, and instead of a person answering the phone, you hear the computer ask you a question. This type of phone answering system is called a speech telephony server. A speech telephony server is a computer that has a large number of telephone lines plugged into it, and on top of being able to route your call and play options to you, it can recognize your speech and make the right decisions based on what you've said.

Because speech telephony servers answer calls from anyone, they are speaker-independent, grammar-based systems. You cannot just say anything into the phone, you must say something that the server expects. For example, if the server asks "Would you like stock information, news reports, or weather?", your response can be either "stock information", "news reports", or "weather." You cannot say, for example, "Hi server, how about the weather today?"

What is speech recognition on the Web?
Speech recognition over the Web can mean a few things, but primarily, to us at EverSpeech, it means browsing the Web with your voice. You see a link on the screen, and instead of clicking it with your mouse, you say it. This also includes radio buttons and other form elements.

What is dictation-based versus grammar-based recognition?
Dictation-based speech recognition is the ability to speak free-form input to your computer. You say words and phrases in a free-form manner, and the speech engine will put the text of what you said on the screen. However, you cannot say absolutely everything. The speech engine must be able to convert what you've said into a word it knows about. Dictation vocabularies are extremely large, so that they can support free-form speech. These engines also require training.

Grammar-based recognition, on the other hand, typically does not require training to achieve high accuracy rates and has a small to medium vocabulary. Because of the vocabulary size, it is easier for the engine to determine what you've said, as long as it is within the context of what the grammar provides. For example, if the words "apple" and "orange" are the only words in the grammar, you can only say "apple" or "orange" and not "carrot." Typically, grammar-based recognizers will be able to tell you that you've said something that isn't in their vocabulary, but not be able to repeat it back to you.

How can I add speech to my application?
There are a number of ways to add speech recognition to your application, and EverSpeech is prepared to help you. Depending on what type of platform you're running on, you'll need to select an engine, make sure you have the appropriate hardware and SDK's for the language you're working with. If you have questions or would like help, be sure to contact us


Text To Speech What is TTS or Text-To-Speech?
Text-To-Speech, also known as TTS, is the process of converting ASCII or Unicode text into speech that you can hear.

What are the benefits to using TTS?
The primary benefit to using TTS is that TTS can be used for very dynamic content. The text that your application creates, whether it is database driven or created by some other method, can provide immediate feedback to the user.

When delivering TTS over the Internet, or an intranet, TTS saves bandwidth by sending only the text of what needs to be said, instead of large wave files. Wave files can consume hundreds of kilobytes of memory versus a few for a long string that needs to be spoken.

The downside to TTS is that you give up quality compared to using prerecorded audio.

What makes good TTS and what makes it bad?
Text to speech quality is usually measured in the following ways:

  • Intelligibility - How well can you understand what was said?
  • Naturalness - How much does it sound like a human? Is it annoying to listen to over longer periods of time? How natural are pauses, sentence and paragraph transitions?
  • Text Preprocessing - How well does it convert acronyms into normal speech? E.g, Mr. is read as "M R period" or "Mister" How well does it read things like phone numbers? E.g, 206-555-1212 is read as "two oh six, five five five, one two one two" or "two hundred six hyphen five hundred fifty five hyphen one thousand two hundred twelve"

All of these things factor into what you might call good TTS. Bad TTS is not having a combination of the previous elements.

What is concatenative TTS versus synthesized TTS?
Concatenative TTS processes human quality wave files to generate the speech in the TTS string. Synthesized TTS is the creation of speech by generating sounds through digitized speech formants. Concatenative TTS usually sounds more natural, while synthesized TTS sounds more "computer like." The tradeoff is that concatenative TTS systems can be large in size and require lots of drive space in order to run, while synthesized speech systems can usually fit into a few megabytes. There are some happy mediums, in the case where you have a limited vocabulary for the TTS. In a case like saying the time, a small vocabulary concatenative TTS system would work quite well.

Can I hear some TTS?
Sure! There are examples on the web, and we have a small listing of these resources on our samples page.


Engines, APIs and Markups

What is SAPI?
SAPI is the acronym for Microsoft's Speech API (Application Programming Interface). It allows developers to voice enable their applications for the Microsoft Windows platform. You can find the link to the Microsoft SAPI pages through our related sites page.

What is JSAPI?
JSAPI is the JavaTM Speech API (Application Programming Interface). It allows developers to voice enable their applications on any platform which has an engine that supports JSAPI. You can find the link to the JSAPI documentation through our related sites page.

More recently, JSAPI2 was released through the Java Community Process (JCP) as JSR 113. This new version of the Java Speech API works on Java ME (Mobile Edition). You can find a link to JSAPI2 through our related sites page.

What is JSML?
JSML stands for JavaTM Speech Markup Language. It allows you to add prosody elements to your TTS strings when you use the JSAPI interface. A typical JSML marked up string might look something like this:

  <JSML><PROS Vol="100"><PROS Rate="10"><PROS Pitch="-25">Hello World!</PROS></PROS></PROS></JSML>
  

The string "Hello World!" would be spoken with a volume of 100 (much louder), a rate of 10 (slightly faster) and a pitch of -25 (lower pitched). Different JSAPI implementations may differ on what those numbers mean.

© 2001 - 2021, EverSpeech, Inc.
Comments or questions to: webmaster@EverSpeech.com