|
|
|
|
Below you'll find an index to our frequently asked questions list. If have a question that you don't see on this list, feel free to Ask Us. Speech Recognition
What is speech recognition, speech reco or ASR? - top Speech recognition, also known as "speech reco" or ASR (automatic speech recognition) is the process of a computer converting spoken input into text. In one example, you speak into a microphone attached to your computer, and you can see what you said on your screen. Speech recognizers, also known as speech engines, can recognize many different languages. How good is it, really? - top Accuracy has improved in recent years to the point where it is ready for applications. Proper consulting can help determine the best technology for a given application. Why do I have to train some speech engines? - top Training a speech engine creates "models" of your voice that tell the speech engine how you, personally, pronounce words. You do not have to train all speech engines, but in order to achieve higher accuracy rates, you must typically train dictation engines. Dictation engines offer a very large list of words that you can say, and speech recognition on such large vocabularies is extremely hard. Therefore, whenever the engine has more voice data from you, it increases the likelyhood that the word you said will match its vocabulary. Who has the best speech engine? - top This question is often asked, and to be honest, it really depends on your environment and application. EverSpeech can help you determine the most suitable engine for your need through our thorough testing methodologies. What is speech recognition over the telephone? - top Speech recognition over the telephone is when you have called a phone number, and instead of a person answering the phone, you hear the computer ask you a question. This type of phone answering system is called a speech telephony server. A speech telephony server is a computer that has a large number of telephone lines plugged into it, and on top of being able to route your call and play options to you, it can recognize your speech and make the right decisions based on what you've said. Because speech telephony servers answer calls from anyone, they are speaker-independent, grammar-based systems. You cannot just say anything into the phone, you must say something that the server expects. For example, if the server asks "Would you like stock information, news reports, or weather?", your response can be either "stock information", "news reports", or "weather." You cannot say, for example, "Hi server, how about the weather today?" What is speech recognition on the Web? - top Speech recognition over the Web can mean a few things, but primarily, to us at EverSpeech, it means browsing the Web with your voice. You see a link on the screen, and instead of clicking it with your mouse, you say it. There are a handfull of speech enabled browsers on the market today. You can find links to them on our downloads page. What is dictation-based versus grammar-based recognition? - top Dictation-based speech recognition is the ability to speak free-form input to your computer. You say words and phrases in a free-form manner, and the speech engine will put the text of what you said on the screen. However, you cannot say absolutely everything. The speech engine must be able to convert what you've said into a word it knows about. Dictation vocabularies are extremely large, so that they can support free-form speech. These engines also require training. Grammar-based recognition, on the other hand, typically does not require training to achieve high accuracy rates and has a small to medium vocabulary. Because of the vocabulary size, it is easier for the engine to determine what you've said, as long as it is within the context of what the grammar provides. For example, if the words "apple" and "orange" are the only words in the grammar, you can only say "apple" or "orange" and not "carrot." Typically, grammar-based recognizers will be able to tell you that you've said something that isn't in their vocabulary, but not be able to repeat it back to you. How can I add speech to my application? - top There are a number of ways to add speech recognition to your application, and EverSpeech is prepared to help you. Depending on what type of platform you're running on, you'll need to select an engine, make sure you have the appropriate hardware and SDK's for the language you're working with. If you have questions or would like help, e-mail us at info@EverSpeech.com Text To Speech What is TTS or Text-To-Speech? - top Text-To-Speech, also known as TTS, is the process of converting ASCII or Unicode text into speech that you can hear. What are the benefits to using TTS? - top The primary benefit to using TTS is that TTS can be used for very dynamic content. The text that your application creates, whether it is database driven or created by some other method, can provide immediate feedback to the user. With standard wave files, such an ability does not exist. When delivering TTS over the internet, or an intranet, TTS saves bandwidth by sending only the text of what needs to be said, instead of large wave files. Wave files can consume hundreds of kilobytes of memory versus a few for a long string that needs to be spoken. The downside to TTS is that you give up quality compared to using normal wave files. What makes good TTS and what makes it bad? - top Text to speech quality is usually measured in the following ways:
Concatenative TTS processes human quality wave files to generate the speech in the TTS string. Synthesized TTS is the creation of speech by generating sounds through digitized speech formants. Concatenative TTS usually sounds more natural, while synthesized TTS sounds more "computer like." The tradeoff is that concatenative TTS systems can be large in size and require lots of drive space in order to run, while synthesized speech systems can usually fit into a few megabytes. There are some happy mediums, in the case where you have a limited vocabulary for the TTS. In a case like saying the time, a small vocabulary concatenative TTS system would work quite well. Can I hear some TTS? - top Sure! There are examples on the web, and we have a small listing of these resources on our samples page. Engines, APIs and Markups What is SAPI? - top SAPI is the acronym for Microsoft's Speech API (Application Programming Interface). It allows developers to voice enable their applications for the Microsoft Windows platform. Currently, versions 4 and 5 exist, while 5.1 is due out in Fall 2001. You can find the link to the Microsoft SAPI pages through our related sites page. What is JSAPI? - top JSAPI is the acronym for Sun's JavaTM Speech API (Application Programming Interface). It allows developers to voice enable their applications on any platform which has an engine that supports JSAPI. You can find the link to the Sun JSAPI pages through our related sites page. Are there any free dication engines? - top Yes! Microsoft provides a free dication engine with their SDK. You can find the link through our resources page. What is JSML? - top JSML stands for JavaTM Speech Markup Language. It allows you to add prosidy elements to your TTS strings when you use the JSAPI interface. A typical JSML marked up string might look something like this: The string "Hello World!" would be spoken with a volume of 100 (much louder), a rate of 10 (slightly faster) and a pitch of -25 (lower pitched). Different JSAPI implementations may differ on what those numbers mean.<JSML><PROS Vol="100"><PROS Rate="10"><PROS Pitch="-25">Hello World!</PROS></PROS></PROS></JSML> |
© 2001 - 2008, EverSpeech, Inc.
Comments or questions to:
webmaster@EverSpeech.com