US20070016420A1 - Dictionary lookup for mobile devices using spelling recognition - Google Patents

Dictionary lookup for mobile devices using spelling recognition Download PDF

Info

Publication number
US20070016420A1
US20070016420A1 US11/176,154 US17615405A US2007016420A1 US 20070016420 A1 US20070016420 A1 US 20070016420A1 US 17615405 A US17615405 A US 17615405A US 2007016420 A1 US2007016420 A1 US 2007016420A1
Authority
US
United States
Prior art keywords
letters
user
list
speech input
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/176,154
Inventor
Ophir Azulai
Ron Hoory
Zohar Sivan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US11/176,154 priority Critical patent/US20070016420A1/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: AZULAI, OPHIR, HOORY, RON, SIVAN, ZOHAR
Priority to EP06763137A priority patent/EP1905001A1/en
Priority to CNA2006800245515A priority patent/CN101218625A/en
Priority to CA002613154A priority patent/CA2613154A1/en
Priority to PCT/EP2006/062284 priority patent/WO2007006596A1/en
Priority to BRPI0613699-0A priority patent/BRPI0613699A2/en
Publication of US20070016420A1 publication Critical patent/US20070016420A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules

Definitions

  • the present invention relates generally to speech recognition systems, and particularly to methods and systems for querying an electronic dictionary using spoken input.
  • a dictionary may comprise, for example, a thesaurus or lexicon that provides definitions of words or phrases.
  • bilingual or multilingual dictionaries provide translation of words from one language to another.
  • Ectaco, Inc. (Long Island City, N.Y.) offers a number of handheld electronic dictionaries and translators.
  • Other applications use speech recognition methods, in which the user vocally pronounces the query word.
  • Ectaco, Inc. offers a multilingual translator called “UT-103 Universal Translator” that supports voice input. Additional details regarding this product can be found at www.universal-translator.net.
  • OCR Optical Character Recognition
  • data entry methods are prone to errors. Therefore, some applications use methods for detecting errors or reducing the possibility of erroneous data entry.
  • One way of reducing the probability of error is using two or more different data entry methods for the same word. This approach is sometimes referred to as “multimodal” data entry.
  • some speech recognition applications use alphanumeric data entry from a telephone keypad. Such a technique is described by Parthasarathy in “Experiments in Keypad-Aided Spelling Recognition,” The 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), Quebec, Canada, May, 2004. The author describes several schemes for augmenting speech input with input from a telephone keypad in a call-center application.
  • U.S. Pat. No. 5,995,928 Another spelling-based application is described in U.S. Pat. No. 5,995,928.
  • the inventors describe a speech recognition system capable of recognizing a word based on a continuous spelling of the word by a user.
  • the system continuously outputs an updated string of hypothesized letters, based on the letters uttered by the user.
  • the system compares each string of hypothesized letters to a vocabulary list of words and returns a best match for the string.
  • U.S. Pat. No. 5,027,406 describes a method for creating word models in a natural language dictation system. After the user dictates a word, the system displays a list of the words in the active vocabulary which best match the spoken word. By keyboard or voice command, the user may choose the correct word from the list or may choose to edit a similar word if the correct word is not on the list. Alternatively, the user may type or speak the initial letters of the word.
  • a method for querying an electronic dictionary using letters of an alphabet enunciated by a user includes accepting a speech input from the user, the speech input including a sequence of spelled letters enunciated by the user that spell a query word.
  • the speech input is analyzed to determine one or more sequences of the letters that approximate the sequence of spelled letters.
  • the one or more sequences of the letters are post-processed so as to produce a plurality of recognized words approximating the query word.
  • the electronic dictionary is queried with the plurality of recognized words so as to retrieve a respective plurality of dictionary entries. A list of results including the plurality of recognized words and the respective plurality of dictionary entries is presented to the user.
  • analyzing the speech input includes applying at least one of an acoustic model and a language model to the speech input. Additionally or alternatively, applying the language model includes representing at least part of the dictionary in terms of a finite state grammar (FSG). Further additionally or alternatively, applying the language model includes assigning probabilities to the sequences of the letters based on a probabilistic language model.
  • FSG finite state grammar
  • post-processing the sequences includes defining two or more letter classes including subsets of the letters in the alphabet that have similar sounds, and constructing sequences of the letters by substituting at least one of the letters belonging to the same letter class as at least one of the letters of the query word, so as to produce the plurality of recognized words.
  • querying the dictionary includes accepting a user command including at least one of a typed input and a voice command, and modifying at least one letter of one of the recognized words based on the user command.
  • presenting the list of results includes assigning likelihood scores to the recognized words on the list and sorting the list based on the likelihood scores. Additionally or alternatively, presenting the list of results includes converting at least part of the list to a speech output, and playing the speech output to the user. Further additionally or alternatively, presenting the list of results includes accepting a user command including at least one of a typed input and a voice command, and scrolling through the list responsively to the user command.
  • accepting the speech input includes receiving the speech input via an audio interface associated with a mobile device including at least one of a mobile telephone, a portable computer and a personal digital assistant (PDA), and presenting the list includes providing the list via an output of the mobile device.
  • a mobile device including at least one of a mobile telephone, a portable computer and a personal digital assistant (PDA)
  • PDA personal digital assistant
  • accepting the speech input includes sending the speech input from the mobile device to a remote server that serves one or more users, and presenting the list of results includes transmitting the list of results from the remote server to the mobile device for presentation to the user.
  • Apparatus and a computer software product for querying an electronic dictionary are also provided.
  • a system for querying an electronic dictionary using letters of an alphabet enunciated by a user includes a remote server including a memory, which is coupled to store the electronic dictionary.
  • the system includes one or more spelling processors, which are coupled to accept a speech input from the user, the speech input including a sequence of spelled letters enunciated by the user that spell a query word, to analyze the speech input so as to determine one or more sequences of the letters approximating the sequence of spelled letters, to post-process the one or more sequences of the letters so as to produce a plurality of recognized words approximating the query word, to query the electronic dictionary stored in the memory with the plurality of recognized words so as to retrieve a respective plurality of dictionary entries, and to generate a list of results including the plurality of recognized words and the respective plurality of dictionary entries.
  • the system also includes a user device, including a client processor, which is coupled to receive the speech input from the user and to send the speech input to the remote server, and which is coupled to receive, responsively to the speech input, the list of results.
  • the user device includes an output device, which is coupled to present the list of results generated by the spelling processor to the user.
  • FIG. 1 is a schematic, pictorial illustration of a system for querying an electronic dictionary, in accordance with an embodiment of the present invention
  • FIG. 2A is a block diagram that schematically illustrates a mobile device, in accordance with an embodiment of the present invention
  • FIG. 2B is a block diagram that schematically illustrates a spelling processor, in accordance with an embodiment of the present invention
  • FIG. 3 is a block diagram that schematically illustrates a system for querying an electronic dictionary, in accordance with another embodiment of the present invention.
  • FIG. 4 is a block diagram that schematically illustrates a system for querying an electronic dictionary, in accordance with yet another embodiment of the present invention.
  • FIG. 5 is a flow chart that schematically illustrates a method for querying an electronic dictionary, in accordance with an embodiment of the present invention.
  • Embodiments of the present invention provide improved methods and systems that allow users of mobile devices to query an electronic dictionary using spelling recognition. Instead of pronouncing the query word as a whole, as implemented in conventional speech recognition systems, the user vocally spells the query word letter by letter.
  • a spelling processor in the mobile device captures and processes the spelled word.
  • a list of possible recognized words is produced, according to predefined models.
  • a list of results, comprising the recognized words along with the corresponding dictionary entries, is presented to the user. The user can then scroll through the results and identify the correct word and dictionary entry.
  • Embodiments of the present invention provide a method and a system that are particularly suitable for users who are not familiar with the language in question, such as tourists or foreigners. Such users may not know the correct pronunciation of words but can easily spell them out. Users with speech impairments, whose pronunciation of words may be difficult to understand, may also benefit from the disclosed methods.
  • reliable letter-by-letter spelling recognition is a non-trivial task that introduces other types of error mechanisms, as will be explained below.
  • the disclosed methods address these error mechanisms by defining appropriate models that determine the list of alternative recognized words.
  • the list is typically sorted by relevance, using relevance measures that are based on the same error mechanisms and/or the model being used.
  • Some embodiments of the present invention also provide a quick and simple user interface for users of mobile devices.
  • the user interface combines spelling recognition with keypad functions and/or voice commands. This multimodal functionality enables quick and smooth operation of the dictionary application by both ordinary users and users with special needs.
  • the disclosed user interface enables the user to query the dictionary without having to move his or her eyes from the written text.
  • the user interface enables querying the dictionary without moving the user's fingers away from the page.
  • the result list is converted to speech and played to the user using a text-to-speech (TTS) generator.
  • TTS text-to-speech
  • This implementation is also particularly suitable for blind users and for users who operate the system while driving or carrying out other tasks that require continuous visual attention.
  • the dictionary query system is implemented in a remote server configuration using distributed speech recognition (DSR).
  • DSR distributed speech recognition
  • FIG. 1 is a schematic, pictorial illustration of a system for querying an electronic dictionary, in accordance with an embodiment of the present invention.
  • a user 22 communicates using speech 24 with a mobile device 26 , for querying an electronic dictionary.
  • the mobile device may comprise a personal digital assistant (PDA), such as one of the palmOneTM PDA products (see www.palmone.com).
  • PDA personal digital assistant
  • the mobile device may alternatively comprise a laptop computer, a mobile phone or another device with suitable computational and I/O capabilities.
  • the embodiments described hereinbelow relate to mobile devices by way of illustration, the principles of the present invention may also be applied in non-mobile computing devices, such as desktop computers.
  • the mobile device typically comprises a microphone 27 for accepting speech from the user and a keypad 28 for accepting user input.
  • a display 30 presents textual information to the user.
  • mobile device 26 also comprises a speaker 31 for playing synthesized speech to the user, as will be explained below.
  • the electronic dictionary application may comprise a thesaurus or a lexicon, in which case querying the dictionary means retrieving a definition of a word.
  • the dictionary may comprise a bilingual or multilingual dictionary, in which case querying the dictionary means retrieving a translation of the word to another language.
  • Additional dictionary applications comprise dictionaries that are specific to particular professional disciplines and phrasebooks that translate phrases from one language to another. Other dictionary applications will be apparent to those skilled in the art, and can be implemented using the methods described hereinbelow.
  • the term “dictionary” pertains to any such dictionary application.
  • the term “dictionary entry” refers to the definition or the translation of a word or phrase, as relevant to the particular application.
  • FIG. 2A is a block diagram that schematically illustrates mobile device 26 , in accordance with an embodiment of the present invention.
  • Mobile device 26 comprises an input device, such as a microphone 27 , that accepts speech input from the user.
  • the speech comprises a query word or phrase, spelled letter-by-letter by the user.
  • a sampler 32 samples the speech input and produces digitized speech.
  • a spelling processor 34 processes the digitized speech and produces a list of possible recognized words.
  • the spelling processor is typically implemented as a software process that runs on a central processing unit (CPU) of the mobile device.
  • the spelling processor queries an electronic dictionary 36 , which is stored in a memory of the mobile device, and retrieves dictionary entries corresponding to the recognized words.
  • the spelling processor typically displays the list of results using an output device such as display 30 .
  • the output device comprises a text to speech (TTS) generator 38 that converts the list of results, or parts of it, to speech and plays it to the user.
  • TTS text to speech
  • FIG. 2B is a block diagram that schematically shows details of spelling processor 34 , in accordance with an embodiment of the present invention.
  • the spelling recognition process carried out by processor 34 can be divided into two consecutive steps.
  • a speech recognizer 39 in processor 34 accepts the digitized speech.
  • the speech recognizer applies a suitable model to the digitized speech so as to produce one or more letter sequences that represents a possibly-recognized word.
  • Each letter sequence is assigned a probability value indicating the probability of the particular letter sequence representing the word spelled by the user.
  • speech recognizer 39 queries dictionary 36 as part of the recognition process.
  • the model used by recognizer 39 already contains at least part of the dictionary.
  • a post processor 41 in spelling processor 36 accepts the letter sequences and associated probabilities from recognizer 39 .
  • the post processor queries dictionary 36 with the recognized words and produces an ordered list of results.
  • the list comprises the recognized words and the associated dictionary definitions of these words.
  • the configuration of spelling processor 34 shown in FIG. 2B is typically used in both the local configuration shown in FIG. 2A above and in the remote server configuration shown in FIGS. 3 and 4 below.
  • speech recognizer 39 and post processor 41 are implemented as two software processes managed by spelling processor 34 .
  • FIG. 3 is a block diagram that schematically illustrates a remote server system for querying electronic dictionary 36 , in accordance with another embodiment of the present invention.
  • the dictionary application is preferable to implement the dictionary application using a remote server configuration.
  • the electronic dictionary is located in a single central location. Multiple users can query the dictionary using distributed speech recognition (DSR) techniques, as are known in the art.
  • DSR distributed speech recognition
  • a centralized dictionary configuration is sometimes preferred because it enables the use of larger dictionaries.
  • Large dictionaries, or dictionaries holding large and detailed entries, may significantly exceed the memory storage capabilities of typical mobile devices. Additionally, maintaining and updating information in a centralized dictionary data structure is often easier than managing multiple dictionaries distributed between multiple users.
  • FIG. 3 The configuration shown in FIG. 3 comprises an application server 40 .
  • Spelling processor 34 and dictionary 36 are located in server 40 .
  • FIG. 3 shows a single spelling processor, typical implementations of server 40 comprise multiple spelling processors 34 that interact with multiple mobile devices 26 .
  • the multiple spelling processors are typically implemented as parallel software instances or threads running on one or more CPUs of server 40 .
  • Dictionary 36 can be implemented using any suitable data structure, such as a database, suitable for multi-user access.
  • mobile device 26 comprises a client processor 42 that accepts the speech input from the user via microphone 27 and sampler 32 (not shown in this figure).
  • Processor 42 compresses the captured and digitized speech and transmits it, typically in a compact form, such as a stream of compressed feature vectors, to spelling processor 34 in server 40 .
  • the spelling processor decompresses the feature vectors, processes the decompressed speech and queries dictionary 36 , according to the method of FIG. 5 below.
  • the processing performed by spelling processor 36 in the remote server configuration is similar to that performed in the local configuration shown in FIG. 2A above.
  • the spelling processor sends the list of recognized words and the corresponding dictionary entries to client processor 42 in the mobile device.
  • the client processor presents the results to the user using display 30 and/or TTS generator 38 .
  • the client processor handles the user interface, which allows the user to scroll and edit the list of results using keypad 28 and/or voice commands. Again, the user interface is explained in detail in the description of FIG. 5 below.
  • Mobile device 26 and server 40 are linked by a communication channel.
  • the channel is used to send compressed speech to the server, send result lists to the mobile device and exchange miscellaneous control information.
  • the communication channel may comprise any suitable medium, such as an Internet connection, a telephone line, a wireless data network, a cellular network, or a combination of several such media.
  • FIG. 4 is a block diagram that schematically illustrates a remote server system for querying electronic dictionary 36 , in accordance with yet another embodiment of the present invention.
  • the configuration of FIG. 4 is similar to the configuration of FIG. 3 above, except that in the configuration of FIG. 4 the text-to-speech conversion function is also split between the server and the mobile device.
  • Server 40 here comprises TTS generator 38 , which in this embodiment accepts the list of results from the spelling processor and converts it (or parts of it) to a stream of compressed speech feature vectors.
  • the compressed speech is then sent to the mobile device over the communication channel.
  • a speech decoder in the mobile device decompresses and decodes the received feature vectors and plays the decoded speech to the user.
  • spelling processor 34 and client processor 42 comprise general-purpose computer processors, which are programmed in software to carry out the functions described herein.
  • the software may be downloaded to the computers in electronic form, over a network, for example, or it may alternatively be supplied to the computers on tangible media, such as CD-ROM.
  • the spelling processor may be a standalone unit, or it may alternatively be integrated with other computing functions of mobile device 26 or server 40 . Additionally or alternatively, at least some of the functions of the spelling processor may be implemented using dedicated hardware.
  • Client processor 42 may also be integrated with other computing functions of mobile device 26 .
  • FIG. 5 is a flow chart that schematically illustrates a method for querying electronic dictionary 36 , in accordance with an embodiment of the present invention.
  • the method begins with user 22 entering a query word or phrase, at a word entry step 50 .
  • the user first initiates the dictionary application running on mobile device 26 .
  • the user then starts the speech acquisition process, for example by clicking a button on keypad 28 .
  • the user spells the query word vocally, letter by letter. After spelling the entire word the user stops the speech acquisition process, for example using keypad 28 .
  • the mobile device captures the speech comprising the sequence of spelled letters using microphone 27 .
  • Sampler 32 digitizes the captured speech.
  • the user can start and stop the speech acquisition process using predetermined voice commands.
  • client processor 42 transmits data, typically in the form of a stream of compressed feature vectors, that represent the processed speech to the spelling processor, at a speech transmission step 52 .
  • the spelling processor in such a configuration is part of server 40 . If the method is implemented locally in the mobile device, as shown in FIG. 2A above, step 52 is omitted.
  • Speech recognizer 39 and post processor 41 in spelling processor 34 process the digitized speech, at a speech processing step 54 .
  • Speech recognizer 39 analyzes the digitized speech, typically segmenting the speech into phonetic components that represent individual letters of the query word.
  • Various methods are known in the art for identifying a phonetic sound within a limited vocabulary. Any suitable method can be used by the speech recognizer to identify the spelled letters in the captured speech. Most methods do not require user-specific training (sometimes referred to as “user enrollment”) because of the small vocabulary and the small user-dependent differences in pronunciation of spelled letters.
  • speech recognizer 39 extracts additional information from the digitized speech, to be used in the recognition process as will be explained below.
  • the speech recognizer uses a suitable acoustic model for assigning a likelihood score to each identified spelled letter.
  • Each likelihood score quantifies the likelihood that the particular letter was indeed iterated by the user.
  • the speech recognizer uses a language model, which may be based in whole or in part on the dictionary being used. Using the language model, the speech recognizer generates one or more letter sequences that represent possibly-recognized words in response to the captured input speech.
  • the language model comprises a graph representing the dictionary, which is commonly referred to as a Finite State Grammar (FSG).
  • FSG Finite State Grammar
  • Finite state grammars (sometimes also referred to as finite-state networks) are described, for example, by Rabiner and Juang in “Fundamentals of Speech Recognition,” Prentice Hall, April 1993, pages 414-416.
  • the nodes of the FSG represent letters of the alphabet. (In typical implementations, each letter of the alphabet appears several times in the graph.) Arcs between nodes represent adjacent letters in legitimate words. In other words, each word in the dictionary is represented as a trajectory or path through the graph.
  • only part of the dictionary is represented as a FSG.
  • FSG-based models are used for small to medium size vocabularies and dictionaries, typically up to several thousands of words.
  • the speech recognizer When using the FSG, the speech recognizer typically compares the sequence of spelled letters of the digitized speech to the different trajectories through the FSG. In some embodiments, the speech recognizer assigns likelihood scores to the trajectories. The speech recognizer produces the letter sequences and the associated likelihood scores.
  • the language model comprises a probabilistic language model, which assigns probabilities to different letter sequences in the vocabulary.
  • Probabilistic language models are described, for example, by Young in “A Review of Large-Vocabulary Continuous-Speech Recognition,” IEEE Signal Processing Magazine, September 1996, pages 45-57. Probabilistic language models are typically used when the size of the dictionary is very large, making it difficult to represent every word in the model explicitly.
  • speech recognizer 39 produces one or more letter sequences that resemble the sequence of spelled letters, with associated likelihood scores in accordance with the probabilistic language model.
  • the speech recognizer represents the different letter sequences produced by the probabilistic language model in terms of a lattice.
  • the lattice is a graph comprising the possible sequences of letters, with each sequence assigned a respective likelihood score, according to the probabilistic language model.
  • speech recognizer 39 provides to post processor 41 one or more letter sequences with associated likelihood scores, as described above.
  • the letter sequences provided to post processor 41 are already legitimate words that appear in dictionary 36 .
  • post processor 41 selects a subset of the letter sequences in the lattice, having the highest likelihood scores. Since not all of the possible letter sequences in the lattice necessarily correspond to legitimate dictionary words, post processor 41 typically queries dictionary 36 with the selected letter sequences, and discards words that do not appear in the dictionary.
  • speech recognizer 39 uses a probabilistic language model
  • speech recognizer 39 outputs only the letter sequence having the maximum likelihood score (referred to hereinbelow as the highest ranking sequence).
  • Post processor 41 constructs a list of alternative letter sequences based on the highest ranking sequence by using letter classes, as explained below.
  • Spelled letters can be classified into letter classes based on their pronunciation characteristics.
  • some spelled letters may be mistaken for one another.
  • the spelled letters /b/, /c/, /d/, /e/, /g/, /p/, /t/, /v/ and /z/ all belong to the same letter class (referred to as the “e-class”) .
  • These letters all have similar vowel sounds when spelled.
  • the speech recognizer may erroneously mistake one such letter for another.
  • the speech recognizer may erroneously interchange letters belonging to the “a-class” (/a/, /h/, /j/, /k/), the “i-class” (/i/, /y/) and the “u-class” (/u/, /q/).
  • the probabilities of mistaking one letter for another are typically represented as a matrix, which is called a “confusion matrix.”
  • the probability of interchanging letters belonging to different letter classes is assumed to be small.
  • the post processor constructs the list of alternative letter sequences by replacing each letter of the best ranking sequence with similarly-sounding letters, according to the letter classes described above.
  • the post processor typically ranks the list, for example by computing likelihood scores based on the confusion matrix.
  • the alternative letter sequences may also comprise a different number of letters, or letters from other letter classes.
  • the query word “cat” can also be recognized as “beat.”
  • the spelling processor may request the user's assistance in determining which one of the recognized letter sequences, or recognized words, is the original query word entered by the user.
  • the post processor prepares a list of results, at a list preparation step 56 .
  • the post processor produces the list of results in accordance with one of the language models described above.
  • the post processor sorts the list of results in descending order of relevance. The relevance score of a particular recognized word is typically determined in accordance with the language model being used, as described above. Alternatively, the list can be sorted alphabetically, or using any other suitable criterion.
  • step 58 If the disclosed method is implemented using a remote server configuration, as shown in FIGS. 3 and 4 above, spelling processor 34 in server 40 transmits the list of results to client processor 42 , at a result transmission step 58 . If the method is implemented locally in the mobile device, as shown in FIG. 2A above, step 58 is omitted.
  • the spelling processor presents the list of results to the user, at a presentation step 60 .
  • the list of recognized words is displayed as text on display 30 of the mobile device.
  • the user may scroll through the list using keypad 28 until he or she finds the intended query word and the corresponding dictionary entry. Alternatively, only the first word on the list is displayed together with its dictionary entry. If the first recognized word on the result list is incorrect, the user may scroll down and select the next word. Any other suitable presentation method can be used, depending upon the particular application and the capabilities of keypad 28 and display 30 of the mobile device. Additionally, the user can also edit the displayed recognized words at any time using the keypad, so as to enter part or all of the intended query word.
  • the list of results is converted to speech using TTS generator 38 and played to the user through speaker 31 .
  • the user can indicate, either using the keypad or by uttering a voice command, when the correct word is being played. After selecting the correct word, the TTS generator plays the corresponding dictionary entry.
  • the disclosed methods mainly address spelling-based dictionary lookup in mobile devices, the same methods can be used in a variety of additional applications.
  • the disclosed methods can also be used in desktop or mainframe computer applications that require high quality word recognition.
  • Such applications include, for example, directory assistance services and name dialing applications.

Abstract

A method for querying an electronic dictionary using letters of an alphabet enunciated by a user includes accepting a speech input from the user. The speech input includes a sequence of spelled letters enunciated by the user that spell a query word. The speech input is analyzed to determine one or more sequences of the letters that approximate the sequence of spelled letters. The one or more sequences of the letters are post-processed so as to produce a plurality of recognized words approximating the query word. The electronic dictionary is queried with the plurality of recognized words so as to retrieve a respective plurality of dictionary entries. A list of results including the plurality of recognized words and the respective plurality of dictionary entries is presented to the user.

Description

    FIELD OF THE INVENTION
  • The present invention relates generally to speech recognition systems, and particularly to methods and systems for querying an electronic dictionary using spoken input.
  • BACKGROUND OF THE INVENTION
  • Many mobile devices and desktop applications enable users to query electronic dictionaries. A dictionary may comprise, for example, a thesaurus or lexicon that provides definitions of words or phrases. In, other applications, bilingual or multilingual dictionaries provide translation of words from one language to another.
  • A number of data entry methods are known in the art for entering a word or phrase to be looked-up in the dictionary. In some applications, the user types the query word using a keyboard or keypad. For example, Ectaco, Inc. (Long Island City, N.Y.) offers a number of handheld electronic dictionaries and translators. One exemplary product is described in www.ectaco.com/dictionaries/view_info.php3?refid=831&pagelang=23&dict_id=92. Other applications use speech recognition methods, in which the user vocally pronounces the query word. For example, Ectaco, Inc., offers a multilingual translator called “UT-103 Universal Translator” that supports voice input. Additional details regarding this product can be found at www.universal-translator.net.
  • Some dictionary applications use Optical Character Recognition (OCR) methods for entering queries. For example, Wizcom Technologies, Ltd. (Jerusalem, Israel), offers a family of translators and dictionaries called “Quicktionary.” The Quicktionary products are pen-shaped handheld devices that use OCR methods to scan and analyze printed text. Additional details regarding the Quicktionary products can be found at www.wizcomtech.com. Another example of the use of OCR techniques is described by Elgan in “Nothing Lost in Translation,” HP World Magazine, (5:6), June 2002. This article is also available at www.interex.org/hpworldnews/hpw206/pub_hpw_features1.jsp. According to this method, the user takes a picture of the required word using a digital camera. An OCR module produces a string comprising the letters of the word, which is then used for querying the dictionary.
  • Generally speaking, data entry methods are prone to errors. Therefore, some applications use methods for detecting errors or reducing the possibility of erroneous data entry. One way of reducing the probability of error is using two or more different data entry methods for the same word. This approach is sometimes referred to as “multimodal” data entry. For example, some speech recognition applications use alphanumeric data entry from a telephone keypad. Such a technique is described by Parthasarathy in “Experiments in Keypad-Aided Spelling Recognition,” The 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), Quebec, Canada, May, 2004. The author describes several schemes for augmenting speech input with input from a telephone keypad in a call-center application.
  • Another example is a flight reservation system that uses keypad entry for error detection, described by Filisko and Seneff in “Error Detection and Recovery in Spoken Dialogue Systems,” Proceedings of the Human Language Technology Conference, North American Chapter of the Association for Computational Linguistics Annual Meeting (HLT-NAACL 2004), Workshop on Spoken Language Understanding for Conversational Systems, Boston, Mass., May, 2004, pages 31-38.
  • Some applications use letter spelling or phonetic spelling as a mode for data entry. The paper by Filisko and Seneff cited above also describes a “speak and spell” method, in which the user is asked to spell words as an error recovery measure. Another application, in which a user enters a target word using phonetic spelling, is described in U.S. Pat. No. 6,321,196. Spelling a word phonetically means representing each letter in the word to be spelled by a commonly understood word. For example, one may phonetically spell the work “key” by saying “kilo echo yankee.” The inventor describes a speech recognition system in which the user says a sequence of words selected from a given vocabulary without being restricted to a pre-specified phonetic alphabet. The system recognizes the spoken words, associates letters with these words and then arranges the letters to form the target word.
  • Another spelling-based application is described in U.S. Pat. No. 5,995,928. The inventors describe a speech recognition system capable of recognizing a word based on a continuous spelling of the word by a user. The system continuously outputs an updated string of hypothesized letters, based on the letters uttered by the user. The system compares each string of hypothesized letters to a vocabulary list of words and returns a best match for the string.
  • In some speech recognition applications, the user is presented with several alternative results following the automatic recognition process. For example, U.S. Pat. No. 5,027,406 describes a method for creating word models in a natural language dictation system. After the user dictates a word, the system displays a list of the words in the active vocabulary which best match the spoken word. By keyboard or voice command, the user may choose the correct word from the list or may choose to edit a similar word if the correct word is not on the list. Alternatively, the user may type or speak the initial letters of the word.
  • Another user-assisted method is described in U.S. Patent Application Publication 2002/0064257 A1. The inventors describe a voice-activated dialing system that uses a DTMF (dual-tone multi-frequency) entry device to narrow the possibilities for the selection of a phonetically based name. The user enters a DTMF signature of a name and the signature is used by a dictionary to generate likely possibilities for the word. The user is asked to confirm whether the suggested name is the name entered.
  • SUMMARY OF THE INVENTION
  • There is therefore provided, in accordance with an embodiment of the present invention, a method for querying an electronic dictionary using letters of an alphabet enunciated by a user. The method includes accepting a speech input from the user, the speech input including a sequence of spelled letters enunciated by the user that spell a query word. The speech input is analyzed to determine one or more sequences of the letters that approximate the sequence of spelled letters. The one or more sequences of the letters are post-processed so as to produce a plurality of recognized words approximating the query word. The electronic dictionary is queried with the plurality of recognized words so as to retrieve a respective plurality of dictionary entries. A list of results including the plurality of recognized words and the respective plurality of dictionary entries is presented to the user.
  • In an embodiment, analyzing the speech input includes applying at least one of an acoustic model and a language model to the speech input. Additionally or alternatively, applying the language model includes representing at least part of the dictionary in terms of a finite state grammar (FSG). Further additionally or alternatively, applying the language model includes assigning probabilities to the sequences of the letters based on a probabilistic language model.
  • In another embodiment, post-processing the sequences includes defining two or more letter classes including subsets of the letters in the alphabet that have similar sounds, and constructing sequences of the letters by substituting at least one of the letters belonging to the same letter class as at least one of the letters of the query word, so as to produce the plurality of recognized words.
  • In yet another embodiment, querying the dictionary includes accepting a user command including at least one of a typed input and a voice command, and modifying at least one letter of one of the recognized words based on the user command.
  • In still another embodiment, presenting the list of results includes assigning likelihood scores to the recognized words on the list and sorting the list based on the likelihood scores. Additionally or alternatively, presenting the list of results includes converting at least part of the list to a speech output, and playing the speech output to the user. Further additionally or alternatively, presenting the list of results includes accepting a user command including at least one of a typed input and a voice command, and scrolling through the list responsively to the user command.
  • In an embodiment, accepting the speech input includes receiving the speech input via an audio interface associated with a mobile device including at least one of a mobile telephone, a portable computer and a personal digital assistant (PDA), and presenting the list includes providing the list via an output of the mobile device.
  • In another embodiment, accepting the speech input includes sending the speech input from the mobile device to a remote server that serves one or more users, and presenting the list of results includes transmitting the list of results from the remote server to the mobile device for presentation to the user.
  • Apparatus and a computer software product for querying an electronic dictionary are also provided.
  • There is additionally provided, in accordance with an embodiment of the present invention, a system for querying an electronic dictionary using letters of an alphabet enunciated by a user. The system includes a remote server including a memory, which is coupled to store the electronic dictionary.
  • The system includes one or more spelling processors, which are coupled to accept a speech input from the user, the speech input including a sequence of spelled letters enunciated by the user that spell a query word, to analyze the speech input so as to determine one or more sequences of the letters approximating the sequence of spelled letters, to post-process the one or more sequences of the letters so as to produce a plurality of recognized words approximating the query word, to query the electronic dictionary stored in the memory with the plurality of recognized words so as to retrieve a respective plurality of dictionary entries, and to generate a list of results including the plurality of recognized words and the respective plurality of dictionary entries.
  • The system also includes a user device, including a client processor, which is coupled to receive the speech input from the user and to send the speech input to the remote server, and which is coupled to receive, responsively to the speech input, the list of results. The user device includes an output device, which is coupled to present the list of results generated by the spelling processor to the user.
  • The present invention will be more fully understood from the following detailed description of the embodiments thereof, taken together with the drawings in which:
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a schematic, pictorial illustration of a system for querying an electronic dictionary, in accordance with an embodiment of the present invention;
  • FIG. 2A is a block diagram that schematically illustrates a mobile device, in accordance with an embodiment of the present invention;
  • FIG. 2B is a block diagram that schematically illustrates a spelling processor, in accordance with an embodiment of the present invention;
  • FIG. 3 is a block diagram that schematically illustrates a system for querying an electronic dictionary, in accordance with another embodiment of the present invention;
  • FIG. 4 is a block diagram that schematically illustrates a system for querying an electronic dictionary, in accordance with yet another embodiment of the present invention; and
  • FIG. 5 is a flow chart that schematically illustrates a method for querying an electronic dictionary, in accordance with an embodiment of the present invention.
  • DETAILED DESCRIPTION OF EMBODIMENTS Overview
  • Embodiments of the present invention provide improved methods and systems that allow users of mobile devices to query an electronic dictionary using spelling recognition. Instead of pronouncing the query word as a whole, as implemented in conventional speech recognition systems, the user vocally spells the query word letter by letter. A spelling processor in the mobile device captures and processes the spelled word. A list of possible recognized words is produced, according to predefined models. A list of results, comprising the recognized words along with the corresponding dictionary entries, is presented to the user. The user can then scroll through the results and identify the correct word and dictionary entry.
  • In comparison with conventional speech recognition methods that recognize the entire word, spelling recognition typically achieves better recognition performance. Embodiments of the present invention provide a method and a system that are particularly suitable for users who are not familiar with the language in question, such as tourists or foreigners. Such users may not know the correct pronunciation of words but can easily spell them out. Users with speech impairments, whose pronunciation of words may be difficult to understand, may also benefit from the disclosed methods.
  • On the other hand, reliable letter-by-letter spelling recognition is a non-trivial task that introduces other types of error mechanisms, as will be explained below. The disclosed methods address these error mechanisms by defining appropriate models that determine the list of alternative recognized words. The list is typically sorted by relevance, using relevance measures that are based on the same error mechanisms and/or the model being used.
  • Some embodiments of the present invention also provide a quick and simple user interface for users of mobile devices. The user interface combines spelling recognition with keypad functions and/or voice commands. This multimodal functionality enables quick and smooth operation of the dictionary application by both ordinary users and users with special needs.
  • Additionally, the disclosed user interface enables the user to query the dictionary without having to move his or her eyes from the written text. For blind users who read text written in Braille, the user interface enables querying the dictionary without moving the user's fingers away from the page.
  • In a disclosed embodiment, the result list is converted to speech and played to the user using a text-to-speech (TTS) generator. This implementation is also particularly suitable for blind users and for users who operate the system while driving or carrying out other tasks that require continuous visual attention.
  • In another embodiment, the dictionary query system is implemented in a remote server configuration using distributed speech recognition (DSR).
  • System Description
  • FIG. 1 is a schematic, pictorial illustration of a system for querying an electronic dictionary, in accordance with an embodiment of the present invention. A user 22 communicates using speech 24 with a mobile device 26, for querying an electronic dictionary. The mobile device may comprise a personal digital assistant (PDA), such as one of the palmOne™ PDA products (see www.palmone.com). The mobile device may alternatively comprise a laptop computer, a mobile phone or another device with suitable computational and I/O capabilities. Although the embodiments described hereinbelow relate to mobile devices by way of illustration, the principles of the present invention may also be applied in non-mobile computing devices, such as desktop computers.
  • The mobile device typically comprises a microphone 27 for accepting speech from the user and a keypad 28 for accepting user input. A display 30 presents textual information to the user. In some embodiments, mobile device 26 also comprises a speaker 31 for playing synthesized speech to the user, as will be explained below.
  • The electronic dictionary application may comprise a thesaurus or a lexicon, in which case querying the dictionary means retrieving a definition of a word. Alternatively, the dictionary may comprise a bilingual or multilingual dictionary, in which case querying the dictionary means retrieving a translation of the word to another language. Additional dictionary applications comprise dictionaries that are specific to particular professional disciplines and phrasebooks that translate phrases from one language to another. Other dictionary applications will be apparent to those skilled in the art, and can be implemented using the methods described hereinbelow. In the context of the present patent application and in the claims, the term “dictionary” pertains to any such dictionary application. The term “dictionary entry” refers to the definition or the translation of a word or phrase, as relevant to the particular application.
  • FIG. 2A is a block diagram that schematically illustrates mobile device 26, in accordance with an embodiment of the present invention. Mobile device 26 comprises an input device, such as a microphone 27, that accepts speech input from the user. The speech comprises a query word or phrase, spelled letter-by-letter by the user. A sampler 32 samples the speech input and produces digitized speech. A spelling processor 34 processes the digitized speech and produces a list of possible recognized words. Several alternative recognition methods are explained in detail in the description of FIG. 5 below.
  • The spelling processor is typically implemented as a software process that runs on a central processing unit (CPU) of the mobile device. The spelling processor queries an electronic dictionary 36, which is stored in a memory of the mobile device, and retrieves dictionary entries corresponding to the recognized words. The spelling processor typically displays the list of results using an output device such as display 30. Additionally or alternatively, the output device comprises a text to speech (TTS) generator 38 that converts the list of results, or parts of it, to speech and plays it to the user. Again, a detailed description of the method and the associated user interfaces is given in the description of FIG. 5 below.
  • FIG. 2B is a block diagram that schematically shows details of spelling processor 34, in accordance with an embodiment of the present invention. In some embodiments, the spelling recognition process carried out by processor 34 can be divided into two consecutive steps. A speech recognizer 39 in processor 34 accepts the digitized speech. The speech recognizer applies a suitable model to the digitized speech so as to produce one or more letter sequences that represents a possibly-recognized word. Each letter sequence is assigned a probability value indicating the probability of the particular letter sequence representing the word spelled by the user. In some embodiments, speech recognizer 39 queries dictionary 36 as part of the recognition process. In alternative embodiments, the model used by recognizer 39 already contains at least part of the dictionary.
  • A post processor 41 in spelling processor 36 accepts the letter sequences and associated probabilities from recognizer 39. The post processor queries dictionary 36 with the recognized words and produces an ordered list of results. The list comprises the recognized words and the associated dictionary definitions of these words. The configuration of spelling processor 34 shown in FIG. 2B is typically used in both the local configuration shown in FIG. 2A above and in the remote server configuration shown in FIGS. 3 and 4 below. In some embodiments, speech recognizer 39 and post processor 41 are implemented as two software processes managed by spelling processor 34.
  • FIG. 3 is a block diagram that schematically illustrates a remote server system for querying electronic dictionary 36, in accordance with another embodiment of the present invention. In some cases it is preferable to implement the dictionary application using a remote server configuration. In a remote server configuration, the electronic dictionary is located in a single central location. Multiple users can query the dictionary using distributed speech recognition (DSR) techniques, as are known in the art.
  • A centralized dictionary configuration is sometimes preferred because it enables the use of larger dictionaries. Large dictionaries, or dictionaries holding large and detailed entries, may significantly exceed the memory storage capabilities of typical mobile devices. Additionally, maintaining and updating information in a centralized dictionary data structure is often easier than managing multiple dictionaries distributed between multiple users.
  • The configuration shown in FIG. 3 comprises an application server 40. Spelling processor 34 and dictionary 36 are located in server 40. Although FIG. 3 shows a single spelling processor, typical implementations of server 40 comprise multiple spelling processors 34 that interact with multiple mobile devices 26. The multiple spelling processors are typically implemented as parallel software instances or threads running on one or more CPUs of server 40. Dictionary 36 can be implemented using any suitable data structure, such as a database, suitable for multi-user access.
  • In the remote server configuration, mobile device 26 comprises a client processor 42 that accepts the speech input from the user via microphone 27 and sampler 32 (not shown in this figure). Processor 42 compresses the captured and digitized speech and transmits it, typically in a compact form, such as a stream of compressed feature vectors, to spelling processor 34 in server 40. The spelling processor decompresses the feature vectors, processes the decompressed speech and queries dictionary 36, according to the method of FIG. 5 below. The processing performed by spelling processor 36 in the remote server configuration is similar to that performed in the local configuration shown in FIG. 2A above. The spelling processor sends the list of recognized words and the corresponding dictionary entries to client processor 42 in the mobile device. The client processor presents the results to the user using display 30 and/or TTS generator 38. The client processor handles the user interface, which allows the user to scroll and edit the list of results using keypad 28 and/or voice commands. Again, the user interface is explained in detail in the description of FIG. 5 below.
  • Mobile device 26 and server 40 are linked by a communication channel. The channel is used to send compressed speech to the server, send result lists to the mobile device and exchange miscellaneous control information. The communication channel may comprise any suitable medium, such as an Internet connection, a telephone line, a wireless data network, a cellular network, or a combination of several such media.
  • FIG. 4 is a block diagram that schematically illustrates a remote server system for querying electronic dictionary 36, in accordance with yet another embodiment of the present invention. The configuration of FIG. 4 is similar to the configuration of FIG. 3 above, except that in the configuration of FIG. 4 the text-to-speech conversion function is also split between the server and the mobile device. Server 40 here comprises TTS generator 38, which in this embodiment accepts the list of results from the spelling processor and converts it (or parts of it) to a stream of compressed speech feature vectors. The compressed speech is then sent to the mobile device over the communication channel. A speech decoder in the mobile device decompresses and decodes the received feature vectors and plays the decoded speech to the user.
  • Typically, spelling processor 34 and client processor 42 comprise general-purpose computer processors, which are programmed in software to carry out the functions described herein. The software may be downloaded to the computers in electronic form, over a network, for example, or it may alternatively be supplied to the computers on tangible media, such as CD-ROM. Further alternatively, the spelling processor may be a standalone unit, or it may alternatively be integrated with other computing functions of mobile device 26 or server 40. Additionally or alternatively, at least some of the functions of the spelling processor may be implemented using dedicated hardware. Client processor 42 may also be integrated with other computing functions of mobile device 26.
  • Dictionary Querying Method Description
  • FIG. 5 is a flow chart that schematically illustrates a method for querying electronic dictionary 36, in accordance with an embodiment of the present invention. The method begins with user 22 entering a query word or phrase, at a word entry step 50. For this purpose, the user first initiates the dictionary application running on mobile device 26. The user then starts the speech acquisition process, for example by clicking a button on keypad 28. The user spells the query word vocally, letter by letter. After spelling the entire word the user stops the speech acquisition process, for example using keypad 28. The mobile device captures the speech comprising the sequence of spelled letters using microphone 27. Sampler 32 digitizes the captured speech. In another embodiment, the user can start and stop the speech acquisition process using predetermined voice commands.
  • (If the disclosed method is implemented using a remote server configuration, as shown in FIGS. 3 and 4 above, client processor 42 transmits data, typically in the form of a stream of compressed feature vectors, that represent the processed speech to the spelling processor, at a speech transmission step 52. As shown in FIGS. 3 and 4 above, the spelling processor in such a configuration is part of server 40. If the method is implemented locally in the mobile device, as shown in FIG. 2A above, step 52 is omitted.)
  • Speech recognizer 39 and post processor 41 in spelling processor 34 (FIG. 2B) process the digitized speech, at a speech processing step 54. Speech recognizer 39 analyzes the digitized speech, typically segmenting the speech into phonetic components that represent individual letters of the query word. Various methods are known in the art for identifying a phonetic sound within a limited vocabulary. Any suitable method can be used by the speech recognizer to identify the spelled letters in the captured speech. Most methods do not require user-specific training (sometimes referred to as “user enrollment”) because of the small vocabulary and the small user-dependent differences in pronunciation of spelled letters.
  • However, in specific cases, such as users with speech impairments or users with heavy accents, the use of learned user-specific speech characteristics may improve the quality of recognition. In some embodiments, speech recognizer 39 extracts additional information from the digitized speech, to be used in the recognition process as will be explained below.
  • In some embodiments, the speech recognizer uses a suitable acoustic model for assigning a likelihood score to each identified spelled letter. Each likelihood score quantifies the likelihood that the particular letter was indeed iterated by the user.
  • The speech recognizer uses a language model, which may be based in whole or in part on the dictionary being used. Using the language model, the speech recognizer generates one or more letter sequences that represent possibly-recognized words in response to the captured input speech.
  • In some embodiments, the language model comprises a graph representing the dictionary, which is commonly referred to as a Finite State Grammar (FSG). Finite state grammars (sometimes also referred to as finite-state networks) are described, for example, by Rabiner and Juang in “Fundamentals of Speech Recognition,” Prentice Hall, April 1993, pages 414-416. The nodes of the FSG represent letters of the alphabet. (In typical implementations, each letter of the alphabet appears several times in the graph.) Arcs between nodes represent adjacent letters in legitimate words. In other words, each word in the dictionary is represented as a trajectory or path through the graph.
  • In some embodiments, only part of the dictionary is represented as a FSG. In many practical cases, FSG-based models are used for small to medium size vocabularies and dictionaries, typically up to several thousands of words.
  • When using the FSG, the speech recognizer typically compares the sequence of spelled letters of the digitized speech to the different trajectories through the FSG. In some embodiments, the speech recognizer assigns likelihood scores to the trajectories. The speech recognizer produces the letter sequences and the associated likelihood scores.
  • In other embodiments, the language model comprises a probabilistic language model, which assigns probabilities to different letter sequences in the vocabulary. Probabilistic language models are described, for example, by Young in “A Review of Large-Vocabulary Continuous-Speech Recognition,” IEEE Signal Processing Magazine, September 1996, pages 45-57. Probabilistic language models are typically used when the size of the dictionary is very large, making it difficult to represent every word in the model explicitly. In these embodiments, speech recognizer 39 produces one or more letter sequences that resemble the sequence of spelled letters, with associated likelihood scores in accordance with the probabilistic language model.
  • In yet another embodiment, the speech recognizer represents the different letter sequences produced by the probabilistic language model in terms of a lattice. The lattice is a graph comprising the possible sequences of letters, with each sequence assigned a respective likelihood score, according to the probabilistic language model.
  • Following the speech recognition process, speech recognizer 39 provides to post processor 41 one or more letter sequences with associated likelihood scores, as described above.
  • In one embodiment, when speech recognizer 39 uses a FSG as the language model, the letter sequences provided to post processor 41 are already legitimate words that appear in dictionary 36.
  • In another embodiment, in which speech recognizer 39 uses a probabilistic language model with lattice output, as described above, post processor 41 selects a subset of the letter sequences in the lattice, having the highest likelihood scores. Since not all of the possible letter sequences in the lattice necessarily correspond to legitimate dictionary words, post processor 41 typically queries dictionary 36 with the selected letter sequences, and discards words that do not appear in the dictionary.
  • In yet another embodiment, in which speech recognizer 39 uses a probabilistic language model, speech recognizer 39 outputs only the letter sequence having the maximum likelihood score (referred to hereinbelow as the highest ranking sequence). Post processor 41 constructs a list of alternative letter sequences based on the highest ranking sequence by using letter classes, as explained below.
  • Spelled letters can be classified into letter classes based on their pronunciation characteristics. During speech recognition, some spelled letters may be mistaken for one another. For example, the spelled letters /b/, /c/, /d/, /e/, /g/, /p/, /t/, /v/ and /z/ all belong to the same letter class (referred to as the “e-class”) . These letters all have similar vowel sounds when spelled. In some cases, the speech recognizer may erroneously mistake one such letter for another. Similarly, the speech recognizer may erroneously interchange letters belonging to the “a-class” (/a/, /h/, /j/, /k/), the “i-class” (/i/, /y/) and the “u-class” (/u/, /q/).
  • The probabilities of mistaking one letter for another are typically represented as a matrix, which is called a “confusion matrix.” The probability of interchanging letters belonging to different letter classes is assumed to be small. When using letter classes, the post processor constructs the list of alternative letter sequences by replacing each letter of the best ranking sequence with similarly-sounding letters, according to the letter classes described above. The post processor typically ranks the list, for example by computing likelihood scores based on the confusion matrix.
  • For example, assume that the best ranking sequence, as recognized by speech recognizer 39, is /c/, /a/ and /t/, assuming the user has spelled the word “cat.” Using the letter classes described above, the post processor constructs a list of alternative letter sequences defined by [{e-class}, {a-class}, {e-class}] (i.e., all 9×4×9=324 three-letter strings, in which the first letter belongs to the e-class, the second letter belongs to the a-class and the third letter again belongs to the e-class). In some embodiments, the alternative letter sequences may also comprise a different number of letters, or letters from other letter classes. For example, the query word “cat” can also be recognized as “beat.”
  • Obviously, only a few of the alternative letter sequences produced in the above example (such as “bat”, “the”, “pad” and the original “cat”) are meaningful words. Most are meaningless strings. Note also that the pronunciation of the entire words may be very different from the pronunciation of the query word. As an extreme example, the sound of the word “the” is very different from the sound of the word “cat”. Nevertheless, these two words are both considered legitimate alternative letter sequences by the spelling processor, because the spelled sequence /t/, /h/, /e/ does sound similar to the spelled sequence /c/, /a/, /t/. The post processor maintains (or produces in the first place) only the letter sequences that correspond to meaningful words. The post processor may differentiate between meaningful and meaningless letter sequences by querying dictionary 36 or by using any suitable grammatical rules, which are part of the language model being used.
  • In order to minimize the probability of false recognition, the spelling processor may request the user's assistance in determining which one of the recognized letter sequences, or recognized words, is the original query word entered by the user. For this purpose, the post processor prepares a list of results, at a list preparation step 56. In some embodiments, the post processor produces the list of results in accordance with one of the language models described above. In some embodiments, the post processor sorts the list of results in descending order of relevance. The relevance score of a particular recognized word is typically determined in accordance with the language model being used, as described above. Alternatively, the list can be sorted alphabetically, or using any other suitable criterion.
  • (If the disclosed method is implemented using a remote server configuration, as shown in FIGS. 3 and 4 above, spelling processor 34 in server 40 transmits the list of results to client processor 42, at a result transmission step 58. If the method is implemented locally in the mobile device, as shown in FIG. 2A above, step 58 is omitted.)
  • The spelling processor presents the list of results to the user, at a presentation step 60. Typically, the list of recognized words is displayed as text on display 30 of the mobile device. The user may scroll through the list using keypad 28 until he or she finds the intended query word and the corresponding dictionary entry. Alternatively, only the first word on the list is displayed together with its dictionary entry. If the first recognized word on the result list is incorrect, the user may scroll down and select the next word. Any other suitable presentation method can be used, depending upon the particular application and the capabilities of keypad 28 and display 30 of the mobile device. Additionally, the user can also edit the displayed recognized words at any time using the keypad, so as to enter part or all of the intended query word.
  • In another embodiment, the list of results is converted to speech using TTS generator 38 and played to the user through speaker 31. The user can indicate, either using the keypad or by uttering a voice command, when the correct word is being played. After selecting the correct word, the TTS generator plays the corresponding dictionary entry.
  • Although the disclosed methods mainly address spelling-based dictionary lookup in mobile devices, the same methods can be used in a variety of additional applications. For example, the disclosed methods can also be used in desktop or mainframe computer applications that require high quality word recognition. Such applications include, for example, directory assistance services and name dialing applications.
  • It will thus be appreciated that the embodiments described above are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art.

Claims (22)

1. A method for querying an electronic dictionary using letters of an alphabet enunciated by a user, the method comprising:
accepting a speech input from the user, the speech input comprising a sequence of spelled letters enunciated by the user that spell a query word;
analyzing the speech input to determine one or more sequences of the letters that approximate the sequence of spelled letters;
post-processing the one or more sequences of the letters so as to produce a plurality of recognized words approximating the query word;
querying the electronic dictionary with the plurality of recognized words so as to retrieve a respective plurality of dictionary entries; and
presenting a list of results comprising the plurality of recognized words and the respective plurality of dictionary entries to the user.
2. The method according to claim 1, wherein analyzing the speech input comprises applying at least one of an acoustic model and a language model to the speech input.
3. The method according to claim 2, wherein applying the language model comprises representing at least part of the dictionary in terms of a finite state grammar (FSG).
4. The method according to claim 2, wherein applying the language model comprises assigning probabilities to the sequences of the letters based on a probabilistic language model.
5. The method according to claim 1, wherein post-processing the sequences comprises defining two or more letter classes comprising subsets of the letters in the alphabet that have similar sounds, and constructing sequences of the letters by substituting at least one of the letters belonging to the same letter class as at least one of the letters of the query word, so as to produce the plurality of recognized words.
6. The method according to claim 1, wherein querying the dictionary comprises accepting a user command comprising at least one of a typed input and a voice command, and modifying at least one letter of one of the recognized words based on the user command.
7. The method according to claim 1, wherein presenting the list of results comprises assigning likelihood scores to the recognized words on the list and sorting the list based on the likelihood scores.
8. The method according to claim 1, wherein presenting the list of results comprises converting at least part of the list to a speech output, and playing the speech output to the user.
9. The method according to claim 1, wherein presenting the list of results comprises accepting a user command comprising at least one of a typed input and a voice command, and scrolling through the list responsively to the user command.
10. The method according to claim 1, wherein accepting the speech input comprises receiving the speech input via an audio interface associated with a mobile device comprising at least one of a mobile telephone, a portable computer and a personal digital assistant (PDA), and wherein presenting the list comprises providing the list via an output of the mobile device.
11. The method according to claim 10, wherein accepting the speech input comprises sending the speech input from the mobile device to a remote server that serves one or more users, and wherein presenting the list of results comprises transmitting the list of results from the remote server to the mobile device for presentation to the user.
12. Apparatus for querying an electronic dictionary using letters of an alphabet enunciated by a user, the apparatus comprising:
a memory, which is arranged to store the electronic dictionary;
a spelling processor, which is arranged to accept a speech input from the user, the speech input comprising a sequence of spelled letters enunciated by the user that spell a query word, to analyze the speech input so as to determine one or more sequences of the letters that approximate the sequence of spelled letters, to post-process the one or more sequences of the letters so as to produce a plurality of recognized words approximating the query word, to query the electronic dictionary stored in the memory with the plurality of recognized words so as to retrieve a respective plurality of dictionary entries, and to generate a list of results comprising the plurality of recognized words and the respective plurality of dictionary entries; and
an output device, which is coupled to present the list of results generated by the spelling processor to the user.
13. The apparatus according to claim 12, wherein the spelling processor comprises a speech recognizer, which is arranged to apply at least one of an acoustic model and a language model so as to analyze the speech input.
14. The apparatus according to claim 13, wherein the language model comprises a finite state grammar (FSG) representing at least part of the dictionary.
15. The apparatus according to claim 13, wherein the language model comprises a probabilistic language model, and wherein the speech recognizer is arranged to assign probabilities to the recognized words based on the probabilistic language model.
16. The apparatus according to claim 12, wherein the spelling processor is arranged to define two or more letter classes comprising subsets of the letters in the alphabet that have similar sounds, and to construct sequences of the letters by substituting at least one of the letters belonging to the same letter class as at least one of the letters of the query word, so as to produce the plurality of recognized words.
17. The apparatus according to claim 12, wherein the spelling processor is arranged to accept a user command comprising at least one of a typed input and a voice command, and to modify at least one letter of one of the recognized words based on the user command.
18. The apparatus according to claim 12, wherein the spelling processor is arranged to assign likelihood scores to the recognized words on the list of results and to sort the list based on the likelihood scores.
19. The apparatus according to claim 12, wherein the output device comprises a text-to-speech converter, which is arranged to convert at least part of the list to a speech output and to play the speech output to the user.
20. The apparatus according to claim 12, wherein the spelling processor is arranged to receive the speech input via an audio interface associated with a mobile device comprising at least one of a mobile telephone, a portable computer and a personal digital assistant (PDA), and to provide the list of results via an output of the mobile device.
21. A system for querying an electronic dictionary using letters of an alphabet enunciated by a user, the system comprising:
a remote server comprising:
a memory, which is coupled to store the electronic dictionary; and
one or more spelling processors, which are coupled to accept a speech input from the user, the speech input comprising a sequence of spelled letters enunciated by the user that spell a query word, to analyze the speech input so as to determine one or more sequences of the letters approximating the sequence of spelled letters, to post-process the one or more sequences of the letters so as to produce a plurality of recognized words approximating the query word, to query the electronic dictionary stored in the memory with the plurality of recognized words so as to retrieve a respective plurality of dictionary entries, and to generate a list of results comprising the plurality of recognized words and the respective plurality of dictionary entries; and
a user device, comprising:
a client processor, which is coupled to receive the speech input from the user and to send the speech input to the remote server, and which is coupled to receive, responsively to the speech input, the list of results; and
an output device, which is coupled to present the list of results generated by the spelling processor to the user.
22. A computer software product for querying an electronic dictionary using letters of an alphabet enunciated by a user, the product comprising a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to accept a speech input from the user, the speech input comprising a sequence of spelled letters enunciated by the user that spell a query word, to analyze the speech input so as to determine one or more sequences of the letters approximating the sequence of spelled letters, to post-process the one or more sequences of the letters so as to produce a plurality of recognized words approximating the query word, to query the electronic dictionary with the plurality of recognized words so as to retrieve a respective plurality of dictionary entries, to generate a list of results comprising the plurality of recognized words and the respective plurality of dictionary entries, and to output the list of results generated by the spelling processor for presentation to the user.
US11/176,154 2005-07-07 2005-07-07 Dictionary lookup for mobile devices using spelling recognition Abandoned US20070016420A1 (en)

Priority Applications (6)

Application Number Priority Date Filing Date Title
US11/176,154 US20070016420A1 (en) 2005-07-07 2005-07-07 Dictionary lookup for mobile devices using spelling recognition
EP06763137A EP1905001A1 (en) 2005-07-07 2006-05-12 Dictionary lookup for mobile devices using spelling recognition
CNA2006800245515A CN101218625A (en) 2005-07-07 2006-05-12 Dictionary lookup for mobile devices using spelling recognition
CA002613154A CA2613154A1 (en) 2005-07-07 2006-05-12 Dictionary lookup for mobile devices using spelling recognition
PCT/EP2006/062284 WO2007006596A1 (en) 2005-07-07 2006-05-12 Dictionary lookup for mobile devices using spelling recognition
BRPI0613699-0A BRPI0613699A2 (en) 2005-07-07 2006-05-12 mobile dictionary search that uses handwriting recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11/176,154 US20070016420A1 (en) 2005-07-07 2005-07-07 Dictionary lookup for mobile devices using spelling recognition

Publications (1)

Publication Number Publication Date
US20070016420A1 true US20070016420A1 (en) 2007-01-18

Family

ID=36617037

Family Applications (1)

Application Number Title Priority Date Filing Date
US11/176,154 Abandoned US20070016420A1 (en) 2005-07-07 2005-07-07 Dictionary lookup for mobile devices using spelling recognition

Country Status (6)

Country Link
US (1) US20070016420A1 (en)
EP (1) EP1905001A1 (en)
CN (1) CN101218625A (en)
BR (1) BRPI0613699A2 (en)
CA (1) CA2613154A1 (en)
WO (1) WO2007006596A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080120110A1 (en) * 2006-11-20 2008-05-22 Mcdonald Samuel A Handheld voice activated spelling device
US20110137638A1 (en) * 2009-12-04 2011-06-09 Gm Global Technology Operations, Inc. Robust speech recognition based on spelling with phonetic letter families
US10290299B2 (en) 2014-07-17 2019-05-14 Microsoft Technology Licensing, Llc Speech recognition using a foreign word grammar
CN113053362A (en) * 2021-03-30 2021-06-29 建信金融科技有限责任公司 Method, device, equipment and computer readable medium for speech recognition
US11514904B2 (en) * 2017-11-30 2022-11-29 International Business Machines Corporation Filtering directive invoking vocal utterances

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722525A (en) * 2012-05-15 2012-10-10 北京百度网讯科技有限公司 Methods and systems for establishing language model of address book names and searching voice
CN105096945A (en) * 2015-08-31 2015-11-25 百度在线网络技术(北京)有限公司 Voice recognition method and voice recognition device for terminal
US10446143B2 (en) * 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
CN110019667A (en) * 2017-10-20 2019-07-16 沪江教育科技(上海)股份有限公司 It is a kind of that word method and device is looked into based on voice input information

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4890230A (en) * 1986-12-19 1989-12-26 Electric Industry Co., Ltd. Electronic dictionary
US5027406A (en) * 1988-12-06 1991-06-25 Dragon Systems, Inc. Method for interactive speech recognition and training
US5995928A (en) * 1996-10-02 1999-11-30 Speechworks International, Inc. Method and apparatus for continuous spelling speech recognition with early identification
US6047257A (en) * 1997-03-01 2000-04-04 Agfa-Gevaert Identification of medical images through speech recognition
US6304844B1 (en) * 2000-03-30 2001-10-16 Verbaltek, Inc. Spelling speech recognition apparatus and method for communications
US6321196B1 (en) * 1999-07-02 2001-11-20 International Business Machines Corporation Phonetic spelling for speech recognition
US20020013707A1 (en) * 1998-12-18 2002-01-31 Rhonda Shaw System for developing word-pronunciation pairs
US20020032566A1 (en) * 1996-02-09 2002-03-14 Eli Tzirkel-Hancock Apparatus, method and computer readable memory medium for speech recogniton using dynamic programming
US20020064257A1 (en) * 2000-11-30 2002-05-30 Denenberg Lawrence A. System for storing voice recognizable identifiers using a limited input device such as a telephone key pad
US20030067495A1 (en) * 2001-10-04 2003-04-10 Infogation Corporation System and method for dynamic key assignment in enhanced user interface
US20040014484A1 (en) * 2000-09-25 2004-01-22 Takahiro Kawashima Mobile terminal device
US20040049388A1 (en) * 2001-09-05 2004-03-11 Roth Daniel L. Methods, systems, and programming for performing speech recognition
US20040117189A1 (en) * 1999-11-12 2004-06-17 Bennett Ian M. Query engine for processing voice based queries including semantic decoding
US20040172258A1 (en) * 2002-12-10 2004-09-02 Dominach Richard F. Techniques for disambiguating speech input using multimodal interfaces
US20060100871A1 (en) * 2004-10-27 2006-05-11 Samsung Electronics Co., Ltd. Speech recognition method, apparatus and navigation system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6182039B1 (en) * 1998-03-24 2001-01-30 Matsushita Electric Industrial Co., Ltd. Method and apparatus using probabilistic language model based on confusable sets for speech recognition
DE19944608A1 (en) * 1999-09-17 2001-03-22 Philips Corp Intellectual Pty Recognition of spoken speech input in spelled form
EP1352388B1 (en) * 2000-12-14 2005-04-27 Siemens Aktiengesellschaft Speech recognition method and system for a handheld device
EP1396840A1 (en) * 2002-08-12 2004-03-10 Siemens Aktiengesellschaft Spelling speech recognition apparatus

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4890230A (en) * 1986-12-19 1989-12-26 Electric Industry Co., Ltd. Electronic dictionary
US5027406A (en) * 1988-12-06 1991-06-25 Dragon Systems, Inc. Method for interactive speech recognition and training
US20020032566A1 (en) * 1996-02-09 2002-03-14 Eli Tzirkel-Hancock Apparatus, method and computer readable memory medium for speech recogniton using dynamic programming
US5995928A (en) * 1996-10-02 1999-11-30 Speechworks International, Inc. Method and apparatus for continuous spelling speech recognition with early identification
US6047257A (en) * 1997-03-01 2000-04-04 Agfa-Gevaert Identification of medical images through speech recognition
US20020013707A1 (en) * 1998-12-18 2002-01-31 Rhonda Shaw System for developing word-pronunciation pairs
US6321196B1 (en) * 1999-07-02 2001-11-20 International Business Machines Corporation Phonetic spelling for speech recognition
US20040117189A1 (en) * 1999-11-12 2004-06-17 Bennett Ian M. Query engine for processing voice based queries including semantic decoding
US6304844B1 (en) * 2000-03-30 2001-10-16 Verbaltek, Inc. Spelling speech recognition apparatus and method for communications
US20040014484A1 (en) * 2000-09-25 2004-01-22 Takahiro Kawashima Mobile terminal device
US20020064257A1 (en) * 2000-11-30 2002-05-30 Denenberg Lawrence A. System for storing voice recognizable identifiers using a limited input device such as a telephone key pad
US6728348B2 (en) * 2000-11-30 2004-04-27 Comverse, Inc. System for storing voice recognizable identifiers using a limited input device such as a telephone key pad
US20040049388A1 (en) * 2001-09-05 2004-03-11 Roth Daniel L. Methods, systems, and programming for performing speech recognition
US20030067495A1 (en) * 2001-10-04 2003-04-10 Infogation Corporation System and method for dynamic key assignment in enhanced user interface
US20040172258A1 (en) * 2002-12-10 2004-09-02 Dominach Richard F. Techniques for disambiguating speech input using multimodal interfaces
US20060100871A1 (en) * 2004-10-27 2006-05-11 Samsung Electronics Co., Ltd. Speech recognition method, apparatus and navigation system

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080120110A1 (en) * 2006-11-20 2008-05-22 Mcdonald Samuel A Handheld voice activated spelling device
US8756063B2 (en) * 2006-11-20 2014-06-17 Samuel A. McDonald Handheld voice activated spelling device
US20110137638A1 (en) * 2009-12-04 2011-06-09 Gm Global Technology Operations, Inc. Robust speech recognition based on spelling with phonetic letter families
US8195456B2 (en) * 2009-12-04 2012-06-05 GM Global Technology Operations LLC Robust speech recognition based on spelling with phonetic letter families
US10290299B2 (en) 2014-07-17 2019-05-14 Microsoft Technology Licensing, Llc Speech recognition using a foreign word grammar
US11514904B2 (en) * 2017-11-30 2022-11-29 International Business Machines Corporation Filtering directive invoking vocal utterances
CN113053362A (en) * 2021-03-30 2021-06-29 建信金融科技有限责任公司 Method, device, equipment and computer readable medium for speech recognition

Also Published As

Publication number Publication date
BRPI0613699A2 (en) 2011-01-25
CA2613154A1 (en) 2007-01-18
CN101218625A (en) 2008-07-09
WO2007006596A1 (en) 2007-01-18
EP1905001A1 (en) 2008-04-02

Similar Documents

Publication Publication Date Title
US6910012B2 (en) Method and system for speech recognition using phonetically similar word alternatives
US7047195B2 (en) Speech translation device and computer readable medium
US6067520A (en) System and method of recognizing continuous mandarin speech utilizing chinese hidden markou models
JP4267081B2 (en) Pattern recognition registration in distributed systems
US6937983B2 (en) Method and system for semantic speech recognition
US7162423B2 (en) Method and apparatus for generating and displaying N-Best alternatives in a speech recognition system
US6526380B1 (en) Speech recognition system having parallel large vocabulary recognition engines
KR100769029B1 (en) Method and system for voice recognition of names in multiple languages
KR101309042B1 (en) Apparatus for multi domain sound communication and method for multi domain sound communication using the same
US20070016420A1 (en) Dictionary lookup for mobile devices using spelling recognition
US5937383A (en) Apparatus and methods for speech recognition including individual or speaker class dependent decoding history caches for fast word acceptance or rejection
JP5703491B2 (en) Language model / speech recognition dictionary creation device and information processing device using language model / speech recognition dictionary created thereby
JP4987682B2 (en) Voice chat system, information processing apparatus, voice recognition method and program
JP2002540477A (en) Client-server speech recognition
KR20010108402A (en) Client-server speech recognition
EP1617409A1 (en) Multimodal method to provide input to a computing device
KR20060037086A (en) Method and apparatus for speech recognition, and navigation system using for the same
Bai et al. Syllable-based Chinese text/spoken document retrieval using text/speech queries
KR101250897B1 (en) Apparatus for word entry searching in a portable electronic dictionary and method thereof
JP2000056795A (en) Speech recognition device
EP1135768B1 (en) Spell mode in a speech recognizer
JP2008083165A (en) Voice recognition processing program and voice recognition processing method
JP3748429B2 (en) Speech input type compound noun search device and speech input type compound noun search method
JPH05119793A (en) Method and device for speech recognition
Wang et al. Browsing the Chinese Web pages using Mandarin speech

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:AZULAI, OPHIR;HOORY, RON;SIVAN, ZOHAR;REEL/FRAME:016548/0150

Effective date: 20050627

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION