US6393398B1 - Continuous speech recognizing apparatus and a recording medium thereof - Google Patents

Continuous speech recognizing apparatus and a recording medium thereof Download PDF

Info

Publication number
US6393398B1
US6393398B1 US09/447,391 US44739199A US6393398B1 US 6393398 B1 US6393398 B1 US 6393398B1 US 44739199 A US44739199 A US 44739199A US 6393398 B1 US6393398 B1 US 6393398B1
Authority
US
United States
Prior art keywords
word string
speech recognition
word
speech
pass
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
US09/447,391
Inventor
Toru Imai
Akio Ando
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Japan Broadcasting Corp
Original Assignee
Nippon Hoso Kyokai NHK
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Hoso Kyokai NHK filed Critical Nippon Hoso Kyokai NHK
Assigned to NIPPON HOSO KYOKAI reassignment NIPPON HOSO KYOKAI ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ANDO, AKIO, IMAI, TORU
Application granted granted Critical
Publication of US6393398B1 publication Critical patent/US6393398B1/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search

Definitions

  • the present invention relates to a continuous speech recognizing apparatus for recognizing continuous speech, and particularly to a continuous speech recognizing apparatus for carrying out speech recognition using a probabilistic language model and to a recording medium.
  • an apparatus for recognizing continuous speech using a multiple-pass decoder is known.
  • the system narrows the list of word candidates for speech to be recognized by carrying out time synchronous search in a first pass circuit using a simple model. Subsequently, it determines the word candidate in the list obtained in the first pass in a second pass circuit using a complex model after completing the speech (Imai, et al., Technical Report of Information Processing Society of Japan, SLP-23-11 (October 1998)).
  • the inventors of the present application also proposed a continuous speech recognizing apparatus for carrying out time synchronous Viterbi beam search using a bigram in the first pass (Imai, et al., Proceedings of Autumn Meeting of the Acoustical Society of Japan, 3-1-12 (September 1998)).
  • the continuous speech recognizing apparatus carries out word-dependent N-best search of a tree structure phoneme network (see, R. Schwarz, et al., ICASSP-91, PP. 701-704 (May 1991)).
  • N-best sentences by recursively tracing back a word lattice composed of the end time of each word candidate, its score and a pointer to a first previous word (see, R. Schwarz, et al., ICASSP-91, PP. 701-704 (May 1991)). Then, it determines a maximum likelihood word string as a recognition result by rescoring the N-best sentences using a trigram.
  • the speech recognition candidates of the word string at the final position of the current time has a tendency to be different from the word string which will be obtained at a next time at the location corresponding to the word string at the final position of the present time.
  • the speech recognition candidates of a sentence are unstable until the speech input of the sentence has completed, and hence the second pass circuit cannot determine the speech recognition result until then. This will cause a large time lag (delay) between the input instant of the speech and the output of the speech recognition result from the continuous speech recognizing apparatus.
  • an object of the present invention is to provide a continuous speech recognizing apparatus and a recording medium capable of reducing in a multiple-pass speech recognizing apparatus the time lag between the input of speech and the output of the speech recognition result.
  • a continuous speech recognizing apparatus that obtains from input continuous speech a plurality of speech recognition candidates of a word string using a simple probabilistic language model in a first pass processor, and that determines a speech recognition result of the plurality of speech recognition candidates using a complex probabilistic language model in a second pass processor, wherein
  • the first pass processor obtains word strings of the plurality of speech recognition candidates of the continuous speech at fixed time intervals from an input start time
  • the second pass processor comprises:
  • word string selecting means for selecting, using the complex probabilistic language model, a maximum likelihood word string from among the word strings of the plurality of speech recognition candidates obtained at the fixed time intervals, and
  • speech recognition result determining means for detecting a stable portion in word strings detected at every fixed intervals, and for successively determining a word string of the stable portion as the speech recognition result.
  • the speech recognition result determining means may comprise:
  • a comparator for comparing a first word string with a second word string, the first word string consisting of a word string currently detected by the word string selecting means with the exception of a final portion of the word string, and the second word string consisting of a speech recognition candidates previously obtained by the word string selecting means;
  • a determining section for determining, when the comparator makes a decision that a same word string as the second word string is contained in the first word string, the second word string as the speech recognition result.
  • the first pass processor may obtain the plurality of speech recognition candidates by tracing back a word lattice beginning from a phoneme with a maximum score as of now when a plurality of speech recognition candidates of a word string are obtained by using the simple probabilistic language model.
  • the first pass processor may trace back the word lattice beginning from a plurality of currently active phonemes.
  • a recording medium having a computer executable program code means for obtaining from input continuous speech a plurality of speech recognition candidates of a word string using a simple probabilistic language model in a first pass, and for determining a speech recognition result of the plurality of speech recognition candidates using a complex probabilistic language model in a second pass,
  • the first pass comprises a step of obtaining, beginning from an input start time, word strings of the plurality of speech recognition candidates of the continuous speech at fixed time intervals, and
  • the second pass comprises:
  • the speech recognition result determining step may comprise:
  • the first pass may obtain the plurality of speech recognition candidates by tracing back a word lattice beginning from a phoneme with a maximum score as of now when a plurality of speech recognition candidates of a word string are obtained by using the simple probabilistic language model.
  • the first pass may trace back the word lattice beginning from a plurality of currently active phonemes.
  • the present invention detects a stable portion of a maximum likelihood word string (1-best word string in the following embodiments) obtained by carrying out the two-pass processing successively, and makes it a partial speech recognition result. This makes it possible to successively determine the speech recognition result while inputting continuous speech. In addition, this makes it possible to reduce the time lag between the speech input and the subtitle output to a minimum with maintaining recognition accuracy of the speech, even when generating subtitles automatically by recognizing speech of television news.
  • FIG. 1 is a block diagram showing a functional configuration of an embodiment in accordance with the present invention
  • FIG. 2 is a block diagram showing a functional configuration of a first pass processor
  • FIG. 3 is a block diagram showing a configuration of a second pass processor
  • FIG. 4 is a diagram illustrating a processing of the embodiment in accordance with the present invention.
  • FIG. 5 is a block diagram showing a hardware configuration of the embodiment in accordance with the present invention.
  • FIG. 6 is a flowchart illustrating the processing executed by a CPU
  • FIG. 7 is a flowchart illustrating the contents of the first pass processing executed by the CPU.
  • FIG. 8 is a flowchart illustrating the contents of the second pass processing executed by the CPU.
  • FIG. 1 shows a functional configuration of a continuous speech recognizing apparatus in accordance with the present invention.
  • FIG. 2 shows the detail of the first pass processor 2 as shown in FIG. 1; and
  • FIG. 3 shows the detail of the second pass processor 3 as shown in FIG. 1 .
  • An acoustic analyzer 1 carries out analog/digital (A/D) conversion of a speech signal input from a speech input section such as a microphone, followed by acoustic analysis, and outputs parameters indicating speech features.
  • a speech input section such as a microphone
  • a first pass processor 2 uses a simple probabilistic language model, such as a word bigram, and successively generates a word lattice 4 , which is composed of the end time of each word candidate, its score and a pointer to a previous word, using a word lattice generator 21 . Then, it has a trace back section 22 trace back the word lattice 4 beginning from a phoneme with a maximum score as of now at every ⁇ t frames (for example, 30 frames (one frame is 10 ms.)) of the input speech, thereby obtaining N (for example 200) word string candidates (called N-best word strings from now on).
  • a simple probabilistic language model such as a word bigram
  • the first pass processor 2 can also have a configuration analogous to a conventional hardware (for example, see, Imai, et al., Proceedings of Autumn Meeting of the Acoustical Society of Japan, 3-1-12 (September 1998)) except that although the conventional system does not output N-best word string candidates for each sentence until the speech input of the sentence is completed, the present embodiment outputs, when the speech is input, N-best word sequences generated up to that instant at every ⁇ t frames even at midpoint in the sentence.
  • a second pass processor 3 has a rescoring section 31 rescore the N-best word strings generated at every ⁇ t frames by using a more complex probabilistic language model 6 (see, FIG. 3 ), such as a word trigram, and selects from among the N-best word strings a best word string (1-best word string) with the maximum score.
  • a more complex probabilistic language model 6 see, FIG. 3
  • a word comparing/determining section 32 compares the current 1-best word string with the previous 1-best word string obtained At frames before to detect a stable portion in the current 1-best word string, and when the same word string is included in both the current and previous 1-best word strings, it determines the same word string as a speech recognition result (see, FIG. 4 ).
  • final M for example, one
  • the current 1-best word string is not counted as a candidate to be determined. Furthermore, even if a variation in the current 1-best word string takes place in a section in which it has already been fixed, the variation is ignored.
  • the second pass processor 3 determines the 1-best word string up to that instant in the input speech at every ⁇ t frames as long as a speech input continues, the time lag between the speech input of each sentence and the output of the speech recognition result is about ⁇ t frame time. Considering that the time lag of the conventional system is nearly equal to the input duration of the input speech of each sentence, it is clear that the present embodiment can sharply reduce its time lag.
  • FIG. 5 A configuration of a computer system for this purpose is shown in FIG. 5 .
  • a CPU 100 executes programs which will be described below, and carries out a continuous speech recognizing processing.
  • a system memory 110 temporarily stores input/output data for the information processing of the CPU 100 .
  • a hard disk drive (abbreviated to HDD from now on) 130 stores the foregoing simple model 5 and complex model 6 among others.
  • the HDD 130 also stores a continuous speech recognizing program that is loaded from the HDD 130 to the system memory 110 in response to the instruction by a keyboard or mouse not shown to be executed by the CPU 100 .
  • An input interface (I/O) 120 carries out the A/D conversion of a speech signal input from a microphone, and supplies the CPU 100 with a digital speech signal.
  • a one-chip type digital processor can also be utilized as the computer.
  • an nonvolatile memory such as a ROM can be preferably used instead of the HDD 130 .
  • FIG. 6 illustrates a main processing of a program for continuous speech recognizing processing.
  • FIG. 7 illustrates details of the first pass processing of FIG. 1
  • FIG. 8 illustrates details of the second pass processing.
  • FIGS. 6-8 illustrate the functions of the individual processings for convenience of description.
  • the CPU 100 When a speech input is made from the microphone such as “A, B, C, D, E, . . . ”, the CPU 100 temporarily stores the digital speech signal in sequence into the system memory 110 through the I/O 120 .
  • the CPU 100 carries out acoustic analysis of the temporarily stored speech signal as in a conventional manner such as phoneme by phoneme, and stores the acoustic analysis result to the system memory 110 (step S 100 ).
  • the CPU 100 carries out word speech recognition in the first pass processing at step S 20 using the simple model (word bigram) and the acoustic analysis result stored in the HDD 130 , captures word string candidates (in this case, the number of characters in the word string is one) such as “A”, or those in the form of Hiragana or Kanji, and generates for individual candidates the end time of the word candidates, their scores (points indicating the probability of the recognition result), and a word lattice consisting of pointers to the first previous word (step S 100 of FIG. 7 ).
  • the word lattice generated is temporarily stored in the system memory 110 (step S 110 ).
  • the CPU 100 traces back the word lattice on the system memory 110 beginning from the phoneme with the currently maximum score, and transfers to the second pass processing the initial N word string candidates such as “A”, or those in the form of Hiragana or Kanji, that is, the N-best word strings, at step S 30 (steps S 120 ⁇ S 130 ).
  • the CPU 100 carries out rescoring (recalculation of the scores) of the N-best word strings using the complex model.
  • the word string candidate with the highest score in the N-best word strings, that is, “A” in this case, is selected as the 1-best word string, and is temporarily stored in the system memory 110 (step S 210 ).
  • step S 220 returns in FIG. 8, so that the execution procedure of the CPU 100 transfers to the end decision at step S 40 . Since the user does not provide the end instruction of the program at this instant, the procedure proceeds through steps S 40 ⁇ S 10 ⁇ S 20 , to execute the acoustic analysis and the first pass processing of the next speech input “B”.
  • step S 120 of FIG. 7 ⁇ step S 10 of FIG. 6 ⁇ S 20 ⁇ step S 100 of FIG. 7 ⁇ S 120 are repeated, and the word lattice generated is stored in the system memory 110 .
  • the decision processing at step S 120 of FIG. 7 becomes positive (“YES”), and the procedure proceeds to the trace back processing at step S 130 .
  • the trace back processing receiving an undetermined word string to be traced back, generates N-best word strings by combining the word string candidates of the speech input “A” with the word string candidates of the speech input “B”, and provides them to the second pass processing at step S 30 of FIG. 6 .
  • the CPU 100 recalculates the scores of the two word received using the complex model (step S 200 of FIG. 8 ). Because the current speech input is “A B”, the score of the “A B” is highest in the candidates of the two words, and hence it is selected as the 1-best word string. The 1-best word string selected is stored in the system memory 110 .
  • the current 1-best word string “A B” is compared with the ⁇ t frame previous 1-best word string “A”.
  • the decision result at step S 220 of FIG. 8 is positive (“YES”).
  • the CPU 100 determines the coincident word string “A” as the partial speech recognition result in the continuous speech.
  • step S 220 when a negative decision (“NO”) is obtained at step S 220 , the determination processing at step S 230 is skipped, and the processing is returned to step S 40 of FIG. 6, followed by the acoustic analysis of the speech (step S 10 ).
  • the next speech input “C” will generate a word string “B C” as a word string following the character string determined as the 1-best word string. Since the word string is “B” which follows the character string determined as the previous 1-best word string, the decision processing at step S 210 of FIG. 8 makes a coincident decision of “B”, and the coincident word string “B” is determined as the speech recognition result.
  • the speech recognition result decided is cumulatively stored in the system memory 110 (step S 230 ). Thus, the cumulatively stored speech recognition result at this time is “A B”.
  • the foregoing processing procedure carries out the two-pass processing every time the speech input takes place, detects the stable portion in the 1-best word strings obtained by the two-pass processing, and sequentially outputs the stable portions as the speech recognition candidates to be determined. This enables the speech recognition result to be displayed in such a manner that it follows the speech input in real time.
  • a hard disk storage as a recording medium for storing the continuous speech recognizing programs
  • IC memories such as ROM or RAM
  • portable recording medium such as a floppy disk or magneto-optical disk
  • downloading the program from an external system through a communications line a storage that stores the program on the external system side corresponds to the recording medium in accordance with the present invention.
  • the stable portion in the word string (the 1-best word string in the embodiment) having a maximum likelihood is detected by executing the successive two-pass processings, and is made the partial speech recognition result.
  • This makes it possible to successively determine the speech recognition result while inputting the continuous speech, and to reduce the time lag involved in generating the subtitles from the speech, even when producing the subtitles automatically through the speech recognition of the TV news with maintaining a high recognition rate of the speech.

Abstract

A second pass processor detects a stable portion in a 1-best word string obtained by a second pass processing, and determines the word string in the detected stable portion as a speech recognition result.

Description

This application is based on Patent Application No. 11-269457 (1999) filed on Sep. 22, 1999 in Japan, the content of which is incorporated hereinto by reference.
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a continuous speech recognizing apparatus for recognizing continuous speech, and particularly to a continuous speech recognizing apparatus for carrying out speech recognition using a probabilistic language model and to a recording medium.
2 Description of the Related Art
As one of conventional continuous speech recognizing apparatuses for carrying out speech recognition using a probabilistic language model, an apparatus for recognizing continuous speech using a multiple-pass decoder is known. The system narrows the list of word candidates for speech to be recognized by carrying out time synchronous search in a first pass circuit using a simple model. Subsequently, it determines the word candidate in the list obtained in the first pass in a second pass circuit using a complex model after completing the speech (Imai, et al., Technical Report of Information Processing Society of Japan, SLP-23-11 (October 1998)). The inventors of the present application also proposed a continuous speech recognizing apparatus for carrying out time synchronous Viterbi beam search using a bigram in the first pass (Imai, et al., Proceedings of Autumn Meeting of the Acoustical Society of Japan, 3-1-12 (September 1998)).
The continuous speech recognizing apparatus carries out word-dependent N-best search of a tree structure phoneme network (see, R. Schwarz, et al., ICASSP-91, PP. 701-704 (May 1991)).
It obtains N-best sentences by recursively tracing back a word lattice composed of the end time of each word candidate, its score and a pointer to a first previous word (see, R. Schwarz, et al., ICASSP-91, PP. 701-704 (May 1991)). Then, it determines a maximum likelihood word string as a recognition result by rescoring the N-best sentences using a trigram.
When the one-pass processing of continuous speech is executed in such a multiple-pass continuous speech recognizing apparatus, the speech recognition candidates of the word string at the final position of the current time has a tendency to be different from the word string which will be obtained at a next time at the location corresponding to the word string at the final position of the present time. As a result, the speech recognition candidates of a sentence are unstable until the speech input of the sentence has completed, and hence the second pass circuit cannot determine the speech recognition result until then. This will cause a large time lag (delay) between the input instant of the speech and the output of the speech recognition result from the continuous speech recognizing apparatus.
Such a time lag will presents a problem when producing real time subtitles by recognizing speech broadcast in news programs.
SUMMARY OF THE INVENTION
Therefore, an object of the present invention is to provide a continuous speech recognizing apparatus and a recording medium capable of reducing in a multiple-pass speech recognizing apparatus the time lag between the input of speech and the output of the speech recognition result.
In the first aspect of the present invention, there is provided a continuous speech recognizing apparatus that obtains from input continuous speech a plurality of speech recognition candidates of a word string using a simple probabilistic language model in a first pass processor, and that determines a speech recognition result of the plurality of speech recognition candidates using a complex probabilistic language model in a second pass processor, wherein
the first pass processor obtains word strings of the plurality of speech recognition candidates of the continuous speech at fixed time intervals from an input start time, and
the second pass processor comprises:
word string selecting means for selecting, using the complex probabilistic language model, a maximum likelihood word string from among the word strings of the plurality of speech recognition candidates obtained at the fixed time intervals, and
speech recognition result determining means for detecting a stable portion in word strings detected at every fixed intervals, and for successively determining a word string of the stable portion as the speech recognition result.
Here, the speech recognition result determining means may comprise:
a comparator for comparing a first word string with a second word string, the first word string consisting of a word string currently detected by the word string selecting means with the exception of a final portion of the word string, and the second word string consisting of a speech recognition candidates previously obtained by the word string selecting means; and
a determining section for determining, when the comparator makes a decision that a same word string as the second word string is contained in the first word string, the second word string as the speech recognition result.
The first pass processor may obtain the plurality of speech recognition candidates by tracing back a word lattice beginning from a phoneme with a maximum score as of now when a plurality of speech recognition candidates of a word string are obtained by using the simple probabilistic language model.
Trace back timing of the word lattice may be made variable.
The first pass processor may trace back the word lattice beginning from a plurality of currently active phonemes.
In the second aspect of the present invention, there is provided a recording medium having a computer executable program code means for obtaining from input continuous speech a plurality of speech recognition candidates of a word string using a simple probabilistic language model in a first pass, and for determining a speech recognition result of the plurality of speech recognition candidates using a complex probabilistic language model in a second pass,
wherein
the first pass comprises a step of obtaining, beginning from an input start time, word strings of the plurality of speech recognition candidates of the continuous speech at fixed time intervals, and
the second pass comprises:
a word string selecting step of selecting, using the complex probabilistic language model, a maximum likelihood word string from among the word strings of the plurality of speech recognition candidates obtained at the fixed time intervals, and
speech recognition result determining step of detecting a stable portion in word strings detected at every fixed intervals, and of successively determining a word string of the stable portion as the speech recognition result.
Here, the speech recognition result determining step may comprise:
a comparing step of comparing a first word string with a second word string, the first word string consisting of a word string currently detected in the word string selecting step with the exception of a final portion of the word string, and the second word string consisting of speech recognition candidates previously obtained in the word string selecting step; and
a determining step of determining, when the comparing step makes a decision that a same word string as the second word string is contained in the first word string, the second word string as the speech recognition result.
The first pass may obtain the plurality of speech recognition candidates by tracing back a word lattice beginning from a phoneme with a maximum score as of now when a plurality of speech recognition candidates of a word string are obtained by using the simple probabilistic language model.
Trace back timing of the word lattice may be made variable.
The first pass may trace back the word lattice beginning from a plurality of currently active phonemes.
The present invention detects a stable portion of a maximum likelihood word string (1-best word string in the following embodiments) obtained by carrying out the two-pass processing successively, and makes it a partial speech recognition result. This makes it possible to successively determine the speech recognition result while inputting continuous speech. In addition, this makes it possible to reduce the time lag between the speech input and the subtitle output to a minimum with maintaining recognition accuracy of the speech, even when generating subtitles automatically by recognizing speech of television news.
The above and other objects, effects, features and advantages of the present invention will become more apparent from the following description of embodiments thereof taken in conjunction with the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram showing a functional configuration of an embodiment in accordance with the present invention;
FIG. 2 is a block diagram showing a functional configuration of a first pass processor;
FIG. 3 is a block diagram showing a configuration of a second pass processor;
FIG. 4 is a diagram illustrating a processing of the embodiment in accordance with the present invention;
FIG. 5 is a block diagram showing a hardware configuration of the embodiment in accordance with the present invention;
FIG. 6 is a flowchart illustrating the processing executed by a CPU;
FIG. 7 is a flowchart illustrating the contents of the first pass processing executed by the CPU; and
FIG. 8 is a flowchart illustrating the contents of the second pass processing executed by the CPU.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
The embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
FIG. 1 shows a functional configuration of a continuous speech recognizing apparatus in accordance with the present invention. FIG. 2 shows the detail of the first pass processor 2 as shown in FIG. 1; and FIG. 3 shows the detail of the second pass processor 3 as shown in FIG. 1.
An acoustic analyzer 1 carries out analog/digital (A/D) conversion of a speech signal input from a speech input section such as a microphone, followed by acoustic analysis, and outputs parameters indicating speech features. As the acoustic analyzer 1, a conventional acoustic analyzer can be utilized.
A first pass processor 2 uses a simple probabilistic language model, such as a word bigram, and successively generates a word lattice 4, which is composed of the end time of each word candidate, its score and a pointer to a previous word, using a word lattice generator 21. Then, it has a trace back section 22 trace back the word lattice 4 beginning from a phoneme with a maximum score as of now at every Δt frames (for example, 30 frames (one frame is 10 ms.)) of the input speech, thereby obtaining N (for example 200) word string candidates (called N-best word strings from now on). The first pass processor 2 can also have a configuration analogous to a conventional hardware (for example, see, Imai, et al., Proceedings of Autumn Meeting of the Acoustical Society of Japan, 3-1-12 (September 1998)) except that although the conventional system does not output N-best word string candidates for each sentence until the speech input of the sentence is completed, the present embodiment outputs, when the speech is input, N-best word sequences generated up to that instant at every Δt frames even at midpoint in the sentence.
A second pass processor 3 has a rescoring section 31 rescore the N-best word strings generated at every Δt frames by using a more complex probabilistic language model 6 (see, FIG. 3), such as a word trigram, and selects from among the N-best word strings a best word string (1-best word string) with the maximum score.
A word comparing/determining section 32 compares the current 1-best word string with the previous 1-best word string obtained At frames before to detect a stable portion in the current 1-best word string, and when the same word string is included in both the current and previous 1-best word strings, it determines the same word string as a speech recognition result (see, FIG. 4).
In this case, final M (for example, one) words of the current 1-best word string are not counted as a candidate to be determined. Furthermore, even if a variation in the current 1-best word string takes place in a section in which it has already been fixed, the variation is ignored.
Because the second pass processor 3 determines the 1-best word string up to that instant in the input speech at every Δt frames as long as a speech input continues, the time lag between the speech input of each sentence and the output of the speech recognition result is about Δt frame time. Considering that the time lag of the conventional system is nearly equal to the input duration of the input speech of each sentence, it is clear that the present embodiment can sharply reduce its time lag.
Although the foregoing continuous speech recognizing apparatus can be implemented in the form of a digital circuit, it can also be implemented using a computer that executes programs which will be described below. A configuration of a computer system for this purpose is shown in FIG. 5. In FIG. 5, a CPU 100 executes programs which will be described below, and carries out a continuous speech recognizing processing. A system memory 110 temporarily stores input/output data for the information processing of the CPU 100. A hard disk drive (abbreviated to HDD from now on) 130 stores the foregoing simple model 5 and complex model 6 among others.
The HDD 130 also stores a continuous speech recognizing program that is loaded from the HDD 130 to the system memory 110 in response to the instruction by a keyboard or mouse not shown to be executed by the CPU 100.
An input interface (I/O) 120 carries out the A/D conversion of a speech signal input from a microphone, and supplies the CPU 100 with a digital speech signal.
Although it is assumed that the present embodiment employs a personal computer as its computer in the following description, a one-chip type digital processor can also be utilized as the computer. In this case, an nonvolatile memory such as a ROM can be preferably used instead of the HDD 130.
The operation of the continuous speech recognizing apparatus with the foregoing configuration will now be described with reference to FIGS. 6-8. FIG. 6 illustrates a main processing of a program for continuous speech recognizing processing. FIG. 7 illustrates details of the first pass processing of FIG. 1, and FIG. 8 illustrates details of the second pass processing.
Although the programs are written in a program language the CPU 100 can execute and are stored in the HDD 130, FIGS. 6-8 illustrate the functions of the individual processings for convenience of description.
When a speech input is made from the microphone such as “A, B, C, D, E, . . . ”, the CPU 100 temporarily stores the digital speech signal in sequence into the system memory 110 through the I/O 120. The CPU 100 carries out acoustic analysis of the temporarily stored speech signal as in a conventional manner such as phoneme by phoneme, and stores the acoustic analysis result to the system memory 110 (step S100).
Receiving the speech input “A”, for example, the CPU 100 carries out word speech recognition in the first pass processing at step S20 using the simple model (word bigram) and the acoustic analysis result stored in the HDD 130, captures word string candidates (in this case, the number of characters in the word string is one) such as “A”, or those in the form of Hiragana or Kanji, and generates for individual candidates the end time of the word candidates, their scores (points indicating the probability of the recognition result), and a word lattice consisting of pointers to the first previous word (step S100 of FIG. 7). The word lattice generated is temporarily stored in the system memory 110 (step S110). When the time Δt has elapsed from the speech input, the CPU 100 traces back the word lattice on the system memory 110 beginning from the phoneme with the currently maximum score, and transfers to the second pass processing the initial N word string candidates such as “A”, or those in the form of Hiragana or Kanji, that is, the N-best word strings, at step S30 (steps S120→S130).
Subsequently, following the processing procedure as shown in FIG. 8, the CPU 100 carries out rescoring (recalculation of the scores) of the N-best word strings using the complex model. The word string candidate with the highest score in the N-best word strings, that is, “A” in this case, is selected as the 1-best word string, and is temporarily stored in the system memory 110 (step S210).
Because no other speech is input before the “A” in this case, the processing proceeds from step S220 to Return in FIG. 8, so that the execution procedure of the CPU 100 transfers to the end decision at step S40. Since the user does not provide the end instruction of the program at this instant, the procedure proceeds through steps S40→S10→S20, to execute the acoustic analysis and the first pass processing of the next speech input “B”. Incidentally, when the Δt frame time has not elapsed from the previous trace back processing in the first pass processing, the processings passing through step S120 of FIG. 7→step S10 of FIG. 6→S20→step S100 of FIG. 7→S120 are repeated, and the word lattice generated is stored in the system memory 110.
When the Δt frame time elapsed, the decision processing at step S120 of FIG. 7 becomes positive (“YES”), and the procedure proceeds to the trace back processing at step S130. The trace back processing, receiving an undetermined word string to be traced back, generates N-best word strings by combining the word string candidates of the speech input “A” with the word string candidates of the speech input “B”, and provides them to the second pass processing at step S30 of FIG. 6.
In the second pass processing, the CPU 100 recalculates the scores of the two word received using the complex model (step S200 of FIG. 8). Because the current speech input is “A B”, the score of the “A B” is highest in the candidates of the two words, and hence it is selected as the 1-best word string. The 1-best word string selected is stored in the system memory 110.
Because the 1-best word strings currently stored in the system memory 110 are “A” and “A B”, the current 1-best word string “A B” is compared with the Δt frame previous 1-best word string “A”.
Because the current 1-best word string “A B” includes a word string that agrees with the previous 1-best word string “A”, the decision result at step S220 of FIG. 8 is positive (“YES”). In response to the positive decision, the CPU 100 determines the coincident word string “A” as the partial speech recognition result in the continuous speech.
On the other hand, when a negative decision (“NO”) is obtained at step S220, the determination processing at step S230 is skipped, and the processing is returned to step S40 of FIG. 6, followed by the acoustic analysis of the speech (step S10).
By repeating the foregoing processing procedure, the next speech input “C” will generate a word string “B C” as a word string following the character string determined as the 1-best word string. Since the word string is “B” which follows the character string determined as the previous 1-best word string, the decision processing at step S210 of FIG. 8 makes a coincident decision of “B”, and the coincident word string “B” is determined as the speech recognition result. The speech recognition result decided is cumulatively stored in the system memory 110 (step S230). Thus, the cumulatively stored speech recognition result at this time is “A B”.
As described above, in the present embodiment, the foregoing processing procedure carries out the two-pass processing every time the speech input takes place, detects the stable portion in the 1-best word strings obtained by the two-pass processing, and sequentially outputs the stable portions as the speech recognition candidates to be determined. This enables the speech recognition result to be displayed in such a manner that it follows the speech input in real time.
When the user inputs an end instruction to the CPU 10 through a keyboard or mouse not shown, the end decision at step S40 of FIG. 6 becomes positive (“YES”), completing the processing procedure of FIG. 6.
The following modifications can be implemented in addition to the foregoing embodiment.
1) Although the foregoing embodiment employs a hard disk storage as a recording medium for storing the continuous speech recognizing programs, IC memories such as ROM or RAM, or portable recording medium such as a floppy disk or magneto-optical disk can also be used. When downloading the program from an external system through a communications line, a storage that stores the program on the external system side corresponds to the recording medium in accordance with the present invention.
2) Although the foregoing embodiment compares the two adjacent 1-best word strings to obtain the stable portion in the 1-best word string, the duration between the trace back can be made variable. The trace back from a plurality of currently active phonemes will provide a more stable word string.
3) Although the foregoing embodiment employs the word bigram (that obtains the probability for two word strings) for the simple probabilistic language model, and the word trigram (that obtains the probability for three word strings) for the complex probabilistic language model, the present invention is not limited to this.
4) The foregoing embodiment is an example for describing the present invention, and various modifications of the foregoing embodiment can be made without departing from the technical idea defined in the claims of the present invention. The variations of the embodiment fall within the scope of the present application.
As described above, according to the present invention, the stable portion in the word string (the 1-best word string in the embodiment) having a maximum likelihood is detected by executing the successive two-pass processings, and is made the partial speech recognition result. This makes it possible to successively determine the speech recognition result while inputting the continuous speech, and to reduce the time lag involved in generating the subtitles from the speech, even when producing the subtitles automatically through the speech recognition of the TV news with maintaining a high recognition rate of the speech.
The present invention has been described in detail with respect to preferred embodiments, and it will now be apparent from the foregoing to those skilled in the art that changes and modifications may be made without departing from the invention in its broader aspects, and it is the intention, therefore, in the appended claims to cover all such changes and modifications as fall within the true spirit of the invention.

Claims (10)

What is claimed is:
1. A continuous speech recognizing apparatus that obtains from input continuous speech a plurality of speech recognition candidates of a word string using a simple probabilistic language model in a first pass processor, and that determines a speech recognition result of the plurality of speech recognition candidates using a complex probabilistic language model in a second pass processor, wherein
said first pass processor obtains word strings of the plurality of speech recognition candidates of the continuous speech at fixed time intervals from an input start time, and
said second pass processor comprises:
word string selecting means for selecting, using the complex probabilistic language model, a maximum likelihood word string from among the word strings of the plurality of speech recognition candidates obtained at the fixed time intervals, and
speech recognition result determining means for detecting a stable portion in word strings detected at every fixed intervals, and for successively determining a word string of the stable portion as the speech recognition result.
2. The continuous speech recognizing apparatus as claimed in claim 1, wherein said speech recognition result determining means comprises:
a comparator for comparing a first word string with a second word string, the first word string consisting of a word string currently detected by said word string selecting means with the exception of a final portion of the word string, and the second word string consisting of a speech recognition candidates previously obtained by said word string selecting means; and
a determining section for determining, when said comparator makes a decision that a same word string as the second word string is contained in the first word string, the second word string as the speech recognition result.
3. The continuous speech recognizing apparatus as claimed in claim 1, wherein said first pass processor obtains the plurality of speech recognition candidates by tracing back a word lattice beginning from a phoneme with a maximum score as of now when a plurality of speech recognition candidates of a word string are obtained by using the simple probabilistic language model.
4. The continuous speech recognizing apparatus as claimed in claim 3, wherein trace back timing of the word lattice is made variable.
5. The continuous speech recognizing apparatus as claimed in claim 3, wherein said first pass processor traces back the word lattice beginning from a plurality of currently active phonemes.
6. A recording medium having a computer executable program code means for obtaining from input continuous speech a plurality of speech recognition candidates of a word string using a simple probabilistic language model in a first pass, and for determining a speech recognition result of the plurality of speech recognition candidates using a complex probabilistic language model in a second pass, wherein
said first pass comprises a step of obtaining, beginning from an input start time, word strings of the plurality of speech recognition candidates of the continuous speech at fixed time intervals, and
said second pass comprises:
a word string selecting step of selecting, using the complex probabilistic language model, a maximum likelihood word string from among the word strings of the plurality of speech recognition candidates obtained at the fixed time intervals, and
speech recognition result determining step of detecting a stable portion in word strings detected at every fixed intervals, and of successively determining a word string of the stable portion as the speech recognition result.
7. The recording medium as claimed in claim 6, wherein said speech recognition result determining step comprises:
a comparing step of comparing a first word string with a second word string, the first word string consisting of a word string currently detected in said word string selecting step with the exception of a final portion of the word string, and the second word string consisting of speech recognition candidates previously obtained in said word string selecting step; and
a determining step of determining, when said comparing step makes a decision that a same word string as the second word string is contained in the first word string, the second word string as the speech recognition result.
8. The recording medium as claimed in claim 6, wherein said first pass obtains the plurality of speech recognition candidates by tracing back a word lattice beginning from a phoneme with a maximum score as of now when a plurality of speech recognition candidates of a word string are obtained by using the simple probabilistic language model.
9. The recording medium as claimed in claim 8, wherein trace back timing of the word lattice is made variable.
10. The recording medium as claimed in claim 8, wherein said first pass traces back the word lattice beginning from a plurality of currently active phonemes.
US09/447,391 1999-09-22 1999-11-22 Continuous speech recognizing apparatus and a recording medium thereof Expired - Lifetime US6393398B1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP26945799A JP3834169B2 (en) 1999-09-22 1999-09-22 Continuous speech recognition apparatus and recording medium
JP11-269457 1999-09-22

Publications (1)

Publication Number Publication Date
US6393398B1 true US6393398B1 (en) 2002-05-21

Family

ID=17472712

Family Applications (1)

Application Number Title Priority Date Filing Date
US09/447,391 Expired - Lifetime US6393398B1 (en) 1999-09-22 1999-11-22 Continuous speech recognizing apparatus and a recording medium thereof

Country Status (2)

Country Link
US (1) US6393398B1 (en)
JP (1) JP3834169B2 (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010020226A1 (en) * 2000-02-28 2001-09-06 Katsuki Minamino Voice recognition apparatus, voice recognition method, and recording medium
US20020156627A1 (en) * 2001-02-20 2002-10-24 International Business Machines Corporation Speech recognition apparatus and computer system therefor, speech recognition method and program and recording medium therefor
US20020160342A1 (en) * 2001-04-26 2002-10-31 Felix Castro Teaching method and device
US20030023439A1 (en) * 2001-05-02 2003-01-30 Gregory Ciurpita Method and apparatus for automatic recognition of long sequences of spoken digits
US20060074898A1 (en) * 2004-07-30 2006-04-06 Marsal Gavalda System and method for improving the accuracy of audio searching
US20060074671A1 (en) * 2004-10-05 2006-04-06 Gary Farmaner System and methods for improving accuracy of speech recognition
US7035788B1 (en) * 2000-04-25 2006-04-25 Microsoft Corporation Language model sharing
US20070132834A1 (en) * 2005-12-08 2007-06-14 International Business Machines Corporation Speech disambiguation in a composite services enablement environment
US20080040111A1 (en) * 2006-03-24 2008-02-14 Kohtaroh Miyamoto Caption Correction Device
US20080208577A1 (en) * 2007-02-23 2008-08-28 Samsung Electronics Co., Ltd. Multi-stage speech recognition apparatus and method
US20090030680A1 (en) * 2007-07-23 2009-01-29 Jonathan Joseph Mamou Method and System of Indexing Speech Data
US20120116560A1 (en) * 2009-04-01 2012-05-10 Motorola Mobility, Inc. Apparatus and Method for Generating an Output Audio Data Signal
WO2013066468A1 (en) * 2011-11-01 2013-05-10 Google Inc. Enhanced stability prediction for incrementally generated speech recognition hypotheses
US8509931B2 (en) 2010-09-30 2013-08-13 Google Inc. Progressive encoding of audio
US20130325456A1 (en) * 2011-01-28 2013-12-05 Nippon Hoso Kyokai Speech speed conversion factor determining device, speech speed conversion device, program, and storage medium
CN103811014A (en) * 2012-11-15 2014-05-21 纬创资通股份有限公司 Voice interference filtering method and voice interference filtering system
US20140358533A1 (en) * 2013-05-30 2014-12-04 International Business Machines Corporation Pronunciation accuracy in speech recognition
US9405823B2 (en) 2007-07-23 2016-08-02 Nuance Communications, Inc. Spoken document retrieval using multiple speech transcription indices
US10572538B2 (en) 2015-04-28 2020-02-25 Kabushiki Kaisha Toshiba Lattice finalization device, pattern recognition device, lattice finalization method, and computer program product

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010128560A1 (en) * 2009-05-08 2010-11-11 パイオニア株式会社 Voice recognition device, voice recognition method, and voice recognition program
US9583095B2 (en) 2009-07-17 2017-02-28 Nec Corporation Speech processing device, method, and storage medium
JP6508808B2 (en) * 2014-10-16 2019-05-08 日本放送協会 Speech recognition error correction device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4489435A (en) * 1981-10-05 1984-12-18 Exxon Corporation Method and apparatus for continuous word string recognition
US5349645A (en) * 1991-12-31 1994-09-20 Matsushita Electric Industrial Co., Ltd. Word hypothesizer for continuous speech decoding using stressed-vowel centered bidirectional tree searches
US5737489A (en) * 1995-09-15 1998-04-07 Lucent Technologies Inc. Discriminative utterance verification for connected digits recognition
US5953701A (en) * 1998-01-22 1999-09-14 International Business Machines Corporation Speech recognition models combining gender-dependent and gender-independent phone states and using phonetic-context-dependence
US6076057A (en) * 1997-05-21 2000-06-13 At&T Corp Unsupervised HMM adaptation based on speech-silence discrimination

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4489435A (en) * 1981-10-05 1984-12-18 Exxon Corporation Method and apparatus for continuous word string recognition
US5349645A (en) * 1991-12-31 1994-09-20 Matsushita Electric Industrial Co., Ltd. Word hypothesizer for continuous speech decoding using stressed-vowel centered bidirectional tree searches
US5737489A (en) * 1995-09-15 1998-04-07 Lucent Technologies Inc. Discriminative utterance verification for connected digits recognition
US6076057A (en) * 1997-05-21 2000-06-13 At&T Corp Unsupervised HMM adaptation based on speech-silence discrimination
US5953701A (en) * 1998-01-22 1999-09-14 International Business Machines Corporation Speech recognition models combining gender-dependent and gender-independent phone states and using phonetic-context-dependence

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Iami, et al., A Broadcast News Transcription System for Captioning, Information Processing Society of Japan, SLP-23-11 (Oct. 1998). (no translation).
Iami, et al., Development of a Decoder for News Speech Recognition, Proceedings of Autumn Meeting of the Acoustical Society of Japan, 3-1-12 (Sep. 1998). (translation provided).
R. Schwarz, et al., A Comparison of Several Approximate Algorithms for Finding Multiple (N-BEST) Sentence Hypotheses, ICASSP-91, pp. 701-704, (May 1991).

Cited By (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7013277B2 (en) * 2000-02-28 2006-03-14 Sony Corporation Speech recognition apparatus, speech recognition method, and storage medium
US20010020226A1 (en) * 2000-02-28 2001-09-06 Katsuki Minamino Voice recognition apparatus, voice recognition method, and recording medium
US7035788B1 (en) * 2000-04-25 2006-04-25 Microsoft Corporation Language model sharing
US20060173674A1 (en) * 2000-04-25 2006-08-03 Microsoft Corporation Language model sharing
US7895031B2 (en) 2000-04-25 2011-02-22 Microsoft Corporation Language model sharing
US20020156627A1 (en) * 2001-02-20 2002-10-24 International Business Machines Corporation Speech recognition apparatus and computer system therefor, speech recognition method and program and recording medium therefor
US6985863B2 (en) * 2001-02-20 2006-01-10 International Business Machines Corporation Speech recognition apparatus and method utilizing a language model prepared for expressions unique to spontaneous speech
US20020160342A1 (en) * 2001-04-26 2002-10-31 Felix Castro Teaching method and device
US6802717B2 (en) * 2001-04-26 2004-10-12 Felix Castro Teaching method and device
US20030023439A1 (en) * 2001-05-02 2003-01-30 Gregory Ciurpita Method and apparatus for automatic recognition of long sequences of spoken digits
US7725318B2 (en) * 2004-07-30 2010-05-25 Nice Systems Inc. System and method for improving the accuracy of audio searching
US20060074898A1 (en) * 2004-07-30 2006-04-06 Marsal Gavalda System and method for improving the accuracy of audio searching
US7925506B2 (en) * 2004-10-05 2011-04-12 Inago Corporation Speech recognition accuracy via concept to keyword mapping
US20110191099A1 (en) * 2004-10-05 2011-08-04 Inago Corporation System and Methods for Improving Accuracy of Speech Recognition
US20060074671A1 (en) * 2004-10-05 2006-04-06 Gary Farmaner System and methods for improving accuracy of speech recognition
US8352266B2 (en) 2004-10-05 2013-01-08 Inago Corporation System and methods for improving accuracy of speech recognition utilizing concept to keyword mapping
US20070132834A1 (en) * 2005-12-08 2007-06-14 International Business Machines Corporation Speech disambiguation in a composite services enablement environment
US20080040111A1 (en) * 2006-03-24 2008-02-14 Kohtaroh Miyamoto Caption Correction Device
US7729917B2 (en) 2006-03-24 2010-06-01 Nuance Communications, Inc. Correction of a caption produced by speech recognition
US20080208577A1 (en) * 2007-02-23 2008-08-28 Samsung Electronics Co., Ltd. Multi-stage speech recognition apparatus and method
US8762142B2 (en) * 2007-02-23 2014-06-24 Samsung Electronics Co., Ltd. Multi-stage speech recognition apparatus and method
US20090030680A1 (en) * 2007-07-23 2009-01-29 Jonathan Joseph Mamou Method and System of Indexing Speech Data
US9405823B2 (en) 2007-07-23 2016-08-02 Nuance Communications, Inc. Spoken document retrieval using multiple speech transcription indices
US8831946B2 (en) * 2007-07-23 2014-09-09 Nuance Communications, Inc. Method and system of indexing speech data
US20120116560A1 (en) * 2009-04-01 2012-05-10 Motorola Mobility, Inc. Apparatus and Method for Generating an Output Audio Data Signal
US9230555B2 (en) * 2009-04-01 2016-01-05 Google Technology Holdings LLC Apparatus and method for generating an output audio data signal
US8965545B2 (en) 2010-09-30 2015-02-24 Google Inc. Progressive encoding of audio
US8509931B2 (en) 2010-09-30 2013-08-13 Google Inc. Progressive encoding of audio
US9129609B2 (en) * 2011-01-28 2015-09-08 Nippon Hoso Kyokai Speech speed conversion factor determining device, speech speed conversion device, program, and storage medium
US20130325456A1 (en) * 2011-01-28 2013-12-05 Nippon Hoso Kyokai Speech speed conversion factor determining device, speech speed conversion device, program, and storage medium
US8909512B2 (en) 2011-11-01 2014-12-09 Google Inc. Enhanced stability prediction for incrementally generated speech recognition hypotheses based on an age of a hypothesis
CN103918026A (en) * 2011-11-01 2014-07-09 谷歌公司 Enhanced stability prediction for incrementally generated speech recognition hypotheses
WO2013066468A1 (en) * 2011-11-01 2013-05-10 Google Inc. Enhanced stability prediction for incrementally generated speech recognition hypotheses
CN103918026B (en) * 2011-11-01 2018-01-02 谷歌公司 The stability prediction for the enhancing assumed for the speech recognition being incrementally generated
CN103811014A (en) * 2012-11-15 2014-05-21 纬创资通股份有限公司 Voice interference filtering method and voice interference filtering system
US20140358533A1 (en) * 2013-05-30 2014-12-04 International Business Machines Corporation Pronunciation accuracy in speech recognition
US9384730B2 (en) * 2013-05-30 2016-07-05 International Business Machines Corporation Pronunciation accuracy in speech recognition
US20160210964A1 (en) * 2013-05-30 2016-07-21 International Business Machines Corporation Pronunciation accuracy in speech recognition
US9978364B2 (en) * 2013-05-30 2018-05-22 International Business Machines Corporation Pronunciation accuracy in speech recognition
US10572538B2 (en) 2015-04-28 2020-02-25 Kabushiki Kaisha Toshiba Lattice finalization device, pattern recognition device, lattice finalization method, and computer program product

Also Published As

Publication number Publication date
JP2001092496A (en) 2001-04-06
JP3834169B2 (en) 2006-10-18

Similar Documents

Publication Publication Date Title
US6393398B1 (en) Continuous speech recognizing apparatus and a recording medium thereof
US5218668A (en) Keyword recognition system and method using template concantenation model
US5884259A (en) Method and apparatus for a time-synchronous tree-based search strategy
US6275801B1 (en) Non-leaf node penalty score assignment system and method for improving acoustic fast match speed in large vocabulary systems
US6076056A (en) Speech recognition system for recognizing continuous and isolated speech
US6801892B2 (en) Method and system for the reduction of processing time in a speech recognition system using the hidden markov model
US6178401B1 (en) Method for reducing search complexity in a speech recognition system
US6374219B1 (en) System for using silence in speech recognition
JPH11191000A (en) Method for aligning text and voice signal
US6801891B2 (en) Speech processing system
US6662159B2 (en) Recognizing speech data using a state transition model
CN111145733B (en) Speech recognition method, speech recognition device, computer equipment and computer readable storage medium
JP2008216756A (en) Technique for acquiring character string or the like to be newly recognized as phrase
US5987409A (en) Method of and apparatus for deriving a plurality of sequences of words from a speech signal
CN111552777B (en) Audio identification method and device, electronic equipment and storage medium
JP2002215187A (en) Speech recognition method and device for the same
US6374218B2 (en) Speech recognition system which displays a subject for recognizing an inputted voice
JP2000056795A (en) Speech recognition device
KR100374921B1 (en) Word string recognition method and word string determination device
JP3006496B2 (en) Voice recognition device
US20040148163A1 (en) System and method for utilizing an anchor to reduce memory requirements for speech recognition
Bai et al. A multi-phase approach for fast spotting of large vocabulary Chinese keywords from Mandarin speech using prosodic information
WO2021181451A1 (en) Speech recognition device, control method, and program
CN112542159B (en) Data processing method and device
JP2004012615A (en) Continuous speech recognition apparatus and continuous speech recognition method, continuous speech recognition program and program recording medium

Legal Events

Date Code Title Description
AS Assignment

Owner name: NIPPON HOSO KYOKAI, JAPAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:IMAI, TORU;ANDO, AKIO;REEL/FRAME:010564/0991

Effective date: 20000207

STCF Information on status: patent grant

Free format text: PATENTED CASE

CC Certificate of correction
FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FPAY Fee payment

Year of fee payment: 4

FPAY Fee payment

Year of fee payment: 8

FPAY Fee payment

Year of fee payment: 12