CN100578612C - Speech processing device - Google Patents

Speech processing device Download PDF

Info

Publication number
CN100578612C
CN100578612C CN200610006603A CN200610006603A CN100578612C CN 100578612 C CN100578612 C CN 100578612C CN 200610006603 A CN200610006603 A CN 200610006603A CN 200610006603 A CN200610006603 A CN 200610006603A CN 100578612 C CN100578612 C CN 100578612C
Authority
CN
China
Prior art keywords
speech
voice
dictionary
sound
speech recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN200610006603A
Other languages
Chinese (zh)
Other versions
CN1819016A (en
Inventor
关根直树
柿野友成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba TEC Corp
Original Assignee
Toshiba TEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba TEC Corp filed Critical Toshiba TEC Corp
Publication of CN1819016A publication Critical patent/CN1819016A/en
Application granted granted Critical
Publication of CN100578612C publication Critical patent/CN100578612C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

To ensure that speech recognition is performed without being accompanied by operation of a forward trigger. The system includes a speech/non-speech discrimination section 5 which discriminates whether the sound inputted from a speech input section 3 is speech or non-speech, a keyword dictionary 10, a dictionary 13 for speech recognition, a speech recognition section 8 which performs speech recognition based on the dictionary 13 for speech recognition, a speech keyword detection section 11 which detects whether the sound judged to be the speech in the speech/non-speech discrimination section 5 is a word previously registered in the keyword dictionary 10 or not, and a recognition instruction section 9 which emits the instruction to perform the speech recognition of the sound inputted at the time the sound inputted from the speech input section 3 is detected to be the sound including a word registered in the keyword dictionary 10 to the speech recognition section 8. The speech recognition is performed using the specific utterance after the user utters a desired word as a trigger.

Description

Voice processing apparatus
Technical field
The present invention relates to a kind of voice processing apparatus, can carry out speech recognition and speaker identification, be used for controlling various device by voice.
Background technology
Usually, in the speech processes that is used for carrying out speech recognition and speaker identification, there is the problem that causes mistake identification owing to the environment voice around except the voice of target, also picking up.So unfavorable in order to eliminate, the technology of using the voice operating button before a kind of language that sends target the user is disclosed in the following patent documentation 1.This technology is commonly referred to as PoC (push-to-talk).In addition, in following patent documentation 2, disclose and replaced disclosed voice operating button in the patent documentation 1 and by sending the technology that specific key word solves.This technology is to wait for becoming a word of key word, and has obtained discerning the information after this word, is called voice command (magic word) mode.Specific operation before the language that sends target like this, be that the sounding of the operation of voice operating button or key word is done the place ahead in following note and triggered.
[patent documentation 1] spy opens flat 8-328584 communique
[patent documentation 2] spy opens the 2000-322078 communique
[non-patent literature 1] modern science society periodical Furui Sadaoki's work ' sound Ring sound acoustooptics '
If do not have a mind to carry out its operation or sounding then can't force users carry out the place ahead and trigger, be burden therefore for the user.In addition, after carrying out the operation that triggers in the place ahead, require reliably and sounding accurately.But, as previously mentioned, the sounding as the language of target after the operation of voice operating button, the place aheads such as sounding of key word trigger requires above-mentioned sounding like that accurately, so the speaker recognizes this situation and anxiety, and causes that the smooth or not wrong possibility of speaking is very high.Therefore, the mistake identification that is difficult to avoid the sounding because of the user to cause.
Summary of the invention
The invention provides a kind of voice processing apparatus, comprising:
Phonetic entry portion can import the sound that comprises user's voice;
The AD converter section will be converted to digital signal from the sound of described phonetic entry portion input;
The speech/non-speech judegment part, differentiating from the sound of described AD converter section input is voice or non-voice;
Voice recording portion makes the recording data store record to the sound that is judged as voice by described speech/non-speech judegment part;
Keyword dictionary can be preserved key word in advance;
Voiced keyword detection portion, whether detection is the word of registered in advance in described keyword dictionary by the sound that described speech/non-speech judegment part is judged as voice;
The speech recognition dictionary is used to carry out speech recognition;
Speech recognition portion according to the indication of answering speech recognition, based on described speech recognition dictionary, carries out speech recognition to the sound by described recording data store recording; And
The identification instruction unit to described speech recognition portion, is sent the indication of answering speech recognition,
It is characterized in that described identification instruction unit is sent indication to described speech recognition portion, making it is the moment of the word of registered in advance in described keyword dictionary being detected by described voiced keyword detection portion, begins to carry out speech recognition.
The present invention also provides a kind of voice processing apparatus, comprising:
Phonetic entry portion can import the sound that comprises user's voice;
The AD converter section will be converted to digital signal from the sound of described phonetic entry portion input;
The speech/non-speech judegment part, differentiating from the sound of described AD converter section input is voice or non-voice;
Keyword dictionary can be preserved key word in advance;
Voiced keyword detection portion, whether detection is the word of registered in advance in described keyword dictionary by the sound that described speech/non-speech judegment part is judged as voice;
The speech recognition dictionary is used to carry out speech recognition; And
Speech recognition portion carries out speech recognition with dictionary to the sound that is judged as voice by described speech/non-speech judegment part based on described speech recognition,
It is characterized in that having the recognition result detection unit, it has following function, is the result who of the word registered in the described keyword dictionary accepts described speech recognition portion constantly being detected by described voiced keyword detection portion promptly.
The present invention includes: the speech/non-speech judegment part, differentiating the sound that comprises user's voice is voice or non-voice by the sound after the input of phonetic entry portion; Keyword dictionary can be preserved key word in advance; The speech recognition dictionary is used to carry out speech recognition; Speech recognition portion carries out speech recognition based on described speech recognition with dictionary; Voiced keyword detection portion, whether detection is the word of registered in advance in described keyword dictionary by the sound that described speech/non-speech judegment part is judged as voice; And identification instruction unit, described speech recognition portion is sent indication, make it comprise the moment that is registered in the word the described keyword dictionary at the sound that detects from the input of phonetic entry portion, sound to input carries out speech recognition, and the language specific sounding (key word) afterwards that sends target with the user serves as to trigger to carry out speech recognition.
Force users is not carried out the preceding the place ahead triggering of sounding, and can carry out speech recognition with the sounding of nature.Promptly, different with used with voice commands, owing to carry out specific sounding (key word) afterwards at ' language of target ', so when the key word sounding, said the language of target, therefore there is not nervous sense, thus, cause that the smooth or not wrong possibility of speaking of key word reduces, can carry out reliable speech recognition with the sounding of nature.
Description of drawings
Fig. 1 is the block scheme of the voice processing apparatus of expression first embodiment of the present invention.
Fig. 2 is the oscillogram of the source of sound information of voice.
Fig. 3 is the oscillogram of the source of sound information of non-voice.
Fig. 4 is the graph of a relation of the relation of the maximal value of frequency spectrum correlated characteristic amount of expression voice and non-voice and frequency.
Fig. 5 is the synoptic diagram of voiced keyword detection portion.
Fig. 6 is the process flow diagram of expression from voiced keyword detection portion to the action of speech recognition portion.
Fig. 7 is the process flow diagram of relation of migration of the action of expression effluxion of speech utterance and each several part.
Fig. 8 is the block scheme of the voice processing apparatus of expression second embodiment of the present invention.
Fig. 9 is the key diagram of the relation of expression user and key word.
Figure 10 is the block scheme of the voice processing apparatus of expression the 3rd embodiment of the present invention.
Figure 11 is the block scheme of the voice processing apparatus of expression the 4th embodiment of the present invention.
Embodiment
Based on Fig. 1 to Fig. 7 first embodiment of the present invention is described.
Fig. 1 is the integrally-built block scheme of expression voice processing apparatus 1.Block scheme shown in Figure 1 is a functional-block diagram, and the various functions shown in this functional-block diagram are carried out by computing machine (not shown).In other words, function shown in Figure 1 is by realizing in the calculation process of processor according to being used for making computing machine carry out this functional programs sign indicating number.In this case, the storage medium of processor and store program codes etc. can be the firmware structure that constitutes integrated circuit, for example also can be made of multi-purpose computer etc.Under the situation that storage medium of processor and store program codes etc. is made of multi-purpose computer etc., as an example, installation procedure sign indicating number in HDD of this multi-purpose computer etc. in advance.The procedure code of installing for example is copied among the RAM, and the built-in processor of this multi-purpose computer is carried out various functions shown in Figure 1 according to the procedure code that is replicated.
The voice processing apparatus 1 of present embodiment has the phonetic entry portion 3 that can import the voice that comprise the voice that speaker 2 sends.This phonetic entry portion 3 is connected in series with: AD converter section 4 will be converted to digital signal from the sound of described phonetic entry portion 3 inputs; Speech/non-speech judegment part 5, differentiating from the sound of described AD converter section 4 inputs is voice or non-voice; Voice recording portion 7 records 6 pairs of sound that are judged as voice by described speech/non-speech judegment part 5 of recording data store; And identification instruction unit 9, have the function that the sound by 6 recording of recording data store is sent to the speech recognition portion 8 of back level.And, voiced keyword detection portion 11 is connected between described speech/non-speech judegment part 5 and the described identification instruction unit 9, described keyword dictionary 10 is connected with key word changing unit 12, and whether the sound that the described speech/non-speech judegment part 5 of described voiced keyword detection portion's 11 detections is judged as voice is the word of registered in advance in keyword dictionary 10.In addition, between described voice recording portion 7 and described speech recognition portion 8, be connected with described recording data store 6.And then, described speech recognition portion 8 is connected with the speech recognition that is used to carry out speech recognition with dictionary 13.
Then, described phonetic entry portion 3 be will input speech conversion be the converter section of electric analogy sound, be representative with loudspeaker etc.Described AD converter section 4 is sample frequency, the quantization digits with regulation, is the converter section of digital signal with the analog signal conversion of importing.Be configured for the input block of sound import by these phonetic entry portions 3 and AD converter section 4.
In addition, described speech/non-speech judegment part 5 possesses whether the voice of differentiating input are the function of people's sound.As representative, represent the structure of source of sound information below, but be not limited thereto.Voice are changed by sound channel by the vibration of vocal cords, consider to show as the voice of 50 sounds.Vocal cord vibration is called source of sound information, and the variation of sound channel is called the sound channel characteristic, and the special consideration information suitable with the vibration of vocal cords has the feature of speech/non-speech.Below, this information is called source of sound information.As the extracting method of source of sound information, the method by linear predictive residual enumerated is as representative.Time series x (n) for digitized input voice, past p sample x (n-p) ... the linear junction of x (n-1) is combined into x ~ (n)=α 1*x (n-1)+α 2*x (n-2) ... α p*x (n-p), in the linear prediction analysis of the current sample value x (n) of prediction, with x (n) * x ~ (n) be called linear predictive residual, be the characteristic quantity that is equivalent to source of sound information.In detail, be recorded in after the 124th page of above-mentioned non-patent literature 1.
Fig. 2 represents the source of sound information of voice, and Fig. 3 represents the source of sound information of non-voice.Transverse axis is represented frequency, and the longitudinal axis is represented volume (speech energy).Compare these two figure, the source of sound information of the voice of Fig. 2 is with 0kHz~2.5kHz indication cycle property, and the source of sound information of the non-voice of Fig. 3 is aperiodicity.As the method that has or not of judging this cycle, correlation method is more famous.Correlation method is meant the correlated characteristic amount of sequence { y (1)~y (n) } being calculated with c (j)=∑ y (i) * (i+j).Fig. 4 represents the maximal value histogram of correlated characteristic amount.Can near correlated characteristic amount 0.3, separate on the transverse axis as can be known from this Fig. 4.Utilize this difference to constitute speech/non-speech judegment part 5.
Under the situation of non-voice, speech/non-speech judegment part 5 does not carry out later action.In other words, only under differentiating for the situation of voice, input signal is sent to voice recording portion 7, voiced keyword detection portion 11.Voice recording portion 7 finishes the function in the data store 6 of being recorded for the sound import of voice is recorded in by differentiation.Storage medium is so long as storage areas such as HDD, storer are what can.
Voiced keyword detection portion 11 only accepts the identification statement that is registered in the keyword dictionary 10.Fig. 5 is the details drawing of voiced keyword detection portion 11.The numerical data that 14 acceptance of sound equipment analysis portion are included from speech/non-speech judegment part 5, carry out the frequency analysis handled by FET (high speed fourier transform) etc. etc., to the input voice each predetermined interval (for example, phoneme unit or word units etc.) by time sequence output for the required characteristic information of each interval speech recognition (for example, frequency spectrum etc.).
Sound equipment comparing part 15 is accepted from the characteristic information of sound equipment analysis portion 14 outputs, and with reference in the keyword dictionary 10 registration word contrast, and calculate and (for example import between speech region, phone string units such as phoneme or syllable or intonation sentence, whether the similarity of identification candidate perhaps character string unit such as word units etc.) is the word of registration in the keyword dictionary 10 thereby differentiate.In addition, the above-mentioned processing in the sound equipment comparing part 15 can realize adding keyword dictionary 10 in HMM (hidden Markov model) or DP (dynamic programming) or the NN existing contrast technology such as (neural networks).After whether 15 differentiations of sound equipment comparing part were the word of registration in the keyword dictionary 10, its result was sent to identification instruction unit 9.The frame of broken lines of Fig. 6 represents to discern the details of the action of instruction unit 9.Only under the situation of the word that detects keyword dictionary 10, indication is carried out speech recognition to the data in the recording data store 6 to identification instruction unit 9.This function can realize by branch's order (if etc.) of software.
As concrete example, suppose the user, be that speaker 2 says that ‘ Salmon Dings Shi Tone reason be over (さ け て い I く Chi I う り か ん り I う) '.Here, suppose that registration You ‘ Tone reason is over (Chi I う り か ん り I う) in keyword dictionary 10 ' as key word.Send ' さ け speaker 2 ... ' moment, speech/non-speech judegment part 5 is differentiated and is ' voice ' to begin recording by 7 pairs of these voice of voice recording portion, and is kept in the recording data store 6.Sounding continues, arrive and send ' during Chi I う り か ん り I う ' and since with keyword dictionary 10 in the identification statement coupling of registration, so voiced keyword detection portion 11 message informing that will ' detect key word ' is given and is discerned instruction unit 9.
The identification instruction unit 9 of notice that is subjected to detecting key word stops recording shown in the process flow diagram of Fig. 6, be sent to speech recognition portion 8 by the voice of recording data store 6 records by the order of 1~n of Fig. 6.
Speech recognition portion 8 is based on the identification statement of registered in advance in speech recognition usefulness dictionary 13, beginning speech recognition.Speech recognition portion 8 can realize by the function that sound equipment analysis portion and sound equipment comparing part are set as the voiced keyword detection portion 11 of Fig. 5.
Speech recognition with dictionary 13 in the identification statement of registration under the situation of ' Salmon eats (さ け て い I く) surely ', as previously mentioned, because with interval ' ' さ け て い I く ' mates, so the correct ‘ of the, as a result Salmon of 8 outputs of speech recognition portion eats surely ' among the さ け て い I く Chi I う り か ん り I う ' of voice recording portion 7 recording.
That shown in Figure 7 is the figure that schematically illustrates the action of present embodiment, and transverse axis is represented the effluxion of sounding, the sequence of movement of y direction indication device.At first, speaker 2 sends in the non-voice state
‘ Salmon decides Shi , Tone reason and is over '.The amplitude of the sound of the sounding of this moment is represented illustrated waveform to ' non-voice ', ' Salmon eats surely ', ' the Tone reason is over ' successively according to the order of ' Salmon eats surely, and the Tone reason is over '.On the other hand, at device side, speech/non-speech judegment part 5 is not surveyed voice in the timing of ' non-voice ', and Zai ‘ Salmon eats surely ' timing survey voice.When detecting voice by this speech/non-speech judegment part 5, voice recording portion 7 begins recording, in voiced keyword detection portion 11, because the voice that detect Wei ‘ Salmon eats surely ', therefore do not carry out key word and survey.Then, be over sending ‘ Tone reason ' timing, voiced keyword detection portion 11 carries out key word and surveys.Thus, identification instruction unit 9 is discerned indication, and the correct ‘ of the, as a result Salmon of 8 outputs of speech recognition portion eats surely '.
By such method,, just can carry out and the equal speech recognition of existing key word input only by speaker's 2 sounding.In other words, replace determining as the input of key word input ' return key (the return key) ' of operation, voiced keyword detection portion 11 carries out work.In this case, do not say ' さ glibly speaker 2, さ け て, under the situation of I く ', if even noticing that wrong sounding is for ' under the situation of あ ゆ て い I く ', only otherwise carry out that ' sounding of Chi I う り か ん り I う ' in the present embodiment, is not just carried out speech recognition.Thus, the mistake that can reduce voice is significantly discerned the misoperation that causes.
If used in other situation under the situation of voice processing apparatus 1 of present embodiment, suppose that also ‘ Tone reason is over ' the unsuitable situation of key word.Under these circumstances, also can use key word changing unit 12 key word to be registered as the statement of the situation of being suitable for.
Then, based on Fig. 8 and Fig. 9 second embodiment of the present invention is described.With use same numeral for the identical part of Fig. 1~Fig. 7 declaratives, also omit explanation.
Fig. 8 is the integrally-built block scheme of expression voice processing apparatus 16.Block scheme shown in Figure 8 is a functional-block diagram, and the various functions shown in this functional-block diagram are carried out by computing machine (not shown).In other words, function shown in Figure 8 is by finishing in the calculation process of processor according to being used for making computing machine carry out this functional programs sign indicating number.In this case, the recording medium of processor and store program codes etc. can be the firmware structure that constitutes integrated circuit, for example also can be made of multi-purpose computer etc.Under the situation that storage medium of processor and store program codes etc. is made of multi-purpose computer etc., as an example, installation procedure sign indicating number in HDD of this multi-purpose computer etc. in advance.The procedure code of installing for example is copied among the RAM, and the built-in processor of this multi-purpose computer is carried out various functions shown in Figure 8 according to the procedure code that is replicated.
Voice processing apparatus 16 in the present embodiment is same with described voice processing apparatus 1 in following structure.That is, has the phonetic entry portion 3 that can import the voice that comprise speaker 2 voice that send.This phonetic entry portion 3 is connected in series with: AD converter section 4 will be a digital signal from the speech conversion of described phonetic entry portion 3 inputs; Speech/non-speech judegment part 5, differentiating from the sound of described AD converter section 4 inputs is voice or non-voice; Voice recording portion 7 records 6 pairs of sound that are judged as voice by described speech/non-speech judegment part 5 of recording data store; And identification instruction unit 9, have the function that the sound by 6 recording of recording data store is sent to the speech recognition portion 8 of back level.And, voiced keyword detection portion 11 is connected between described speech/non-speech judegment part 5 and the described identification instruction unit 9, and described voiced keyword detection portion 11 surveys whether the sound that is judged as voice by described speech/non-speech judegment part 5 is the word of registered in advance in described keyword dictionary 10.In addition, described speech recognition portion 8 is connected with the speech recognition that is used to carry out speech recognition with dictionary 13.Then, be characterised in that in the present embodiment, described speech/non-speech judegment part 5 is connected speaker identification portion 18, this speaker identification portion 18 has connected speaker's 2 information determined in record from speaker 2 voice messaging speaker identification usefulness dictionary 17, this speaker identification portion 18 is connected with key word selection portion 19, this key word selection portion 19 is connected with described keyword dictionary 10.
The effect of the speaker identification portion 18 of newly appending in the present embodiment then, below is described.Speaker identification is to determine speaker 2 individual's technology from speaker 2 voice messaging (be not particular words information, but the feature on the voice that comprise in speaker 2 the voice), is mainly used in safety applications.At the voice messaging of speaker identification, can judge speaker 2 by in advance with registration speaker in the dictionary 17.As shown in Figure 5, speaker identification portion 18 is made of sound equipment analysis portion 14 and sound equipment comparing part 15.Acceptance is from the numerical data of speech/non-speech judegment part 5 outputs, carry out the frequency analysis handled by FFT (high speed fourier transform) etc. etc., to the input voice each predetermined interval (for example, phoneme unit or word units etc.) by time sequence output for the required characteristic information of each interval speaker identification (for example, frequency spectrum etc.).
Sound equipment comparing part 15 is accepted from the characteristic information of sound equipment analysis portion 14 outputs, and contrast with reference to speaker 2 the voice messaging of speaker identification with registration in the dictionary 17, and the similarity of the candidate of the speaker between calculating and input speech region, thereby determine speaker 2.In addition, the above-mentioned processing in the sound equipment comparing part 15 can add keyword dictionary 10 and realizes in HMM (hidden Markov model) or the eigenvalue method of development or VQ existing contrast technology such as (vector quantizations).
Speaker 2 voice are determined speaker individual by speaker identification portion 18, and individual name is sent to key word selection portion 19.Fig. 9 represents an example of key word selection portion 19.Current, suppose that speaker 2 sends ' hillside plot too youth is entered the room '.Speaker 2 is under ' hillside plot is the youth too ' my situation, and key word selection portion 19 will ' be entered the room ' according to tabulation and is thought of as key word, and be registered in the keyword dictionary 10.Specifically, send ' や ま だ the speaker ... ' moment of sound, it is ' voice ' that speech/non-speech judegment part 5 is differentiated, speaker identification portion 18 is identified as ' hillside plot is the youth too ' in person, selected key word ' to enter the room ' afterwards in key word selection portion 19, registration ' (To ゆ う つ) enters the room ' in keyword dictionary 10.Same with described first embodiment, in the moment from differentiating for ' voice ', begin recording by 7 pairs of these voice of voice recording portion by speech/non-speech judegment part 5.Sounding continues, arrive and send ' during To ゆ う つ ' and since with keyword dictionary 10 in the identification statement coupling of registration, so the message informing that voiced keyword detection portion 11 will ' detect key word ' gives identification instruction unit 9, stops to record.After this action and described first embodiment are same.
In the present embodiment, not only produce and the equal effect of described embodiment, and can in silence, change the sounding that operation is determined in input each user.Promptly, go out society as ' hillside plot too youth is entered the room ', ‘ Fu swamp time youth ' releasing of, ‘ Seal wood beggar ロ Star Network ', to each user based on ' hillside plot is the youth too ', ‘ Fu swamp time youth the speaker identification of ', ‘ Seal wood beggar ', change to that ' entering the room ' ' goes out society ', the key word of ' releasing of ロ Star Network '.In addition, although user Shi Fu swamp time youth sends ' や ま だ じ ろ う for the jactitation hillside plot is time bright, enter the room ', speaker identification portion 18 also nonrecognition is ' hillside plot is the youth too ', and, since with determine that as Fu swamp time youth's input operation ' going out society ' do not match, and therefore is failure to actuate.Thus, also bring speaker's safety to strengthen.
Then, based on Figure 10 the 3rd embodiment of the present invention is described.The part identical with the part that illustrates for Fig. 1~Fig. 7 used same numeral, also omits explanation.
Figure 10 is the integrally-built block scheme of expression voice processing apparatus 20.Block scheme shown in Figure 10 is a functional-block diagram, and the various functions shown in this functional-block diagram are carried out by computing machine (not shown).In other words, function shown in Figure 10 is by finishing in the calculation process of processor according to being used for making computing machine carry out this functional programs sign indicating number.In this case, the recording medium of processor and store program codes etc. can be the firmware structure that constitutes integrated circuit, for example also can be made of multi-purpose computer etc.Under the situation that storage medium of processor and store program codes etc. is made of multi-purpose computer etc., as an example, installation procedure sign indicating number in HDD of this multi-purpose computer etc. in advance.The procedure code of installing for example is copied among the RAM, and the built-in processor of this multi-purpose computer is carried out various functions shown in Figure 10 according to the procedure code that is replicated.
At first, be same structure from the phonetic entry portion 3 of voice processing apparatus 20 to the speech/non-speech judegment part 5 and first embodiment, but do not comprise recording data recording section 6.Under speech/non-speech judegment part 5 is differentiated for the situation of non-voice, do not carry out later action.In other words, only under differentiating for the situation of voice, input signal is sent to speech recognition portion 8, voiced keyword detection portion 11.Voiced keyword detection portion 11 only accepts the identification statement of registration in the keyword dictionary 10.The implementation method of voiced keyword detection portion 11 that possesses this function is as first embodiment.
As concrete example, suppose the user, be that speaker 2 says that ‘ Salmon Dings Shi Tone reason be over (さ け て い I く Chi I う り か ん り I う) '.Here, suppose that registration You ‘ Tone reason is over (Chi I う り か ん り I う) in keyword dictionary 10 ' time, ' さ け sent speaker 2 ... ' moment, speech/non-speech judegment part 5 is differentiated and is ' voice ', these voice are sent to speech recognition portion 8.At this constantly, speech recognition portion 8 begins speech recognition in speech recognition with the identification statement in the dictionary 13 based on registered in advance.In the first embodiment, used method, but in the present embodiment,, can early beam back recognition result by beginning speech recognition earlier by 6 storages of recording data store.Speech recognition with dictionary 13 in the identification statement of registration under the situation of ' Salmon eats (さ け て い I く) surely ', detect key word interval, ' さ け て い I く before by voiced keyword detection portion 11, ' さ け て い I く ' is obtained the correct Salmon of ‘ as a result and is eaten surely by 8 identifications of speech recognition portion ' among the Chi I う り か ん り I う '.Sounding continues, arrive and send ' during Chi I う り か ん り I う ' and since with keyword dictionary 10 in the identification statement coupling of registration, so the message informing that voiced keyword detection portion 11 will ' detect key word ' is to recognition result detection unit 21.After recognition result detection unit 21 is received this notice, just will eat surely from the correct Salmon of ‘ as a result of speech recognition portion 8 output ' as result's output of voice processing apparatus 20.
In the present embodiment, not only can obtain and the equal effect of first embodiment, compare, because therefore the record of the data that do not need to record, plays superiority aspect recognition speed with it.As, carried out the user that ' Xin Zi ソ one ス pays I flavor
Figure C20061000660300131
Boil that Write body Ha Application バ one Network is eaten surely, the Tone reason is over, the situation of sounding under, in the first embodiment, only ' Xin Zi ソ one ス pays the I flavor
Figure C20061000660300132
Zhu Write body Ha Application バ one Network is eaten surely ' the speech recognition speed of part postpones, and the output result, but in the present embodiment, can export the result forthwith.
If will use in other cases under the situation of voice processing apparatus 20 of present embodiment, suppose that also ‘ Tone reason is over ' the unsuitable situation of key word.Under these circumstances, also can use key word changing unit 12 key word to be registered as the statement of the situation of being suitable for.
Then, based on Figure 11 the 4th embodiment of the present invention is described.
Figure 11 is the integrally-built block scheme of expression voice processing apparatus 22.Block scheme shown in Figure 11 is a functional-block diagram, and the various functions shown in this functional-block diagram are carried out by computing machine (not shown).In other words, function shown in Figure 11 is by finishing in the calculation process of processor according to being used for making computing machine carry out this functional programs sign indicating number.In this case, the recording medium of processor and store program codes etc. can be the firmware structure that constitutes integrated circuit, for example also can be made of multi-purpose computer etc.Under the situation that storage medium of processor and store program codes etc. is made of multi-purpose computer etc., as an example, installation procedure sign indicating number in HDD of this multi-purpose computer etc. in advance.The procedure code of installing for example is copied among the RAM, and the built-in processor of this multi-purpose computer is carried out various functions shown in Figure 11 according to the procedure code that is replicated.
The voice processing apparatus 22 and second embodiment in the present embodiment are same, and the 3rd embodiment has been added speaker identification dictionary 17, speaker identification portion 18 and key word selection portion 19.Therefore, specific description is omitted, but can be to the feature of additional the 3rd embodiment of second embodiment, and can realize the high speed handled.
In the present invention, having being surveyed by the speech/non-speech judegment part is speaker identification portion and the speaker identification dictionary that the sound of voice carries out speaker identification, the identification instruction unit is sent indication to speech recognition portion, make it in the situation that detects the speaker who registers in dictionary for speaker identification by voiced keyword detection portion be moment of the situation that is registered in the word in the keyword dictionary, begin the sound by the recording of recording data store is carried out speech recognition, so thereby can possess the function intensified safety function of determining of carrying out the user.
In addition, the key word corresponding with the speaker who is determined by speaker identification portion and speaker identification dictionary is registered in the keyword dictionary, so can in silence, change the sounding that operation is determined in input each user.
And then, can easily change the registration content of keyword dictionary.
And then keyword dictionary can be preserved a plurality of key words, can tackle multiple use.
Then, have: the sound input part that can import the sound that comprises the user, to be the AD converter section of digital signal from the speech conversion of phonetic entry portion input, differentiation is the speech/non-speech judegment part of voice or non-voice from the sound of AD converter section input, the voice recording portion that the recording data store is recorded to the sound that is judged as voice by the speech/non-speech judegment part, only can preserve the keyword dictionary of a key word in advance, be used for and survey whether the sound that is judged as voice by the speech/non-speech judegment part is registered in advance is carried out speech recognition in the voiced keyword detection portion of the word of keyword dictionary speech recognition dictionary, the sound that is judged as voice by the speech/non-speech judegment part is carried out with dictionary based on speech recognition in the voice processing apparatus of speech recognition portion of speech recognition, by comprising the recognition result detection unit, can carry out high speed processing, this recognition result detection unit has the function of moment of the word registered in the keyword dictionary accepting the result of speech recognition portion for being being surveyed by voiced keyword detection portion.
In addition, by the key word corresponding with the speaker who is determined by speaker identification portion and speaker identification dictionary is registered in the keyword dictionary, can tackle multiple user mode.

Claims (9)

1. voice processing apparatus comprises:
Phonetic entry portion (3) can import the sound that comprises user's voice;
AD converter section (4) will be converted to digital signal from the sound of described phonetic entry portion input;
Speech/non-speech judegment part (5), differentiating from the sound of described AD converter section input is voice or non-voice;
Voice recording portion (7) makes recording data store (6) record to the sound that is judged as voice by described speech/non-speech judegment part;
Keyword dictionary (10) can be preserved key word in advance;
Voiced keyword detection portion (11), whether detection is the word of registered in advance in described keyword dictionary by the sound that described speech/non-speech judegment part is judged as voice;
Speech recognition is used to carry out speech recognition with dictionary (13);
Speech recognition portion (8) according to the indication of answering speech recognition, based on described speech recognition dictionary, carries out speech recognition to the sound by described recording data store recording; And
Identification instruction unit (9) to described speech recognition portion (8), is sent the indication of answering speech recognition,
It is characterized in that, described identification instruction unit (9) is sent indication to described speech recognition portion (8), making it is the moment of the word of registered in advance in described keyword dictionary (10) being detected by described voiced keyword detection portion (11), begins to carry out speech recognition.
2. voice processing apparatus as claimed in claim 1 is characterized in that,
Whether the speaker identification with the information that has write down the voice that are used for determining the speaker is the speaker identification portion (18) that the speaker's that registers with dictionary in described speaker identification voice carry out speaker identification with dictionary (17) with based on the sound that is voice by described speech/non-speech judegment part detection
Described identification instruction unit is sent indication to described speech recognition portion, to be surveyed by described speech/non-speech judegment part be that the sound of voice is to be registered in the voice of described speaker identification with the speaker in the dictionary being identified as by described speaker identification portion to make it, and survey for this reason by described voiced keyword detection portion that sound is the moment that is registered in the word in the described keyword dictionary, begin to carry out speech recognition.
3. voice processing apparatus as claimed in claim 2 is characterized in that,
Can in described keyword dictionary, register and the corresponding key word of determining by described speaker identification portion and described speaker identification dictionary of speaker.
4. voice processing apparatus as claimed in claim 2 is characterized in that,
Can change the registration content of keyword dictionary.
5. as claim 1,2, any one described voice processing apparatus of 3 or 4, it is characterized in that,
Described keyword dictionary can be preserved a plurality of key words.
6. voice processing apparatus comprises:
Phonetic entry portion (3) can import the sound that comprises user's voice;
AD converter section (4) will be converted to digital signal from the sound of described phonetic entry portion input;
Speech/non-speech judegment part (5), differentiating from the sound of described AD converter section input is voice or non-voice;
Keyword dictionary (10) can be preserved key word in advance;
Voiced keyword detection portion (11), whether detection is the word of registered in advance in described keyword dictionary by the sound that described speech/non-speech judegment part is judged as voice;
Speech recognition is used to carry out speech recognition with dictionary (13); And
Speech recognition portion (8) carries out speech recognition with dictionary to the sound that is judged as voice by described speech/non-speech judegment part based on described speech recognition,
It is characterized in that having recognition result detection unit (21), it has following function, is moment of word of registration in the described keyword dictionary (10) to accept the result of described speech recognition portion (8) being detected by described voiced keyword detection portion (11) promptly.
7. voice processing apparatus as claimed in claim 6 is characterized in that,
Whether the speaker identification with the information that has write down the voice that are used for determining the speaker is the speaker identification portion (18) that the speaker's that registers with dictionary in described speaker identification voice carry out speaker identification with dictionary (17) with based on the sound that is voice by described speech/non-speech judegment part detection
Survey to being that the sound of voice is to be registered in the sound of described speaker identification with the speaker in the dictionary being identified as by described speaker identification portion by described voiced keyword detection portion, and survey for this reason by described voiced keyword detection portion that sound is the moment that is registered in the word in the described keyword dictionary, described recognition result detection unit (21) begins to accept the result of described speech recognition portion.
8. voice processing apparatus as claimed in claim 7 is characterized in that,
Can in described keyword dictionary, register and the corresponding key word of determining by described speaker identification portion and described speaker identification dictionary of speaker.
9. voice processing apparatus as claimed in claim 7 is characterized in that,
Can change the registration content of described keyword dictionary.
CN200610006603A 2005-02-07 2006-01-26 Speech processing device Expired - Fee Related CN100578612C (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP31032/05 2005-02-07
JP2005031032A JP4237713B2 (en) 2005-02-07 2005-02-07 Audio processing device

Publications (2)

Publication Number Publication Date
CN1819016A CN1819016A (en) 2006-08-16
CN100578612C true CN100578612C (en) 2010-01-06

Family

ID=36918998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200610006603A Expired - Fee Related CN100578612C (en) 2005-02-07 2006-01-26 Speech processing device

Country Status (2)

Country Link
JP (1) JP4237713B2 (en)
CN (1) CN100578612C (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4887264B2 (en) * 2007-11-21 2012-02-29 株式会社日立製作所 Voice data retrieval system
JP2009175179A (en) * 2008-01-21 2009-08-06 Denso Corp Speech recognition device, program and utterance signal extraction method
JP5042194B2 (en) 2008-10-27 2012-10-03 インターナショナル・ビジネス・マシーンズ・コーポレーション Apparatus and method for updating speaker template
JP2013037030A (en) * 2011-08-03 2013-02-21 Casio Comput Co Ltd Emulator device and program
US9117449B2 (en) * 2012-04-26 2015-08-25 Nuance Communications, Inc. Embedded system for construction of small footprint speech recognition with user-definable constraints
JP6502249B2 (en) 2013-08-29 2019-04-17 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America Speech recognition method and speech recognition apparatus
US10770075B2 (en) * 2014-04-21 2020-09-08 Qualcomm Incorporated Method and apparatus for activating application by speech input
JP2016024212A (en) 2014-07-16 2016-02-08 ソニー株式会社 Information processing device, information processing method and program
GB2535766B (en) 2015-02-27 2019-06-12 Imagination Tech Ltd Low power detection of an activation phrase
US10255913B2 (en) * 2016-02-17 2019-04-09 GM Global Technology Operations LLC Automatic speech recognition for disfluent speech
JP6296121B2 (en) * 2016-08-31 2018-03-20 カシオ計算機株式会社 Emulator device, program, and display method
CN107403011B (en) * 2017-08-01 2020-08-07 三星电子(中国)研发中心 Virtual reality environment language learning implementation method and automatic recording control method
JP7179834B2 (en) * 2018-04-09 2022-11-29 マクセル株式会社 VOICE RECOGNITION DEVICE, VOICE RECOGNITION DEVICE COOPERATION SYSTEM, AND VOICE RECOGNITION DEVICE COOPERATION METHOD
JP2019200394A (en) * 2018-05-18 2019-11-21 シャープ株式会社 Determination device, electronic apparatus, response system, method for controlling determination device, and control program
JP2021144065A (en) * 2018-06-12 2021-09-24 ソニーグループ株式会社 Information processing device and information processing method
WO2020003785A1 (en) * 2018-06-25 2020-01-02 ソニー株式会社 Audio processing device, audio processing method, and recording medium
DE112019003234T5 (en) * 2018-06-27 2021-03-11 Sony Corporation AUDIO PROCESSING DEVICE, AUDIO PROCESSING METHOD AND RECORDING MEDIUM

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11231895A (en) * 1998-02-17 1999-08-27 Nippon Telegr & Teleph Corp <Ntt> Method and device speech recognition
JP2001067091A (en) * 1999-08-25 2001-03-16 Sony Corp Voice recognition device
JP2002175096A (en) * 2000-12-06 2002-06-21 Denso Corp Microphone controller

Also Published As

Publication number Publication date
CN1819016A (en) 2006-08-16
JP2006215499A (en) 2006-08-17
JP4237713B2 (en) 2009-03-11

Similar Documents

Publication Publication Date Title
CN100578612C (en) Speech processing device
JP5024154B2 (en) Association apparatus, association method, and computer program
TWI253056B (en) Combined engine system and method for voice recognition
JPH09500223A (en) Multilingual speech recognition system
Aggarwal et al. Performance evaluation of sequentially combined heterogeneous feature streams for Hindi speech recognition system
KR101699252B1 (en) Method for extracting feature parameter of speech recognition and apparatus using the same
Prakoso et al. Indonesian Automatic Speech Recognition system using CMUSphinx toolkit and limited dataset
Sultana et al. Recent advancement in speech recognition for bangla: A survey
Zhu et al. Filler word detection and classification: A dataset and benchmark
JP4791857B2 (en) Utterance section detection device and utterance section detection program
Grewal et al. Isolated word recognition system for English language
JP2011053569A (en) Audio processing device and program
KR102113879B1 (en) The method and apparatus for recognizing speaker&#39;s voice by using reference database
KR101598950B1 (en) Apparatus for evaluating pronunciation of language and recording medium for method using the same
Kumar et al. Multilingual speaker recognition using neural network
JP2012053218A (en) Sound processing apparatus and sound processing program
US20050246172A1 (en) Acoustic model training method and system
JP7222828B2 (en) Speech recognition device, speech recognition method and storage medium
Sharma et al. Speech recognition of Punjabi numerals using synergic HMM and DTW approach
Lingam Speaker based language independent isolated speech recognition system
KR20210081166A (en) Spoken language identification apparatus and method in multilingual environment
Prasangini et al. Sinhala speech to sinhala unicode text conversion for disaster relief facilitation in sri lanka
KR100677224B1 (en) Speech recognition method using anti-word model
Li et al. Multi-task deep neural network acoustic models with model adaptation using discriminative speaker identity for whisper recognition
Patro et al. Statistical feature evaluation for classification of stressed speech

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100106

Termination date: 20120126