WO2004008295A2 - System and method for voice characteristic medical analysis - Google Patents

System and method for voice characteristic medical analysis Download PDF

Info

Publication number
WO2004008295A2
WO2004008295A2 PCT/US2003/022636 US0322636W WO2004008295A2 WO 2004008295 A2 WO2004008295 A2 WO 2004008295A2 US 0322636 W US0322636 W US 0322636W WO 2004008295 A2 WO2004008295 A2 WO 2004008295A2
Authority
WO
WIPO (PCT)
Prior art keywords
voice
data
originator
captured
template
Prior art date
Application number
PCT/US2003/022636
Other languages
French (fr)
Other versions
WO2004008295A3 (en
Inventor
Steven J. Keough
Original Assignee
Keough Steven J
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Keough Steven J filed Critical Keough Steven J
Priority to AU2003259177A priority Critical patent/AU2003259177A1/en
Publication of WO2004008295A2 publication Critical patent/WO2004008295A2/en
Publication of WO2004008295A3 publication Critical patent/WO2004008295A3/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices

Definitions

  • a patient's voice characteristic is frequently an indicator of an unwanted medical condition to a primary care physician, regardless of any prior knowledge of the patient.
  • a shortage of physicians and a combined inability of patients to be able to get to physicians and medical care promotes greater interest in telemedicine techniques. Improvements to telemedicine are long overdue, and provide opportunities not previously recognized to general populations.
  • a tape or digital recording device is used to record someone's voice and thereby retain it for future listening and replay as it was recorded originally, or portions of the original recording may be played as desired.
  • These devices and methods of voice recording also include a range of artificial voices, created by computers, which may be used for many different functions, including for example telephone automatic assistance and verification, very basic speech between toys or equipment and users, synthesized voices for the film and entertainment industry, and the like.
  • these artificial voices are preprogrammed to a narrow set of responses according to a specific input.
  • these artificial voice sounds are nevertheless simple compared to the robust voice capabilities of the present invention. Indeed, in certain embodiments of the invention there are elements that are either quite different from such systems or which take the previous technology far beyond that ever contemplated or even suggested by such prior discoveries or innovations.
  • Figure 1 is a flow diagram of one embodiment of the system operation of the invention.
  • Figure 2 is a schematic diagram of one embodiment of a voice capture subsystem.
  • Figure 3 is a schematic diagram of one embodiment of a voice analysis subsystem.
  • Figure 4 is a schematic diagram of one embodiment of a voice characterization subsystem.
  • Figure 5 is a schematic diagram of one embodiment of a voice template subsystem.
  • Figure 6 is a schematic diagram of one embodiment of a voice template signal bundler subsystem.
  • Figure 7 is one embodiment of a schematic diagram of the system of the invention used with remote information download and upload options.
  • Figure 8 is one embodiment of an exemplary plan view of an embodiment of the invention embodied in a mobile, compact component.
  • Figure 9 is an exemplary plan view of an embodiment of the invention used with a visual media source.
  • Systems and methods are provided for recording or otherwise capturing an enabling amount of a specific person's voice to form a voice pattern template. That template is then useful as a tool for building new speech sounding like that precise voice, using the template, with the new speech probably never having been actually said or never having been said in the precise context or sentences by the specific human but actually sounding identical in all aspects to that specific human's actual speech.
  • the enabling portion is designed to capture the elements of the actual voice necessary to reconstruct the actual voice, however a confidence rating is available to predict the limits of the reconstructed or re-created speech in the event there is not enough enabling speech to start with.
  • a new voice or voices may be used with a database of subject matter, historical data, and adaptive or artificial intelligence modules to enable new discussions with the user just as if the templated voice's originator were present.
  • This system and method may be combined with other media, such as a software file, a chip embedded tool, or other forms. Interactive use of this system and method may occur in various manners.
  • a unit module itself may comprise the entirety of an embodiment this invention, e.g. a chip or electronic board which is configured to capture and enable use of a voice in the manner disclosed herein.
  • the template is useful, for example, as a tool for capturing and creating new dialogs with people whom are no longer immediately available, who may be deceased, or even those who consent to having the voices templated and used in this manner.
  • Another example is the application to media, such as film or photos or other depictions of the actual voice(s) originator to create on-demand virtual dialog with the originator.
  • media such as film or photos or other depictions of the actual voice(s) originator to create on-demand virtual dialog with the originator.
  • Various other uses and applications are contemplated within the scope of the invention. Detailed Description of the Invention
  • Voice is a sound of extraordinary power among mammals.
  • the sound of a mother's voice is recognized by and soothes a child even before birth, and the sound of a grandfather's voice calms the fears of even a grown person.
  • Other voices may inspire complete strangers or may elicit memories from loved ones of long past events and moments.
  • this particularity of one's voice derives from the genetic contribution of the parents resulting in the shape, size, position and development of the various human body components that influence the way one sounds when speaking or otherwise communicating with voice or through the mouth and nasal passages.
  • One method of synthesizing voices and sounds is referred to as concatenative, and refers to the recordings of wave form data samples or real human speech. The method then breaks down the pre-recorded original human speech into segments and generates speech utterances by linking these human speech segments to build syllables, words, or phrases. The size of these segments varies.
  • Another method of human speech synthesis is known as parametric. In this method, mathematical models are used to recreate a desired speech sound. For each desired sound, a mathematical model or function is used to generate that sound. As such, the parametric method is generally without human sound as an element.
  • parametric speech synthesizers there are generally a few well-known types.
  • articulatory synthesizer which mathemetically models the physical aspects of the human lungs, larynx, and vocal and nasal tracts.
  • formant synthesizer which mathematically models the acoustic aspects of the human vocal tract.
  • This invention describes various techniques for templating specific human voices; disclosure regarding development of standardized classification, distribution, and reconstruction modalities for novel human voice templates built with unique voice profile subcodes; and various alternative paths for analyzing and assessing human voice characteristics, and creating very specific patient-related and overall individual person-related voice codes. All of these embodiments are incorporated into the current application.
  • the invention comprises developing remote phone-based analyzing modalities of patient voices for telemedicine screening applications.
  • the product and service inventions will provide remote identification of certain disease onset via detectable changes in patient voice.
  • Patient confidentiality may also be enhanced and provided for using this feature as a biometric validation during a phone call.
  • a patient's voice characteristic is frequently an indicator of an unwanted medical condition to a primary care physician, regardless of any prior knowledge of the patient. Examples include detection of tremor, Parkinsonian voice (paucity of expression), psychoactive medications (monotone), peritonsular abscess (known as "hot potato in the throat"), epiglottitis, neurological changes causing dysarthria (changed speech due to brain changes).
  • a product or service utilizing this technology will provide instantaneous analysis and recurrent updates of a patient's voice, compare the updates to a critical code which defines a patient baseline voice (when the patient is known), and generate a differential diagnosis based on likely conditions triggering certain anomalies to that patient's voice, whether the patient is a first time caller or not.
  • a digital voice code with ultra-high patient specificity, and novel medical condition recognition capabilities (preferably) by digital signal analyses are provided and disclosed herein.
  • the digital voice code is derived from any of the plurality of techniques and algorithms, as disclosed herein below.
  • the medical condition recognition capabilities and algorithms are those being developed by Applicant in relation to the above-listed conditions and others.
  • this technology also provides additional business method opportunities.
  • utilization of this technology in a blind voluntary automated screening system for an organization-communication system with private queuing to voice originators would allow for non-traditional intervention and courtesy alerts to non- patient participants in the plan, as well as traditional patients in order to screen for major disease onsets and have wider organizational impact on healthcare savings.
  • Use by large non-medical organizations as a daily screening tool with their phone systems is a very likely utilization of this technology, in addition to telemedicine call-in applications utilized within healthcare organizations.
  • this technology has not been possible because the enabling voice platform technology and the medical condition recognition capabilities and algorithms were not created.
  • This telemedicine approach is particularly unique in its use of the digital code for each person's voice, which allows extensive data comparison and record checking capabilities for each patient as well.
  • Research indicates a very large population which is readily reachable and is also affected by the main diseases/conditions currently being investigated for use by this technology.
  • Other applications exist as well, for example, status post-laryngectomy, or no vocal cords, may use this technology as an alternate to the currently used esophageal speech or vibrator, and then be classified accordingly.
  • One goal of this technology is to create a uniform classification system which captures, defines, and quantifies all parameters of a particular human voice.
  • a specific digital code will be generated to uniquely identify and define a particular human's voice, in all regards.
  • This unique voice identifying code will be useful in later generation of a machine spoken voice using text or other data when combined with the code for that selected human's voice.
  • an important element of this technology is the development of standardized classification, distribution, and reconstitution modalities for novel human voice templates built with unique voice profile sub-codes, as disclosed herein.
  • the classification system will capture and normalize the results of different voice analysis techniques. These techniques are drawn from voice recognition, voice synthesis/production, text-to- speech know how, and other fields.
  • Various analytical tools will be categorized from both the traditional analytical realm (e.g., vocal fold analysis, glottal pulse response, histogram/spectrogram analysis, vocal tract dimensions, fractal analysis, audio-visual analysis, short time Fourier transform, etc.) as well as more novel areas (e.g., genetic characteristics, age, accent, education, health, etc.).
  • This voice identifying and classifying system will yield a unique, data-rich descriptive code for each human voice, and such code will function as a cipher key to generate a synthetic voice having identical sound and other characteristics as the original human voice.
  • Each code will then be formatted in various ways to facilitate it's use as a building block in licensing or other commercial applications, including electronic transfer and storage. Such a system will enable dramatic changes to the human interface field as well as other technical areas.
  • One objective of this technical effort is to identify the best methodology for classifying all elements of a specific voice and then arranging a logical and unique classification system for future standard categorization protocols, which require development. It is well known that numerous technical approaches to voice recognition, authentication, generation, biometrics, and digital communications have been developed and successfully implemented. Our technology will use many of these underlying technical approaches as building blocks to develop voice classification, distribution and re-creation of actual voice profiles. Analysis of an enabling portion of an actual speaker according to a very extensive parsing methodology produces a digital code which uniquely defines the various elements of the speaker's voice. This digital code is referred to as a "voice DNA" and becomes the basis to the particular speaker's voice descriptive code. The diverse yet excellent work done in the voice recognition field is the source of best available technology in this regard. However, that technology has been developed with considerable constraints, and has not focused on the underlying voice DNA of speakers. Rather, it has been more content-dependent on the whole.
  • a third technical objective and embodiment is to identify the uses of specific genetic codes and/or gene sequence codes to assist in the unique classification of a particular speaker's voice and to enhance the fidelity of any voice re-created using a voice DNA product. This has always been considered (by us) to be an important tool for classifying the enabling portion of the original voice. We believe this feature alone merits research support, particularly in light of recent discoveries involving the FOXP2 gene, but in this context is considered to be one part of a larger effort to define the best techniques to achieve unique classification of each human's voice. An additional opportunity exists because a derivative of understanding and classifying the mechanisms of individual speech is improved understanding of the specific brain circuitry that underlies such faculty in each person.
  • the cognitive information processing role per subject, is a basis of the interpretation and formation of speech. This technical effort will endeavor to utilize this basic element as a further enhancement of the specificity of each person's voice DNA, mindful of identifying future opportunities for research in communication among human-human and human-machine neural interfaces.
  • a fourth technical objective is the distribution, validation and control over use of voice DNA. Identifying the best format for these voice classification signals, and the various interface requirements of the reconstituting data, are essential technical milestones. The reality of fusing numerous underlying voice technologies requires technical reconciliation and normalization, at certain levels. The resulting integration will create a new technology field, with considerable revenue and scientific spin-offs. It is believed that a driver of revenue will be the ability to distribute voice DNA or templates via digital networks and the like for use with different products and services. This will require proper validation, transfer and proliferation/derivative rights control protocols and regimes.
  • Voice template technology will create revenue from these and other products: chip logic; PC cards; HPCs/PPCs/PDAs; cellular phone systems; travel and ID cards and related biometric systems; internet transfers; voice use license fees; and numerous interface applications within many industries.
  • the potential for unforeseen spin-off applications for this technology is also quite high.
  • Systems and methods are proposed to capture and enhance an enabling portion of an individual's voice and then to create a voice template or profile signal which may be combined at a later time with noise/text of another origin to utilize the original voice. Such voice may then be used to speak any form/content/text provided via digital or other input thereto, and to "speak" content which was not spoken in an original form by the original voice.
  • Products and processes for online use are planned, as are certain business methods and industry applications. Specific objectives guide the technology development, as follows:
  • One goal is to identify the best methodology for classifying all elements of a specific voice into a logical and unique code utilizing standard categorization protocols, which require development.
  • the particularity of voice derives from the genetic contribution of the parents resulting in the shape, size, position and development of the various human body morphology that influences the way one sounds when speaking or otherwise communicating with voice or through the mouth and nasal passages. Other influences exist as well. It is understandable therefore that there is a range of differences among people, often even within the same family. Indeed, even the same person may sound slightly different according to temporal influences such as the health, stress level, emotional state, fatigue, the ambient temperature around the person, or other factors. Regardless, the ability of humans to associate through their senses is remarkable, particularly as such sensing relates to identification and association with the human voice.
  • One method of synthesizing voices and sounds is referred to as concatenative, and refers to the recordings of wave form data samples or real human speech. The method then breaks down the pre-recorded original human speech into segments and generates speech utterances by linking these human speech segments to build syllables, words, or phrases. The size of these segments varies.
  • Another method of human speech synthesis is known as parametric. In this method, mathematical models are used to recreate a desired speech sound. For each desired sound, a mathematical model or function is used to generate that sound. As such, the parametric method is generally without human sound as an element.
  • parametric speech synthesizers there are generally a few well-known types.
  • articulatory synthesizer which mathematically models the physical aspects of the human lungs, larynx, and vocal and nasal tracts.
  • formant synthesizer which mathematically models the acoustic aspects of the human vocal tract.
  • Other systems include means for recognizing a specific voice, once the using system has been trained in that voice. Examples of this include the various speech recognition systems useful in the field of capturing spoken language and then translating those sounds into text, such as with systems for dictation and the like. Other speech related systems concern the field of biometrics, and use of certain spoken words as security codes or ciphers. What has long been overlooked, however, is a system and method for preserving the key or cipher to voices of other beings in a dynamic and adaptive manner for future use and benefit by the originator or by others.
  • Table I is one of several uncorrelated tables which identifies technically relevant characteristics ( C ) of human voice, and tools/technology ( T ) having known analytical value.
  • An initial task is to validate these Tables and then to rank best T 1_x for each C.
  • the stability of each ranking will need testing against various C categories. For example, whether T 1 is a reliable and stable choice for all voice types or all voices under stress.
  • a correlation ranking among selected T categories per C category will also be done. This will result in the best techniques for deriving each characteristic, and then lead to a classification system and reconciliation algorithms suitable for fusing the output of the best techniques into an integrated classification code.
  • a master topographic selection model is currently preferred due to its intuitive adaptability to voice dynamics, somewhat similar to that of Genetic Algorithms. Use of GAs and other techniques will likely aid in more rapid assessment and selection of best models.
  • Figure 1 is a schematic diagram of one embodiment of a system 10 for capturing an enabling portion of a specific voice sufficient for using that portion as a template in further use of the voice characteristics.
  • System 10 may be part of a handheld device, such as an electronic handheld device, or it may be part of a computing device of the size of a laptop, a notebook, or a desktop, or system 1 may be part of merely a circuit board within another device, or an electronics component or element designed for temporary or permanent placement in or use with another electronic element, circuit, or system, or system 10 may, in whole or in part, comprise computer readable code or merely a logic or functional circuit in a neural system, or system 10 may be formed as some other device or product such as a distributed network-style system.
  • system 10 comprises input or capture means 15 for capturing or receiving a portion of a voice for processing and construction of a voice algorithm or template means 19, which may be formed as a stream of data, a data package, a telecommunications signal, software code means for defining and re-generating a specific voice, or a plurality of voice characteristics organized for application to or template on another organization of sound or noise suitable to arrange the sound or noise as an apparent voice of an originator's voice.
  • Other means of formatting computer readable program code means, or other means, for causing use of certain identified voice characteristics data to artificially generate a voice is also contemplated within this invention.
  • the logic or rules of the algorithm or template means 19 are preferably formed with a minimum of voice input, however various amounts of voice and other data may be desired to form an acceptable data set for a particular voice.
  • an enabling portion of a human voice for example, with a small amount of analog or digital recording, or real-time live input, of the person's voice that is to be templated.
  • a prescribed grouping of words may be formed to optimize data capture of the most relevant voice characteristics of the person to enable accurate replication of the voice.
  • Analysis means are contemplated for most efficiently determining what form of enabling portion is best for a particular person. Whether by a single data input or a series of inputs, the voice data is captured and stored in at least one portion of storage means 22.
  • a template of the voice is, in one embodiment, stored until called for by the processor means 25. For example, after voice AA has had an enabling portion captured, analyzed and templated (now referred to as AA) it is stored in a storage means 22 (which may be either resident near the other components or located in a remote or distributed mode at one or more locations) until a demand request occurs.
  • a demand request is a user of system 10 submitting a request via representative input means 29 to utilize the voice AA template AA t in a newly created conversation with voice AA participating as a generated voice rather than an actual, live use of voice AA.
  • This may occur in conjunction with or utilization of one or more various databases, a few of which are represented by situational database 33 or personal database 36.
  • voice AA template AA t is called and provided as a forming mechanism with certain other noise to create a new conversational voice AA 1 that sounds precisely like the original voice AA of the originally inputted data, once formed.
  • the new voice AA 1 sounds like original voice AA in all respects, it is actually an artificially created voice with the template AA t providing the matching key, such as a genetic code, to voice AA.
  • an enabling portion of an actual voice may encode the system 10 using a template to allow regeneration and unlimited utilization of the captured voice in virtually any way desired by the user.
  • This is not simply a synthesis of prior utterances of bits of voice AA which are electronically fused together, by either concatenation or formant techniques, but rather an entirely new voice that is designed, manufactured and assembled or constructed using the voice data characteristics of voice AA (i.e., the voice template or profile), and possibly other characteristics relevant to the originator of voice AA, e.g. genetic code, tissue DNA applicable to a specific voice, or other physiologic precursor.
  • connection means 41 represents pathways for energy or data flow which may be actual leads, light channels, or other electronic, biologic or other activatable paths among system components.
  • power means 44 is shown within system 10, but may also be remote if desired.
  • the algorithm, signal, code means or template which is created in whole or in part may be returned for storage or refinement within either storage means 22, template means 19, or other system component or architecture.
  • This capability permits and facilitates improvement or adaptation of the specific voice template according to the instructions of the creator or another user. This could be accomplished, for example, if multiple data sets of the same person's voice could be inputted over time, or if different ages, development, or other changes to physiology or temperament of the originator of the voice occur. Indeed, it is possible to train the templated voice to recall the context of previous engagements and to include such knowledge in future operations.
  • voice AA 1 template AA ⁇
  • voice or template AA ⁇
  • location of a person with a voice BB that comprises one or more voice characteristics that are similar to voice AA which was the originator for voice template AA ⁇ .
  • voice BB it may be useful to input the one or more similar characteristics from voice BB as either limited or general refinement inputs to voice AA 1 or voice template AA
  • voice BB it is then possible to also retain voice BB and create a voice BB 1 and voice template BB ⁇ , either of which may be useful at a future date.
  • Another example includes creation of a database of variously refined voices for a single originator of the voice, useful on demand or as appropriate by system or user, according to the situation that is presented.
  • a service may be offered to voice match and provide suitable refinement tools, such as natural or artificially generated waveforms or other acoustic or signal elements, to refine voice templates according to the user's desires.
  • any use of a voice-like noise which is generated by data provided to and data resulting from a template or coding tool for creation of that voice-like noise, is captured within the scope of this invention, particularly when such coding tool is used with other noise or sound generating means, if needed, to re-create a voice sound that is virtually identical to the originator's actual voice.
  • the ability to provide machine, component, or computer readable code means as part of the signal forming or transmitting of the voice template process or product further facilitates use of this technology.
  • Means to tie or activate use of this voice templating and voice generating technology to streaming or other forms of data allows for virtual dialog, which may be adaptive and intelligent, as well as merely informational or reactive, and with such dialog or conversations being with voices selected by the user. It is also recognized that the technology herein disclosed may be utilized with visual images as well as aural sounds.
  • a voice template as described herein may be created using data that does not include an actual enabling portion of an originator's voice, but that the enabling portion of the originator's voice may be used, possibly with other data, to validate the replication accuracy of the originator's voice.
  • a templated or replicated voice may be used to interact with or prompt users of computers or other machines and systems. The user may select such templated voice from either her own library of templated voices, another source of templated voices, or she may simply create a new voice.
  • templated voice AA 1 may be selected by the user for voicemail prompts or reading of texts, or other communication interface, whereas templated voice CC may be selected for use in relation to an interactive entertainment use.
  • Troubleshooting or problems lurking in the user's machine, or alerting signals to a user of a device, may be identified or resolved by the user while working with templated voice DD.
  • Template selection and use, and generated voice creation and use may be accomplished either within the user's machine or device, partially within the user's machine or device, or external of the user's machine or device.
  • a parent desired her child to learn about race relations in the United States in the decade of the 1960s using one of the child's deceased grandparent's voices
  • the templated voice of the selected grandparent would be designed, manufactured and designated for use.
  • System 10 would access one or more databases to harvest information and knowledge about the designated topic and provide that information to one or more databases within system 10, such as situational database 33 for use as needed.
  • the grandparents' templated voice EE 1 would be used, following access to the desired information, and the demand request would be met by the templated voice EE 1 commencing a discussion on the designated topic when desired.
  • Such discussion can be saved for later use within system 10 or at a remote location as desired, or the discussion may be interactive between the "grandparent" i.e. the templated voice, and the child.
  • This feature is possible by use of a voice recognition module to know in advance of the discussion the identity of the child's voice and to include adequate vocabulary and neural cognition of the various question combinations likely from the child.
  • a bridge would be provided from the input and voice recognition module to the templated voice portion of the system, to enable responsiveness by the templated voice.
  • Various speech recognition tools are conceivable for use in this manner, when so configured according to the novel uses described herein. Of course this configuration also requires means to rapidly search for the answer to the question and to formulate a response appropriate to the listening child.
  • this example illustrates the extraordinary potential of this technology, particularly when combined with suitable data, system power, and system speed.
  • the optional voice recognition module it is possible to utilize only limited features to enable a listener of a templated voice to direct the generated voice to cease or continue, or to enable certain other features with certain commands. This would be a form of limited interactive mode appropriate for some but not all types of use. Even if the user chose not to use the optional features and instead merely arranged for a story or a discussion in the absent grandparents' voice, the effect and utility of this is enormous to this or other types of uses.
  • the templated voice may again be that of the grandparent selected above (templated voice EE 1 ), and the filter of DATA DATES is used with a selected date of "BEFORE DECEMBER 1963" for a discussion of race relations in the United States in the decade of the 1960s. The result would be a discussion that would not include any information that occurred after the designated date. In this example, the "grandparent" could not discuss the Voting Rights Act of 1965 or the urban riots of the late 1960s in that country.
  • a user may direct a templated voice of a loved one or someone else to read to the user.
  • people of all ages it is possible for people of all ages to have books read to them in the voice of an absent or deceased family member or other person known to the user.
  • this innovation alone will provide enormous benefit to users.
  • This type of use has wide applications beyond the specific example just provided. Indeed, an even broader use of this technology in this manner is to have available a database of authorized and templated voices which may be accessible and useable by others for a fee or other form of compensation.
  • this technology When used for music, this technology has similar profound implications, particularly if one can access templated voices of past and present singers of renown- many of whose voices are still available for templating. Clearly, this technology enables a new industry of manufacturing, leasing, purchasing, or otherwise using voice templates and associated means, techniques and methods of conducting business therewith.
  • the invention may also have utility in medical treatments for certain minor or major psychological ailments, for which proper use of templated voice therapy may be quite palliative or even therapeutic.
  • Yet another possible use of this technology is to create a newly designed voice for use, but one which has a basis or precursor in one or more templated voices from actual mammalian origin. Ownership and further use of the newly created voice may be controllable under various means or legal enforcement, such as licensing or royalties and the like. Of course, such voices may be retained as private possessions for limited use by the creator as well.
  • Such voices will represent the creative aspirations of the creator, but each voice will actually have a component or strain of actual mammalian voice as a basis through use of the templating tool or code, similar to a strand of tissue DNA but applicable to a specific voice.
  • This type of combination presents powerful new communication capabilities and relationships based on voice and other sounds created by mammals.
  • Systems according to the invention may be handheld or of other size. Systems may be embedded in other systems or may be stand alone in operation. The systems and methods herein may have part or all of the elements in a distributed, network or other remote system of relationship. Systems and methods herein may utilize downloadable or remotely accessible data, and may be used for control of various other systems or methods or processes. Embodiments of the invention include exposed interface routines for requesting and implementing the methods and operations disclosed herein but which may be carried out in whole or in part by other operating or application systems. The templating process and the use of templated voices may be accomplished and used by either mammals or artificial machines or processes. For example, a bot or other intelligent aide may create or use one or more templated voices of this type.
  • Such an aide may also be utilized to search for voices automatically according to certain general or limited criteria, and may then generate templated voices in voice factories, either virtual or physical. In this manner, large databases of templated voices may be efficiently created. In this or similar systemic use, it may be desirable to create and apply data or other types of tagging and identification technology to one or more portions of the actual voice utilized to create a templated voice.
  • a templating process using elements of the embodiments herein yields a voice coding signal, comprising the logic structure of characteristics of a specific voice essential for accurately replicating the sound of that voice.
  • Example 4 A home energy monitor, reporter, or mate, using one or more selected voices using the technology herein.
  • Example 4 A home energy monitor, reporter, or mate, using one or more selected voices using the technology herein.
  • a hotel room assistant, or automobile assistant to prompt the user according to desired prompting such as for example a wake-up call in a hotel in the voice selected by the user.
  • desired prompting such as for example a wake-up call in a hotel in the voice selected by the user.
  • an operator of a vehicle might receive information in the voice or voices selected by the user.
  • voice template technology in combination with other visual media, such as with a photograph, digital video or a holographic image.
  • a personal device that scans and updates downloadable information for a user as desired in voice or voices of one's choosing. For example, this may be useful for organizing actions capable of being done by a bot, such as an info-bot for background searching and interface while the user is not available and then reporting status to the user in one or more designated voices using the technology herein.
  • a bot such as an info-bot for background searching and interface while the user is not available and then reporting status to the user in one or more designated voices using the technology herein.
  • a safety reminder when used with one or more components of gear or equipment in the workplace, such as a personal computer posture monitor, electrical equipment, dangerous equipment, etc.
  • voice activated systems such as dictation devices, as prompts, companions, or text readers.
  • Using the technology disclosed herein use as social mediation or control mechanisms, such as a tool against road rage or other forms of anger and frustration, activatable by driver or automatically, or by other means.
  • VoiceSelectTM brand of movie or video match technology uses as a VoiceSelectTM brand of movie or video match technology to utilize preferred voices for templating of entertainment script already used by the original performer or subsequently created for voice template technology combination uses.
  • an "alter ego” device such as a handheld unit which engages on "SelectVoiceTM” brand or “VoiceXTM” brand mode(s) of operation and has a database of images of those who match the voice as well as anonymous models which can be selected, similar to that referred to in Example 7.
  • Using the technology disclosed herein use as a bedtime reader or a night mate in a dwelling for monitoring and interactive security.
  • Figure 2 is a flow diagram of one embodiment of a voice capture subsystem which may comprise computer readable code means or method for accomplishing the capture, analysis and use of a voice AA designated for templating.
  • Figure 3 is one embodiment of a voice analysis subsystem which may comprise logic or method means for efficiently determining voice data characterization routing.
  • voice AA is captured in acquisition module or step 103 and then routed by logic steps and data conductive pathways, such as pathway 106, through the templating process. Capture may be accomplished by either digital or analog methods and components.
  • the signal which then represents captured voice AA is routed through analysis means 111 or method to determine whether an existing voice profile or template matches voice AA.
  • This may be accomplished, for example, by comparing one or a plurality of characteristics (such as those shown in voice characterization subsystem 113 of Figure 4) as determined by either acquisition module 103 or analysis means 111, and then comparing those one or more characteristics with known voice profiles or templates available for access, such as at analysis step 111.
  • Representative feedback and initial analysis loop 114 facilitates these steps, as does pathway 116.
  • Such comparison may include querying of a voice profile database or other storage medium, either locally or remotely.
  • the analysis step at analysis module 111 and voice characterization subsystem 113 may be repeated according to algorithmic, statistical or other techniques to affirm whether the voice being analyzed does or does not relate or match an existing voice profile or data file.
  • Figure 4 provides further detail of voice characterization subsystem 113.
  • the signal corresponding to voice AA does not have a match or is not identified with an existing voice profile set then the signal is routed to the voice characterization subsystem for comprehensive characterization.
  • creation of a template may not be required at module/step 127.
  • the signal might be analyzed and/or characterized for possible generation of a revised profile or template- which itself may then be stored or applied. This situation might occur, for example, when additional characterization data is available (such as size of enabling portion, existence or lack of stress, or other factors) which had not been previously available.
  • a specific voice data file might comprise a plurality of templates.
  • the template creation module/step 127 of Figure 2 comprises utilizing the voice characterization subsystem to create a unique identifier, preferably a digital identifier, for that specific voice being templated or profiled.
  • This data is similar, in the abstract, to genetic codes, gene sequence codes, or bar codes, and like identifiers of singularly unique objects, entities or phenomena. Accordingly, applicants refer to this voice profile or template as "Voice Template TechnologyTM” as well as “Voice DNATM or VDNATM” and “Voice Sequence CodesTM or Voice Sequence CodingTM”.
  • the terms "Profile, Profiles or Profiling” and derivative terms may be substituted in the above trademark or other reference terms for this new technology.
  • the voice template may be stored (shown at storage module or step 161 or applied in use at module or step 164).
  • Figure 4 is a schematic representation of a voice characterization subsystem.
  • This disclosure comprises at least one embodiment of characterization data and means for determining and characterizing salient data to define a voice using voice templating or profiling, as disclosed herein.
  • various types of data is available for comparison in formulating the characterization data.
  • This characterization data will then be used to create the voice template or profile according to coding criteria.
  • the data in Figure 4 appears to be arranged in discreet modules, an open comparator process may be preferred in which any data may be accessed for comparison in any of various sequences or weighted priorities.
  • data may comprise the categories of language, gender, dialect, region, or accent (shown as "Voice Characteristics" output signal VC 0 at module or step 201); frequency, pitch, tone, duration, or amplitude (shown as output signal V at module or step 203); age, health, pronunciation, vocabulary, or physiology- either genetic or otherwise (shown as output signal VC 2 at module or step 205); patterns, syntax, volume, transition, or voice type (shown as output signal VC 3 at module or step 207); education, experience, phase, repetition, or grammar (shown as output signal VC at module or step 209); occupation, nationality, ethnicity, custom or setting (shown as output signal VC 5 at module or step 211); context, variances, rules/models, enabling portion type, size or number (shown as output signal VC 6 at module or step 213); speed, emotion, cluster, similarities, or acoustic model (shown as output signal VC 7 at module or step 215); math model
  • VC X encompasses any known categorization technique at the time of interpretation, regardless of mention herein, provided it is useful in then defining a unique voice profile or template for a specific voice- and is used according to the novel teachings disclosed herein.
  • data combined in voice characteristic files and output signals VC 0 , VC b VC 2 , VC 3 , VC 4 , VC 5 , VC 6 , VC 7 , VC 8 , VC 9 , VC 10 , VC U , VC 12 , and VC X may be prioritized and combined in various ways in order to accurately and efficiently analyze and characterize a voice, with VC X representing still further techniques incorporated herein by reference.
  • Another goal of this technology is to identify protocols for reconstituting encoded voice DNA with content into a voice recognizable as an originator.
  • the voice DNA will be combined with text or similar input data in a manner that is likely similar to compiling steps in known text-to-speech synthesizers.
  • a key difference is the addition of the very specific voice DNA template, which functions as a unique recipe or filter for each generated voice.
  • Figures 5 and 6 illustrate an exemplary signal bundler suitable for receiving the various voice characteristic data, such as digital or coded data representative of the information deemed relevant and formative of the voice being templated.
  • the signal bundler 316 then combines the output of signal content module or step 332 and values/scoring from one or more signals VC 0 - VC X and formats the signal or code at module or step 343 as appropriate for proper transfer and use by various potential user interfaces, devices or transmission means to create an output voice template, code, or signal VT X . It is recognized that various methods are possible to create a unique identifier to delineate the various voice characteristics- and that such various possibilities are enabled herein in view of the broader context and scope of this invention- to a certain degree independent of some component methodology.
  • Another goal of this technology is to identify the use of specific genetic codes and/or gene sequence codes to assist in the unique classification of a particular speaker's voice, and to enhance the fidelity of any voice re-created using a voice DNA product.
  • Another goal is to determine the best methodology for distribution, validation and use-control of voice DNA. It is recognized, of course, that the implications of this technology are vast, and safeguards will be necessary to maintain the proper use of this templated voice technology. Indeed, this technology may require further use of authorization protocols to only allow authorized users to access and use the voice template technology and data. An additional necessity may be to have mechanisms for verifying that voices heard are either real or templated, in order to ensure against fraudulent or unauthorized use of such created voices.
  • Systems using this technology effort may be handheld or of other size.
  • Systems may be embedded in other systems or may be stand alone in operation.
  • the systems and methods herein may have part or all of the elements in a distributed, network or other remote system of relationship.
  • Systems and methods herein may utilize downloadable or remotely accessible data, and may be used for control of various other systems or methods or processes.
  • Embodiments of the technology include exposed interface routines for requesting and implementing the methods and operations disclosed herein but which may be carried out in whole or in part by other operating or application systems.
  • the templating process and the use of templated voices may be accomplished and used by either mammals or artificial machines or processes. For example, a bot or other intelligent aide may create or use one or more templated voices of this type.
  • Such an aide may also be utilized to search for voices automatically according to certain general or limited criteria, and may then generate templated voices in voice factories, either virtual or physical. In this manner, large databases of templated voices may be efficiently created. In this or similar systemic use, it may be desirable to create and apply data or other types of tagging and identification technology to one or more portions of the actual voice utilized to create a templated voice or to use voice tags for marking other data.
  • Figure 7 is a representative organization and method of an electronic query and transfer between a voice template generation or storage facility 404 and a remote user.
  • enabling portions may be sent to a remote voice template generation or storage facility 404 by any number of various users 410, 413, 416.
  • the facility 404 then generates or retrieves a voice template data file and creates or retrieves a voice template signal.
  • the template signal is then transmitted or downloaded to the user or its designee, shown at step 437.
  • the template signal is formatted for appropriate use by a destination device, including activation instructions and protocols, shown at step/module 457.
  • Figure 8 is a schematic representation of a mobile medium, such as a card, disk, or chip on which are essential components, depending on the user mode and need, for utilizing voice template technology.
  • a hotel door card 477 may be provided at check-in to a hotel by a traveler.
  • additional features incorporating aspects of this invention may be made available.
  • a schematic representation of optional features within such a card include means 481 for receiving and using a voice template for a voice or voices selected by the traveler for various purposes during the traveler's stay at the hotel.
  • such features may include a template receiving and storage element 501, a noise generator or generator circuitry 506, a central processing unit 511, input output circuitry 515, digital to analog/analog to digital elements 518, and clock means 521.
  • various other elements may be utilized, such as voice compression or expansion means- such as those known in the cellular phone industry, or other components to enable the card to function as desired.
  • voice compression or expansion means such as those known in the cellular phone industry, or other components to enable the card to function as desired.
  • the user may then enjoy dialog or interface with inanimate devices within the hotel in the voice(s) selected by the traveler. Indeed, a traveler profile may even retain such voice preference information, as appropriate, and certain added billings or benefits may accrue through use of this invention. It is recognized that the invention may be employed in a wide variety of applications and articles, and the example of Figures 8 and 9 should not be considered limiting.
  • Figure 9 is a depiction of a photograph 602 which is configured for interactive use of voice template technology with voice JJ attributable to figure F JT and voice KK attributable to figure F KK - Means are combined with the frame 610 or other structure, whether computer readable code means or simple three dimensional material, for interfacing the subjects or objects of the photo (or other media) with the appropriate voice templates to recreate a dialogue that either likely occurred or could have occurred, as desired by the user.

Abstract

Systems and methods are disclosed to capture (103) an enabling portion of a voice and then to create a voice template (127) or profile signal which may be combined at a later time with noise of another origin to reconstitute the original voice. Such reconstituted voice may then be used to speak any form or content provided via digital input thereto, and to say content, which was not spoken in an original form by the original voice. Products and processes for online use are disclosed, as are certain business methods and industry applications.

Description

SYSTEM AND METHOD FOR VOICE CHARACTERISTIC MEDICAL ANALYSIS
Field of the Invention
Systems, methods, and products for preserving and adapting sound, and more specifically human voices. A patient's voice characteristic is frequently an indicator of an unwanted medical condition to a primary care physician, regardless of any prior knowledge of the patient. However, a shortage of physicians and a combined inability of patients to be able to get to physicians and medical care promotes greater interest in telemedicine techniques. Improvements to telemedicine are long overdue, and provide opportunities not previously recognized to general populations.
Background of the Invention ,
Since the beginning of time mammals and other creatures have communicated in some form by voice or similar noises. Indeed, such noises are normally quite distinct in view of the differences in morphology of creatures- even within species. The distinctiveness of creatures includes the very distinct elements of speech patterns and tones. Unfortunately, the joy of listening to the speech of others with a voice of particular interest is lost when that person dies or ceases contact with the listener.
Only the very basic forms of media capture exist today by which voices may be preserved. For example, a tape or digital recording device is used to record someone's voice and thereby retain it for future listening and replay as it was recorded originally, or portions of the original recording may be played as desired. These devices and methods of voice recording also include a range of artificial voices, created by computers, which may be used for many different functions, including for example telephone automatic assistance and verification, very basic speech between toys or equipment and users, synthesized voices for the film and entertainment industry, and the like. In some applications, these artificial voices are preprogrammed to a narrow set of responses according to a specific input. Although more responsive, in some instances, than a mere recording of an actual voice, these artificial voice sounds are nevertheless simple compared to the robust voice capabilities of the present invention. Indeed, in certain embodiments of the invention there are elements that are either quite different from such systems or which take the previous technology far beyond that ever contemplated or even suggested by such prior discoveries or innovations.
Many publications worldwide disclose aspects of artificial vocalization. In similar fashion, some references disclose systems and techniques of using and creating artificial voice sounds. However, none of these references disclose the concepts of the present invention. Brief Description of the Drawings
Figure 1 is a flow diagram of one embodiment of the system operation of the invention.
Figure 2 is a schematic diagram of one embodiment of a voice capture subsystem.
Figure 3 is a schematic diagram of one embodiment of a voice analysis subsystem.
Figure 4 is a schematic diagram of one embodiment of a voice characterization subsystem.
Figure 5 is a schematic diagram of one embodiment of a voice template subsystem.
Figure 6 is a schematic diagram of one embodiment of a voice template signal bundler subsystem.
Figure 7 is one embodiment of a schematic diagram of the system of the invention used with remote information download and upload options.
Figure 8 is one embodiment of an exemplary plan view of an embodiment of the invention embodied in a mobile, compact component.
Figure 9 is an exemplary plan view of an embodiment of the invention used with a visual media source.
Summary of the Invention
Systems and methods are provided for recording or otherwise capturing an enabling amount of a specific person's voice to form a voice pattern template. That template is then useful as a tool for building new speech sounding like that precise voice, using the template, with the new speech probably never having been actually said or never having been said in the precise context or sentences by the specific human but actually sounding identical in all aspects to that specific human's actual speech. The enabling portion is designed to capture the elements of the actual voice necessary to reconstruct the actual voice, however a confidence rating is available to predict the limits of the reconstructed or re-created speech in the event there is not enough enabling speech to start with. A new voice or voices may be used with a database of subject matter, historical data, and adaptive or artificial intelligence modules to enable new discussions with the user just as if the templated voice's originator were present. This system and method may be combined with other media, such as a software file, a chip embedded tool, or other forms. Interactive use of this system and method may occur in various manners. A unit module itself may comprise the entirety of an embodiment this invention, e.g. a chip or electronic board which is configured to capture and enable use of a voice in the manner disclosed herein.
The template is useful, for example, as a tool for capturing and creating new dialogs with people whom are no longer immediately available, who may be deceased, or even those who consent to having the voices templated and used in this manner. Another example is the application to media, such as film or photos or other depictions of the actual voice(s) originator to create on-demand virtual dialog with the originator. Various other uses and applications are contemplated within the scope of the invention. Detailed Description of the Invention
Voice is a sound of extraordinary power among mammals. The sound of a mother's voice is recognized by and soothes a child even before birth, and the sound of a grandfather's voice calms the fears of even a grown person. Other voices may inspire complete strangers or may elicit memories from loved ones of long past events and moments. These are but a few examples of the great gift of distinctiveness that the human and other species have; and their ability to influence others (and themselves) by the very unique sound of each creatures' voice. In humans, for example, this particularity of one's voice derives from the genetic contribution of the parents resulting in the shape, size, position and development of the various human body components that influence the way one sounds when speaking or otherwise communicating with voice or through the mouth and nasal passages. Other influences exist as well. It is understandable therefore that there is a range of differences among people, often even within the same family. Indeed, even the same person may sound slightly different according to temporal influences such as the health, stress level, emotional state, fatigue, the ambient temperature around the person, or other factors.
There is general agreement worldwide, however, that a person's voice qualities present a very unique combination, that is discernible to those who have heard the voice before. The ability of humans to associate through their senses is remarkable, particularly as such sensing relates to identification and association with the human voice. Life's grand and small events are often recalled many years or decades later by the nature of comments made or tones remembered. Thus is the enduring strength and emotive power of voice.
It is of course well known to capture and play back human voice on various media and machines. Basic manipulation of recorded human voice has been done for many decades, both intentionally and unintentionally, in tape and digital media. However, this manipulation has been generally limited by the bounds of what has actually been stated by the human rather than what could be stated by that human. For example, segments of actual statements by the human have been played, edited, mixed and re-played, sometimes even at different speeds. Other examples of human voice use include playbac of intentionally distorted voice segments, such as may be used in cartoons or other audio related to animation or certain music. Of course, the animation medium also has used artificial voice not necessarily created using actual voice. One example of this is a computer generated "voice" operator used by some telephone and communication systems. One method of synthesizing voices and sounds is referred to as concatenative, and refers to the recordings of wave form data samples or real human speech. The method then breaks down the pre-recorded original human speech into segments and generates speech utterances by linking these human speech segments to build syllables, words, or phrases. The size of these segments varies. Another method of human speech synthesis is known as parametric. In this method, mathematical models are used to recreate a desired speech sound. For each desired sound, a mathematical model or function is used to generate that sound. As such, the parametric method is generally without human sound as an element. Finally, there are generally a few well-known types of parametric speech synthesizers. One is known as an articulatory synthesizer, which mathemetically models the physical aspects of the human lungs, larynx, and vocal and nasal tracts. The other type of parametric speech synthesizer is known as a formant synthesizer, which mathematically models the acoustic aspects of the human vocal tract.
Other systems include means for recognizing a specific voice, once the using system has been trained in that voice. Examples of this include the various speech recognition systems useful in the field of capturing spoken language and then translating those sounds into text, such as with systems for dictation and the like. Other speech related systems concern the field of biometrics, and use of certain spoken words as security codes or ciphers. None of these systems, methods, means or other forms of disclosure recognize the various inventions disclosed herein, nor do any such disclosures even recognize a need for such technical innovations. What has long been needed is a system and method for preserving the voices of other beings in a dynamic and adaptive manner for future use and benefit by the originator or by others. What has been further needed are systems and methods for accomplishing and utilizing such voice capture or profiling in manners which present a seamless, articulate, or otherwise genuine vocalization or voice in the voice of the original person in ways possibly never contemplated by that person. Certain additional advantages accrue to systems and methods for accomplishing this which are easily used by all people of virtually any skill, culture or language. What has been further needed is a new business method, technique and model, along with implementing apparatus and other means, to create and to facilitate access to specific voice templates and then facilitate use of those voice templates for personal needs or desires, whether related to business or pleasure. Once again, although much has been accomplished in the field of voice technology, none of these past efforts contemplate the instant inventions and merely highlight the novel and heretofore unrecognized need for these inventions.
This invention describes various techniques for templating specific human voices; disclosure regarding development of standardized classification, distribution, and reconstruction modalities for novel human voice templates built with unique voice profile subcodes; and various alternative paths for analyzing and assessing human voice characteristics, and creating very specific patient-related and overall individual person-related voice codes. All of these embodiments are incorporated into the current application.
In one embodiment, the invention comprises developing remote phone-based analyzing modalities of patient voices for telemedicine screening applications. The product and service inventions will provide remote identification of certain disease onset via detectable changes in patient voice. Patient confidentiality may also be enhanced and provided for using this feature as a biometric validation during a phone call. As noted above, a patient's voice characteristic is frequently an indicator of an unwanted medical condition to a primary care physician, regardless of any prior knowledge of the patient. Examples include detection of tremor, Parkinsonian voice (paucity of expression), psychoactive medications (monotone), peritonsular abscess (known as "hot potato in the throat"), epiglottitis, neurological changes causing dysarthria (changed speech due to brain changes). In one embodiment, a product or service utilizing this technology will provide instantaneous analysis and recurrent updates of a patient's voice, compare the updates to a critical code which defines a patient baseline voice (when the patient is known), and generate a differential diagnosis based on likely conditions triggering certain anomalies to that patient's voice, whether the patient is a first time caller or not.
A digital voice code with ultra-high patient specificity, and novel medical condition recognition capabilities (preferably) by digital signal analyses are provided and disclosed herein. The digital voice code is derived from any of the plurality of techniques and algorithms, as disclosed herein below. The medical condition recognition capabilities and algorithms are those being developed by Applicant in relation to the above-listed conditions and others.
Early screening via telemedicine enhances rapid intervention by healthcare providers, which in turn reduces life cycle cost of these conditions. Additionally, there are several applications for this technology which enable revenue through more voluntary routine screening and use as a revenue generating or add-on fee for service telemedicine stream. Independent analysis indicate very high ratios of savings/revenue versus cost by earlier intervention, compliance monitoring, and elective fee based screening.
In addition to the business and other methods noted above, this technology also provides additional business method opportunities. For example, utilization of this technology in a blind voluntary automated screening system for an organization-communication system with private queuing to voice originators would allow for non-traditional intervention and courtesy alerts to non- patient participants in the plan, as well as traditional patients in order to screen for major disease onsets and have wider organizational impact on healthcare savings. Use by large non-medical organizations as a daily screening tool with their phone systems is a very likely utilization of this technology, in addition to telemedicine call-in applications utilized within healthcare organizations.
Heretofore, this technology has not been possible because the enabling voice platform technology and the medical condition recognition capabilities and algorithms were not created. This telemedicine approach is particularly unique in its use of the digital code for each person's voice, which allows extensive data comparison and record checking capabilities for each patient as well. Research indicates a very large population which is readily reachable and is also affected by the main diseases/conditions currently being investigated for use by this technology. Other applications exist as well, for example, status post-laryngectomy, or no vocal cords, may use this technology as an alternate to the currently used esophageal speech or vibrator, and then be classified accordingly.
One goal of this technology is to create a uniform classification system which captures, defines, and quantifies all parameters of a particular human voice. A specific digital code will be generated to uniquely identify and define a particular human's voice, in all regards. This unique voice identifying code will be useful in later generation of a machine spoken voice using text or other data when combined with the code for that selected human's voice. Accordingly, an important element of this technology is the development of standardized classification, distribution, and reconstitution modalities for novel human voice templates built with unique voice profile sub-codes, as disclosed herein.
The classification system will capture and normalize the results of different voice analysis techniques. These techniques are drawn from voice recognition, voice synthesis/production, text-to- speech know how, and other fields. Various analytical tools will be categorized from both the traditional analytical realm (e.g., vocal fold analysis, glottal pulse response, histogram/spectrogram analysis, vocal tract dimensions, fractal analysis, audio-visual analysis, short time Fourier transform, etc.) as well as more novel areas (e.g., genetic characteristics, age, accent, education, health, etc.).
This voice identifying and classifying system will yield a unique, data-rich descriptive code for each human voice, and such code will function as a cipher key to generate a synthetic voice having identical sound and other characteristics as the original human voice. Each code will then be formatted in various ways to facilitate it's use as a building block in licensing or other commercial applications, including electronic transfer and storage. Such a system will enable dramatic changes to the human interface field as well as other technical areas.
One objective of this technical effort is to identify the best methodology for classifying all elements of a specific voice and then arranging a logical and unique classification system for future standard categorization protocols, which require development. It is well known that numerous technical approaches to voice recognition, authentication, generation, biometrics, and digital communications have been developed and successfully implemented. Our technology will use many of these underlying technical approaches as building blocks to develop voice classification, distribution and re-creation of actual voice profiles. Analysis of an enabling portion of an actual speaker according to a very extensive parsing methodology produces a digital code which uniquely defines the various elements of the speaker's voice. This digital code is referred to as a "voice DNA" and becomes the basis to the particular speaker's voice descriptive code. The diverse yet excellent work done in the voice recognition field is the source of best available technology in this regard. However, that technology has been developed with considerable constraints, and has not focused on the underlying voice DNA of speakers. Rather, it has been more content-dependent on the whole.
Creation of the protocols for reconstitution of the encoded voice DNA with the desired data is another technical objective and embodiment. The technology of this effort goes well beyond merely encoding and decoding known voice signals to enable compression transfer (as in RF or cellular phone technology). Indeed, the scope of the new technology goes beyond even a first goal of creating a uniform classification system to define each person's voice, independent of spoken content. The voice DNA will be designed for use as a key or template for future reconstitution of an artificially generated or synthesized voice which uses the voice DNA in combination with other digitized subject matter/text. This permits the exact sound of the original speaker's voice to "speak" words and phrases which were neither stored in whole or in part, but rather which are formed with the voice DNA and the digitized subject matter/text input, which may come from virtually any source. It is expected that the current state of the art of text-to-speech software and signal processing will greatly facilitate this aspect of the technical development, but much depends on the structure, size and format of the voice DNA code that will be selected for the unifying methodology.
A third technical objective and embodiment is to identify the uses of specific genetic codes and/or gene sequence codes to assist in the unique classification of a particular speaker's voice and to enhance the fidelity of any voice re-created using a voice DNA product. This has always been considered (by us) to be an important tool for classifying the enabling portion of the original voice. We believe this feature alone merits research support, particularly in light of recent discoveries involving the FOXP2 gene, but in this context is considered to be one part of a larger effort to define the best techniques to achieve unique classification of each human's voice. An additional opportunity exists because a derivative of understanding and classifying the mechanisms of individual speech is improved understanding of the specific brain circuitry that underlies such faculty in each person. The cognitive information processing role, per subject, is a basis of the interpretation and formation of speech. This technical effort will endeavor to utilize this basic element as a further enhancement of the specificity of each person's voice DNA, mindful of identifying future opportunities for research in communication among human-human and human-machine neural interfaces.
A fourth technical objective is the distribution, validation and control over use of voice DNA. Identifying the best format for these voice classification signals, and the various interface requirements of the reconstituting data, are essential technical milestones. The reality of fusing numerous underlying voice technologies requires technical reconciliation and normalization, at certain levels. The resulting integration will create a new technology field, with considerable revenue and scientific spin-offs. It is believed that a driver of revenue will be the ability to distribute voice DNA or templates via digital networks and the like for use with different products and services. This will require proper validation, transfer and proliferation/derivative rights control protocols and regimes.
Although great advances have been made in the voice-related fields of science and commerce, the current state of the art is akin to that of genomics in 1980. A dramatic series of integrating and accelerating technologies is long overdue in this field, and our technology represents a bridge to such phenomenon. ATP funding is appropriate due to the integration this Project requires among numerous competitive technologies, and disparate scientific/engineering disciplines. ATP support will underpin the strategic importance of leading this highly innovative development/use of future voice-related products and services worldwide.
Voice template technology will create revenue from these and other products: chip logic; PC cards; HPCs/PPCs/PDAs; cellular phone systems; travel and ID cards and related biometric systems; internet transfers; voice use license fees; and numerous interface applications within many industries. The potential for unforeseen spin-off applications for this technology is also quite high. Systems and methods are proposed to capture and enhance an enabling portion of an individual's voice and then to create a voice template or profile signal which may be combined at a later time with noise/text of another origin to utilize the original voice. Such voice may then be used to speak any form/content/text provided via digital or other input thereto, and to "speak" content which was not spoken in an original form by the original voice. Products and processes for online use are planned, as are certain business methods and industry applications. Specific objectives guide the technology development, as follows:
1. Identify best methodology for classifying all elements of a specific voice and then arrange a logical and unique classification system for future use.
2. Identify protocols for reconstitution of the encoded voice DNA into a voice recognizable as an originator but with other content than that spoken by the originator.
3. Identify use of specific genetic codes and or gene sequence codes to assist in the unique classification of a particular speaker's voice, and to enhance the fidelity of any voice recreated using a voice DNA product.
4. Determine best methodology for distribution, validation and use-control of voice DNA.
One goal is to identify the best methodology for classifying all elements of a specific voice into a logical and unique code utilizing standard categorization protocols, which require development. In humans, the particularity of voice derives from the genetic contribution of the parents resulting in the shape, size, position and development of the various human body morphology that influences the way one sounds when speaking or otherwise communicating with voice or through the mouth and nasal passages. Other influences exist as well. It is understandable therefore that there is a range of differences among people, often even within the same family. Indeed, even the same person may sound slightly different according to temporal influences such as the health, stress level, emotional state, fatigue, the ambient temperature around the person, or other factors. Regardless, the ability of humans to associate through their senses is remarkable, particularly as such sensing relates to identification and association with the human voice.
It is of course well known to capture and play back human voice on various media and machines. Basic manipulation of recorded human voice has been done for many decades, both intentionally and unintentionally, in analog and digital media. However, this manipulation has been generally limited by the bounds of what has actually been stated by the human rather than what could be stated by that human. For example, segments of actual statements by the human have been played, edited, mixed and re-played, sometimes even at different speeds. Other examples of human voice use include playback of intentionally distorted voice segments, such as may be used in cartoons or other audio related to animation or certain music. Of course, the animation medium also has used artificial voice not necessarily created using actual voice. One example of this is a computer generated "voice" operator used by some telephone and communication systems. One method of synthesizing voices and sounds is referred to as concatenative, and refers to the recordings of wave form data samples or real human speech. The method then breaks down the pre-recorded original human speech into segments and generates speech utterances by linking these human speech segments to build syllables, words, or phrases. The size of these segments varies. Another method of human speech synthesis is known as parametric. In this method, mathematical models are used to recreate a desired speech sound. For each desired sound, a mathematical model or function is used to generate that sound. As such, the parametric method is generally without human sound as an element. Finally, there are generally a few well-known types of parametric speech synthesizers. One is known as an articulatory synthesizer, which mathematically models the physical aspects of the human lungs, larynx, and vocal and nasal tracts. The other type of parametric speech synthesizer is known as a formant synthesizer, which mathematically models the acoustic aspects of the human vocal tract.
Other systems include means for recognizing a specific voice, once the using system has been trained in that voice. Examples of this include the various speech recognition systems useful in the field of capturing spoken language and then translating those sounds into text, such as with systems for dictation and the like. Other speech related systems concern the field of biometrics, and use of certain spoken words as security codes or ciphers. What has long been overlooked, however, is a system and method for preserving the key or cipher to voices of other beings in a dynamic and adaptive manner for future use and benefit by the originator or by others.
Table I is one of several uncorrelated tables which identifies technically relevant characteristics ( C ) of human voice, and tools/technology ( T ) having known analytical value. An initial task is to validate these Tables and then to rank best T1_x for each C. The stability of each ranking will need testing against various C categories. For example, whether T1 is a reliable and stable choice for all voice types or all voices under stress. A correlation ranking among selected T categories per C category will also be done. This will result in the best techniques for deriving each characteristic, and then lead to a classification system and reconciliation algorithms suitable for fusing the output of the best techniques into an integrated classification code.
Due to the very high number of variables, it is expected that multiple analytical models and mapping techniques will be utilized to build voice DNA . A master topographic selection model is currently preferred due to its intuitive adaptability to voice dynamics, somewhat similar to that of Genetic Algorithms. Use of GAs and other techniques will likely aid in more rapid assessment and selection of best models.
Representative Table I of n Tables Voice Characteristic C Tool/Technology T pitch Fast Fourier Transform amplitude parametric duration dictionary spectral concatenative accent context dependent/free phrasing lexical stress melody letter-to-phoneme timing morphological prosody back-end syntax acoustic frequency modeling punctuation linear predictive coding articulation multi-pulse LPC formants residual-excited LPC intonation pitch synchronous overlap & add (PSOLA) closed syllables source filter open syllables diacritics stress spectrogram phonemes statistical clustering allophones synthesis-by concatenation diphones Hidden Markov modeling triphones articulatory demisyllables synthesis-by-rule suballophone warped linear predictor subsyllable acoustic stability confidence measuring suprasegmental maximum likelihood linear regression synthesis unit variant pronunciation rules transition vector quantization vowel Gaussian mixture modeling consonant Code Excited Linear Prediction
Again, numerous other features and topographies of the human voice are contemplated, such as frequencies, volume, pronunciation, vocal tract physiology, patterns, laughter, tone, amplitude, gender, dialect, origin, region, language, health, voice type, education, experience, grammar, breathing rate, timbre, resonance, harmonics, genetic, and others. Other analytic techniques are also available.
Although much has been accomplished in the field of voice technology, none of these past efforts contemplate the instant research products. This highlights the novel and heretofore unrecognized need for these innovations. The following are high-level examples of systems and methods for implementing or commercializing the research and development of these technical concepts.
Figure 1 is a schematic diagram of one embodiment of a system 10 for capturing an enabling portion of a specific voice sufficient for using that portion as a template in further use of the voice characteristics. System 10 may be part of a handheld device, such as an electronic handheld device, or it may be part of a computing device of the size of a laptop, a notebook, or a desktop, or system 1 may be part of merely a circuit board within another device, or an electronics component or element designed for temporary or permanent placement in or use with another electronic element, circuit, or system, or system 10 may, in whole or in part, comprise computer readable code or merely a logic or functional circuit in a neural system, or system 10 may be formed as some other device or product such as a distributed network-style system. In one embodiment, system 10 comprises input or capture means 15 for capturing or receiving a portion of a voice for processing and construction of a voice algorithm or template means 19, which may be formed as a stream of data, a data package, a telecommunications signal, software code means for defining and re-generating a specific voice, or a plurality of voice characteristics organized for application to or template on another organization of sound or noise suitable to arrange the sound or noise as an apparent voice of an originator's voice. Other means of formatting computer readable program code means, or other means, for causing use of certain identified voice characteristics data to artificially generate a voice is also contemplated within this invention. The logic or rules of the algorithm or template means 19 are preferably formed with a minimum of voice input, however various amounts of voice and other data may be desired to form an acceptable data set for a particular voice.
In one embodiment of the invention, it is desired to capture an enabling portion of a human voice, for example, with a small amount of analog or digital recording, or real-time live input, of the person's voice that is to be templated. Indeed, a prescribed grouping of words may be formed to optimize data capture of the most relevant voice characteristics of the person to enable accurate replication of the voice. Analysis means are contemplated for most efficiently determining what form of enabling portion is best for a particular person. Whether by a single data input or a series of inputs, the voice data is captured and stored in at least one portion of storage means 22.
Analysis of the voice data is performed at processor means 25, to identify characteristics useful in creating a template of that specific user's voice. It is recognized that the voice data may be routed directly to the processor means and need not necessarily go initially to the storage means 22. Further exemplary discussion of the interaction among the processor means, storage means, and the template means is found below, and in relation to Figures 2-8 . After adequate voice data has been analyzed, then a template of the voice is, in one embodiment, stored until called for by the processor means 25. For example, after voice AA has had an enabling portion captured, analyzed and templated (now referred to as AA) it is stored in a storage means 22 (which may be either resident near the other components or located in a remote or distributed mode at one or more locations) until a demand request occurs. One example of a demand request is a user of system 10 submitting a request via representative input means 29 to utilize the voice AA template AAt in a newly created conversation with voice AA participating as a generated voice rather than an actual, live use of voice AA. This may occur in conjunction with or utilization of one or more various databases, a few of which are represented by situational database 33 or personal database 36. In turn, then voice AA template AAt is called and provided as a forming mechanism with certain other noise to create a new conversational voice AA1 that sounds precisely like the original voice AA of the originally inputted data, once formed. Although the new voice AA1 sounds like original voice AA in all respects, it is actually an artificially created voice with the template AAt providing the matching key, such as a genetic code, to voice AA. In this way an enabling portion of an actual voice may encode the system 10 using a template to allow regeneration and unlimited utilization of the captured voice in virtually any way desired by the user. This is not simply a synthesis of prior utterances of bits of voice AA which are electronically fused together, by either concatenation or formant techniques, but rather an entirely new voice that is designed, manufactured and assembled or constructed using the voice data characteristics of voice AA (i.e., the voice template or profile), and possibly other characteristics relevant to the originator of voice AA, e.g. genetic code, tissue DNA applicable to a specific voice, or other physiologic precursor.
It is recognized, of course, that the implications of this technology are vast, and safeguards will be necessary to maintain the proper use of this templated voice technology. Indeed, this technology may require further use of authorization means to only allow authorized users to access and use the voice template technology and data. An additional necessity may be to have means for verifying that voices heard are either real or templated, in order to ensure against fraudulent or unauthorized use of such created voices. Legal mechanisms may need to be created to recognize this realm of technology, in addition to the licensing, contract, and other mechanisms already in existence in most countries.
In Figure 1, connection means 41 represents pathways for energy or data flow which may be actual leads, light channels, or other electronic, biologic or other activatable paths among system components. In one embodiment power means 44 is shown within system 10, but may also be remote if desired.
In another embodiment of system 10, the algorithm, signal, code means or template which is created in whole or in part may be returned for storage or refinement within either storage means 22, template means 19, or other system component or architecture. This capability permits and facilitates improvement or adaptation of the specific voice template according to the instructions of the creator or another user. This could be accomplished, for example, if multiple data sets of the same person's voice could be inputted over time, or if different ages, development, or other changes to physiology or temperament of the originator of the voice occur. Indeed, it is possible to train the templated voice to recall the context of previous engagements and to include such knowledge in future operations. In these instances it may be useful to select a refinement mode to retrieve voice AA1 template (AA\) and refine the voice or template with a comparison and update using the analysis means 22 or input means 29. Yet another example includes location of a person with a voice BB that comprises one or more voice characteristics that are similar to voice AA which was the originator for voice template AA\. In this case it may be useful to input the one or more similar characteristics from voice BB as either limited or general refinement inputs to voice AA1 or voice template AA It is then possible to also retain voice BB and create a voice BB1 and voice template BB\ , either of which may be useful at a future date. Another example includes creation of a database of variously refined voices for a single originator of the voice, useful on demand or as appropriate by system or user, according to the situation that is presented. In yet another example, a service may be offered to voice match and provide suitable refinement tools, such as natural or artificially generated waveforms or other acoustic or signal elements, to refine voice templates according to the user's desires.
Prior to describing further embodiments of system 10 or related systems and methods, it is useful to examine possible applications of this technology. In general, there are applications so numerous as to be difficult to list them all. However, it is contemplated that any use of a voice-like noise, which is generated by data provided to and data resulting from a template or coding tool for creation of that voice-like noise, is captured within the scope of this invention, particularly when such coding tool is used with other noise or sound generating means, if needed, to re-create a voice sound that is virtually identical to the originator's actual voice. The use of the generated voice in completely new sentences, or other language structures, is also within the scope of this invention. The ability to provide machine, component, or computer readable code means as part of the signal forming or transmitting of the voice template process or product further facilitates use of this technology. Means to tie or activate use of this voice templating and voice generating technology to streaming or other forms of data allows for virtual dialog, which may be adaptive and intelligent, as well as merely informational or reactive, and with such dialog or conversations being with voices selected by the user. It is also recognized that the technology herein disclosed may be utilized with visual images as well as aural sounds.
Moreover, it is believed that a voice template as described herein may be created using data that does not include an actual enabling portion of an originator's voice, but that the enabling portion of the originator's voice may be used, possibly with other data, to validate the replication accuracy of the originator's voice. In this manner, it is possible to either use an enabling portion of a voice in either the templating of the voice or merely in the validation of the accuracy of an otherwise templated voice. A templated or replicated voice may be used to interact with or prompt users of computers or other machines and systems. The user may select such templated voice from either her own library of templated voices, another source of templated voices, or she may simply create a new voice. For example, templated voice AA1 may be selected by the user for voicemail prompts or reading of texts, or other communication interface, whereas templated voice CC may be selected for use in relation to an interactive entertainment use. Troubleshooting or problems lurking in the user's machine, or alerting signals to a user of a device, may be identified or resolved by the user while working with templated voice DD. These are simply examples of how this technology will enable improved user interface and association by the user with functions, tasks, modes or other features by use of templated voice technology. Template selection and use, and generated voice creation and use may be accomplished either within the user's machine or device, partially within the user's machine or device, or external of the user's machine or device. There may be instances of only temporal use of one or more devices, such as in a hotel room, a visiting office, or other transient scenario or with a temporary device use, but which nevertheless provides the above features in the above-varied manner. For example, a traveler may wish to carry or access certain voices for accompaniment of the traveler on aircraft, or in hotel rooms. The invention may be useful in hospital or hospice rooms, or other locations. These uses are possible with one or more of the embodiments herein. Interestingly, this system may also be used by some individuals on their own voice and given as a legacy to others. Many other uses are within the scope of the teachings herein.
Other uses of the inventions disclosed herein include education, such as teaching children and others about historical events using a templated voice of choice. For example, if a parent desired her child to learn about race relations in the United States in the decade of the 1960s using one of the child's deceased grandparent's voices, then the templated voice of the selected grandparent (if available) would be designed, manufactured and designated for use. System 10 would access one or more databases to harvest information and knowledge about the designated topic and provide that information to one or more databases within system 10, such as situational database 33 for use as needed. The grandparents' templated voice EE1 would be used, following access to the desired information, and the demand request would be met by the templated voice EE1 commencing a discussion on the designated topic when desired. Such discussion can be saved for later use within system 10 or at a remote location as desired, or the discussion may be interactive between the "grandparent" i.e. the templated voice, and the child. This feature is possible by use of a voice recognition module to know in advance of the discussion the identity of the child's voice and to include adequate vocabulary and neural cognition of the various question combinations likely from the child. In addition, a bridge would be provided from the input and voice recognition module to the templated voice portion of the system, to enable responsiveness by the templated voice. Various speech recognition tools are conceivable for use in this manner, when so configured according to the novel uses described herein. Of course this configuration also requires means to rapidly search for the answer to the question and to formulate a response appropriate to the listening child. Clearly this example illustrates the extraordinary potential of this technology, particularly when combined with suitable data, system power, and system speed.
Alternatively, using the optional voice recognition module, it is possible to utilize only limited features to enable a listener of a templated voice to direct the generated voice to cease or continue, or to enable certain other features with certain commands. This would be a form of limited interactive mode appropriate for some but not all types of use. Even if the user chose not to use the optional features and instead merely arranged for a story or a discussion in the absent grandparents' voice, the effect and utility of this is enormous to this or other types of uses.
In the event the user wishes to only use a templated voice consistent with the education and life experiences of the originator of that voice, then such is possible through input of various filters or modifiers. For example, the templated voice may again be that of the grandparent selected above (templated voice EE1), and the filter of DATA DATES is used with a selected date of "BEFORE DECEMBER 1963" for a discussion of race relations in the United States in the decade of the 1960s. The result would be a discussion that would not include any information that occurred after the designated date. In this example, the "grandparent" could not discuss the Voting Rights Act of 1965 or the urban riots of the late 1960s in that country. In similar fashion it is possible to adjust the numerous different aspects of the data or the templated voice itself, for example using the characteristics type of data shown in Figure 4. It is recognized, however, that other adjustments are possible and contemplated within the scope of the inventions herein, and that the above examples are merely representative of the capabilities of the invented technology.
In another embodiment of the system and methods' disclosed herein, a user may direct a templated voice of a loved one or someone else to read to the user. In this example it is possible for people of all ages to have books read to them in the voice of an absent or deceased family member or other person known to the user. When combined with a vast array of properly configured media and computer readable code means to implement the data links, this innovation alone will provide enormous benefit to users. This type of use has wide applications beyond the specific example just provided. Indeed, an even broader use of this technology in this manner is to have available a database of authorized and templated voices which may be accessible and useable by others for a fee or other form of compensation. When used for music, this technology has similar profound implications, particularly if one can access templated voices of past and present singers of renown- many of whose voices are still available for templating. Clearly, this technology enables a new industry of manufacturing, leasing, purchasing, or otherwise using voice templates and associated means, techniques and methods of conducting business therewith.
The invention may also have utility in medical treatments for certain minor or major psychological ailments, for which proper use of templated voice therapy may be quite palliative or even therapeutic. Yet another possible use of this technology is to create a newly designed voice for use, but one which has a basis or precursor in one or more templated voices from actual mammalian origin. Ownership and further use of the newly created voice may be controllable under various means or legal enforcement, such as licensing or royalties and the like. Of course, such voices may be retained as private possessions for limited use by the creator as well. One can imagine the nature of such libraries which may be created. Such voices will represent the creative aspirations of the creator, but each voice will actually have a component or strain of actual mammalian voice as a basis through use of the templating tool or code, similar to a strand of tissue DNA but applicable to a specific voice. This type of combination presents powerful new communication capabilities and relationships based on voice and other sounds created by mammals.
Systems according to the invention may be handheld or of other size. Systems may be embedded in other systems or may be stand alone in operation. The systems and methods herein may have part or all of the elements in a distributed, network or other remote system of relationship. Systems and methods herein may utilize downloadable or remotely accessible data, and may be used for control of various other systems or methods or processes. Embodiments of the invention include exposed interface routines for requesting and implementing the methods and operations disclosed herein but which may be carried out in whole or in part by other operating or application systems. The templating process and the use of templated voices may be accomplished and used by either mammals or artificial machines or processes. For example, a bot or other intelligent aide may create or use one or more templated voices of this type. Such an aide may also be utilized to search for voices automatically according to certain general or limited criteria, and may then generate templated voices in voice factories, either virtual or physical. In this manner, large databases of templated voices may be efficiently created. In this or similar systemic use, it may be desirable to create and apply data or other types of tagging and identification technology to one or more portions of the actual voice utilized to create a templated voice.
The following are examples of applications using the technology disclosed herein. These are not meant to be limiting, but rather are provided as representative possible uses in addition to those enabled and otherwise suggested elsewhere in this disclosure.
Example I
A templating process using elements of the embodiments herein yields a voice coding signal, comprising the logic structure of characteristics of a specific voice essential for accurately replicating the sound of that voice.
Example 2
A personal computer prompter and updater, status reporter, or mate using one or more selected voices using the technology herein.
Example 3
A home energy monitor, reporter, or mate, using one or more selected voices using the technology herein. Example 4
A hotel room assistant, or automobile assistant to prompt the user according to desired prompting, such as for example a wake-up call in a hotel in the voice selected by the user. In similar manner, an operator of a vehicle might receive information in the voice or voices selected by the user.
Example 5
Using one or more selected voices using the technology herein in a personal digital assistant, a handheld personal computing device, or other electronic device or component at any time for voice capture, mate, alerter, etc.
Example 6
Creating or managing one or more selected voices or voice templates in computer/electronic chip logic, instructions, or code means for implementing the business and technology methods and manufactures disclosed herein.
Example 7
Using the voice template technology in combination with other visual media, such as with a photograph, digital video or a holographic image.
Example 8
Using the technology disclosed herein with a flash-memory based profile card for plug-in with any device that can record, play, or reconstitute a voice.
Example 9
Using the technology disclosed herein with a personal device that scans and updates downloadable information for a user as desired in voice or voices of one's choosing. For example, this may be useful for organizing actions capable of being done by a bot, such as an info-bot for background searching and interface while the user is not available and then reporting status to the user in one or more designated voices using the technology herein.
Example 10
Using the technology disclosed herein in combination with one or more components of a vehicle or other transportation system.
Example 11
Using the technology disclosed herein with one or more components of an airplane for an in-flight companion. Example 12
Using the technology disclosed herein as a safety reminder when used with one or more components of gear or equipment in the workplace, such as a personal computer posture monitor, electrical equipment, dangerous equipment, etc.
Example 13
Using the technology disclosed herein as an add-on to other voice activated systems, such as dictation devices, as prompts, companions, or text readers.
Example 14
Using the technology disclosed herein use as social mediation or control mechanisms, such as a tool against road rage or other forms of anger and frustration, activatable by driver or automatically, or by other means.
Example 15
Using the technology disclosed herein as a teaching tool in home, school or the workplace.
Example 16
Using the technology disclosed herein for inspirational readings.
Example 17
Using the technology disclosed herein as a tool to act as a family history machine.
Example 18
Using the technology disclosed herein as a MusicMatch™ brand of voice sourcing and matching technology for singers with best or desired voice.
Example 19
Using the technology disclosed herein use as a VoiceSelect™ brand of movie or video match technology to utilize preferred voices for templating of entertainment script already used by the original performer or subsequently created for voice template technology combination uses.
Example 20
Using the technology disclosed herein use as an "alter ego" device such as a handheld unit which engages on "SelectVoice™" brand or "VoiceX™" brand mode(s) of operation and has a database of images of those who match the voice as well as anonymous models which can be selected, similar to that referred to in Example 7.
Example 21
Using the technology disclosed herein to create a profile of a profiled or templated voice.
Example 22
Using the technology disclosed herein use as a bedtime reader or a night mate in a dwelling for monitoring and interactive security.
Figure 2 is a flow diagram of one embodiment of a voice capture subsystem which may comprise computer readable code means or method for accomplishing the capture, analysis and use of a voice AA designated for templating. Figure 3 is one embodiment of a voice analysis subsystem which may comprise logic or method means for efficiently determining voice data characterization routing. In these embodiments, voice AA is captured in acquisition module or step 103 and then routed by logic steps and data conductive pathways, such as pathway 106, through the templating process. Capture may be accomplished by either digital or analog methods and components. The signal which then represents captured voice AA is routed through analysis means 111 or method to determine whether an existing voice profile or template matches voice AA. This may be accomplished, for example, by comparing one or a plurality of characteristics (such as those shown in voice characterization subsystem 113 of Figure 4) as determined by either acquisition module 103 or analysis means 111, and then comparing those one or more characteristics with known voice profiles or templates available for access, such as at analysis step 111. Representative feedback and initial analysis loop 114 facilitates these steps, as does pathway 116. Such comparison may include querying of a voice profile database or other storage medium, either locally or remotely. The analysis step at analysis module 111 and voice characterization subsystem 113 may be repeated according to algorithmic, statistical or other techniques to affirm whether the voice being analyzed does or does not relate or match an existing voice profile or data file. Figure 4 provides further detail of voice characterization subsystem 113.
Referring again to Figure 2, if the signal corresponding to voice AA does not have a match or is not identified with an existing voice profile set then the signal is routed to the voice characterization subsystem for comprehensive characterization. However, if an existing voice profile data file matches the profile signal of voice AA, then creation of a template may not be required at module/step 127. In that situation, the signal might be analyzed and/or characterized for possible generation of a revised profile or template- which itself may then be stored or applied. This situation might occur, for example, when additional characterization data is available (such as size of enabling portion, existence or lack of stress, or other factors) which had not been previously available. Accordingly, a specific voice data file might comprise a plurality of templates. This is a validation process, having logic steps and system components shown generally at validation subsystem 133 in Figures 2 and 3. It is emphasized that, as to relational location to subsystems and components, these Figures are generally schematic. Also, as shown in Figure 3, after determination that a voice profile data file exists (step 137), then the validation logic at step 139 will, optionally, occur. If a revision of an existing template is merited, then it is generated at step 142. Alternatively, logic step 145 notes that no revision to an existing template is to be made. Following either steps 143 or 145, then the new, revised, or previous voice profile or template is stored or used at step 155.
The template creation module/step 127 of Figure 2 comprises utilizing the voice characterization subsystem to create a unique identifier, preferably a digital identifier, for that specific voice being templated or profiled. This data is similar, in the abstract, to genetic codes, gene sequence codes, or bar codes, and like identifiers of singularly unique objects, entities or phenomena. Accordingly, applicants refer to this voice profile or template as "Voice Template Technology™" as well as "Voice DNA™ or VDNA™" and "Voice Sequence Codes™ or Voice Sequence Coding™". The terms "Profile, Profiles or Profiling" and derivative terms may be substituted in the above trademark or other reference terms for this new technology. Following completion of template creation, the voice template may be stored (shown at storage module or step 161 or applied in use at module or step 164).
Figure 4 is a schematic representation of a voice characterization subsystem. This disclosure comprises at least one embodiment of characterization data and means for determining and characterizing salient data to define a voice using voice templating or profiling, as disclosed herein. As shown, various types of data is available for comparison in formulating the characterization data. This characterization data will then be used to create the voice template or profile according to coding criteria. Although the data in Figure 4 appears to be arranged in discreet modules, an open comparator process may be preferred in which any data may be accessed for comparison in any of various sequences or weighted priorities. Regardless, as shown in this figure, data may comprise the categories of language, gender, dialect, region, or accent (shown as "Voice Characteristics" output signal VC0 at module or step 201); frequency, pitch, tone, duration, or amplitude (shown as output signal V at module or step 203); age, health, pronunciation, vocabulary, or physiology- either genetic or otherwise (shown as output signal VC2 at module or step 205); patterns, syntax, volume, transition, or voice type (shown as output signal VC3 at module or step 207); education, experience, phase, repetition, or grammar (shown as output signal VC at module or step 209); occupation, nationality, ethnicity, custom or setting (shown as output signal VC5 at module or step 211); context, variances, rules/models, enabling portion type, size or number (shown as output signal VC6 at module or step 213); speed, emotion, cluster, similarities, or acoustic model (shown as output signal VC7 at module or step 215); math model, processing model, signal model, sounds-like model, or shared model (shown as output signal VC8 at module or step 217); vector model, adaptive data, classifications, phonetic, or articulation (shown as output signal VC9 at module or step 219); segments, syllables, combinations, self-learned, or silence (shown as output signal VCio at module or step 221); packets, breathing rate, timbre, resonance, or recurrence model (shown as VCn at module or step 223); harmonics, synthesis models, resolution, fidelity, or other characteristics (shown as output signal VC12 at module or step 225); or various other techniques for uniquely identifying a portion (whether fractional or in its entirety) of a voice. For example, this may further include a digital or analog voice signature, modulation, synthesizer input data, or other data formed or useful for this purpose, all of which is shown as output signal VCX at module or step 227.
It is recognized that one or more data types from any one or more modules or steps may provide value to a voice template. Also, for purposes of this invention, VCX encompasses any known categorization technique at the time of interpretation, regardless of mention herein, provided it is useful in then defining a unique voice profile or template for a specific voice- and is used according to the novel teachings disclosed herein. Again, it is recognized that data combined in voice characteristic files and output signals VC0, VCb VC2, VC3, VC4, VC5, VC6, VC7, VC8, VC9, VC10, VCU, VC12, and VCX may be prioritized and combined in various ways in order to accurately and efficiently analyze and characterize a voice, with VCX representing still further techniques incorporated herein by reference.
Another goal of this technology is to identify protocols for reconstituting encoded voice DNA with content into a voice recognizable as an originator. The voice DNA will be combined with text or similar input data in a manner that is likely similar to compiling steps in known text-to-speech synthesizers. A key difference is the addition of the very specific voice DNA template, which functions as a unique recipe or filter for each generated voice.
Figures 5 and 6 illustrate an exemplary signal bundler suitable for receiving the various voice characteristic data, such as digital or coded data representative of the information deemed relevant and formative of the voice being templated. The signal bundler 316 then combines the output of signal content module or step 332 and values/scoring from one or more signals VC0 - VCX and formats the signal or code at module or step 343 as appropriate for proper transfer and use by various potential user interfaces, devices or transmission means to create an output voice template, code, or signal VTX. It is recognized that various methods are possible to create a unique identifier to delineate the various voice characteristics- and that such various possibilities are enabled herein in view of the broader context and scope of this invention- to a certain degree independent of some component methodology.
Another goal of this technology is to identify the use of specific genetic codes and/or gene sequence codes to assist in the unique classification of a particular speaker's voice, and to enhance the fidelity of any voice re-created using a voice DNA product. In the 1990s, there was significant development in the language impairment gene search, culminating in the more recent FOXP2 isolation/identification as having a role in language and speech development. When we began developing our technology, we recognized the likelihood that a specific gene or genetic code might influence our results. We continue to believe that such research provides clues and possibly tools for enhancing the classification of the enabling portion of an original voice.
Such use is an example of where the technology of our company will lead. No entity has ever suggested analyzing this basic genetic code and then correlating that information into a novel classification system for a voice template. The effect of this is to draw upon the genetic descriptors, combine those with the learned voice-traits of the individual, and then further combine this data with other ambient or environmental factors of the individual. The resulting "voice DNA code" provides a level of specificity of that particular voice that, in principle, should be an extraordinary match of that person.
As noted above, an additional opportunity exists because a derivative of understanding and classifying the mechanisms of individual speech is improved understanding of the specific brain circuitry that underlies such faculty in each person. The cognitive information processing role, per subject, is a basis of the interpretation and formation of speech. This technology will also utilize this basic element as a further enhancement of the specificity of each person's voice DNA, mindful of identifying future opportunities for research in communication among human-human and human- machine neural interfaces. While this realm is still nascent as to its impact on our digital classification system and coding of each person's voice sound/characteristics, it is believed that this will be a productive and worthwhile area of our technical exploration and interface analysis.
Another goal is to determine the best methodology for distribution, validation and use-control of voice DNA. It is recognized, of course, that the implications of this technology are vast, and safeguards will be necessary to maintain the proper use of this templated voice technology. Indeed, this technology may require further use of authorization protocols to only allow authorized users to access and use the voice template technology and data. An additional necessity may be to have mechanisms for verifying that voices heard are either real or templated, in order to ensure against fraudulent or unauthorized use of such created voices.
Systems using this technology effort may be handheld or of other size. Systems may be embedded in other systems or may be stand alone in operation. The systems and methods herein may have part or all of the elements in a distributed, network or other remote system of relationship. Systems and methods herein may utilize downloadable or remotely accessible data, and may be used for control of various other systems or methods or processes. Embodiments of the technology include exposed interface routines for requesting and implementing the methods and operations disclosed herein but which may be carried out in whole or in part by other operating or application systems. The templating process and the use of templated voices may be accomplished and used by either mammals or artificial machines or processes. For example, a bot or other intelligent aide may create or use one or more templated voices of this type. Such an aide may also be utilized to search for voices automatically according to certain general or limited criteria, and may then generate templated voices in voice factories, either virtual or physical. In this manner, large databases of templated voices may be efficiently created. In this or similar systemic use, it may be desirable to create and apply data or other types of tagging and identification technology to one or more portions of the actual voice utilized to create a templated voice or to use voice tags for marking other data.
Figure 7 is a representative organization and method of an electronic query and transfer between a voice template generation or storage facility 404 and a remote user. In this representation, enabling portions may be sent to a remote voice template generation or storage facility 404 by any number of various users 410, 413, 416. The facility 404 then generates or retrieves a voice template data file and creates or retrieves a voice template signal. The template signal is then transmitted or downloaded to the user or its designee, shown at step 437. At the time of download, or later, following a user request 441, the template signal is formatted for appropriate use by a destination device, including activation instructions and protocols, shown at step/module 457.
Figure 8 is a schematic representation of a mobile medium, such as a card, disk, or chip on which are essential components, depending on the user mode and need, for utilizing voice template technology. For example, using Figures 7 and 8, a hotel door card 477 may be provided at check-in to a hotel by a traveler. However, in addition to the normal onsite security code programming and circuitry 479 applied to the card, additional features incorporating aspects of this invention may be made available. A schematic representation of optional features within such a card include means 481 for receiving and using a voice template for a voice or voices selected by the traveler for various purposes during the traveler's stay at the hotel. As shown, such features may include a template receiving and storage element 501, a noise generator or generator circuitry 506, a central processing unit 511, input output circuitry 515, digital to analog/analog to digital elements 518, and clock means 521. Again, various other elements may be utilized, such as voice compression or expansion means- such as those known in the cellular phone industry, or other components to enable the card to function as desired. The user may then enjoy dialog or interface with inanimate devices within the hotel in the voice(s) selected by the traveler. Indeed, a traveler profile may even retain such voice preference information, as appropriate, and certain added billings or benefits may accrue through use of this invention. It is recognized that the invention may be employed in a wide variety of applications and articles, and the example of Figures 8 and 9 should not be considered limiting.
Figure 9 is a depiction of a photograph 602 which is configured for interactive use of voice template technology with voice JJ attributable to figure FJT and voice KK attributable to figure FKK- Means are combined with the frame 610 or other structure, whether computer readable code means or simple three dimensional material, for interfacing the subjects or objects of the photo (or other media) with the appropriate voice templates to recreate a dialogue that either likely occurred or could have occurred, as desired by the user.
It is recognized that various means and methods exist to capture, analyze, and synthesize real and artificial voice components. For example, the following United States patents, and their cited or listed references, illustrate a few of the means for capturing, synthesizing, translating, recognizing, characterizing or otherwise analyzing voices, and are incorporated herein in their entirety by reference for such teachings: 4,493,050; 4,710,959; 5,930,755; 5,307,444; 5,890,117; 5,030,101; 4,257,304;
5,794,193; 5,774,837; 5,634,085; 5,704,007; 5,280,527; 5,465,290; 5,428,707 5,231,670; 4,914,703 4,803,729; 5,850,627; 5,765,132; 5,715,367; 4,829,578; 4,903,305; 4,805,218: 5,915,236; 5,920,836: 5,909,666; 5,920,837; 4,907,279; 5,859,913; 5,978,765; 5,475,796; 5,483,579 4,122,742; 5,278,943 4,833,718; 4,757,737; 4,754,485; 4,975,957; 4,912,768; 4,907,279; 4,888,806; 4,682,292; 4,415,767 4,181,821; 3,982,070; and 4,884,972. None of these references illustrates the inventive contributions claimed or elsewhere disclosed herein. Rather, the above patents illustrate tools that may be useful rather than necessary in practicing one or more embodiments of this invention. Thus, it is recognized that various systems, products, means, methods, processes, data formats, data related storage and transfer media, data contents and other aspects are contemplated within this invention to achieve the novel and nonobvious innovations, advantages, products and applications of the technology disclosed herein. Therefore the above disclosures shall be considered exemplary rather than limiting, where appropriate, so that the claims are afforded the breadth of scope to which this pioneering technology should be entitled without limitation by the pace of development and availability of implementing technologies.

Claims

What is claimed:
1. A system for capturing an enabling portion of a specific voice sufficient for using that portion as a template in further use of the voice, comprising: a. a subsystem for capturing an enabling portion of a voice in a form useful for analysis as to voice characteristics; b. an analysis subsystem for receiving and analyzing the captured voice and for characterizing elements of the captured voice as characterization data; c. a storage subsystem for receiving characterization data from the analysis means for a specific voice; and d. a retrieval subsystem for retrieving the analysis and characterization data for further use.
2. The system of claim 1 in which the subsystem for capturing the voice comprises digital recording means.
3. The system of claim 1 in which the subsystem for capturing the voice comprises a flash memory card.
4. The system ofclaim 1 in which the subsystem for capturing the voice comprises analog recording means.
5. The system ofclaim 1 in which the subsystem for capturing the voice comprises input means for receiving a live voice and for transmitting that live voice to the analysis means.
6. The system of claim 1 in which the analysis subsystem comprises digital data storage means.
7. The system of claim 1 in which the analysis subsystem comprises components for identifying specific patterns, syntax, frequency, pitch and tones of speech in the captured voice data.
8. The system ofclaim 1 in which the analysis subsystem comprises components for identifying specific vocabulary, pronunciation, or accent unique to the captured voice.
9. The system of claim 1 in which the analysis subsystem comprises components for identifying specific features unique to the captured voice deriving principally from specific anatomic structures or also from genetic sequence information of the originator of the voice.
10. The system of claim 1 in which the analysis subsystem comprises components for determining the vocabulary of the originator of the captured voice.
11. The system ofclaim 10 in which the analysis subsystem comprises components for setting the vocabulary as characterization data for use in forming a future templated voice.
12. The system ofclaim 1 in which the analysis subsystem comprises digital processing apparatus for digitally processing input data in the form of a voice or digital representation of a recorded voice.
13. The system ofclaim 1 in which the analysis subsystem comprises second input components for receiving additional data regarding the physiology of the voice originator.
14. The system ofclaim 13 in which the analysis subsystem second input components comprises digital signal processing structure suitable for selectively receiving audio or other data comprising visualization information on the morphology of the voice originator.
15. The system ofclaim 1 in which the analysis subsystem comprises comparison components for comparing an input voice data set with stored data comprising age data, language data, educational data, gender data, occupation data, accent data, nationality data, ethnic data, voice type data, custom data and setting data.
16. The system ofclaim 1 in which the analysis subsystem comprises third input components for receiving data regarding the voice originator comprising age data, educational data, gender data, occupation data, accent data, nationality data, ethnic data, voice type data, custom data, language data, genetic sequence data and setting data.
17. A method of creating a voice-like noise which is identical in sound to an actual specific human's voice, comprising the steps of: a. capturing an enabling portion of a specific human's voice for storage and use: b. storing the enabling portion of the specific human' s voice; c. analyzing the enabling portion to identify essential components or characteristics of the captured voice; and d. utilizing the identified essential components or characteristics to create a new voice which, when assigned data from one or more database means and when heard, sounds identical in all respects to the voice of the specific human's voice to a listener having normal aural discretion abilities.
18. The method ofclaim 17 in which the analyzing step comprises the steps of identifying the components in the captured enabling portion of the specific human's voice relating to at least one of the components including frequency, tone, pitch, volume, accent, gender, harmonic structure, acoustic power, phonetic or timing accent, power and periodicity.
19. The method ofclaim 18 in which the step of capturing an enabling portion of a specific human's voice for storage and use includes capturing either larynx generated noise or turbulence generated noise of the specific human's voice.
20. A method of accurately replicating a human voice comprising the steps of:
' a. identifying a minimum size data set comprising a combination of words, sounds or phrases which must be emitted by the originator of a voice to be replicated; b. capturing the emission of the combination of words, sounds or phrases by the originator of the voice to be replicated in a medium; c. analyzing the captured emission to identify voice characteristics of the originator of the voice sufficient to allow artificial generation of the voice, using the identified characteristics, so that the artificially generated voice is substantially identical in all respects to a listener having normal aural discretion abilities when the listener hears the generated voice utilizing some language components not contained in the captured emission of the originator's actual voice.
21. An article of manufacture comprising: a. a computer usable medium having computer readable program code means embodied therein for causing replication of a human voice, the computer readable program code means in said article of manufacture comprising: b. computer readable program code means for causing a computer to effect an analysis of a captured enabling portion of an originator's voice to identify voice characteristics data sufficient to allow artificial generation of the voice; and c. computer readable program code means for causing use of the identified voice characteristics data to artificially generate a voice, so that the artificially generated voice is substantially identical in sound and usage to a listener when the listener hears the generated voice utilizing some language components not contained in the captured emission of the originator's actual voice.
22. The article of manufacture ofclaim 21 further comprising computer readable program code means for storing the generated voice for later use.
23. The article of manufacture ofclaim 21 further comprising computer readable program code means for using the voice characteristics data to create a voice profile of the originator of the voice.
24. The article of manufacture ofclaim 21 further comprising computer readable program code means for accessing data base means for storing data comprising age data, educational data, gender data, occupation data, accent data, language, nationality data, ethnic data, voice type data, custom data, genetic data, general data and setting data.
25. A computer program product for use with an aural output device, said computer program product comprising: a. a computer usable medium having computer readable program code means embodied therein for causing replication of a human voice via an output aural device, the computer program product comprising: b. computer readable program code means for causing a computer to effect an analysis of a captured enabling portion of an originator's voice to identify voice characteristics data sufficient to allow artificial generation of the voice; and c. computer readable program code means for causing use of the identified voice characteristics data to artificially generate and output a voice via an aural output device, so that the artificially generated voice is substantially identical in sound and usage to a listener when the listener hears the generated voice utilizing some language components not contained in the captured emission of the originator's actual voice.
26. A computer program product for use with a display device, said computer program product comprising: a. a computer usable medium having computer readable program code means embodied therein for causing replication of a human voice and verification of the accuracy of the replicated voice displayed on the display device, the computer program product comprising: d. computer readable program code means for causing a computer to effect an analysis of a captured enabling portion of an originator's voice to identify voice characteristics data sufficient to allow artificial generation of the voice; and e. computer readable program code means for causing use of the identified voice characteristics data to artificially generate a voice and to compare the characteristics of the generated voice to the originator's voice on a display device, so that the artificially generated voice is substantially identical in sound to a listener when the display device so indicates and when a listener actually hears the generated voice utilizing some language components not contained in the captured emission of the originator's actual voice.
27. A computer program product for use with an aural output device, said computer program product comprising: a. a computer usable medium having computer readable program code means embodied therein for initiating replication of a human voice via an output aural device, the computer program product comprising: b. computer readable program code means for causing a computer to receive and activate a voice characteristics data file unique to a specific voice sufficient to allow artificial generation of the voice; and c. computer readable program code means for causing use of the identified voice characteristics data to artificially generate and output a voice via an aural output device, so that the artificially generated voice is substantially identical in sound to a listener when the listener hears the generated voice and a captured emission of the originator's actual voice.
28. A computer program product for use with an electronic device, said computer program product comprising: a. a computer usable medium having computer readable program code means embodied therein for initiating replication of a human voice, the computer program product comprising: b. computer readable program code means for causing receipt and activation of a voice characteristics data file unique to a specific voice sufficient to allow artificial generation of the voice; and c. computer readable program code means for causing use of the identified voice characteristics data file and a noise generation means sound output to artificially generate a voice, so that the artificially generated voice is substantially identical in sound to the originator's actual voice.
29. A memory for storing data for access by an application program being executed on a data processing sub-system, comprising: a. a data structure stored in said memory, said data structure including information resident in a database used by said application program and including: b. at least one voice enabling portion data file stored in said memory, each of said voice enabling portion data file set containing information substantially different from any other voice enabling portion data file set; c. a plurality of voice characteristics data files containing different reference information for a plurality of voice characteristics; and d. a plurality of voice profile sets each having at least one voice profile data file having data unique to that data file only; wherein the data structure allows access to the voice characteristics data files and the voice profile data files to conduct comparison operations with at least one voice enabling portion data file.
30. A data processing system executing an application program and containing a database used by said application program, said data processing system comprising: a. CPU means for processing said application program; and b. memory means for holding a data structure for access by said application program, said data structure being composed of information resident in a database used by said application program and including: at least one voice enabling portion data file stored in said memory, each of said voice enabling portion data file set containing information substantially different from any other voice enabling portion data file set; a plurality of voice characteristics data files containing different reference information for a plurality of voice characteristics; a plurality of voice profile sets each having at least one voice profile data file having data unique to that data file only; and c. wherein the data processing system allows access to the voice characteristics data files and the voice profile data files to conduct comparison operations with at least one voice enabling portion data file.
31. A computer data signal embodied in a transmission medium comprising: a. an encryption source code for a unique voice profile template useful for keying additional electronic noise to create a specific generated voice; and b. a carrier medium suitable for carrying the encryption source code to a location and configured so that the encryption source code is removable from the carrier medium to be applied as a key to create a generated voice.
32. A method for using a selected voice as a personal voice assistant with an electronic device, comprising the steps of: a. activating electronic means for accessing a remote database; b. transmitting a signal portion to a remote database having a voice database containing a plurality of voice profile sets each having at least one voice profile data file having data unique to that data file only and identifiable by a unique identifier; c. transmitting a signal portion to the remote database to uniquely identify a desired data file and then to effect transfer of the data file content to the user's designated electronic device location; and d. implementing use of the selected and transferred data file as a voice template, in combination with appropriate noise generated either by the electronic device or other means for generating such noise, so that as desired the user may receive noise from the electronic device in the sound of the selected voice as determined by the identified voice.
33. The method ofclaim 32 in which the data file includes data characteristics of the selected voice arranged as computer readable program code means for causing use of the identified voice characteristics data to artificially generate a voice template.
34. The method ofclaim 32 in which the implementing step comprises application of authorization means to only allow authorized users to access and use the voice template technology and data.
35. The method ofclaim 32 in which the implementing step comprises application of selectively accessible verification means for verifying that voices heard are either real or template generated.
36. A method of doing business in which a system is used for capturing an enabling portion of a specific voice sufficient for using that portion as a template in further use of the voice, comprising the steps of: a. capturing an enabling portion of a voice in a form useful for analysis as to voice characteristics; b. inputting the enabling portion into an analysis module for characterizing elements of the captured voice as characterization data; c. receiving the characterization data from the analysis module for a specific voice; and d. storing the characterization data for further use.
37. The method ofclaim 36 in which the means for capturing the voice comprises digital input means.
38. The method ofclaim 36 in which the enabling portion of the voice is received electronically.
39. The method of claim 36 in which the characterization data is bundled to form a voice template signal useful for combining with generated noise to create a templated voice which sounds like the original specific voice.
40. The method ofclaim 36 in which the templated voice is controlled so that the templated voice may receive speech input commands to elicit new words in the templated voice but which were not inputted by the specific voice.
41. An automated machine for capturing an enabling portion of a specific voice and for using that portion as a template useful for further use of the templated voice, comprising: a. an acquisition module for acquiring an enabling portion of a voice in a form useful for analysis as to voice characteristics; b. an analysis module for receiving and analyzing the captured voice and for characterizing elements of the captured voice as characterization data; and c. a template generator module for automatically generating a voice template signal as a unique identifier of the acquired specific voice.
42. The machine ofclaim 41 further comprising communication means for communicating with storage means for receiving characterization data from a database.
43. The machine ofclaim 41 further comprising communication means for communicating with storage means for storing the generated template until requested.
44. An online method for creating voice templates and generating revenue for such generation, comprising: a. capturing an enabling portion of a specific voice; b. analyzing the enabling portion of the specific voice to generate a data profile which defines the characteristics of the captured voice in a way that can be reconstituted for later use; c. generating a voice template signal as a unique identifier of the acquired specific voice; and d. providing at least one generated data profile for commercial use by another.
45. A machine operated method for creating a voice template and generating revenue for such generation, comprising: a. capturing an enabling portion of a specific voice; b. analyzing the enabling portion of the specific voice to generate a data profile which defines the characteristics of the captured voice in a way that can be reconstituted for later use; c. using the data profile, generating a voice template signal as a unique identifier of the captured specific voice; and d. providing at least one voice template signal for commercial use.
46. A business method for creating a voice template, comprising: a. capturing an enabling portion of a specific voice or templated voice; b. using computer means, analyzing the enabling portion of the voice to generate a data profile which defines the characteristics of the captured voice in a way that can be reconstituted for later use; c. electronically generating or retrieving a voice template signal as a unique identifier of the captured voice; and d. providing at least one voice template signal for commercial use.
47. The method of doing business ofclaim 46 in which the step of providing is accomplished on an electronic data exchange.
48. A method for creating a voice template from a plurality of voices, comprising: a. capturing an enabling portion of a plurality of voices or templated voices; b. using computer means, analyzing the enabling portions of the voices to generate a data profile which defines the characteristics of the captured voices in a way that can be bundled as a single voice signal suitable for reconstitution for later use; and c. electronically generating a voice template signal as a unique identifier of the newly generated voice.
49. A method of accurately replicating a human voice of someone who lost the ability to speak in the desired normal voice, comprising the steps of: a. identifying a minimum size data set comprising a combination of words, sounds or phrases which must be emitted by the originator of a voice to be replicated; b. capturing the emission of the combination of words, sounds or phrases by the originator of the voice to be replicated in a medium; c. analyzing the captured emission to identify voice characteristics of the originator of the voice sufficient to allow artificial generation of the voice, using the identified characteristics, so that the artificially generated voice is substantially identical in all respects to a listener having normal aural discretion abilities when the listener hears the generated voice utilizing some language components not contained in the captured emission of the originator's actual voice.
50. The method ofclaim 49 further comprising identifying the voice to be replicated by genetic code.
51. The method ofclaim 49 further comprising a step of validating or adjusting the artificially generated voice by use of genetic code analysis of the originator of the voice being replicated.
52. A method of accurately replicating an actual human voice, comprising the steps of: a. identifying a minimum size data set comprising a combination of fractions or segments of actual words, sounds or phrases which were emitted by the originator of the voice to be replicated; b. capturing the emission of the combination of words, sounds or phrases by the originator of the voice to be replicated in a medium; c. analyzing the captured emission to identify voice characteristics of the originator of the voice by analysis of fractions or segments of the words, sounds or phrases sufficient to allow artificial generation of the voice, using the identified characteristics, so that the artificially generated voice is substantially identical in all respects to a listener having normal aural discretion abilities when the listener hears the generated voice utilizing some language components not contained in the captured emission of the originator's actual voice.
53. A method for creating a voice template from a plurality of voice fragments, comprising: a. capturing an enabling portion of a plurality of voice fragments; b. using computer means, analyzing the enabling portions of the voice fragments to generate a voice fragment data code which defines the characteristics of the captured voice fragments in a way that can be bundled as a single voice signal suitable for reconstitution for later use; and c. electronically generating a voice template signal as a unique identifier of the newly generated voice.
54. A system and method of remote communication-based analysis of patient voices for telemedicine screening, comprising the steps of:
(a) capturing an enabling portion of a plurality of voice fragments; and (b) using computer steps and equipment, analyzing the enabling portions of the voice fragments and comparing the results of such analysis with medical condition recognition algorithms to produce an outcome prediction of patient medical condition.
55. The system and method ofclaim 54, further comprising the step of electronically generating a queue or signal indicative of a detected medical condition within the captured voice fragments.
56. A method of creating a voice-like noise which is identical in sound to an actual specific human's voice, comprising the steps of: e. capturing an enabling portion of a specific human's voice for storage and use: f. storing the enabling portion of the specific human's voice; g. analyzing the enabling portion to identify listener-sensitive essential components or characteristics of the captured voice by identifying the components in the captured enabling portion of the specific human's voice which relate to a plurality of voice components including frequency, tone, pitch, volume, accent, gender, harmonic structure, acoustic power, phoneticization, timing, accent, power and periodicity; with said identifying being accomplished by comparing at least a part of the enabling portion with data modules including known voice characteristics to facilitate the classification of the specific human's voice into a digital identifier; and \ h. utilizing the identified essential components or characteristics, as now defined by the digital identifier, to create a new voice which, when assigned and combined with previously unspoken data from one or more database means and when heard, sounds identical in all respects to the voice of the specific human's voice to a listener having normal aural discretion abilities.
57. A method of accurately replicating a human voice comprising the steps of: d. identifying a minimum size data set comprising a combination of words, sounds or phrases which must be emitted by the originator of a voice to be replicated; e. capturing the emission of the combination of words, sounds or phrases by the originator of the voice to be replicated in a medium; f. analyzing the captured emission to identify voice characteristics of the originator of the voice sufficient to allow artificial generation of the voice, using the identified characteristics, so that the artificially generated voice sounds substantially identical in all respects to a listener having normal aural discretion abilities when the listener hears the generated voice utilizing some language not contained in the captured emission of the originator's actual voice; with said identifying being accomplished by comparing at least a part of the enabling portion with data modules including known voice characteristics to facilitate the classification of the characteristics of the specific human's voice into a digital identifier which allows rapid artificial generation of the voice when the digital identifier is used with a source of unspoken data to further characterize the data so that the listener believes the generated voice is that of the originator.
58. A method of accurately replicating a human voice comprising the steps of: a. identifying a minimum size data set comprising a combination of words, sounds or phrases which must be emitted by the originator of a voice to be replicated; b. capturing the emission of the combination of words, sounds or phrases by the originator of the voice to be replicated in a medium; c. analyzing the characteristics of frequency, tone, pitch, volume, accent, gender, harmonic structure, dialect, education, acoustic power, phonetics, timing, rhythm, accent, power and periodicity of the captured emission to identify voice characteristics of the originator of the voice sufficient to allow artificial generation of the voice, using the identified characteristics as elements in a template identifier, so that the artificially generated voice is substantially identical to the originator of the voice in all respects to a listener having normal aural discretion abilities when the listener hears the generated voice utilizing some language components not contained in the captured emission of the originator's actual voice; and d. generating a generated voice using the identified characteristics with a template identifier and a source of data that comprises noise not previously spoken.
PCT/US2003/022636 2002-07-17 2003-07-17 System and method for voice characteristic medical analysis WO2004008295A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003259177A AU2003259177A1 (en) 2002-07-17 2003-07-17 System and method for voice characteristic medical analysis

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US39670202P 2002-07-17 2002-07-17
US60/396,702 2002-07-17

Publications (2)

Publication Number Publication Date
WO2004008295A2 true WO2004008295A2 (en) 2004-01-22
WO2004008295A3 WO2004008295A3 (en) 2004-04-15

Family

ID=30116053

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2003/022636 WO2004008295A2 (en) 2002-07-17 2003-07-17 System and method for voice characteristic medical analysis

Country Status (2)

Country Link
AU (1) AU2003259177A1 (en)
WO (1) WO2004008295A2 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105448291A (en) * 2015-12-02 2016-03-30 南京邮电大学 Parkinsonism detection method and detection system based on voice
CN110211566A (en) * 2019-06-08 2019-09-06 安徽中医药大学 A kind of classification method of compressed sensing based hepatolenticular degeneration disfluency
CN110223688A (en) * 2019-06-08 2019-09-10 安徽中医药大学 A kind of self-evaluating system of compressed sensing based hepatolenticular degeneration disfluency
CN111326162A (en) * 2020-04-15 2020-06-23 厦门快商通科技股份有限公司 Voiceprint feature acquisition method, device and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5007081A (en) * 1989-01-05 1991-04-09 Origin Technology, Inc. Speech activated telephone
US5594789A (en) * 1994-10-13 1997-01-14 Bell Atlantic Network Services, Inc. Transaction implementation in video dial tone network
US5717828A (en) * 1995-03-15 1998-02-10 Syracuse Language Systems Speech recognition apparatus and method for learning
US5774841A (en) * 1995-09-20 1998-06-30 The United States Of America As Represented By The Adminstrator Of The National Aeronautics And Space Administration Real-time reconfigurable adaptive speech recognition command and control apparatus and method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5007081A (en) * 1989-01-05 1991-04-09 Origin Technology, Inc. Speech activated telephone
US5594789A (en) * 1994-10-13 1997-01-14 Bell Atlantic Network Services, Inc. Transaction implementation in video dial tone network
US5717828A (en) * 1995-03-15 1998-02-10 Syracuse Language Systems Speech recognition apparatus and method for learning
US5774841A (en) * 1995-09-20 1998-06-30 The United States Of America As Represented By The Adminstrator Of The National Aeronautics And Space Administration Real-time reconfigurable adaptive speech recognition command and control apparatus and method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105448291A (en) * 2015-12-02 2016-03-30 南京邮电大学 Parkinsonism detection method and detection system based on voice
CN110211566A (en) * 2019-06-08 2019-09-06 安徽中医药大学 A kind of classification method of compressed sensing based hepatolenticular degeneration disfluency
CN110223688A (en) * 2019-06-08 2019-09-10 安徽中医药大学 A kind of self-evaluating system of compressed sensing based hepatolenticular degeneration disfluency
CN111326162A (en) * 2020-04-15 2020-06-23 厦门快商通科技股份有限公司 Voiceprint feature acquisition method, device and equipment

Also Published As

Publication number Publication date
WO2004008295A3 (en) 2004-04-15
AU2003259177A8 (en) 2004-02-02
AU2003259177A1 (en) 2004-02-02

Similar Documents

Publication Publication Date Title
Fagherazzi et al. Voice for health: the use of vocal biomarkers from research to clinical practice
US20020072900A1 (en) System and method of templating specific human voices
Bachorowski Vocal expression and perception of emotion
Kraljic et al. First impressions and last resorts: How listeners adjust to speaker variability
Rachman et al. DAVID: An open-source platform for real-time transformation of infra-segmental emotional cues in running speech
CN108132995A (en) For handling the method and apparatus of audio-frequency information
Johar Emotion, affect and personality in speech: The Bias of language and paralanguage
CN113010138B (en) Article voice playing method, device and equipment and computer readable storage medium
US20050108011A1 (en) System and method of templating specific human voices
Caponetti et al. Biologically inspired emotion recognition from speech
AU2048001A (en) System and method of templating specific human voices
Lotfian et al. Lexical dependent emotion detection using synthetic speech reference
Wu et al. Exemplar-based emotive speech synthesis
Hashem et al. Speech emotion recognition approaches: A systematic review
Qadri et al. A critical insight into multi-languages speech emotion databases
Garcia-Cuesta et al. EmoMatchSpanishDB: study of speech emotion recognition machine learning models in a new Spanish elicited database
WO2004008295A2 (en) System and method for voice characteristic medical analysis
Potapova et al. Forensic identification of foreign-language speakers by the method of structural-melodic analysis of phonograms
He Stress and emotion recognition in natural speech in the work and family environments
Anumanchipalli Intra-lingual and cross-lingual prosody modelling
Lee et al. The Sound of Hallucinations: Toward a more convincing emulation of internalized voices
de Vries et al. “You Can Do It!”—Crowdsourcing Motivational Speech and Text Messages
Hatem et al. Human Speaker Recognition Based Database Method
Midtlyng et al. Voice adaptation by color-encoded frame matching as a multi-objective optimization problem for future games
Jardebrand Talk2me, a voice controlled user interface used in the initial ambulance care process

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase in:

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP