US20100262603A1 - Search engine methods and systems for displaying relevant topics - Google Patents
Search engine methods and systems for displaying relevant topics Download PDFInfo
- Publication number
- US20100262603A1 US20100262603A1 US12/764,792 US76479210A US2010262603A1 US 20100262603 A1 US20100262603 A1 US 20100262603A1 US 76479210 A US76479210 A US 76479210A US 2010262603 A1 US2010262603 A1 US 2010262603A1
- Authority
- US
- United States
- Prior art keywords
- topics
- search
- data
- topic
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
Definitions
- the present invention relates to search engines, and more particularly, to search engine methods and systems that provide relevant and timely topics.
- Illustrative supervised classification technologies include semantic networks and neural networks. While supervised systems generally derive classifications more attuned to what a human would generate, they often require substantial training and tuning by expert operators and, in addition, often rely for their results on data that is more consistent or homogeneous that is often possible to obtain in practice. Hybrid systems attempt to fuse the benefits of manual classification methods with the speed and processing capabilities employed by unsupervised and supervised systems. In known hybrid systems, human operators are used to derive “rules of thumb” which drive the underlying classification engines.
- the boss would like the individual to send a copy of the email and the references back to him as soon as possible. Also, he would like the individual to check for additional references to see if the conclusions in the memo need to be updated.
- the boss requires that the project be completed within fifteen minutes.
- the worker is not disorganized, but as is common, does not have total recall of how the information was gathered or where the email is stored. After thirty minutes, the worker finally finds the email. But, the worker still needs to search for additional information as requested by his boss. The end result is that because no efficient search mechanism existed the worker has missed his boss' deadline.
- search results driven by website popularity can often lead to useless results.
- search engine operations facility there is an army of personnel and massive server farms humming away to potentially deliver hundreds of thousands of results to every search query that an individual enters.
- Web searching, search advertising, and enterprise searching are not consistently providing acceptable search resolution for the user.
- the missing ingredient in current search technology is “true relevance”. Relevance can only be defined by the user for a specific search. Relevancy has no predictable pattern. No generalized algorithm is going to repeatably produce relevant information, because in the end, any generalization is arbitrary.
- search engines do not display topics associated with every subject matter domain related to a search constraint entered by a user. Rather a search engine may only show search results or topics that are most popular without regard to different subject matter domains that search results may belong to. For example, when a user enters the search constraint, Jaguar.
- the data items belonging to the search results may include topics that correspond to subject matter domains that include autos (e.g., there is a car named Jaguar), animals (e.g., there is an animal called Jaguar), software (e.g., there is a software package referred to as Jaguar), resorts (e.g., there are resorts in South America referred to as Jaguar resorts), football (e.g., there is a football team referred to as the Jacksonville Jaguars) and game (e.g., there is a game referred to a Jaguar).
- Those search engines that provide results based only on popularity of website hits might only display topics or search results associated with the subject matter domain Auto. Or, at the very least, items associated with Resorts would be on page 27 of the search results.
- search methods and systems that can efficiently generate search results to identify and display topics by considering, at any given time, the relative significance of a topic based on current events and that ensure coverage of all subject matter domains associated with a search constraint.
- the present invention provides search engine methods and systems for displaying relevant and timely topics.
- FIG. 1 is a flowchart of a method to identify topics in a corpus of data in accordance with one embodiment of the invention.
- FIG. 2 is a flowchart of a method to generate a domain specific word list in accordance with one embodiment of the invention.
- FIG. 3 is a flowchart of a method to identify topics in a corpus of data in accordance with one embodiment of the invention.
- FIG. 4 is a flowchart of a method to measure actual usage of significant words in a corpus of data in accordance with one embodiment of the invention.
- FIG. 5 is a flowchart of a topic refinement process in accordance with one embodiment of the invention.
- FIG. 6 is a flowchart of a topic identification method in accordance with one embodiment of the invention.
- FIG. 7 is a flowchart of one method in accordance with the invention to identify those topics for display during a user query operation.
- FIG. 8 is a diagram that shows enterprise information sources.
- FIG. 9 is a flowchart of a method for displaying topics, according to an embodiment of the invention.
- FIG. 10 provides a screen shot of a search engine web site, according to an embodiment of the invention.
- FIG. 11 is a flowchart of a method for displaying topics, according to an embodiment of the invention.
- FIG. 12 is a flowchart of a method for displaying topics, according to an embodiment of the invention.
- FIG. 13 is a flowchart of a method to rank topics into one of four general rankings, according to an embodiment of the invention.
- FIG. 14 is a diagram that illustrates topic clustering, according to an embodiment of the invention.
- FIG. 15 is a block diagram of a system, according to an embodiment of the invention.
- a collection of topics is determined for a first corpus of data, wherein the topics are domain specific, based on a statistical analysis of the first data, corpus and substantially automatically generated.
- the topics may be associated with each “segment” of a second corpus of data, wherein a segment is a user-defined quantum of information.
- Example segments include, but are not limited to, sentences, paragraphs, headings (e.g., chapter headings, titles of manuscripts, titles of brochures and the like), chapters and complete documents.
- Data comprising the data corpus may be unstructured (e.g., text) or structured (e.g., spreadsheets and database tables).
- topics may be used during user query operations to return a result set based on a user's query input.
- one method in accordance with the invention uses domain specific word list 100 as a starting point from which to analyze data 105 (block 110 ) to generate domain specific topic list 115 .
- topic list 115 entries may be associated with each segment of data 105 (block 120 ) and stored in database 125 where it may be queried by user 135 through user interface 130 .
- Word list 100 may comprise a list of words or word combinations that are meaningful to the domain from which data 105 is drawn. For example, if data 105 represents medical documents then word list 100 may be those words that are meaningful to the medical field or those subfields within the field of medicine relevant to data 105 .
- Data 105 may be substantially any form of data, structured or unstructured.
- data 105 comprises unstructured text files such as medical abstracts and/or articles,
- data 105 comprises books, newspapers, magazine content or a combination of these sources.
- data 105 comprises structured data such as design documents and spreadsheets describing an oil refinery process.
- data 105 comprises content tagged image data, video data and/or audio data.
- data 105 comprises a combination of structured and unstructured data.
- Data 105 may also include data gathered from across a network, such as the Internet.
- Acts in accordance with block 110 use word list 100 entries to statistically analyze data 105 on a segment-by-segment basis.
- a segment may be defined as a sentence and/or heading and/or title.
- a segment may be defined as a paragraph and/or heading and/or title.
- a segment may be defined as a chapter and/or heading and/or title.
- a segment may be defined as a complete document and/or heading and/or title.
- Other definitions may be appropriate for certain types of data and, while different from those enumerated here, would be obvious to one of ordinary skill in the art. For example, headings and titles may be excluded from consideration.
- a first portion of data 105 may be used to generate topic list 115 , with the topics so identified being associated with the entire corpus of data during the acts of block 120 .
- data 105 comprises the text of approximately 12 million abstracts from the Medline ® data collection. These abstracts include approximately 2.8 million unique words, representing approximately 40 Gigabytes of raw data.
- MEDLINE ® Medical Literature, Analysis, and Retrieval System Online
- NLM National Library of Medicine's
- the database contains bibliographic citations and author abstracts from more than 4,600 biomedical journals published in the United States and 71) other countries.
- Medline M is searchable at no cost from the NLM's web site at http://www.nlm.nih.gov.
- word list 100 may be generated by first compiling a preliminary list of domain specific words 200 and then pruning from that list those entries that do not significantly and,(r uniquely identify concepts or topics within the target domain (block 205 ).
- Preliminary list 200 may, for example, be comprised of words from a dictionary, thesaurus, glossary, domain specific word list or a combination of these sources.
- the Internet may be used to obtain preliminary word lists for virtually any field.
- Words removed in accordance with block 205 may include standard STOP words as illustrated in Table 2. (One of ordinary skill in the art will recognize that other STOP words may be used.)
- a general domain word list may be created that comprises those words commonly used in English (or another language), including those that are specific to a number of different domains. This “general word list” may be used to prune words from a preliminary domain specific word list. In another embodiment. some common words removed as a result of the general word list pruning just described may be added back into preliminary word list 200 because, while used across a number of domains, have a particular importance in the particular domain.
- preliminary word list 200 was derived from the Unified Medical language System Semantic Network (see http:/www.nlm.nih.govidatebases/leased.html#umls) and included 4,000,000 unique single- word entries. Of these, roughly 3,945,000 were moved in accordance with block 205. Accordingly, word list 100 comprised approximately 55,000 one word entries.
- Example word list 200 entries for the medical domain include: abdomen, biotherapy, chlorided, distichiasis, enzyme, enzymes, freckle, gustatory, immune, kyphoplasty, laryngectomy, malabsorption, nebulize, obstetrics, pancytcpenia, quadriparesis, retinae, sideeffect, tonsils, unguiuml, Vennicular, womb, xerostomia, yersinia, and zygote.
- word list 100 provides an initial estimation of domain specific concepts/topics. Analysis in accordance with the invention beneficially expands the semantic breadth of word list 100 , however, by identifying word collections (e.g., pairs and triplets) as topics (i.e., topic list 115 ). Once topics are identified, each segment in data 105 may be associated with those topics (block 120 ) that exist in that segment. Accordingly, if a corpus of data comprises information from a plurality of domains, analysis in accordance with FIG. 1 may be run multiple times-each time with a different word list 100 .
- word collections e.g., pairs and triplets
- each segment may be analyzed for each domain list before a next segment is analyzed.
- undifferentiated data i.e., data not identified as belonging to one or another specific domain
- word list 100 may be unique for each target domain but, once developed, may be used against multiple data collections in that field.
- it is beneficial to refine the contents of word list 100 for each domain so as to make the list as domain-specific as possible. It has been empirically determined that tightly focused domain-specific word lists yield a more concise collection of topics which, in turn, provide improved search results (see discussion below).
- FIG. 3 illustrates one method in accordance with the invention to identify topics (block 110 of FIG. 1 ) in data 105 using word list 100 as a starting point.
- data 105 or a portion thereof
- a result of this initial step is preliminary topic fist 305 .
- an expected value for each entry in preliminary topic list 305 is computed (block 310 ) and compared with the actual usage value determined during block 300 (block 315 ). If the measured actual usage of a preliminary topic list entry is significantly greater than the computed expected value of the entry (the “yes” prong of block 315 ), that entry is added to topic list 115 (block 320 ).
- Topic List 115 For the data set identified in Tables 1 and 3, 10 of the 35 Gigabytes were used to generate topic list 115.
- topic list 115 comprised approximately 506,000 entries. In one embodiment, each of these entries are double word entries.
- Illustrative topics identified for Medline (9 abstract content in accordance with the invention include: adenine nucleotide, heart disease, left ventricular. atria ventricles, heart failure, muscle, heart rate, fatty acids, loss bone, patient case, bone marrow, and arterial hypertension.
- one method to measure the actual usage of significant words in data 105 is to determine three statistics for each entry in word list 100 : S 1 (block 400 ); S 2 (block 405 ); and S 3 (block 410 ).
- statistics S 1 , S 2 and S 3 measure the actual frequency of usage of various words and word combinations in data 105 at the granularity of the user-defined segment. More specifically:
- Statistic S 1 (block 400 ) is a segment-level frequency count for each entry in word list 100 .
- the value of S 1 for word-i is the number of unique paragraphs in data 105 in which word-i is found.
- An S 1 value may also be computed for non-word list 100 words if they are identified as part of a word combination as described below with respect to statistic S 2 .
- Statistic S 2 (block 405 ) is a segment-level frequency count for each significant word combination in data 105 . Those word combinations having a non-zero S 2 value may be identified as preliminary topics 305 .
- a “significant word combination” comprises any two entries in word list 100 that are in the same segment.
- a “significant word combination” comprises any two entries in word list 100 that are in the same segment and contiguous.
- a “significant word combination” comprises any two entries in word list 100 that are in the same segment and contiguous or separated only by one or more STOP words.
- a “significant word combination” comprises any two words that are in the same segment and contiguous or separated only by one or more STOP words where at least one of the words in the word combination is in word list 100 .
- a “significant word combination” comprises a two or more word combination appearing in any data item within Data 105 .
- word list 100 would not be used.
- a “significant word combination” comprises any two or more words that are in the same segment and separated by ‘N’ or fewer specified other words: N may be zero or more; and the specified words are typically STOP words.
- word combinations comprising non-word list 100 words may be ignored if they appear in less than a specified number of segments in data 105 (e.g., less than 10 segments).
- the value of S 2 for word-combination-i is the number of unique paragraphs in data 105 in which word-combination-i is found.
- Statistic S 3 (block 410 ) indicates the number of unique word combinations (identified by having non-zero S 2 values, for example) each word in word list 100 was found in.
- word-z's S 3 value is 3.
- One method to compute the expected usage of significant words in data 105 is to calculate the expected value for each preliminary topic list 305 entry based only on its overall frequency of use in data 105 .
- the expected value for each word pair in preliminary word list 305 may be computed as follows:
- S 1 (word-i) and S 1 (word-j) represents the S 1 statistic value for word-i and word-j respectively
- N represents the total number of segments in the data corpus being analyzed.
- the test (block 315 ) of whether a topic's measured usage (block 300 ) is significantly greater than the topic's expected usage (block 310 ), is a constant multiplier. For example, if the measured usage of preliminary topic list entry-i is twice that of preliminary topic list entry-i is expected usage, preliminary topic list entry-i may be added to topic list 115 in accordance with block 320 .
- preliminary topic list entry-i if the measured usage of preliminary topic list entry-i is greater than a threshold value (e.g., 10) across all segments, then that preliminary topic list entry is selected as a topic.
- a threshold value e.g. 10
- a different multiplier e.g. 1.5 or 3
- conventional statistical tests of significance may be used.
- topic list 115 may be refined in accordance with FIG. 5 .
- this refinement process will be described in terms of two-word topics.
- One of ordinary skill in the art will recognize that the technique is equally applicable to topics having more than two words.
- a first two word topic is selected (block 500 ). If both words comprising the topic are found in word list 100 (the “Yes” prong of block 505 ), the two word topic is retained (block 510 ).
- both words comprising the topic are not found in word list 100 (the “no” prong of block 505 ), but the S 3 value for that word which is in word list 100 is not significantly less than the S 3 value for the other word (the “yes” prong of block 515 ), the two word topic is retained (block 510 ). If, on the other hand, one of the topic's words is not in word list 100 (the “no” prong of block 505 ) and the S 3 value for that word which is in word list 100 is significantly less than the S 3 value for the other word (the “no” prong of block 515 ), only the low S 3 value word is retained in topic list 115 as a topic (block 520 ).
- the test for significance is based on whether the “high” S 3 value is in the upper one-third of all S 3 values and the “low” S 3 value is in the lower one-third of all S 3 values. For example, if the S 3 statistic for a corpus of data has a range of zero to 12,000, a low S 3 value is less then or equal to 4,000 and a “high” S 3 value is greater then or equal to 8,000.
- the test for significance in accordance with block 515 may be based on quartiles, quintiles or Bayesian tests. Refinement processes such as that outlined in FIG. 5 acknowledge word associations within data, while ignoring individual words that are so prevalent alone (high S 3 value) as to offer substantially no differentiation as to content.
- each segment in data 105 may be associated with those topics which exist within it (block 120 ) and stored in database 125 .
- Topics may be associated with a data segment in any desired fashion. For example, topics found in a segment may be stored as metadata for the segment. In addition, stored topics may be indexed for improved retrieval performance during subsequent lookup operations.
- Empirical studies show that the large majority of user queries are “under-defined.” That is, the query itself does not identify any particular subject matter with sufficient specificity to allow a search engine to return the user's desired data in a result set (i.e., that collection of results presented to the user) that is acceptably small.
- a typical user query may be a single word such as, for example, “kidney.”
- prior art search techniques generally return large result sets—often containing thousands, or tens of thousands, of “hits.” Such large result sets are almost never useful to a user as they do not have the time to go through every entry to find that one having the information they seek.
- topics associated with data segments in accordance with the invention may be used to facilitate data retrieval operations as shown in FIG. 6 .
- a user query When a user query is received (block 600 ) it may be used to generate an initial result set (block 605 ) in a conventional manner. For example, a literal text search of the query term may identify 100,000 documents (or objects stored in database 125 ) that contain the search term. From this initial result set, a subset may be selected for analysis in accordance with topics (block 610 ). In one embodiment, the subset is a randomly chosen 1% of the initial result set. In another embodiment, the subset is a randomly chosen 1,000 entries from the initial result set. In yet another embodiment, a specified number of entries are selected from the initial result set (chosen in any manner desired).
- the number of entries in the initial result subset may be chosen in substantially any manner desired, it is preferable to select at least a number that provides “coverage” (in a statistical sense) for the initial result set. In other words, it is desirable that the selected subset mirror the initial result set in terms of topics. With an appropriately chosen result subset, the most relevant topics associated with those results may be identified (block 615 ) and displayed to the user (block 620 ).
- FIG. 7 shows one method in accordance with the invention to identify those topics for display (block 615 ).
- all unique topics associated with the result subset are identified (block 700 ), and those topics that appear in more than a specified fraction of the result subset are removed (block 705 ). For example, those topics appearing in 80% or more of the segments comprising the result subset may be ignored for the purposes of this analysis. (A percentage higher or lower than this may be selected without altering the salient characteristics of the process.)
- that topic which appears in the most result subset entries is selected for display (block 710 ). If more than one topic ties for having the most coverage, one may be selected for display in any manner desired.
- the specified threshold of block 715 is 20%, although a percentage higher or lower than this may be selected without altering the salient characteristics of the process.
- the remaining topics are serialized and duplicate words are eliminated (block 725 ). That is, topics comprising two or more words are broken apart and treated as single-word topics.
- the single-word topic that appears in the most result subset entries not already excluded is selected for display (block 730 ). As before, if more than one topic ties for having the most coverage, one may be selected for display in any manner desired.
- the topics identified in accordance with FIG. 7 may be displayed to the user (block 620 in FIG. 6 ).
- data retrieval operations in accordance with the invention return one or more topics which the user may select to pursue or redefine their initial search.
- a specified number of search result entries may be displayed in conjunction with the displayed topics.
- a user may be presented with those data corresponding to the selected topics.
- Topics may, for example, be combined through Boolean “and” and/or “or” operators.
- the user may be presented with another list of topics based on the “new” result set in a manner described above.
- search operations in accordance with the invention respond to user queries by presenting a series of likely topics that most closely reflect the subjects that their initial search query relate to. Subsequent selection of a topic by the user, in effect, supplies additional search information which is used to refine the Search.
- Example Query Result For the data set identified in Tables 1, 3 and 4, a search on the single word “kidney” returns an initial result set comprising 147,549 hits. (That is, 147,549 segments had the word kidney in them.) Of these, 1,000 were chosen as the initial result subset. Using the specified thresholds discussed above, the following topics were represented in the result set: amino acid, dependent presence, amino terminal, kidney transplantation, transcriptional regulation, liver kidney, body weight, rat kidney, filtration fraction, rats treated, heart kidney, renal transplantation, blood pressure, and renal function.
- Selection of the “renal function” topic identified a total of 6,853 entries divided among the following topics: effects renal, kidney transplantation, renal parenchyma, glomerular filtration, loss renal, blood flow, histological examination, renal artery, creatinine clearance, intensive care, and renal failure. Selection of the “glomerular filtration” topic from this list identified a total of 1,400 entries. Thus, in two steps the number of “hits” through which a person must search was reduced from approximately 148,000 to 1,500-a reduction of nearly two orders of magnitude.
- retrieval operations in accordance with FIG. 6 may not be needed for all queries. For example, if a user query includes multiple search words or a quoted phrase that, using literal text-based search techniques, returns a relatively small result set (e.g., 50 hits or fewer), the presentation of this relatively small result set may be made immediately without resort to the topic-based approach of FIG. 6 . What size of initial result set that triggers use of a topic-based retrieval operation in accordance with the invention is a matter of design choice. In one embodiment, all initial result sets having more than 50 hits use a method in accordance with FIG. 6 . In another embodiment, only initial result sets having more than 200 results trigger use of a method in accordance with FIG. 6 .
- FIG. 8 provides a diagram that shows enterprise information sources. An office worker seated as his desk in front of the computer with a need to find information has a dilemma.
- the diagram illustrates that there are at least four main sources of information: enterprise information, server and PC information, Internet information, and email and attachments.
- Enterprise information can include data warehouses, multiple databases, and document systems.
- Server and PC information can include reports, presentations and data generated by the worker or his colleagues.
- Internet information can include a wealth of information, including business websites and business news. These are a few examples of the types of information that can be searched using the present invention, and are not intended to limit the scope of the invention.
- Information within the enterprise is doubling every five years and doubling every 6 years on the web. And that is not counting the scores of duplicate emails, attachments, and corporate documents. More and more time is being spent trying to find information and less of all the relevant information is being found. So, productivity is negatively affected. The quality of the decisions is poorer because of incomplete information and the risk of negative economic impacts rise.
- the first step in addressing the information dilemma is to provide real-time aggregation of information where the context (e.g. title, to, from, name, product, etc.) is identified and maintained. This must be done without requiring normalization of the data. Or, in other words, the information must be imported “as is” without having to reformat or transform the information into some common form. Examples of methods for aggregating the data are taught in commonly owned U.S. Pat. No. 5,842,213, entitled Method for Modeling, Storing and Transferring Data in Neutral Form, issued Nov. 24, 1998 to Odom et al., and U.S. Pat. No.
- the second step relates to the search problem, or put another way, finding the needed information - the proverbial needle in the haystack.
- True relevancy is the missing ingredient in search.
- the industry is looking for ways to produce better results for the user. This is particularly true when the user is searching for specific content as opposed to general information from an omnibus website. The emphasis is on trying to find a way to easily determine which information is relevant to the user.
- search engines do not directly factor in time relevancy, and these topics would be mixed in with the tens of thousands of other possible topic results. Thus, a user would not likely receive as relevant search results as would be desired.
- search engines do not display topics associated with every subject matter domain related to a search constraint entered by a user. Rather a search engine may only show search results that are most popular without regard to different subject matter domains that search results may belong to. For users interested in a particular domain, the search results displayed would not be particularly relevant and their specific areas of interest difficult to find. Thus, a user once again may not receive search results relevant to their particular area of interest.
- the present invention addresses these shortcomings of existing search engines and methods.
- embodiments of the present invention provides search methods and systems that can efficiently generate search results to identify and display topics by considering, at any given time, the relative significance of a topic based on current events and that ensure coverage of all subject matter domains associated with a search constraint.
- a topic comprise a word combination of two or more substantially contiguous words. Two words are substantially contiguous if they are separated only by zero or more words selected from a predetermined list of words. In one embodiment, the predetermined list of words are STOP words.
- the set of information includes one or more of information located within an enterprise network, information located within a server, information located within a personal computer, information located on the Internet, or information contained within email messages or email attachments.
- data item includes one or more of text documents, graphic documents, audio files, video files, multimedia documents, email messages, email attachments, or Internet web page.
- FIG. 9 provides a flowchart of method 900 for displaying topics related to a search constraint entered by a user to request search results that identify data items within a set of information that are related to the search constraint, according to an embodiment of the invention.
- Method 900 begins in step 910 .
- FIG. 10 provides a screen shot of an search engine web site, according to an embodiment of the invention. The screen shot of FIG. 10 is for illustrative purposes, and not intended to limit the scope of the invention.
- step 910 a search constraint is received.
- the search constraint is “Pittsburgh Steelers.”
- a first preliminary set of topics related to the search constraint is identified.
- the first preliminary set of topics is representative of a sample set of general data items.
- the general data items could include a generic sampling of data items located across the Internet.
- a second preliminary set of topics related to the search constraint is identified.
- the second preliminary set of topics are representative of a sample set of current event data items.
- the sample set of current event data items are gathered by receiving feeds from current event websites, such as CNN.COM, MSN.COM, ESPN.COM and the like.
- the current event data items are updated periodically. In one embodiment periodic updates are a function of the subject matter. For example, sports information is updated every thirty minutes, financial information is updated every thirty minutes, health information is updated once a day and other news information is updated every two hours.
- the current event data items database contains approximately 20,000 data items.
- a set of display topics is identified that is a subset of the first preliminary set of topics and the second preliminary set of topics.
- identifying a set of display topics includes selecting a certain number, referred to as the general topic threshold number, of topics from the first preliminary set of topics and selecting a certain number, referred to as the current event topic threshold number of topics, from the second preliminary set of topics. Additionally, in a further embodiment a certain number, referred to as the proper name topic threshold, of proper names from the second preliminary set of topics are also selected. In one embodiment, the proper names are randomly selected from a set of proper names contained within the second preliminary set of topics.
- a personal interest topic repository can be created.
- the personal interest topic repository includes topics that have been identified as relevant to a user. These topics, for example, may be topics associated with frequent searches conducted by a user, topics generated based on a personal profile, or topics that a user may have previously selected.
- step 940 can also include selecting a certain number, referred to as the personal interest topic threshold, of topics from the first preliminary set of topics.
- step 950 the set of display topics identified in step 940 is displayed.
- the topics may be displayed on a computer terminal, cell phone or other display device.
- step 960 method 900 ends.
- the topic display threshold is twenty topics. Of these twenty topics, six topics are identified from the current event topics, six proper names (which are considered topics) are also taken from the current event topics, and eight topics are identified from the general topics. Of the eight topics from the general topics, two of these are personal interest topics, when personal interest topics are available.
- the column labeled AUTOTOPICS displays the set of display topics.
- the topics include, for example, Franco Harris, Pittsburgh Post, and autographed photos.
- FIG. 11 provides a flowchart of method 1100 for displaying topics related to a search constraint entered by a user to request search results that identify data items within a set of information that are related to the search constraint.
- FIG. 10 will again be used.
- the screen shot of FIG. 10 is for illustrative purposes, and not intended to limit the scope of the invention.
- Method 1100 begins in step 1110 .
- step 1110 a search constraint is received. For example, referring to
- FIG. 10 the search constraint is “Pittsburgh Steelers.”
- identifying a set of topics includes conducting a search to generate search results.
- the search results include a set of data items.
- Example searches that can be used include searches using GOGGLE, YAHOO, MSN, ASK.COM and A9 search engines. Other types of search engines can also be used.
- a search can be conducted on a representative sample of data within the set of information that is of interest. For example, when searching the Internet a representative set of data items from the Internet can be used. In one embodiment the representative set of data items includes about 25 million data items.
- a search can be conducted on data items contained within a current events data item database.
- the sample set of current event data items are gathered by receiving feeds from current event websites, such as CNN.COM, MSN.COM, ESPN.COM and the like.
- the current event data items are updated periodically.
- periodic updates are a function of the subject matter. For example, sports information is updated every thirty minutes, financial information is updated every thirty minutes, health information is updated once a day and other news information is updated every two hours.
- the current event data items database contains approximately 20,000 current event data items.
- the set of topics can then be determined from the search results by extracting topics associated with each data item in the search results.
- the topification methods disclosed in the '026 Patent Application can be used to identify the set of topics from any of the above search results using general data items, representative data items and current event data items.
- topics can be generated from a combination of these or other source data items.
- FIG. 13 provides a flowchart of a method 1300 to rank topics into one of four general rankings.
- the highest ranking is assigned to a topic when the topic is a current topic and a personal interest topic.
- a topic is a current topic when it is found in the current event topics.
- a topic is a personal interest topic when it is found in the personal interest topic repository for a particular user.
- step 1320 the second highest ranking is assigned to a topic within the identified when the topic is a current topic.
- step 1330 the third highest ranking is assigned to a topic when the topic is a personal interest topic.
- step 1340 the fourth highest ranking is assigned to a topic when the topic is neither a current topic or a personal interest topic.
- topics are further ranked based on their frequency of occurrence with search result data items. Those topics that occur least frequently among the data items are considered most relevant and given a higher ranking.
- FIG. 14 provides a diagram that graphically illustrates this process.
- Set 1410 represents the complete set of topics found in the data items in the search results.
- three subject matter domains are illustrated. These are subject matter domains 1420 , 1430 and 1440 .
- Subject matter domains include a collection of topics associated with the data items within the search results.
- subject matter domains includes data item 1450 .
- Associated with data item 1450 will be one or more topics.
- Data items that have overlapping sets of topics, represented by the shaded area 1460 for subject matter domain 1430 are clustered together to form a subject matter domain.
- Subject matter domains will have some overlap, as indicated by overlap 1470 .
- the process of clustering includes clustering data items that have overlapping topics, and then creating subject matter domains based on clustering of data items that minimizes the overlap of topics across subject matter areas, such as overlap 1470 .
- Individuals skilled in the relevant arts will be able to apply statistical clustering methods to determine the optimal clustering.
- the most representative topic for each subject matter domain is determined.
- the most representative topic is determined by identifying those topics within a subject matter domain that occur in more than some fraction of the distribution (e.g, more than 90% of the data items) of data items within the set of information.
- the most representative topic is then determined from this set of topics by identifying the topic for each subject matter domain with the highest current event and personal interest topic ranking. As necessary, the frequency of occurrence of the topics can be used to further rank the topics as discussed above.
- step 1160 the most representative topic for each subject matter domain is displayed.
- step 1170 the highest ranked topics not previously displayed are displayed.
- step 1180 method 1100 ends.
- FIG. 12 provides a flowchart of method 1200 for displaying topics related to a search constraint entered by a user to request search results that identify data items within a set of information that are related to the search constraint.
- FIG. 10 will again be used.
- the screen shot of FIG. 10 is for illustrative purposes, and not intended to limit the scope of the invention.
- Method 1200 begins in step 1210 .
- step 1210 a search constraint is received.
- the search constraint is “Pittsburgh Steelers.”
- identifying a set of topics includes conducting a search to generate search results.
- the search results include a set of data items.
- Example searches that can be used include searches using GOGGLE, YAHOO, MSN, ASK.COM and A9 search engines. Other types of search engines can also be used.
- a search can be conducted on a representative sample of data within the set of information that is of interest. For example, when searching the Internet a representative set of data items from the Internet can be used. In one embodiment the representative set of data items includes about 25 million data items.
- a search can be conducted on data items contained within a current event data item database.
- the sample set of current event data items are gathered by receiving feeds from current event websites, such as CNN.COM, MSN.COM, ESPN.COM and the like.
- the current event data items are updated periodically.
- periodic updates are a function of the subject matter. For example, sports information is updated every thirty minutes, financial information is updated every thirty minutes, health information is updated once a day and other news information is updated every two hours.
- the current event data items database contains approximately 20,000 current event data items.
- the set of topics can then be determined from the search results by extracting topics associated with each data item in the search results.
- the topification methods disclosed in the '026 patent application can be used to identify the set of topics from any of the above search results using general data items, representative data items and current event data items.
- topics can be generated from a combination of these or other source data items.
- subject matter domains associated the set of topics are created.
- FIG. 14 provides a diagram that graphically illustrates this process.
- Set 1410 represents the complete set of topics found in the data items in the search results.
- three subject matter domains are illustrated. These are subject matter domains 1420 , 1430 and 1440 .
- Subject matter domains include a collection of topics associated with the data items within the search results.
- subject matter domains includes data item 1450 .
- Associated with data item 1450 will be one or more topics.
- Data items that have overlapping sets of topics, represented by the shaded area 1460 for subject matter domain 1430 are clustered together to form a subject matter domain.
- Subject matter domains will have some overlap, as indicated by overlap 1470 .
- the process of clustering includes clustering data items that have overlapping topics, and then creating subject matter domains based on clustering of data items that minimizes the overlap of topics across subject matter areas, such as overlap 1470 .
- Individuals skilled in the relevant arts will be able to apply statistical clustering methods to determine the optimal clustering.
- the most representative topic for each subject matter domain is determined.
- the most representative topic is determined by identifying those topics within a subject matter domain that occur in more than some fraction of the distribution of data items (e.g., more than 90% of the data items) within the set of information.
- the most representative topic is then determined from this set of topics by identifying the topic for each subject matter domain that has the least frequent number of occurrences in the search result data items.
- step 1250 the most representative topic for each subject matter domain is displayed.
- step 1250 method 1200 ends.
- the set of topics identified that are related to the search constraint can be ranked as was done in step 1130 in method 1100 . Based on these rankings, additional topics can be displayed as was done in step 1170 in method 1100 .
- a programmable control device can include, but is not limited to a personal computer, a laptop computer, a network computer, a wireless telephone, a personal data assistant (“PDA”) and the like.
- PDA personal data assistant
- programmable control device comprises computer system 1505 that includes central processing unit 1510 , storage 1515 , network interface card 1520 for coupling computer system 1505 to network 1525 , display unit 1530 , keyboard 1535 and mouse 1540 .
- a programmable control device may be a multiprocessor computer system or a custom designed state machine.
- Custom designed state machines may be embodied in a hardware device such as a printed circuit board comprising, discrete logic, integrated circuits, or specially designed Application Specific Integrated Circuits (ASICs).
- Storage devices, such as device 1515 suitable for tangibly embodying program module(s) 1500 include all forms of non-volatile memory including, but not limited to: semiconductor memory devices such as Electrically Programmable Read Only Memory (EPROM), Electrically Erasable Programmable Read Only Memory (EEPROM), and flash devices; magnetic disks (fixed, floppy, and removable); other magnetic media such as tape; and optical media such as CD-ROM disks.
- EPROM Electrically Programmable Read Only Memory
- EEPROM Electrically Erasable Programmable Read Only Memory
- flash devices such as magnetic disks (fixed, floppy, and removable); other magnetic media such as tape; and optical media such as CD-ROM disks.
Abstract
Description
- The present application is a continuation of U.S. patent application Ser. No. 11/712,557, filed Mar. 1, 2007, which is a continuation-in-part of U.S. patent application Ser. No. 10/086,026, filed Feb. 26, 2002, U.S. Pat. No. 7,340,466. U.S. patent application Ser. No. 11/712,557 also claims priority to U.S. Provisional Patent Application No. 60/777,576, filed Mar. 1, 2006. These applications are incorporated by reference herein in their entirety and for all purposes.
- 1. Field of the Invention
- The present invention relates to search engines, and more particularly, to search engine methods and systems that provide relevant and timely topics.
- 2. Background of Invention
- The world economic order is shifting from one based on manufacturing to one based on the generation, organization and use of information. To successfully manage this transition, organizations must collect and classify vast amounts of data so that it may be searched and retrieved in a meaningful manner. Traditional techniques to classify data may be divided into four approaches: (1) manual; (2) unsupervised learning; (3) supervised learning; and (4) hybrid approaches.
- Manual classification relies on individuals reviewing and indexing data against a predetermined list of categories. For example, the National Library of Medicine's MEDLINE® (Medical Literature, Analysis, and Retrieval System Online) database of journal articles uses this approach. While manual approaches benefit from the ability of humans to determine what concepts a data represents, they also suffer from the drawbacks of high cost, human error and relatively low rate of processing. Unsupervised classification techniques rely on computer software to examine the content of data to make initial judgments as to what classification data belongs to. Many unsupervised classification technologies rely on Bayesian clustering algorithms. While reducing the cost of analyzing large data collections, unsupervised learning techniques often return classifications that have no obvious basis on the underlying business or technical aspects of the data.
- This disconnect between the data's business or technical framework and the derived classifications make it difficult for users to effectively query the resulting classifications. Supervised classification techniques attempt to overcome this drawback by relying on individuals to “train” the classification engines so that derived classifications more closely reflect what a human would produce.
- Illustrative supervised classification technologies include semantic networks and neural networks. While supervised systems generally derive classifications more attuned to what a human would generate, they often require substantial training and tuning by expert operators and, in addition, often rely for their results on data that is more consistent or homogeneous that is often possible to obtain in practice. Hybrid systems attempt to fuse the benefits of manual classification methods with the speed and processing capabilities employed by unsupervised and supervised systems. In known hybrid systems, human operators are used to derive “rules of thumb” which drive the underlying classification engines.
- No known data classification approach provides a fast, low-cost and substantially automated means to classify large amounts of data that is consistent with the semantic content of the data itself. Thus, it would be beneficial to provide a mechanism to determine a collection of topics that are explicitly related to both the domain of interest and the data corpus analyzed. Commonly owned, co-pending U.S. patent application Ser. No. 10/086,026, entitled Topic Identification and Use Thereof in Information Retrieval Systems, filed on Feb. 26, 2002 by Paul Odom, provides such a mechanism.
- At the same time, the emergence of the Information Age has created a wealth of information that is available electronically. Unfortunately, much of this information is often inaccessible to individuals because they do not know where to look for it, or if they do know where to look the information can not be found efficiently. For example, an individual is working at his desk and his boss requests that he find an electronic copy of a memo that the individual sent last month. The memo contains information that was obtained from a website, which included a spreadsheet that had data extracted from a division report.
- The boss would like the individual to send a copy of the email and the references back to him as soon as possible. Also, he would like the individual to check for additional references to see if the conclusions in the memo need to be updated. The boss requires that the project be completed within fifteen minutes. The worker is not disorganized, but as is common, does not have total recall of how the information was gathered or where the email is stored. After thirty minutes, the worker finally finds the email. But, the worker still needs to search for additional information as requested by his boss. The end result is that because no efficient search mechanism existed the worker has missed his boss' deadline.
- The above example commonly occurs within the workplace, and involves not just email, but all forms of electronically stored information. Human worker studies show that it is not unusual for some office workers to spend more than 10% of each work day looking for information. The same studies claim that less than half those searches are successful. Databases, data warehouses, document management systems, and file searches are often too difficult or “hit and miss” to be used effectively and efficiently. Corporate enterprises and government organizations have spent billions of dollars to aggregate and integrate information, so it will be more accessible. Of course, an individual can get answers if he is a database or document system expert and if the individual remembers the exact title, the exact phrasing used in the document, or the ever elusive primary key associated with the document of interest. Unfortunately, more common than not, this level of detail is not available to assist in finding the information.
- Internet based searches are often times even more frustrating, and less productive. For example, it is not particularly useful when you know that there are approximately 6,120,000 answers to the search criteria you just entered. Ads associated with search engines are also often frustratingly irrelevant to a search and therefore of little interest to the users and of minimal value to the advertiser. The search engine ads try to identify promising content to be associated with. Unfortunately, these are often not very relevant either. For example, you entered “plasma injectors” and you get several ads for plasma televisions. Individuals have learned that keyword ads are not usually very useful, so individuals often completely ignore keyword ads.
- Furthermore, because website popularity has nothing to do with what might be relevant in the thousands of search results, search results driven by website popularity can often lead to useless results. Meanwhile, at search engine operations facility there is an army of personnel and massive server farms humming away to potentially deliver hundreds of thousands of results to every search query that an individual enters.
- Web searching, search advertising, and enterprise searching are not consistently providing acceptable search resolution for the user. The missing ingredient in current search technology is “true relevance”. Relevance can only be defined by the user for a specific search. Relevancy has no predictable pattern. No generalized algorithm is going to repeatably produce relevant information, because in the end, any generalization is arbitrary.
- What has occurred, so far in the industry, is a fragmentation of search applications as vendors try to address niche search markets in an attempt to improve relevancy by narrowing the domain. For example, sites that are product specific, area-of-interest specific, group specific, or subject specific, have all been implemented. So far, there have been no successful generalized search applications that consistently provide high levels of relevancy.
- Present search and topification algorithms generally assume that topics are relatively static. However, the relevance of topics to a particular search query is not only based on what appears in the content of the query, but the relevance can also be a function of current events. For example, if an individual had conducted a search of the Internet in January 2006 using the search string “NFL,” then one would expect the topics Denver vs. Pittsburgh and Charlotte vs. Seattle to be of interested, since these were the team pairings in the American Football Conference and National Football Conference championship games. This set of topics is time sensitive to the playoffs. While a search engine may have these topics in its database, these topics would be part of tens of thousands of possible topic results for a query using the term “NFL.” During the January 2006 time frame, the “Denver vs. Pittsburgh” and “Charlotte vs. Seattle” topics would likely be a very meaningful topic result. Unfortunately, search engines do not directly factor in time relevancy, and these topics would be mixed in with the tens of thousands of other possible topic results. Thus, a user would not likely receive as relevant search results as would be desired.
- Another shortcoming of current search engines that display topics or search results is that search engines do not display topics associated with every subject matter domain related to a search constraint entered by a user. Rather a search engine may only show search results or topics that are most popular without regard to different subject matter domains that search results may belong to. For example, when a user enters the search constraint, Jaguar. The data items belonging to the search results may include topics that correspond to subject matter domains that include autos (e.g., there is a car named Jaguar), animals (e.g., there is an animal called Jaguar), software (e.g., there is a software package referred to as Jaguar), resorts (e.g., there are resorts in South America referred to as Jaguar resorts), football (e.g., there is a football team referred to as the Jacksonville Jaguars) and game (e.g., there is a game referred to a Jaguar). Those search engines that provide results based only on popularity of website hits, might only display topics or search results associated with the subject matter domain Auto. Or, at the very least, items associated with Resorts would be on page 27 of the search results. More often than not, a user probably would be looking for data items in the subject matter domain Auto. However, a reasonable proportion of users may also be interested in other domains that may be less popular. For these users, the search results displayed would not be particularly relevant and their specific areas of interest difficult to find. Thus, a user once again may not receive search results relevant to their particular area of interest.
- What are needed are search methods and systems that can efficiently generate search results to identify and display topics by considering, at any given time, the relative significance of a topic based on current events and that ensure coverage of all subject matter domains associated with a search constraint.
- The present invention provides search engine methods and systems for displaying relevant and timely topics.
- Further embodiments, features, and advantages of the invention, as well as the structure and operation of the various embodiments of the invention are described in detail below with reference to accompanying drawings.
- The present invention is described with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. The drawing in which an element first appears is indicated by the left-most digit in the corresponding reference number.
-
FIG. 1 is a flowchart of a method to identify topics in a corpus of data in accordance with one embodiment of the invention. -
FIG. 2 is a flowchart of a method to generate a domain specific word list in accordance with one embodiment of the invention. -
FIG. 3 is a flowchart of a method to identify topics in a corpus of data in accordance with one embodiment of the invention. -
FIG. 4 is a flowchart of a method to measure actual usage of significant words in a corpus of data in accordance with one embodiment of the invention. -
FIG. 5 is a flowchart of a topic refinement process in accordance with one embodiment of the invention. -
FIG. 6 is a flowchart of a topic identification method in accordance with one embodiment of the invention. -
FIG. 7 is a flowchart of one method in accordance with the invention to identify those topics for display during a user query operation. -
FIG. 8 is a diagram that shows enterprise information sources. -
FIG. 9 is a flowchart of a method for displaying topics, according to an embodiment of the invention. -
FIG. 10 provides a screen shot of a search engine web site, according to an embodiment of the invention. -
FIG. 11 is a flowchart of a method for displaying topics, according to an embodiment of the invention. -
FIG. 12 is a flowchart of a method for displaying topics, according to an embodiment of the invention. -
FIG. 13 is a flowchart of a method to rank topics into one of four general rankings, according to an embodiment of the invention. -
FIG. 14 is a diagram that illustrates topic clustering, according to an embodiment of the invention. -
FIG. 15 is a block diagram of a system, according to an embodiment of the invention. - While the present invention is described herein with reference to illustrative embodiments for particular applications, it should be understood that the invention is not limited thereto. Those skilled in the art with access to the teachings provided herein will recognize additional modifications, applications, and embodiments within the scope thereof and additional fields in which the invention would be of significant utility.
- Techniques (methods and devices) to generate domain specific topics for a corpus of data are described. Other techniques (methods and devices) to associate the generated topics with individual documents, or portions thereof, for use in electronic search actions are also described. The following embodiments of the inventive techniques are illustrative only and are not to be considered limiting in any respect.
- In one embodiment of the invention, a collection of topics is determined for a first corpus of data, wherein the topics are domain specific, based on a statistical analysis of the first data, corpus and substantially automatically generated. In another embodiment of the invention, the topics may be associated with each “segment” of a second corpus of data, wherein a segment is a user-defined quantum of information. Example segments include, but are not limited to, sentences, paragraphs, headings (e.g., chapter headings, titles of manuscripts, titles of brochures and the like), chapters and complete documents. Data comprising the data corpus may be unstructured (e.g., text) or structured (e.g., spreadsheets and database tables). In yet another embodiment of the invention, topics may be used during user query operations to return a result set based on a user's query input.
- Referring to
FIG. 1 , one method in accordance with the invention uses domainspecific word list 100 as a starting point from which to analyze data 105 (block 110) to generate domainspecific topic list 115. Once generated,topic list 115 entries may be associated with each segment of data 105 (block 120) and stored indatabase 125 where it may be queried byuser 135 throughuser interface 130.Word list 100 may comprise a list of words or word combinations that are meaningful to the domain from whichdata 105 is drawn. For example, ifdata 105 represents medical documents thenword list 100 may be those words that are meaningful to the medical field or those subfields within the field of medicine relevant todata 105. Similarly ifdata 105 is drawn from the accounting, corporate governance, or the oil processing and refining business,word list 100 will comprise words that hold particular importance to those fields.Data 105 may be substantially any form of data, structured or unstructured. In one embodiment,data 105 comprises unstructured text files such as medical abstracts and/or articles, In another embodiment,data 105 comprises books, newspapers, magazine content or a combination of these sources. In still another embodiment,data 105 comprises structured data such as design documents and spreadsheets describing an oil refinery process. In yet other embodiments,data 105 comprises content tagged image data, video data and/or audio data. lo still another embodiment,data 105 comprises a combination of structured and unstructured data.Data 105 may also include data gathered from across a network, such as the Internet. - Acts in accordance with
block 110use word list 100 entries to statistically analyzedata 105 on a segment-by-segment basis. In one embodiment, a segment may be defined as a sentence and/or heading and/or title. In another embodiment, a segment may be defined as a paragraph and/or heading and/or title. In yet another embodiment, a segment may be defined as a chapter and/or heading and/or title. In still another embodiment, a segment may be defined as a complete document and/or heading and/or title. Other definitions may be appropriate for certain types of data and, while different from those enumerated here, would be obvious to one of ordinary skill in the art. For example, headings and titles may be excluded from consideration. It is noted that only a portion ofdata 105 need be analyzed in accordance withblock 110. That is, a first portion ofdata 105 may be used to generatetopic list 115, with the topics so identified being associated with the entire corpus of data during the acts ofblock 120. -
TABLE 1 Example Data By way of example only, in one embodiment data 105 comprises the text of approximately12 million abstracts from the Medline ® data collection. These abstracts include approximately 2.8 million unique words, representing approximately 40 Gigabytes of raw data. MEDLINE ® (Medical Literature, Analysis, and Retrieval System Online) is the U.S. National Library of Medicine's (NLM) bibliographic database of journal articles covering basic biomedical research and the clinical sciences including: nursing, dentistry, veterinary medicine, pharmacy, allied health, pre-clinical sciences, environmental science, marine biology, plant and animal science, biophysics and chemistry. The database contains bibliographic citations and author abstracts from more than 4,600 biomedical journals published in the United States and 71) other countries. Medline M is searchable at no cost from the NLM's web site at http://www.nlm.nih.gov. - Referring to
FIG. 2 , in one embodiment of theinvention word list 100 may be generated by first compiling a preliminary list of domainspecific words 200 and then pruning from that list those entries that do not significantly and,(r uniquely identify concepts or topics within the target domain (block 205).Preliminary list 200 may, for example, be comprised of words from a dictionary, thesaurus, glossary, domain specific word list or a combination of these sources. For example, the Internet may be used to obtain preliminary word lists for virtually any field. Words removed in accordance withblock 205 may include standard STOP words as illustrated in Table 2. (One of ordinary skill in the art will recognize that other STOP words may be used.) In addition, it may be beneficial to remove words frompreliminary word list 200 that are not unique to the larger domain. For example, while the word “reservoir” has a particular meaning in the field of oil and gas development, it is also a word of common use. Accordingly, it may be beneficial to remove this word from a word list specific to the oil and gas domain. In one embodiment, a general domain word list may be created that comprises those words commonly used in English (or another language), including those that are specific to a number of different domains. This “general word list” may be used to prune words from a preliminary domain specific word list. In another embodiment. some common words removed as a result of the general word list pruning just described may be added back intopreliminary word list 200 because, while used across a number of domains, have a particular importance in the particular domain. -
TABLE 2 Example Stop Words a, about, affect. after, again, all, along, also, although, among, an, and, another, any, anything, are, as, at, be, became, because, been, before, both, but, by, can, difference, each, even, ever, every, everyone, for, from, great. had, has. have, having, he, hence, here, his, how, however, I, if, in, inbetween, into, is, it, its, join, keep, last, lastly, let, many, may, me, more, most, much, next, no, none, not, nothing, now, of, on, only, or, other, our, pause, quickly, quietly, relationship, relatively, see, she, should, since, so, some, somebody, someone, something, sometimes, successful, successfully, such, take, than, that, the, their, there, these, they, this, those, thus, to, unusual, upon, us, use, usual, view, was, we, went, what, when, whence, where, whether,, which, while. who. whose, will, with, within, without, yes, yet, you, your -
TABLE 3 Example Word List For the data set identified in Table 1, preliminary word list 200 was derived from the UnifiedMedical language System Semantic Network (see http:/www.nlm.nih.govidatebases/leased.html#umls) and included 4,000,000 unique single- word entries. Of these, roughly 3,945,000 were moved in accordance with block 205.Accordingly, word list 100 comprised approximately 55,000 one word entries.Example word list 200 entries for the medical domain include: abdomen, biotherapy, chlorided, distichiasis, enzyme, enzymes, freckle, gustatory, immune, kyphoplasty, laryngectomy, malabsorption, nebulize, obstetrics, pancytcpenia, quadriparesis, retinae, sideeffect, tonsils, unguiuml, Vennicular, womb, xerostomia, yersinia, and zygote. - Conceptually,
word list 100 provides an initial estimation of domain specific concepts/topics. Analysis in accordance with the invention beneficially expands the semantic breadth ofword list 100, however, by identifying word collections (e.g., pairs and triplets) as topics (i.e., topic list 115). Once topics are identified, each segment indata 105 may be associated with those topics (block 120) that exist in that segment. Accordingly, if a corpus of data comprises information from a plurality of domains, analysis in accordance withFIG. 1 may be run multiple times-each time with adifferent word list 100. (Alternatively, each segment may be analyzed for each domain list before a next segment is analyzed.) In this manner, undifferentiated data (i.e., data not identified as belonging to one or another specific domain) may be automatically analyzed and “indexed” with topics. It is noted thatword list 100 may be unique for each target domain but, once developed, may be used against multiple data collections in that field. Thus, it is beneficial to refine the contents ofword list 100 for each domain so as to make the list as domain-specific as possible. It has been empirically determined that tightly focused domain-specific word lists yield a more concise collection of topics which, in turn, provide improved search results (see discussion below). -
FIG. 3 illustrates one method in accordance with the invention to identify topics (block 110 ofFIG. 1 ) indata 105 usingword list 100 as a starting point. Initially, data 105 (or a portion thereof) is analyzed on a segment-by-segment basis to determine the actual usage of significant words and word combinations (block 300). A result of this initial step is preliminarytopic fist 305. Next, an expected value for each entry inpreliminary topic list 305 is computed (block 310) and compared with the actual usage value determined during block 300 (block 315). If the measured actual usage of a preliminary topic list entry is significantly greater than the computed expected value of the entry (the “yes” prong of block 315), that entry is added to topic list 115 (block 320). If the measured actual usage of a preliminary topic list entry is not significantly greater than the computed expected value of the entry (the “no” prong of block 315), that entry is not added totopic list 115. The acts ofblocks preliminary topic list 305 entries have been reviewed (the “yes” prong of block 325). -
TABLE 4 Example Topic List For the data set identified in Tables 1 and 3, 10 of the 35 Gigabytes were used to generate topic list 115. In accordance with FIG. 3,topic list 115 comprised approximately 506,000entries. In one embodiment, each of these entries are double word entries. Illustrative topics identified for Medline (9 abstract content in accordance with the invention include: adenine nucleotide, heart disease, left ventricular. atria ventricles, heart failure, muscle, heart rate, fatty acids, loss bone, patient case, bone marrow, and arterial hypertension. - As shown in
FIG. 4 , one method to measure the actual usage of significant words in data 105 (block 300) is to determine three statistics for each entry in word list 100: S1 (block 400); S2 (block 405); and S3 (block 410). In general, statistics S1, S2 and S3 measure the actual frequency of usage of various words and word combinations indata 105 at the granularity of the user-defined segment. More specifically: - Statistic S1 (block 400) is a segment-level frequency count for each entry in
word list 100. - For example, if a segment is defined as a paragraph, then the value of S1 for word-i is the number of unique paragraphs in
data 105 in which word-i is found. - An S1 value may also be computed for
non-word list 100 words if they are identified as part of a word combination as described below with respect to statistic S2. - Statistic S2 (block 405) is a segment-level frequency count for each significant word combination in
data 105. Those word combinations having a non-zero S2 value may be identified aspreliminary topics 305. In one embodiment, a “significant word combination” comprises any two entries inword list 100 that are in the same segment. In another embodiment, a “significant word combination” comprises any two entries inword list 100 that are in the same segment and contiguous. In still another embodiment, a “significant word combination” comprises any two entries inword list 100 that are in the same segment and contiguous or separated only by one or more STOP words. In yet another embodiment, a “significant word combination” comprises any two words that are in the same segment and contiguous or separated only by one or more STOP words where at least one of the words in the word combination is inword list 100. In still another embodiment a “significant word combination” comprises a two or more word combination appearing in any data item withinData 105. In this embodiment,word list 100 would not be used. In general, a “significant word combination” comprises any two or more words that are in the same segment and separated by ‘N’ or fewer specified other words: N may be zero or more; and the specified words are typically STOP words. As a practical matter, word combinations comprisingnon-word list 100 words may be ignored if they appear in less than a specified number of segments in data 105 (e.g., less than 10 segments). - For example, if a segment is defined as a paragraph, then the value of S2 for word-combination-i is the number of unique paragraphs in
data 105 in which word-combination-i is found. - Statistic S3 (block 410) indicates the number of unique word combinations (identified by having non-zero S2 values, for example) each word in
word list 100 was found in. - For example, if word-z is only a member of word-combination-i, word-combination-j and word-combination-k and the S2 statistic for each of word-combination-i, word-combination-j and word-combination-k is non-zero, then word-z's S3 value is 3.
- One method to compute the expected usage of significant words in data 105 (block 310) is to calculate the expected value for each
preliminary topic list 305 entry based only on its overall frequency of use indata 105. In one embodiment, the expected value for each word pair inpreliminary word list 305 may be computed as follows: -
{S1(word-i)×S1(word-j)}÷N - where S1(word-i) and S1(word-j) represents the S1 statistic value for word-i and word-j respectively, and N represents the total number of segments in the data corpus being analyzed. One of ordinary skill in the art will recognize that the equation above may be easily extended to word combinations have more than two words.
- Referring again to
FIG. 3 , with measured and computed usage values it is possible to determine which entries inpreliminary topic list 305 are suitable for identifying topics withindata 105. In one embodiment, the test (block 315) of whether a topic's measured usage (block 300) is significantly greater than the topic's expected usage (block 310), is a constant multiplier. For example, if the measured usage of preliminary topic list entry-i is twice that of preliminary topic list entry-i is expected usage, preliminary topic list entry-i may be added totopic list 115 in accordance withblock 320. In another embodiment of the invention, if the measured usage of preliminary topic list entry-i is greater than a threshold value (e.g., 10) across all segments, then that preliminary topic list entry is selected as a topic. One of ordinary skill in the art will recognize alternative tests may also be used. For example, a different multiplier may be used (e.g., 1.5 or 3). Additionally conventional statistical tests of significance may be used. - In one embodiment,
topic list 115 may be refined in accordance withFIG. 5 . (For convenience, this refinement process will be described in terms of two-word topics. One of ordinary skill in the art will recognize that the technique is equally applicable to topics having more than two words.) As shown, a first two word topic is selected (block 500). If both words comprising the topic are found in word list 100 (the “Yes” prong of block 505), the two word topic is retained (block 510). If both words comprising the topic are not found in word list 100 (the “no” prong of block 505), but the S3 value for that word which is inword list 100 is not significantly less than the S3 value for the other word (the “yes” prong of block 515), the two word topic is retained (block 510). If, on the other hand, one of the topic's words is not in word list 100 (the “no” prong of block 505) and the S3 value for that word which is inword list 100 is significantly less than the S3 value for the other word (the “no” prong of block 515), only the low S3 value word is retained intopic list 115 as a topic (block 520). The acts of blocks 500-520 are repeated as necessary for each two word topic in topic list 115 (see block 525). In one embodiment, the test for significance (block 515) is based on whether the “high” S3 value is in the upper one-third of all S3 values and the “low” S3 value is in the lower one-third of all S3 values. For example, if the S3 statistic for a corpus of data has a range of zero to 12,000, a low S3 value is less then or equal to 4,000 and a “high” S3 value is greater then or equal to 8,000. In another embodiment, the test for significance in accordance withblock 515 may be based on quartiles, quintiles or Bayesian tests. Refinement processes such as that outlined inFIG. 5 acknowledge word associations within data, while ignoring individual words that are so prevalent alone (high S3 value) as to offer substantially no differentiation as to content. - Referring again to
FIG. 1 , oncetopic list 115 is established, each segment indata 105 may be associated with those topics which exist within it (block 120) and stored indatabase 125. Topics may be associated with a data segment in any desired fashion. For example, topics found in a segment may be stored as metadata for the segment. In addition, stored topics may be indexed for improved retrieval performance during subsequent lookup operations. Empirical studies show that the large majority of user queries are “under-defined.” That is, the query itself does not identify any particular subject matter with sufficient specificity to allow a search engine to return the user's desired data in a result set (i.e., that collection of results presented to the user) that is acceptably small. A typical user query may be a single word such as, for example, “kidney.” In response to under-defined queries, prior art search techniques generally return large result sets—often containing thousands, or tens of thousands, of “hits.” Such large result sets are almost never useful to a user as they do not have the time to go through every entry to find that one having the information they seek. - In one embodiment, topics associated with data segments in accordance with the invention may be used to facilitate data retrieval operations as shown in
FIG. 6 . When a user query is received (block 600) it may be used to generate an initial result set (block 605) in a conventional manner. For example, a literal text search of the query term may identify 100,000 documents (or objects stored in database 125) that contain the search term. From this initial result set, a subset may be selected for analysis in accordance with topics (block 610). In one embodiment, the subset is a randomly chosen 1% of the initial result set. In another embodiment, the subset is a randomly chosen 1,000 entries from the initial result set. In yet another embodiment, a specified number of entries are selected from the initial result set (chosen in any manner desired). While the number of entries in the initial result subset may be chosen in substantially any manner desired, it is preferable to select at least a number that provides “coverage” (in a statistical sense) for the initial result set. In other words, it is desirable that the selected subset mirror the initial result set in terms of topics. With an appropriately chosen result subset, the most relevant topics associated with those results may be identified (block 615) and displayed to the user (block 620). -
FIG. 7 shows one method in accordance with the invention to identify those topics for display (block 615). Initially, all unique topics associated with the result subset are identified (block 700), and those topics that appear in more than a specified fraction of the result subset are removed (block 705). For example, those topics appearing in 80% or more of the segments comprising the result subset may be ignored for the purposes of this analysis. (A percentage higher or lower than this may be selected without altering the salient characteristics of the process.) Next, that topic which appears in the most result subset entries is selected for display (block 710). If more than one topic ties for having the most coverage, one may be selected for display in any manner desired. If, after ignoring those result subset entries associated with the selected topic, there remains more than a specified fraction of the result subset (the “yes” prong of block 715), that topic having the next highest coverage is selected (block 720). The process ofblocks block 715 is 20%, although a percentage higher or lower than this may be selected without altering the salient characteristics of the process. - If, after ignoring those result subset entries associated with the selected topic(s), there remains less than a specified fraction of the result subset (the “no” prong of block 715), the remaining topics are serialized and duplicate words are eliminated (block 725). That is, topics comprising two or more words are broken apart and treated as single-word topics. Next, the single-word topic that appears in the most result subset entries not already excluded is selected for display (block 730). As before, if more than one topic ties for having the most coverage, one may be selected for display in any manner desired. If, after ignoring those result subset entries associated with the selected topic, result subset entries remain un-chosen (the “yes” prong of block 735), that topic having the next highest coverage is selected (block 740). The process of
blocks - The topics identified in accordance with
FIG. 7 may be displayed to the user (block 620 inFIG. 6 ). Thus, data retrieval operations in accordance with the invention return one or more topics which the user may select to pursue or redefine their initial search. Optionally, a specified number of search result entries may be displayed in conjunction with the displayed topics. By selecting one or more of the displayed topics, a user may be presented with those data corresponding to the selected topics. (Topics may, for example, be combined through Boolean “and” and/or “or” operators.) In addition, the user may be presented with another list of topics based on the “new” result set in a manner described above. In summary, search operations in accordance with the invention respond to user queries by presenting a series of likely topics that most closely reflect the subjects that their initial search query relate to. Subsequent selection of a topic by the user, in effect, supplies additional search information which is used to refine the Search. -
TABLE 5 Example Query Result For the data set identified in Tables 1, 3 and 4, a search on the single word “kidney” returns an initial result set comprising 147,549 hits. (That is, 147,549 segments had the word kidney in them.) Of these, 1,000 were chosen as the initial result subset. Using the specified thresholds discussed above, the following topics were represented in the result set: amino acid, dependent presence, amino terminal, kidney transplantation, transcriptional regulation, liver kidney, body weight, rat kidney, filtration fraction, rats treated, heart kidney, renal transplantation, blood pressure, and renal function. Selection of the “renal function” topic identified a total of 6,853 entries divided among the following topics: effects renal, kidney transplantation, renal parenchyma, glomerular filtration, loss renal, blood flow, histological examination, renal artery, creatinine clearance, intensive care, and renal failure. Selection of the “glomerular filtration” topic from this list identified a total of 1,400 entries. Thus, in two steps the number of “hits” through which a person must search was reduced from approximately 148,000 to 1,500-a reduction of nearly two orders of magnitude. - It is noted that retrieval operations in accordance with
FIG. 6 may not be needed for all queries. For example, if a user query includes multiple search words or a quoted phrase that, using literal text-based search techniques, returns a relatively small result set (e.g., 50 hits or fewer), the presentation of this relatively small result set may be made immediately without resort to the topic-based approach ofFIG. 6 . What size of initial result set that triggers use of a topic-based retrieval operation in accordance with the invention is a matter of design choice. In one embodiment, all initial result sets having more than 50 hits use a method in accordance withFIG. 6 . In another embodiment, only initial result sets having more than 200 results trigger use of a method in accordance withFIG. 6 . - One of ordinary skill in the art will recognize that various changes in the details of the illustrated operational methods are possible without departing from the scope of the claims. For example, various acts may be performed in a different order from that shown in
FIGS. 1 through 7 . In addition, usage statistics other than those disclosed herein may be employed to measure a word's (or a word combination's) actual usage in a targeted corpus of data. Further, query result display methods in accordance withFIGS. 6 and 7 may use selection thresholds other than those disclosed herein. -
FIG. 8 provides a diagram that shows enterprise information sources. An office worker seated as his desk in front of the computer with a need to find information has a dilemma. The diagram illustrates that there are at least four main sources of information: enterprise information, server and PC information, Internet information, and email and attachments. Enterprise information can include data warehouses, multiple databases, and document systems. Server and PC information can include reports, presentations and data generated by the worker or his colleagues. Internet information can include a wealth of information, including business websites and business news. These are a few examples of the types of information that can be searched using the present invention, and are not intended to limit the scope of the invention. - The dilemma facing the office worker is where is the information? Can the information be found locally in a file? Is it on the department's server, in a file, in an email, or in an attachment to an email? Is it in a corporate database or warehouse or in a document management system? Or finally, is it on the web?
- Information within the enterprise is doubling every five years and doubling every 6 years on the web. And that is not counting the scores of duplicate emails, attachments, and corporate documents. More and more time is being spent trying to find information and less of all the relevant information is being found. So, productivity is negatively affected. The quality of the decisions is poorer because of incomplete information and the risk of negative economic impacts rise.
- The first step in addressing the information dilemma is to provide real-time aggregation of information where the context (e.g. title, to, from, name, product, etc.) is identified and maintained. This must be done without requiring normalization of the data. Or, in other words, the information must be imported “as is” without having to reformat or transform the information into some common form. Examples of methods for aggregating the data are taught in commonly owned U.S. Pat. No. 5,842,213, entitled Method for Modeling, Storing and Transferring Data in Neutral Form, issued Nov. 24, 1998 to Odom et al., and U.S. Pat. No. 6,393,426, also entitled Method for Modeling, Storing and Transferring Data in Neutral Form, issued May 21, 2002 to Odom et al., which are herein incorporated by reference in their entireties. These are provided as example methods of modeling and storing data, and are not intended to limit the scope of the present invention.
- This aggregation addresses the issue of practically pooling diverse information. The second step relates to the search problem, or put another way, finding the needed information - the proverbial needle in the haystack.
- True relevancy is the missing ingredient in search. The industry is looking for ways to produce better results for the user. This is particularly true when the user is searching for specific content as opposed to general information from an omnibus website. The emphasis is on trying to find a way to easily determine which information is relevant to the user.
- One part of understanding which information is relevant to the user is by trying to understand the intent of what the user enters for the search. More sophisticated natural language processing (NLP) is required to achieve “intent-based” search. The other part of determining what is relevant to the searcher is to extract that information directly from the person doing the search—effortlessly if possible. Both of these requirements will be resource intensive with current technologies. Search engine vendors already have massive hardware installations. Imagine what a quadrupling of resource requirements would do to the present cost structures. Not to mention the resource logistics. Co-pending, commonly owned U.S. patent application Ser. No. 11/194,766, filed on Aug. 2, 2005, which is hereby included herein by reference in its entirety addresses aspects of this relevancy challenge. The methods provided in that application can be coupled with the methods described herein to further improve the relevancy of search results and topics to be displayed.
- As discussed within the background section, present search and topification algorithms generally assume that topics are relatively static. However, the relevance of topics to a particular search query is not only based on what appears in the content of the query, but the relevance can also be a function of current events. Unfortunately, search engines do not directly factor in time relevancy, and these topics would be mixed in with the tens of thousands of other possible topic results. Thus, a user would not likely receive as relevant search results as would be desired.
- Another shortcoming of current search engines that display topics or search results is that search engines do not display topics associated with every subject matter domain related to a search constraint entered by a user. Rather a search engine may only show search results that are most popular without regard to different subject matter domains that search results may belong to. For users interested in a particular domain, the search results displayed would not be particularly relevant and their specific areas of interest difficult to find. Thus, a user once again may not receive search results relevant to their particular area of interest.
- In a set of embodiments, the present invention addresses these shortcomings of existing search engines and methods. In particular, embodiments of the present invention provides search methods and systems that can efficiently generate search results to identify and display topics by considering, at any given time, the relative significance of a topic based on current events and that ensure coverage of all subject matter domains associated with a search constraint.
- In each of
methods - As used herein the set of information includes one or more of information located within an enterprise network, information located within a server, information located within a personal computer, information located on the Internet, or information contained within email messages or email attachments.
- Also, as used herein data item includes one or more of text documents, graphic documents, audio files, video files, multimedia documents, email messages, email attachments, or Internet web page.
-
FIG. 9 provides a flowchart ofmethod 900 for displaying topics related to a search constraint entered by a user to request search results that identify data items within a set of information that are related to the search constraint, according to an embodiment of the invention.Method 900 begins instep 910. For use in illustrating the steps inmethod 900,FIG. 10 will be used.FIG. 10 provides a screen shot of an search engine web site, according to an embodiment of the invention. The screen shot ofFIG. 10 is for illustrative purposes, and not intended to limit the scope of the invention. - In step 910 a search constraint is received. For example, referring to
FIG. 10 the search constraint is “Pittsburgh Steelers.” - In step 920 a first preliminary set of topics related to the search constraint is identified. In an embodiment, the first preliminary set of topics is representative of a sample set of general data items. For example, the general data items could include a generic sampling of data items located across the Internet.
- In step 930 a second preliminary set of topics related to the search constraint is identified. In an embodiment, the second preliminary set of topics are representative of a sample set of current event data items. In an embodiment, the sample set of current event data items are gathered by receiving feeds from current event websites, such as CNN.COM, MSN.COM, ESPN.COM and the like. The current event data items are updated periodically. In one embodiment periodic updates are a function of the subject matter. For example, sports information is updated every thirty minutes, financial information is updated every thirty minutes, health information is updated once a day and other news information is updated every two hours. In one embodiment the current event data items database contains approximately 20,000 data items.
- In step 940 a set of display topics is identified that is a subset of the first preliminary set of topics and the second preliminary set of topics. In an embodiment, identifying a set of display topics includes selecting a certain number, referred to as the general topic threshold number, of topics from the first preliminary set of topics and selecting a certain number, referred to as the current event topic threshold number of topics, from the second preliminary set of topics. Additionally, in a further embodiment a certain number, referred to as the proper name topic threshold, of proper names from the second preliminary set of topics are also selected. In one embodiment, the proper names are randomly selected from a set of proper names contained within the second preliminary set of topics.
- In an additional embodiment, a personal interest topic repository can be created. The personal interest topic repository includes topics that have been identified as relevant to a user. These topics, for example, may be topics associated with frequent searches conducted by a user, topics generated based on a personal profile, or topics that a user may have previously selected. When a personal topic repository is available, step 940 can also include selecting a certain number, referred to as the personal interest topic threshold, of topics from the first preliminary set of topics.
- In
step 950 the set of display topics identified instep 940 is displayed. The topics may be displayed on a computer terminal, cell phone or other display device. Instep 960method 900 ends. - In an embodiment, the topic display threshold is twenty topics. Of these twenty topics, six topics are identified from the current event topics, six proper names (which are considered topics) are also taken from the current event topics, and eight topics are identified from the general topics. Of the eight topics from the general topics, two of these are personal interest topics, when personal interest topics are available. For example, referring back to
FIG. 10 , the column labeled AUTOTOPICS displays the set of display topics. The topics include, for example, Franco Harris, Pittsburgh Post, and autographed photos. -
FIG. 11 provides a flowchart ofmethod 1100 for displaying topics related to a search constraint entered by a user to request search results that identify data items within a set of information that are related to the search constraint. For use in illustrating the steps inmethod 1100,FIG. 10 will again be used. The screen shot ofFIG. 10 is for illustrative purposes, and not intended to limit the scope of the invention.Method 1100 begins instep 1110. - In step 1110 a search constraint is received. For example, referring to
-
FIG. 10 the search constraint is “Pittsburgh Steelers.” - In step 1120 a set of topics related to the search constraint is identified. In an embodiment identifying a set of topics includes conducting a search to generate search results. The search results include a set of data items. Example searches that can be used include searches using GOGGLE, YAHOO, MSN, ASK.COM and A9 search engines. Other types of search engines can also be used.
- In another embodiment a search can be conducted on a representative sample of data within the set of information that is of interest. For example, when searching the Internet a representative set of data items from the Internet can be used. In one embodiment the representative set of data items includes about 25 million data items.
- In another embodiment a search can be conducted on data items contained within a current events data item database. As discussed above, in an embodiment, the sample set of current event data items are gathered by receiving feeds from current event websites, such as CNN.COM, MSN.COM, ESPN.COM and the like. The current event data items are updated periodically. In one embodiment periodic updates are a function of the subject matter. For example, sports information is updated every thirty minutes, financial information is updated every thirty minutes, health information is updated once a day and other news information is updated every two hours. In one embodiment the current event data items database contains approximately 20,000 current event data items.
- The set of topics can then be determined from the search results by extracting topics associated with each data item in the search results. For example, the topification methods disclosed in the '026 Patent Application can be used to identify the set of topics from any of the above search results using general data items, representative data items and current event data items. In alternative embodiments, topics can be generated from a combination of these or other source data items.
- Once the topics are identified in
step 1130 each of the topics within the set of topics are ranked.FIG. 13 provides a flowchart of amethod 1300 to rank topics into one of four general rankings. Instep 1310, the highest ranking is assigned to a topic when the topic is a current topic and a personal interest topic. A topic is a current topic when it is found in the current event topics. A topic is a personal interest topic when it is found in the personal interest topic repository for a particular user. - In
step 1320 the second highest ranking is assigned to a topic within the identified when the topic is a current topic. Instep 1330 the third highest ranking is assigned to a topic when the topic is a personal interest topic. In step 1340 the fourth highest ranking is assigned to a topic when the topic is neither a current topic or a personal interest topic. Within each level of ranking, topics are further ranked based on their frequency of occurrence with search result data items. Those topics that occur least frequently among the data items are considered most relevant and given a higher ranking. - In step 1140 subject matter domains associated the set of topics are created.
FIG. 14 provides a diagram that graphically illustrates this process. Set 1410 represents the complete set of topics found in the data items in the search results. Withinset 1410, three subject matter domains are illustrated. These aresubject matter domains data item 1450. Associated withdata item 1450 will be one or more topics. Data items that have overlapping sets of topics, represented by the shadedarea 1460 forsubject matter domain 1430, are clustered together to form a subject matter domain. Subject matter domains will have some overlap, as indicated byoverlap 1470. - In an embodiment, the process of clustering includes clustering data items that have overlapping topics, and then creating subject matter domains based on clustering of data items that minimizes the overlap of topics across subject matter areas, such as
overlap 1470. Individuals skilled in the relevant arts will be able to apply statistical clustering methods to determine the optimal clustering. - In
step 1150 the most representative topic for each subject matter domain is determined. In an embodiment, the most representative topic is determined by identifying those topics within a subject matter domain that occur in more than some fraction of the distribution (e.g, more than 90% of the data items) of data items within the set of information. The most representative topic is then determined from this set of topics by identifying the topic for each subject matter domain with the highest current event and personal interest topic ranking. As necessary, the frequency of occurrence of the topics can be used to further rank the topics as discussed above. - In
step 1160 the most representative topic for each subject matter domain is displayed. Instep 1170 the highest ranked topics not previously displayed are displayed. Instep 1180method 1100 ends. -
FIG. 12 provides a flowchart ofmethod 1200 for displaying topics related to a search constraint entered by a user to request search results that identify data items within a set of information that are related to the search constraint. For use in illustrating the steps inmethod 1200,FIG. 10 will again be used. The screen shot ofFIG. 10 is for illustrative purposes, and not intended to limit the scope of the invention.Method 1200 begins instep 1210. - In step 1210 a search constraint is received. For example, referring to
FIG. 10 the search constraint is “Pittsburgh Steelers.” - In step 1220 a set of topics related to the search constraint is identified. In an embodiment identifying a set of topics includes conducting a search to generate search results. The search results include a set of data items. Example searches that can be used include searches using GOGGLE, YAHOO, MSN, ASK.COM and A9 search engines. Other types of search engines can also be used.
- In another embodiment a search can be conducted on a representative sample of data within the set of information that is of interest. For example, when searching the Internet a representative set of data items from the Internet can be used. In one embodiment the representative set of data items includes about 25 million data items.
- In another embodiment a search can be conducted on data items contained within a current event data item database. As discussed above, in an embodiment, the sample set of current event data items are gathered by receiving feeds from current event websites, such as CNN.COM, MSN.COM, ESPN.COM and the like. The current event data items are updated periodically. In one embodiment periodic updates are a function of the subject matter. For example, sports information is updated every thirty minutes, financial information is updated every thirty minutes, health information is updated once a day and other news information is updated every two hours. In one embodiment the current event data items database contains approximately 20,000 current event data items.
- The set of topics can then be determined from the search results by extracting topics associated with each data item in the search results. For example, the topification methods disclosed in the '026 patent application can be used to identify the set of topics from any of the above search results using general data items, representative data items and current event data items. In alternative embodiments, topics can be generated from a combination of these or other source data items.
- In step 1230 subject matter domains associated the set of topics are created. As discussed above,
FIG. 14 provides a diagram that graphically illustrates this process. Set 1410 represents the complete set of topics found in the data items in the search results. Withinset 1410, three subject matter domains are illustrated. These aresubject matter domains data item 1450. Associated withdata item 1450 will be one or more topics. Data items that have overlapping sets of topics, represented by the shadedarea 1460 forsubject matter domain 1430, are clustered together to form a subject matter domain. Subject matter domains will have some overlap, as indicated byoverlap 1470. - In an embodiment, the process of clustering includes clustering data items that have overlapping topics, and then creating subject matter domains based on clustering of data items that minimizes the overlap of topics across subject matter areas, such as
overlap 1470. Individuals skilled in the relevant arts will be able to apply statistical clustering methods to determine the optimal clustering. - In
step 1240 the most representative topic for each subject matter domain is determined. In an embodiment, the most representative topic is determined by identifying those topics within a subject matter domain that occur in more than some fraction of the distribution of data items (e.g., more than 90% of the data items) within the set of information. The most representative topic is then determined from this set of topics by identifying the topic for each subject matter domain that has the least frequent number of occurrences in the search result data items. - In
step 1250 the most representative topic for each subject matter domain is displayed. Instep 1250method 1200 ends. In alternative embodiments, the set of topics identified that are related to the search constraint can be ranked as was done instep 1130 inmethod 1100. Based on these rankings, additional topics can be displayed as was done instep 1170 inmethod 1100. - Referring to
FIG. 15 , acts in accordance with any, or a portion of any, ofFIGS. 1 through 14 may be performed by a programmable control device executing instructions organized into one ormore program modules 1500. A programmable control device can include, but is not limited to a personal computer, a laptop computer, a network computer, a wireless telephone, a personal data assistant (“PDA”) and the like. In one embodiment, programmable control device comprisescomputer system 1505 that includescentral processing unit 1510,storage 1515,network interface card 1520 forcoupling computer system 1505 tonetwork 1525,display unit 1530,keyboard 1535 andmouse 1540. In addition to a single processor system shown inFIG. 15 , a programmable control device may be a multiprocessor computer system or a custom designed state machine. - Custom designed state machines may be embodied in a hardware device such as a printed circuit board comprising, discrete logic, integrated circuits, or specially designed Application Specific Integrated Circuits (ASICs). Storage devices, such as
device 1515, suitable for tangibly embodying program module(s) 1500 include all forms of non-volatile memory including, but not limited to: semiconductor memory devices such as Electrically Programmable Read Only Memory (EPROM), Electrically Erasable Programmable Read Only Memory (EEPROM), and flash devices; magnetic disks (fixed, floppy, and removable); other magnetic media such as tape; and optical media such as CD-ROM disks. - Exemplary embodiments of the present invention have been presented. The invention is not limited to these examples. These examples are presented herein for purposes of illustration, and not limitation. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the invention.
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/764,792 US20100262603A1 (en) | 2002-02-26 | 2010-04-21 | Search engine methods and systems for displaying relevant topics |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/086,026 US7340466B2 (en) | 2002-02-26 | 2002-02-26 | Topic identification and use thereof in information retrieval systems |
US77757606P | 2006-03-01 | 2006-03-01 | |
US11/712,557 US7716207B2 (en) | 2002-02-26 | 2007-03-01 | Search engine methods and systems for displaying relevant topics |
US12/764,792 US20100262603A1 (en) | 2002-02-26 | 2010-04-21 | Search engine methods and systems for displaying relevant topics |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/712,557 Continuation US7716207B2 (en) | 2002-02-26 | 2007-03-01 | Search engine methods and systems for displaying relevant topics |
Publications (1)
Publication Number | Publication Date |
---|---|
US20100262603A1 true US20100262603A1 (en) | 2010-10-14 |
Family
ID=46327414
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/712,557 Expired - Fee Related US7716207B2 (en) | 2002-02-26 | 2007-03-01 | Search engine methods and systems for displaying relevant topics |
US12/764,792 Abandoned US20100262603A1 (en) | 2002-02-26 | 2010-04-21 | Search engine methods and systems for displaying relevant topics |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/712,557 Expired - Fee Related US7716207B2 (en) | 2002-02-26 | 2007-03-01 | Search engine methods and systems for displaying relevant topics |
Country Status (1)
Country | Link |
---|---|
US (2) | US7716207B2 (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110145227A1 (en) * | 2009-12-16 | 2011-06-16 | Microsoft Corporation | Determining preferences from user queries |
US20130080460A1 (en) * | 2011-09-22 | 2013-03-28 | Microsoft Corporation | Providing topic based search guidance |
WO2015017731A1 (en) * | 2013-08-01 | 2015-02-05 | Children's Hospital Medical Center | Identification of surgery candidates using natural language processing |
US10296659B2 (en) * | 2016-09-26 | 2019-05-21 | International Business Machines Corporation | Search query intent |
US10422004B2 (en) | 2014-08-08 | 2019-09-24 | Children's Hospital Medical Center | Diagnostic method for distinguishing forms of esophageal eosinophilia |
TWI681304B (en) * | 2018-12-14 | 2020-01-01 | 財團法人工業技術研究院 | System and method for adaptively adjusting related search words |
US10713440B2 (en) | 2007-01-04 | 2020-07-14 | Children's Hospital Medical Center | Processing text with domain-specific spreading activation methods |
US10878939B2 (en) | 2014-02-24 | 2020-12-29 | Children's Hospital Medical Center | Methods and compositions for personalized pain management |
US11314794B2 (en) | 2018-12-14 | 2022-04-26 | Industrial Technology Research Institute | System and method for adaptively adjusting related search words |
US11564905B2 (en) | 2016-01-13 | 2023-01-31 | Children's Hospital Medical Center | Compositions and methods for treating allergic inflammatory conditions |
US11597978B2 (en) | 2011-11-30 | 2023-03-07 | Children's Hospital Medical Center | Personalized pain management and anesthesia: preemptive risk identification and therapeutic decision support |
US11618924B2 (en) | 2017-01-20 | 2023-04-04 | Children's Hospital Medical Center | Methods and compositions relating to OPRM1 DNA methylation for personalized pain management |
US11859250B1 (en) | 2018-02-23 | 2024-01-02 | Children's Hospital Medical Center | Methods for treating eosinophilic esophagitis |
Families Citing this family (42)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8849860B2 (en) | 2005-03-30 | 2014-09-30 | Primal Fusion Inc. | Systems and methods for applying statistical inference techniques to knowledge representations |
US9177248B2 (en) | 2005-03-30 | 2015-11-03 | Primal Fusion Inc. | Knowledge representation systems and methods incorporating customization |
US10002325B2 (en) | 2005-03-30 | 2018-06-19 | Primal Fusion Inc. | Knowledge representation systems and methods incorporating inference rules |
US9378203B2 (en) | 2008-05-01 | 2016-06-28 | Primal Fusion Inc. | Methods and apparatus for providing information of interest to one or more users |
US9104779B2 (en) | 2005-03-30 | 2015-08-11 | Primal Fusion Inc. | Systems and methods for analyzing and synthesizing complex knowledge representations |
US7849090B2 (en) | 2005-03-30 | 2010-12-07 | Primal Fusion Inc. | System, method and computer program for faceted classification synthesis |
US20080250008A1 (en) * | 2007-04-04 | 2008-10-09 | Microsoft Corporation | Query Specialization |
US8051372B1 (en) * | 2007-04-12 | 2011-11-01 | The New York Times Company | System and method for automatically detecting and extracting semantically significant text from a HTML document associated with a plurality of HTML documents |
US20090106681A1 (en) * | 2007-10-19 | 2009-04-23 | Abhinav Gupta | Method and apparatus for geographic specific search results including a map-based display |
US8676732B2 (en) | 2008-05-01 | 2014-03-18 | Primal Fusion Inc. | Methods and apparatus for providing information of interest to one or more users |
JP5530425B2 (en) | 2008-05-01 | 2014-06-25 | プライマル フュージョン インコーポレイテッド | Method, system, and computer program for dynamic generation of user-driven semantic networks and media integration |
US9361365B2 (en) | 2008-05-01 | 2016-06-07 | Primal Fusion Inc. | Methods and apparatus for searching of content using semantic synthesis |
WO2010021530A1 (en) * | 2008-08-20 | 2010-02-25 | Instituto Tecnologico Y De Estudios Superiores De Monterrey | System and method for displaying relevant textual advertising based on semantic similarity |
JPWO2010021145A1 (en) * | 2008-08-22 | 2012-01-26 | パナソニック株式会社 | Recording / playback device |
CA2734756C (en) * | 2008-08-29 | 2018-08-21 | Primal Fusion Inc. | Systems and methods for semantic concept definition and semantic concept relationship synthesis utilizing existing domain definitions |
EP2252051B1 (en) * | 2009-05-13 | 2014-06-11 | Sony Europe Limited | A method of content retrieval |
US10007705B2 (en) | 2010-10-30 | 2018-06-26 | International Business Machines Corporation | Display of boosted slashtag results |
US10726083B2 (en) * | 2010-10-30 | 2020-07-28 | International Business Machines Corporation | Search query transformations |
US9342607B2 (en) * | 2009-06-19 | 2016-05-17 | International Business Machines Corporation | Dynamic inference graph |
US7831609B1 (en) | 2009-08-25 | 2010-11-09 | Vizibility Inc. | System and method for searching, formulating, distributing and monitoring usage of predefined internet search queries |
US20110060644A1 (en) * | 2009-09-08 | 2011-03-10 | Peter Sweeney | Synthesizing messaging using context provided by consumers |
US9292855B2 (en) * | 2009-09-08 | 2016-03-22 | Primal Fusion Inc. | Synthesizing messaging using context provided by consumers |
US20110060645A1 (en) * | 2009-09-08 | 2011-03-10 | Peter Sweeney | Synthesizing messaging using context provided by consumers |
US20110270843A1 (en) * | 2009-11-06 | 2011-11-03 | Mayo Foundation For Medical Education And Research | Specialized search engines |
US9262520B2 (en) | 2009-11-10 | 2016-02-16 | Primal Fusion Inc. | System, method and computer program for creating and manipulating data structures using an interactive graphical interface |
US9235806B2 (en) | 2010-06-22 | 2016-01-12 | Primal Fusion Inc. | Methods and devices for customizing knowledge representation systems |
US10474647B2 (en) | 2010-06-22 | 2019-11-12 | Primal Fusion Inc. | Methods and devices for customizing knowledge representation systems |
EP2633444A4 (en) | 2010-10-30 | 2017-06-21 | International Business Machines Corporation | Transforming search engine queries |
US20140081954A1 (en) * | 2010-11-30 | 2014-03-20 | Kirill Elizarov | Media information system and method |
US9996620B2 (en) * | 2010-12-28 | 2018-06-12 | Excalibur Ip, Llc | Continuous content refinement of topics of user interest |
US11294977B2 (en) | 2011-06-20 | 2022-04-05 | Primal Fusion Inc. | Techniques for presenting content to a user based on the user's preferences |
US9092516B2 (en) | 2011-06-20 | 2015-07-28 | Primal Fusion Inc. | Identifying information of interest based on user preferences |
US9569413B2 (en) * | 2012-05-07 | 2017-02-14 | Sap Se | Document text processing using edge detection |
WO2014008965A1 (en) * | 2012-07-13 | 2014-01-16 | Sony Corporation | Information providing text reader |
US9229988B2 (en) * | 2013-01-18 | 2016-01-05 | Microsoft Technology Licensing, Llc | Ranking relevant attributes of entity in structured knowledge base |
US10204026B2 (en) * | 2013-03-15 | 2019-02-12 | Uda, Llc | Realtime data stream cluster summarization and labeling system |
US20150356171A1 (en) * | 2014-06-05 | 2015-12-10 | Harmon.Ie R&D Ltd. | System and method for cross-cloud topic matching |
US20160078035A1 (en) * | 2014-09-11 | 2016-03-17 | Facebook, Inc. | Systems and methods for providing real-time content items associated with topics |
US11250956B2 (en) * | 2014-11-03 | 2022-02-15 | Cerner Innovation, Inc. | Duplication detection in clinical documentation during drafting |
US20170277793A1 (en) * | 2016-03-24 | 2017-09-28 | NewsRx, LLC | Narrated search results |
US20180232442A1 (en) * | 2017-02-16 | 2018-08-16 | International Business Machines Corporation | Web api recommendations |
TWI809350B (en) * | 2021-02-02 | 2023-07-21 | 財團法人工業技術研究院 | Method and device for managing equipment information of logistics equipment |
Citations (86)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4580218A (en) * | 1983-09-08 | 1986-04-01 | At&T Bell Laboratories | Indexing subject-locating method |
US5265065A (en) * | 1991-10-08 | 1993-11-23 | West Publishing Company | Method and apparatus for information retrieval from a database by replacing domain specific stemmed phases in a natural language to create a search query |
US5278980A (en) * | 1991-08-16 | 1994-01-11 | Xerox Corporation | Iterative technique for phrase query formation and an information retrieval system employing same |
US5490061A (en) * | 1987-02-05 | 1996-02-06 | Toltran, Ltd. | Improved translation system utilizing a morphological stripping process to reduce words to their root configuration to produce reduction of database size |
US5625748A (en) * | 1994-04-18 | 1997-04-29 | Bbn Corporation | Topic discriminator using posterior probability or confidence scores |
US5745776A (en) * | 1995-04-19 | 1998-04-28 | Sheppard, Ii; Charles Bradford | Enhanced electronic dictionary |
US5842206A (en) * | 1996-08-20 | 1998-11-24 | Iconovex Corporation | Computerized method and system for qualified searching of electronically stored documents |
US5920854A (en) * | 1996-08-14 | 1999-07-06 | Infoseek Corporation | Real-time document collection search engine with phrase indexing |
US5924105A (en) * | 1997-01-27 | 1999-07-13 | Michigan State University | Method and product for determining salient features for use in information searching |
US5937422A (en) * | 1997-04-15 | 1999-08-10 | The United States Of America As Represented By The National Security Agency | Automatically generating a topic description for text and searching and sorting text by topic using the same |
US5940821A (en) * | 1997-05-21 | 1999-08-17 | Oracle Corporation | Information presentation in a knowledge base search and retrieval system |
US5960385A (en) * | 1995-06-30 | 1999-09-28 | The Research Foundation Of The State University Of New York | Sentence reconstruction using word ambiguity resolution |
US5987454A (en) * | 1997-06-09 | 1999-11-16 | Hobbs; Allen | Method and apparatus for selectively augmenting retrieved text, numbers, maps, charts, still pictures and/or graphics, moving pictures and/or graphics and audio information from a network resource |
US5987460A (en) * | 1996-07-05 | 1999-11-16 | Hitachi, Ltd. | Document retrieval-assisting method and system for the same and document retrieval service using the same with document frequency and term frequency |
US6038560A (en) * | 1997-05-21 | 2000-03-14 | Oracle Corporation | Concept knowledge base search and retrieval system |
US6070133A (en) * | 1997-07-21 | 2000-05-30 | Battelle Memorial Institute | Information retrieval system utilizing wavelet transform |
US6085187A (en) * | 1997-11-24 | 2000-07-04 | International Business Machines Corporation | Method and apparatus for navigating multiple inheritance concept hierarchies |
US6115718A (en) * | 1998-04-01 | 2000-09-05 | Xerox Corporation | Method and apparatus for predicting document access in a collection of linked documents featuring link proprabilities and spreading activation |
US6125362A (en) * | 1996-12-04 | 2000-09-26 | Canon Kabushiki Kaisha | Data processing method and apparatus for identifying classification to which data belongs |
US6212532B1 (en) * | 1998-10-22 | 2001-04-03 | International Business Machines Corporation | Text categorization toolkit |
US6226792B1 (en) * | 1998-10-14 | 2001-05-01 | Unisys Corporation | Object management system supporting the use of application domain knowledge mapped to technology domain knowledge |
US6233575B1 (en) * | 1997-06-24 | 2001-05-15 | International Business Machines Corporation | Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values |
US6236958B1 (en) * | 1997-06-27 | 2001-05-22 | International Business Machines Corporation | Method and system for extracting pairs of multilingual terminology from an aligned multilingual text |
US6363378B1 (en) * | 1998-10-13 | 2002-03-26 | Oracle Corporation | Ranking of query feedback terms in an information retrieval system |
US6363374B1 (en) * | 1998-12-31 | 2002-03-26 | Microsoft Corporation | Text proximity filtering in search systems using same sentence restrictions |
US20020046018A1 (en) * | 2000-05-11 | 2002-04-18 | Daniel Marcu | Discourse parsing and summarization |
US6421675B1 (en) * | 1998-03-16 | 2002-07-16 | S. L. I. Systems, Inc. | Search engine |
US20020099730A1 (en) * | 2000-05-12 | 2002-07-25 | Applied Psychology Research Limited | Automatic text classification system |
US20020099700A1 (en) * | 1999-12-14 | 2002-07-25 | Wen-Syan Li | Focused search engine and method |
US20020103788A1 (en) * | 2000-08-08 | 2002-08-01 | Donaldson Thomas E. | Filtering search results |
US20020103799A1 (en) * | 2000-12-06 | 2002-08-01 | Science Applications International Corp. | Method for document comparison and selection |
US6446061B1 (en) * | 1998-07-31 | 2002-09-03 | International Business Machines Corporation | Taxonomy generation for document collections |
US20020123994A1 (en) * | 2000-04-26 | 2002-09-05 | Yves Schabes | System for fulfilling an information need using extended matching techniques |
US6460034B1 (en) * | 1997-05-21 | 2002-10-01 | Oracle Corporation | Document knowledge base research and retrieval system |
US6473730B1 (en) * | 1999-04-12 | 2002-10-29 | The Trustees Of Columbia University In The City Of New York | Method and system for topical segmentation, segment significance and segment function |
US6505151B1 (en) * | 2000-03-15 | 2003-01-07 | Bridgewell Inc. | Method for dividing sentences into phrases using entropy calculations of word combinations based on adjacent words |
US20030018659A1 (en) * | 2001-03-14 | 2003-01-23 | Lingomotors, Inc. | Category-based selections in an information access environment |
US6529902B1 (en) * | 1999-11-08 | 2003-03-04 | International Business Machines Corporation | Method and system for off-line detection of textual topical changes and topic identification via likelihood based methods for improved language modeling |
US6542889B1 (en) * | 2000-01-28 | 2003-04-01 | International Business Machines Corporation | Methods and apparatus for similarity text search based on conceptual indexing |
US6556987B1 (en) * | 2000-05-12 | 2003-04-29 | Applied Psychology Research, Ltd. | Automatic text classification system |
US20030097375A1 (en) * | 1996-09-13 | 2003-05-22 | Pennock Kelly A. | System for information discovery |
US6594654B1 (en) * | 2000-03-03 | 2003-07-15 | Aly A. Salam | Systems and methods for continuously accumulating research information via a computer network |
US6606659B1 (en) * | 2000-01-28 | 2003-08-12 | Websense, Inc. | System and method for controlling access to internet sites |
US20030154071A1 (en) * | 2002-02-11 | 2003-08-14 | Shreve Gregory M. | Process for the document management and computer-assisted translation of documents utilizing document corpora constructed by intelligent agents |
US20030167252A1 (en) * | 2002-02-26 | 2003-09-04 | Pliant Technologies, Inc. | Topic identification and use thereof in information retrieval systems |
US20030212669A1 (en) * | 2002-05-07 | 2003-11-13 | Aatish Dedhia | System and method for context based searching of electronic catalog database, aided with graphical feedback to the user |
US20030220913A1 (en) * | 2002-05-24 | 2003-11-27 | International Business Machines Corporation | Techniques for personalized and adaptive search services |
US6665661B1 (en) * | 2000-09-29 | 2003-12-16 | Battelle Memorial Institute | System and method for use in text analysis of documents and records |
US6678694B1 (en) * | 2000-11-08 | 2004-01-13 | Frank Meik | Indexed, extensible, interactive document retrieval system |
US20040024752A1 (en) * | 2002-08-05 | 2004-02-05 | Yahoo! Inc. | Method and apparatus for search ranking using human input and automated ranking |
US20040024739A1 (en) * | 1999-06-15 | 2004-02-05 | Kanisa Inc. | System and method for implementing a knowledge management system |
US20040024583A1 (en) * | 2000-03-20 | 2004-02-05 | Freeman Robert J | Natural-language processing system using a large corpus |
US20040049541A1 (en) * | 2002-09-10 | 2004-03-11 | Swahn Alan Earl | Information retrieval and display system |
US6708162B1 (en) * | 2000-05-08 | 2004-03-16 | Microsoft Corporation | Method and system for unifying search strategy and sharing search output data across multiple program modules |
US20040093327A1 (en) * | 2002-09-24 | 2004-05-13 | Darrell Anderson | Serving advertisements based on content |
US6751611B2 (en) * | 2002-03-01 | 2004-06-15 | Paul Jeffrey Krupin | Method and system for creating improved search queries |
US20040128267A1 (en) * | 2000-05-17 | 2004-07-01 | Gideon Berger | Method and system for data classification in the presence of a temporal non-stationarity |
US6775677B1 (en) * | 2000-03-02 | 2004-08-10 | International Business Machines Corporation | System, method, and program product for identifying and describing topics in a collection of electronic documents |
US20040186828A1 (en) * | 2002-12-24 | 2004-09-23 | Prem Yadav | Systems and methods for enabling a user to find information of interest to the user |
US20040199375A1 (en) * | 1999-05-28 | 2004-10-07 | Farzad Ehsani | Phrase-based dialogue modeling with particular application to creating a recognition grammar for a voice-controlled user interface |
US20050060286A1 (en) * | 2003-09-15 | 2005-03-17 | Microsoft Corporation | Free text search within a relational database |
US20050091211A1 (en) * | 1998-10-06 | 2005-04-28 | Crystal Reference Systems Limited | Apparatus for classifying or disambiguating data |
US20050114198A1 (en) * | 2003-11-24 | 2005-05-26 | Ross Koningstein | Using concepts for ad targeting |
US20050131758A1 (en) * | 2003-12-11 | 2005-06-16 | Desikan Pavan K. | Systems and methods detecting for providing advertisements in a communications network |
US20050154617A1 (en) * | 2000-09-30 | 2005-07-14 | Tom Ruggieri | System and method for providing global information on risks and related hedging strategies |
US20050160082A1 (en) * | 2004-01-16 | 2005-07-21 | The Regents Of The University Of California | System and method of context-specific searching in an electronic database |
US20050165782A1 (en) * | 2003-12-02 | 2005-07-28 | Sony Corporation | Information processing apparatus, information processing method, program for implementing information processing method, information processing system, and method for information processing system |
US20050171761A1 (en) * | 2001-01-31 | 2005-08-04 | Microsoft Corporation | Disambiguation language model |
US6941513B2 (en) * | 2000-06-15 | 2005-09-06 | Cognisphere, Inc. | System and method for text structuring and text generation |
US20050222987A1 (en) * | 2004-04-02 | 2005-10-06 | Vadon Eric R | Automated detection of associations between search criteria and item categories based on collective analysis of user activity data |
US20050240580A1 (en) * | 2003-09-30 | 2005-10-27 | Zamir Oren E | Personalization of placed content ordering in search results |
US20060004732A1 (en) * | 2002-02-26 | 2006-01-05 | Odom Paul S | Search engine methods and systems for generating relevant search results and advertisements |
US20060026013A1 (en) * | 2004-07-29 | 2006-02-02 | Yahoo! Inc. | Search systems and methods using in-line contextual queries |
US7003513B2 (en) * | 2000-07-04 | 2006-02-21 | International Business Machines Corporation | Method and system of weighted context feedback for result improvement in information retrieval |
US20060129381A1 (en) * | 1998-06-04 | 2006-06-15 | Yumi Wakita | Language transference rule producing apparatus, language transferring apparatus method, and program recording medium |
US20060167857A1 (en) * | 2004-07-29 | 2006-07-27 | Yahoo! Inc. | Systems and methods for contextual transaction proposals |
US20060184354A1 (en) * | 2000-06-01 | 2006-08-17 | Microsoft Corporation | Creating a language model for a language processing system |
US20070038603A1 (en) * | 2005-08-10 | 2007-02-15 | Guha Ramanathan V | Sharing context data across programmable search engines |
US20070038614A1 (en) * | 2005-08-10 | 2007-02-15 | Guha Ramanathan V | Generating and presenting advertisements based on context data for programmable search engines |
US20070078822A1 (en) * | 2005-09-30 | 2007-04-05 | Microsoft Corporation | Arbitration of specialized content using search results |
US7219105B2 (en) * | 2003-09-17 | 2007-05-15 | International Business Machines Corporation | Method, system and computer program product for profiling entities |
US20070250501A1 (en) * | 2005-09-27 | 2007-10-25 | Grubb Michael L | Search result delivery engine |
US20080021860A1 (en) * | 2006-07-21 | 2008-01-24 | Aol Llc | Culturally relevant search results |
US7496567B1 (en) * | 2004-10-01 | 2009-02-24 | Terril John Steichen | System and method for document categorization |
US7548910B1 (en) * | 2004-01-30 | 2009-06-16 | The Regents Of The University Of California | System and method for retrieving scenario-specific documents |
US7613690B2 (en) * | 2005-10-21 | 2009-11-03 | Aol Llc | Real time query trends with multi-document summarization |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS63231221A (en) * | 1987-03-19 | 1988-09-27 | Mitsubishi Electric Corp | Measuring instrument for intake air quantity of engine |
JPH07294215A (en) * | 1994-04-25 | 1995-11-10 | Canon Inc | Method and apparatus for processing image |
-
2007
- 2007-03-01 US US11/712,557 patent/US7716207B2/en not_active Expired - Fee Related
-
2010
- 2010-04-21 US US12/764,792 patent/US20100262603A1/en not_active Abandoned
Patent Citations (93)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4580218A (en) * | 1983-09-08 | 1986-04-01 | At&T Bell Laboratories | Indexing subject-locating method |
US5490061A (en) * | 1987-02-05 | 1996-02-06 | Toltran, Ltd. | Improved translation system utilizing a morphological stripping process to reduce words to their root configuration to produce reduction of database size |
US5278980A (en) * | 1991-08-16 | 1994-01-11 | Xerox Corporation | Iterative technique for phrase query formation and an information retrieval system employing same |
US5265065A (en) * | 1991-10-08 | 1993-11-23 | West Publishing Company | Method and apparatus for information retrieval from a database by replacing domain specific stemmed phases in a natural language to create a search query |
US5625748A (en) * | 1994-04-18 | 1997-04-29 | Bbn Corporation | Topic discriminator using posterior probability or confidence scores |
US5745776A (en) * | 1995-04-19 | 1998-04-28 | Sheppard, Ii; Charles Bradford | Enhanced electronic dictionary |
US5960385A (en) * | 1995-06-30 | 1999-09-28 | The Research Foundation Of The State University Of New York | Sentence reconstruction using word ambiguity resolution |
US5987460A (en) * | 1996-07-05 | 1999-11-16 | Hitachi, Ltd. | Document retrieval-assisting method and system for the same and document retrieval service using the same with document frequency and term frequency |
US5920854A (en) * | 1996-08-14 | 1999-07-06 | Infoseek Corporation | Real-time document collection search engine with phrase indexing |
US5842206A (en) * | 1996-08-20 | 1998-11-24 | Iconovex Corporation | Computerized method and system for qualified searching of electronically stored documents |
US6772170B2 (en) * | 1996-09-13 | 2004-08-03 | Battelle Memorial Institute | System and method for interpreting document contents |
US20030097375A1 (en) * | 1996-09-13 | 2003-05-22 | Pennock Kelly A. | System for information discovery |
US6125362A (en) * | 1996-12-04 | 2000-09-26 | Canon Kabushiki Kaisha | Data processing method and apparatus for identifying classification to which data belongs |
US5924105A (en) * | 1997-01-27 | 1999-07-13 | Michigan State University | Method and product for determining salient features for use in information searching |
US5937422A (en) * | 1997-04-15 | 1999-08-10 | The United States Of America As Represented By The National Security Agency | Automatically generating a topic description for text and searching and sorting text by topic using the same |
US5940821A (en) * | 1997-05-21 | 1999-08-17 | Oracle Corporation | Information presentation in a knowledge base search and retrieval system |
US6460034B1 (en) * | 1997-05-21 | 2002-10-01 | Oracle Corporation | Document knowledge base research and retrieval system |
US6038560A (en) * | 1997-05-21 | 2000-03-14 | Oracle Corporation | Concept knowledge base search and retrieval system |
US5987454A (en) * | 1997-06-09 | 1999-11-16 | Hobbs; Allen | Method and apparatus for selectively augmenting retrieved text, numbers, maps, charts, still pictures and/or graphics, moving pictures and/or graphics and audio information from a network resource |
US6233575B1 (en) * | 1997-06-24 | 2001-05-15 | International Business Machines Corporation | Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values |
US6236958B1 (en) * | 1997-06-27 | 2001-05-22 | International Business Machines Corporation | Method and system for extracting pairs of multilingual terminology from an aligned multilingual text |
US6070133A (en) * | 1997-07-21 | 2000-05-30 | Battelle Memorial Institute | Information retrieval system utilizing wavelet transform |
US6085187A (en) * | 1997-11-24 | 2000-07-04 | International Business Machines Corporation | Method and apparatus for navigating multiple inheritance concept hierarchies |
US6421675B1 (en) * | 1998-03-16 | 2002-07-16 | S. L. I. Systems, Inc. | Search engine |
US6115718A (en) * | 1998-04-01 | 2000-09-05 | Xerox Corporation | Method and apparatus for predicting document access in a collection of linked documents featuring link proprabilities and spreading activation |
US20060129381A1 (en) * | 1998-06-04 | 2006-06-15 | Yumi Wakita | Language transference rule producing apparatus, language transferring apparatus method, and program recording medium |
US7321850B2 (en) * | 1998-06-04 | 2008-01-22 | Matsushita Electric Industrial Co., Ltd. | Language transference rule producing apparatus, language transferring apparatus method, and program recording medium |
US6446061B1 (en) * | 1998-07-31 | 2002-09-03 | International Business Machines Corporation | Taxonomy generation for document collections |
US20050091211A1 (en) * | 1998-10-06 | 2005-04-28 | Crystal Reference Systems Limited | Apparatus for classifying or disambiguating data |
US7305415B2 (en) * | 1998-10-06 | 2007-12-04 | Crystal Reference Systems Limited | Apparatus for classifying or disambiguating data |
US6363378B1 (en) * | 1998-10-13 | 2002-03-26 | Oracle Corporation | Ranking of query feedback terms in an information retrieval system |
US6226792B1 (en) * | 1998-10-14 | 2001-05-01 | Unisys Corporation | Object management system supporting the use of application domain knowledge mapped to technology domain knowledge |
US6212532B1 (en) * | 1998-10-22 | 2001-04-03 | International Business Machines Corporation | Text categorization toolkit |
US6363374B1 (en) * | 1998-12-31 | 2002-03-26 | Microsoft Corporation | Text proximity filtering in search systems using same sentence restrictions |
US6473730B1 (en) * | 1999-04-12 | 2002-10-29 | The Trustees Of Columbia University In The City Of New York | Method and system for topical segmentation, segment significance and segment function |
US20040199375A1 (en) * | 1999-05-28 | 2004-10-07 | Farzad Ehsani | Phrase-based dialogue modeling with particular application to creating a recognition grammar for a voice-controlled user interface |
US20040024739A1 (en) * | 1999-06-15 | 2004-02-05 | Kanisa Inc. | System and method for implementing a knowledge management system |
US6529902B1 (en) * | 1999-11-08 | 2003-03-04 | International Business Machines Corporation | Method and system for off-line detection of textual topical changes and topic identification via likelihood based methods for improved language modeling |
US20020099700A1 (en) * | 1999-12-14 | 2002-07-25 | Wen-Syan Li | Focused search engine and method |
US6606659B1 (en) * | 2000-01-28 | 2003-08-12 | Websense, Inc. | System and method for controlling access to internet sites |
US6542889B1 (en) * | 2000-01-28 | 2003-04-01 | International Business Machines Corporation | Methods and apparatus for similarity text search based on conceptual indexing |
US6775677B1 (en) * | 2000-03-02 | 2004-08-10 | International Business Machines Corporation | System, method, and program product for identifying and describing topics in a collection of electronic documents |
US6594654B1 (en) * | 2000-03-03 | 2003-07-15 | Aly A. Salam | Systems and methods for continuously accumulating research information via a computer network |
US6505151B1 (en) * | 2000-03-15 | 2003-01-07 | Bridgewell Inc. | Method for dividing sentences into phrases using entropy calculations of word combinations based on adjacent words |
US20040024583A1 (en) * | 2000-03-20 | 2004-02-05 | Freeman Robert J | Natural-language processing system using a large corpus |
US20020123994A1 (en) * | 2000-04-26 | 2002-09-05 | Yves Schabes | System for fulfilling an information need using extended matching techniques |
US6708162B1 (en) * | 2000-05-08 | 2004-03-16 | Microsoft Corporation | Method and system for unifying search strategy and sharing search output data across multiple program modules |
US20020046018A1 (en) * | 2000-05-11 | 2002-04-18 | Daniel Marcu | Discourse parsing and summarization |
US6556987B1 (en) * | 2000-05-12 | 2003-04-29 | Applied Psychology Research, Ltd. | Automatic text classification system |
US20020099730A1 (en) * | 2000-05-12 | 2002-07-25 | Applied Psychology Research Limited | Automatic text classification system |
US20040128267A1 (en) * | 2000-05-17 | 2004-07-01 | Gideon Berger | Method and system for data classification in the presence of a temporal non-stationarity |
US20060184354A1 (en) * | 2000-06-01 | 2006-08-17 | Microsoft Corporation | Creating a language model for a language processing system |
US7286978B2 (en) * | 2000-06-01 | 2007-10-23 | Microsoft Corporation | Creating a language model for a language processing system |
US6941513B2 (en) * | 2000-06-15 | 2005-09-06 | Cognisphere, Inc. | System and method for text structuring and text generation |
US7003513B2 (en) * | 2000-07-04 | 2006-02-21 | International Business Machines Corporation | Method and system of weighted context feedback for result improvement in information retrieval |
US20020103788A1 (en) * | 2000-08-08 | 2002-08-01 | Donaldson Thomas E. | Filtering search results |
US6665661B1 (en) * | 2000-09-29 | 2003-12-16 | Battelle Memorial Institute | System and method for use in text analysis of documents and records |
US20050154617A1 (en) * | 2000-09-30 | 2005-07-14 | Tom Ruggieri | System and method for providing global information on risks and related hedging strategies |
US6678694B1 (en) * | 2000-11-08 | 2004-01-13 | Frank Meik | Indexed, extensible, interactive document retrieval system |
US7113943B2 (en) * | 2000-12-06 | 2006-09-26 | Content Analyst Company, Llc | Method for document comparison and selection |
US20020103799A1 (en) * | 2000-12-06 | 2002-08-01 | Science Applications International Corp. | Method for document comparison and selection |
US20050171761A1 (en) * | 2001-01-31 | 2005-08-04 | Microsoft Corporation | Disambiguation language model |
US7251600B2 (en) * | 2001-01-31 | 2007-07-31 | Microsoft Corporation | Disambiguation language model |
US20030018659A1 (en) * | 2001-03-14 | 2003-01-23 | Lingomotors, Inc. | Category-based selections in an information access environment |
US20030154071A1 (en) * | 2002-02-11 | 2003-08-14 | Shreve Gregory M. | Process for the document management and computer-assisted translation of documents utilizing document corpora constructed by intelligent agents |
US7340466B2 (en) * | 2002-02-26 | 2008-03-04 | Kang Jo Mgmt. Limited Liability Company | Topic identification and use thereof in information retrieval systems |
US20030167252A1 (en) * | 2002-02-26 | 2003-09-04 | Pliant Technologies, Inc. | Topic identification and use thereof in information retrieval systems |
US20060004732A1 (en) * | 2002-02-26 | 2006-01-05 | Odom Paul S | Search engine methods and systems for generating relevant search results and advertisements |
US6751611B2 (en) * | 2002-03-01 | 2004-06-15 | Paul Jeffrey Krupin | Method and system for creating improved search queries |
US20030212669A1 (en) * | 2002-05-07 | 2003-11-13 | Aatish Dedhia | System and method for context based searching of electronic catalog database, aided with graphical feedback to the user |
US20030220913A1 (en) * | 2002-05-24 | 2003-11-27 | International Business Machines Corporation | Techniques for personalized and adaptive search services |
US20040024752A1 (en) * | 2002-08-05 | 2004-02-05 | Yahoo! Inc. | Method and apparatus for search ranking using human input and automated ranking |
US20040049541A1 (en) * | 2002-09-10 | 2004-03-11 | Swahn Alan Earl | Information retrieval and display system |
US20040093327A1 (en) * | 2002-09-24 | 2004-05-13 | Darrell Anderson | Serving advertisements based on content |
US20040186828A1 (en) * | 2002-12-24 | 2004-09-23 | Prem Yadav | Systems and methods for enabling a user to find information of interest to the user |
US20050060286A1 (en) * | 2003-09-15 | 2005-03-17 | Microsoft Corporation | Free text search within a relational database |
US7219105B2 (en) * | 2003-09-17 | 2007-05-15 | International Business Machines Corporation | Method, system and computer program product for profiling entities |
US20050240580A1 (en) * | 2003-09-30 | 2005-10-27 | Zamir Oren E | Personalization of placed content ordering in search results |
US20050114198A1 (en) * | 2003-11-24 | 2005-05-26 | Ross Koningstein | Using concepts for ad targeting |
US20050165782A1 (en) * | 2003-12-02 | 2005-07-28 | Sony Corporation | Information processing apparatus, information processing method, program for implementing information processing method, information processing system, and method for information processing system |
US20050131758A1 (en) * | 2003-12-11 | 2005-06-16 | Desikan Pavan K. | Systems and methods detecting for providing advertisements in a communications network |
US20050160082A1 (en) * | 2004-01-16 | 2005-07-21 | The Regents Of The University Of California | System and method of context-specific searching in an electronic database |
US7548910B1 (en) * | 2004-01-30 | 2009-06-16 | The Regents Of The University Of California | System and method for retrieving scenario-specific documents |
US20050222987A1 (en) * | 2004-04-02 | 2005-10-06 | Vadon Eric R | Automated detection of associations between search criteria and item categories based on collective analysis of user activity data |
US20060167857A1 (en) * | 2004-07-29 | 2006-07-27 | Yahoo! Inc. | Systems and methods for contextual transaction proposals |
US20060026013A1 (en) * | 2004-07-29 | 2006-02-02 | Yahoo! Inc. | Search systems and methods using in-line contextual queries |
US7496567B1 (en) * | 2004-10-01 | 2009-02-24 | Terril John Steichen | System and method for document categorization |
US20070038614A1 (en) * | 2005-08-10 | 2007-02-15 | Guha Ramanathan V | Generating and presenting advertisements based on context data for programmable search engines |
US20070038603A1 (en) * | 2005-08-10 | 2007-02-15 | Guha Ramanathan V | Sharing context data across programmable search engines |
US20070250501A1 (en) * | 2005-09-27 | 2007-10-25 | Grubb Michael L | Search result delivery engine |
US20070078822A1 (en) * | 2005-09-30 | 2007-04-05 | Microsoft Corporation | Arbitration of specialized content using search results |
US7613690B2 (en) * | 2005-10-21 | 2009-11-03 | Aol Llc | Real time query trends with multi-document summarization |
US20080021860A1 (en) * | 2006-07-21 | 2008-01-24 | Aol Llc | Culturally relevant search results |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10713440B2 (en) | 2007-01-04 | 2020-07-14 | Children's Hospital Medical Center | Processing text with domain-specific spreading activation methods |
US20110145227A1 (en) * | 2009-12-16 | 2011-06-16 | Microsoft Corporation | Determining preferences from user queries |
US8612472B2 (en) * | 2009-12-16 | 2013-12-17 | Microsoft Corporation | Determining preferences from user queries |
US9043350B2 (en) * | 2011-09-22 | 2015-05-26 | Microsoft Technology Licensing, Llc | Providing topic based search guidance |
US20130080460A1 (en) * | 2011-09-22 | 2013-03-28 | Microsoft Corporation | Providing topic based search guidance |
US11746377B2 (en) | 2011-11-30 | 2023-09-05 | Children's Hospital Medical Center | Personalized pain management and anesthesia: preemptive risk identification and therapeutic decision support |
US11597978B2 (en) | 2011-11-30 | 2023-03-07 | Children's Hospital Medical Center | Personalized pain management and anesthesia: preemptive risk identification and therapeutic decision support |
WO2015017731A1 (en) * | 2013-08-01 | 2015-02-05 | Children's Hospital Medical Center | Identification of surgery candidates using natural language processing |
US10878939B2 (en) | 2014-02-24 | 2020-12-29 | Children's Hospital Medical Center | Methods and compositions for personalized pain management |
US10422004B2 (en) | 2014-08-08 | 2019-09-24 | Children's Hospital Medical Center | Diagnostic method for distinguishing forms of esophageal eosinophilia |
US11564905B2 (en) | 2016-01-13 | 2023-01-31 | Children's Hospital Medical Center | Compositions and methods for treating allergic inflammatory conditions |
US10997249B2 (en) * | 2016-09-26 | 2021-05-04 | International Business Machines Corporation | Search query intent |
US10296659B2 (en) * | 2016-09-26 | 2019-05-21 | International Business Machines Corporation | Search query intent |
US11618924B2 (en) | 2017-01-20 | 2023-04-04 | Children's Hospital Medical Center | Methods and compositions relating to OPRM1 DNA methylation for personalized pain management |
US11859250B1 (en) | 2018-02-23 | 2024-01-02 | Children's Hospital Medical Center | Methods for treating eosinophilic esophagitis |
US11314794B2 (en) | 2018-12-14 | 2022-04-26 | Industrial Technology Research Institute | System and method for adaptively adjusting related search words |
TWI681304B (en) * | 2018-12-14 | 2020-01-01 | 財團法人工業技術研究院 | System and method for adaptively adjusting related search words |
Also Published As
Publication number | Publication date |
---|---|
US7716207B2 (en) | 2010-05-11 |
US20070265996A1 (en) | 2007-11-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7716207B2 (en) | Search engine methods and systems for displaying relevant topics | |
US20060004732A1 (en) | Search engine methods and systems for generating relevant search results and advertisements | |
US7340466B2 (en) | Topic identification and use thereof in information retrieval systems | |
US7634466B2 (en) | Realtime indexing and search in large, rapidly changing document collections | |
Tsatsaronis et al. | Bioasq: A challenge on large-scale biomedical semantic indexing and question answering | |
US7617199B2 (en) | Characterizing context-sensitive search results as non-spam | |
Kalashnikov et al. | Web people search via connection analysis | |
US20110191335A1 (en) | Method and system for conducting legal research using clustering analytics | |
Wolfram | Search characteristics in different types of Web-based IR environments: Are they the same? | |
Muller | Comparing tagging vocabularies among four enterprise tag-based services | |
Price et al. | Using semantic components to search for domain-specific documents: An evaluation from the system perspective and the user perspective | |
Spangler et al. | Simple: Interactive analytics on patent data | |
Liu et al. | Detecting promotion campaigns in query auto completion | |
WO2007103096A2 (en) | Search engine methods and systems for displaying relevant topics | |
EP1836555A2 (en) | Search engine methods and systems for generating relevant search results and advertisements | |
Briscoe et al. | Intelligent information access from scientific papers | |
Heenan | A Review of Academic Research on Information Retrieval | |
Jagan et al. | A Query Recommendation System for Efficient Biomedical Information Retrieval | |
Wang et al. | Mining web search behaviors: Strategies and techniques for data modeling and analysis | |
Lin et al. | Personalized optimal search in local query expansion | |
Gedam et al. | Study of Existing Method of Finding Aliases and Improve Method of Finding Aliases from Web [J] | |
Hersh | Terms, Models, and Resources | |
Parnto | Suffix stripping, improving information retrieval efficiency | |
Nandi et al. | HAMSTER: Human Assisted Mapping of Schema & Taxonomies to Enhance Relevance |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: KANG JO MGMT. LIMITED LIABILITY COMPANY, DELAWARE Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:SCIENTIGO, INC.;REEL/FRAME:026000/0438 Effective date: 20070510 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |
|
AS | Assignment |
Owner name: INTELLECTUAL VENTURES ASSETS 151 LLC, DELAWARE Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:RATEZE REMOTE MGMT. L.L.C.;REEL/FRAME:050915/0741 Effective date: 20191031 |
|
AS | Assignment |
Owner name: DATACLOUD TECHNOLOGIES, LLC, GEORGIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INTELLECTUAL VENTURES ASSETS 151 LLC;REEL/FRAME:051463/0934 Effective date: 20191115 |