WO2001035281A1 - Content engine - Google Patents

Content engine Download PDF

Info

Publication number
WO2001035281A1
WO2001035281A1 PCT/US2000/031016 US0031016W WO0135281A1 WO 2001035281 A1 WO2001035281 A1 WO 2001035281A1 US 0031016 W US0031016 W US 0031016W WO 0135281 A1 WO0135281 A1 WO 0135281A1
Authority
WO
WIPO (PCT)
Prior art keywords
content
electronic content
electronic
filter elements
category
Prior art date
Application number
PCT/US2000/031016
Other languages
French (fr)
Inventor
Alan S. Ellman
Brian C. Mcguinty
James P. Vinett
Original Assignee
Screamingmedia Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Screamingmedia Inc. filed Critical Screamingmedia Inc.
Priority to AU14842/01A priority Critical patent/AU1484201A/en
Publication of WO2001035281A1 publication Critical patent/WO2001035281A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Definitions

  • the present invention relates to the filtering, categorization, and delivery of electronic content to any user of the internet, and more particularly, to an automated process whereby content can be read and understood in relation to a client defined filter for a content topic.
  • a system and method of processing electronic content involves storing a filter elements associated with a content category and categorizing electronic content in the content category based on whether any filter elements, associated with the content category, appear in a content body of the electronic content.
  • the filter elements may be a word, a phrase and/or a citation and may part of a Boolean filter which expresses a relationship between the filter elements.
  • the system and method receives the electronic content or a plurality of electronic content in a data stream from a content provider, across a network (e.g., Internet), and categorizes each electronic content in any one of a plurality of content categories having associated therewith corresponding filter elements.
  • the system and method retrieves configuration information defining attributes of a data stream in which the electronic content is received and normalizes the electronic content according to the configuration information.
  • the normalization of the received content may involve sub-parsing the electronic content to obtain header, trailer and payload sections according to the configuration information and performing interpolations within a payload of a discrete electronic content unit including the electronic content.
  • interpolations may involve stripping out contact information, expanding at least one of unique and common abbreviations, tagging at least one of paragraphs or tables, and converting control characters outside an ASCII range to human readable text.
  • the normalization of the electronic content is preferably performed by a
  • system and method indexes the received
  • the system and method hashes at least the body content of the electronic content, and creates a searchable vector of nodes.
  • the vector of nodes may be searched to determine the appearance of any filter elements associated with the content category in the content body of the electronic content as well as to determine a relationship between appearing filter elements in the content body of the electronic content.
  • the system and method categorizes electronic content by determining a first value for the electronic content based on whether any filter elements appear in the content body of the electronic content, determining a second value for the electronic content depending on whether any references associated with the content category are cited in the content body of the electronic content, and comparing a threshold value associated with the content category to a third value based on the first and second values to determine whether to assign the electronic content to the content category.
  • the first and second values may be weighted differently in determining whether to categorize electronic content in a content category.
  • the filter elements such as the Boolean filter elements and citation entries, may be weighted differently in determining whether to categorize electronic content in a content category.
  • a content category may be associated with a client.
  • the system and method provides the electronic content to the client if the electronic content is categorized in that content category.
  • the electronic content may be delivered to a content management system of the client, across a network.
  • the system and method may maintaining the electronic content for the client, and deliver the electronic content directly to a user requesting the electronic content from the client across a network.
  • the system and method provides an address corresponding to a location of the electronic content to the client to enable access to the electronic content.
  • FIG. 1 is a system overview of an electronic content distribution system
  • Fig. 2 is a schematic block diagram of the central server of Fig. 1 including a content engine
  • Fig. 3 is a flowchart illustrating a process by which the central server of Fig. 1, in combination with the content engine, categorize electronic content;
  • Fig. 4 is a flowchart illustrating a process by which the central server of Fig. 1 , in combination with the content engine, normalize electronic content
  • Fig. 5 is a flowchart illustrating a process by which the central server of
  • Fig. 1 in combination with the content engine, analyze a content body as well as traditional index fields of the electronic content to categorize the electronic content.
  • electronic content distribution system 100 includes a central server 1 10. a plurality of content servers 120, a plurality of client servers 130 and a user computer 140, all of which are connected across network backbone 105.
  • Network backbone 105 may include an internet backbone, an intranet backbone or any other conventional network backbone or a combination thereof.
  • Content server 120 may be a conventional server which includes conventional computer hardware and functionality.
  • Content server 120 may be associated with a content provider, such as a publisher (e.g., a magazine publisher, book publisher, etc.), a news agency, or any distributor or provider of electronic content.
  • Electronic content may correspond to any publications (e.g., a news or magazine article), reports, technical papers and so forth.
  • Electronic content may include a content body including text and/or images with associated meta-data as well as traditional index fields generally provided in a header or trailer section of the electronic content. These traditional index fields are typically determined and inserted by human editors.
  • Client server 130 may also be a conventional server which includes conventional computer hardware and functionality and a content management system 135 for managing the storage of and the access to electronic content, for example, associated with a client operated web site accessible to a user of user computer 140.
  • a client may operate a web site, via client server 130, which provides access to electronic content and which is accessible by the user of user computer 140 through the user of a browser program 145, over the internet.
  • Client server 130 may be associated with any operator of a web site, for example, a business (e.g., etailer), an individual, and so forth.
  • Central server 1 10 may be a conventional server which includes
  • Central server 1 10 may be operated or associated with a vendor which provides electronic content to clients according to their needs, e.g.. according to the type of content or content category desired or defined by a client.
  • Centra] server 1 10 is configured to receive electronic content 125 from any of a plurality of content providers 120 across network backbone 105.
  • Central server 100 categorizes electronic content 125 in a content category based on whether any filter elements, associated with the content category, appear in a content body of the electronic content through the use of content engine 1 15.
  • Central server 100 then provides the electronic content 125 to a client associated with the content category, e.g., to deliver the electronic content to a client server 130 of the client or to maintain the electronic content for the client.
  • electronic content may be categorized to a level of granularity to satisfy the expanding needs of content clients while minimizing or eliminating reliance on human editors or predefined categorizations in the categorization process.
  • Fig. 2 is a schematic block diagram illustrating the components of central server 1 10 of Fig. 1.
  • Conventional computer components are included, such as a processor 200.
  • user input devices 205 e.g., keyboard, mouse, etc., for receiving user inputs
  • network interface 210 for interconnection to content servers 120 and client servers 130
  • Storage device 230 stores content engine 1 15, persistent object store 240.
  • client configuration files 245 and citation library 250 are included in FIG. 1.
  • Processor 200 in combination with content engine 1 15, are configured to categorize electronic content 125 in a content category based on whether any filter elements, associated with the content category, appear in a content body of the electronic content through the use of content engine 1 15.
  • a detailed discussion of operational examples of content engine 1 15 are described below with reference to Figs. 3-5.
  • Persistent object store 240 is a dynamic object-oriented database which maintains a plurality of stores of information, such as a store of electronic content vended or provided lo clients, a store of client information associated with distribution of electronic content to clients, and so forth.
  • Each client configuration file 240 is a data file that defines a client content category or topic. This definition is preferably constructed from a series of filter elements, such as words and phrases, joined in by Boolean operators, following the fundamental principals of mathematical order of operations.
  • each client configuration file 240 is a subsection(s) of persistent object store 240 which content engine 115 is to examine in the performance of the filtering, categorization and distribution of electronic content to clients, executable programs that are invoked in the event of a filter match by content engine 115, a name of a disk cached index file content engine 115 should output filter match results too, syntax of the disk cached index file output, result sort criteria, persistent store search criteria, persistent store maintenance information, topical information, and filter threshold
  • Client configuration files 240 may generally contain any other information for a plurality of client associated with the provision and distribution of electronic content to the clients.
  • Citation library 250 maintains citation entries or lists for a plurality of content categories.
  • Citation library 250 is preferably a dynamic data store that can be manipulated through human intervention in an automated fashion by the electronic content units passing through content engine 1 15.
  • databases in Fig. 2 such as citation library 250 are shown as being separate from persistent object store 240, these databases may be indexed and maintained in persistent object store 240.
  • the databases maintained in storage device 230 may also be distributed across a plurality of storage devices situated at different locations.
  • FIG. 3 is a flowchart illustrating a process 300 by which central server 1 10, in combination with content engine 1 15, categorize electronic content in one embodiment.
  • central server 1 10 receives electronic content having a content body from any one of a plurality of content servers 120 associated with a content provider.
  • the electronic content may be provided to central server 110 in a data stream from anyone of the plurality of content servers 120, across network backbone 105.
  • the electronic content particularly the content body may include text and/or images with associated meta-data as well as header and trailer sections including traditional index
  • central server 1 10 normalizes the electronic content. This
  • central server 1 10 indexes the normalized electronic content into persistent object store 240. This allows the electronic content to be read and examined by content engine 1 15 in its entirety, e.g., traditional index fields as well as the electronic content payload or body section.
  • central server 1 10 hashes the electronic content including the content body of the electronic content and creates a vector of searchable match nodes. Each element in the vector preferably has a subsequent node chain that points at the next word in the electronic content unit.
  • central server 1 10 categorizes the electronic content in a content category based on whether filter elements associated with a content category appear in the content body and traditional index fields.
  • These filter elements may be a word, phrase, citation entries or any information or characteristic which may be identified in a content body of electronic content for the purposes of categorizing the electronic content in a content category.
  • content engine 1 15 can iterate through the vector of match nodes to identify whether any filter elements associated with a content category appear in the content body and traditional index fields. That is, content engine examines the vector of match nodes to find or identify any filter matches.
  • Content engine 1 15 can also determine a relationship between those appearing filter elements or filter matches in the content body as well as in the traditional index fields of the electronic content.
  • Content engine 1 15 may then determine whether the electronic content belongs in the content category based on the filter matches and/or the relationship between the matched filter elements appearing in the content body as well as in the traditional index fields.
  • central server 1 10 writes to a disk cached index file for the content category information needed to vend or provide the electronic content to a client associated with that content category.
  • the location of the disk cached index file may be specified in client configurations files 245 of Fig. 2.
  • central sever 1 10 provides the electronic content to the client, for example, to content management system 135 of client server 130 of the client, associated with the content category.
  • central server 1 10 may deliver to client server 130, via network backbone 105, the electronic content in a variety of formats, such as in HTML, ASCII and so forth, preferably in a format desired by the client.
  • This information may be maintained in a client configuration file 245 of Fig. 2 associated with the client.
  • central server 1 10 may provide the electronic content to the client by maintaining the electronic content locally and delivering the electronic content to a user via a hyperlink on a web site provided or operated by a client server 130 of the client. This provides a simple method of content delivery since client server 130 does not need to hold or manage the electronic content. Central server 1 10 simply needs to provide the client with data related to a location for accessing the electronic content which may then be incorporated onto the client's web site. While the above describes the normalization, the categorization and the distribution of electronic content related to one content category, central server 1 10 may perform the above operations to normalize and categorize any one of a plurality of electronic content in any one of a plurality of content categories and to provide them to a plurality of clients.
  • Fig. 4 is a flowchart illustrating a process 400 by which central server 1 10, in combination with content engine 1 15, normalize electronic content received in a data stream from any one of a plurality of content providers via their content server 120.
  • central server 1 10 reads local configuration files defining unique features of a data stream to be parsed.
  • the configuration files are maintained at a location accessible by central server 1 10, for example, persistent object store 240..
  • the local configuration files define the layout of a header section of a discrete electronic content unit in the data stream, unique aspects of the payload of a discrete electronic content unit in the data stream, a trailer of a discrete electronic content unit in the data stream, and unique interpolations that are to take place in the body of a discrete electronic content unit in the data stream.
  • central server 1 10 isolates beginning and end points of atomic
  • central server 1 10 sub-parses a header section of a discrete electronic content unit to yield the traditional element of electronic content categorization, i.e. CATCODE, SELCODE, and so forth.
  • central server 1 10 similarly sub-parses a trailer section of the atomic electronic content units to yield other traditional elements of electronic content categorization, i.e. CATCODE, SELCODE, and so forth.
  • central server 1 10 performs unique interpolations within the payload of a discrete electronic content unit as specified in the configuration files.
  • Unique interpolations are functions that are to be performed within the payload of a discrete electronic content unit. These function may include stripping out contact information, expanding unique and common abbreviations, tagging paragraphs and tables and converting unique control characters outside the ASCII range to human readable text.
  • Fig. 5 is a flowchart illustrating a process 500 by which central server 1 10, in combination with content engine 115, determine whether electronic content is to be categorized in a content category based on an analysis of a content body as well as traditional index fields of the electronic content.
  • central server 1 10 examines the vector of match nodes of hashed electronic content against a filter, e.g.. Boolean filter, associated with a content
  • central server 1 10 determines a first score or value for the electronic content based on the filter matches and/or the relationship between the matched filter elements appearing in the content body as well as in the traditional index fields.
  • the first score may be based on how many filter matches and/or a proximity of the filter matching elements in the content body as well as in the traditional index fields.
  • the first score is preferably a number between zero (0) and one (1).
  • central server 1 10 examines the vector of match nodes of hashed electronic content against citation entries of references associated with the content category. That is, central server 1 10 checks whether any of the citation entries, e.g., references, have been referred to, referenced in or cited in the content body as well as the traditional index fields of the electronic content. These citation entries are maintained in citation library 250.
  • central server 1 10 determines a second score for the electronic content based on any appearances of references to a citation entry associated with the content category.
  • different citation entry matches may have different weights associated therewith.
  • the second score is also preferably a number between zero (0) and one (1).
  • central server 1 10 determines a final score based on the first and second scores.
  • the final score may the first score multiplied by the second score.
  • the final score is preferably a number between zero (0) and one (1 ).
  • these first and second scores may also be weighted differently in the determination of the final score. These weights may be preset according to the client or determined after an initial or preliminary examination through the hashed content of the electronic content based on the appearance or non-appearance of filter elements of the Boolean filter or the citation entries associated with the content category. For example, a greater weight may be given to the score in which more filter matches occurred in the initial examination.
  • central server 1 10 determines whether the final score is less than a threshold score for the content category.
  • This threshold score may be maintained in the client configuration files 245 of a client associated with the content category and is also preferably a number between zero (0) and one (1). If the final score is not less than the threshold score, then central server
  • 1 10 assigns the electronic content to the content category.
  • the electronic content may thereafter be provided to the client associated with the category.
  • central server 1 10 may perform the above operations to categorize any one of a plurality of electronic content in any one of a plurality of content categories associated with a plurality of clients.

Abstract

A system and method of processing electronic content (100) involves categorizing electronic content (125) in a content category (125) based on whether any filter elements (115), associated with the content category (125), appear in a content body of the electronic content (125).

Description

CONTENT ENGINE
BACKGROUND OF THE INVENTION The present invention relates to the filtering, categorization, and delivery of electronic content to any user of the internet, and more particularly, to an automated process whereby content can be read and understood in relation to a client defined filter for a content topic.
To date, the filtering of electronic content has been done by reading, parsing, and isolating various fields and values from the electronic content header. These are fields that are determined and inserted by human editors. Current systems for automated categorization of electronic content are completely reliant on these header fields and are incapable of categorizing electronic content with a finer level of granularity than is provided by these human editors. One problem with this system is that electronic content users are limited to these predefined categories and have no facility for defining their own fields, or redefining currently existing categorical fields. In short, existing methods of electronic content categorization do not provide the breath of categories or granularity needed to fully satisfy the expanding needs of content clients. Accordingly, there is a need to refine this categorization process so that it is no longer reliant on human editors or predefined categorizations.
SUMMARY OF THE INVENTION
A system and method of processing electronic content involves storing a filter elements associated with a content category and categorizing electronic content in the content category based on whether any filter elements, associated with the content category, appear in a content body of the electronic content. The filter elements may be a word, a phrase and/or a citation and may part of a Boolean filter which expresses a relationship between the filter elements. In one embodiment, the system and method receives the electronic content or a plurality of electronic content in a data stream from a content provider, across a network (e.g., Internet), and categorizes each electronic content in any one of a plurality of content categories having associated therewith corresponding filter elements.
In another embodiment, the system and method retrieves configuration information defining attributes of a data stream in which the electronic content is received and normalizes the electronic content according to the configuration information. The normalization of the received content may involve sub-parsing the electronic content to obtain header, trailer and payload sections according to the configuration information and performing interpolations within a payload of a discrete electronic content unit including the electronic content.
These interpolations may involve stripping out contact information, expanding at least one of unique and common abbreviations, tagging at least one of paragraphs or tables, and converting control characters outside an ASCII range to human readable text. The normalization of the electronic content is preferably performed by a
generic, non-data stream or content provider specific program.
In a further embodiment, the system and method indexes the received
electronic content in a persistent data store. In yet a further embodiment, the system and method hashes at least the body content of the electronic content, and creates a searchable vector of nodes. The vector of nodes may be searched to determine the appearance of any filter elements associated with the content category in the content body of the electronic content as well as to determine a relationship between appearing filter elements in the content body of the electronic content.
In another embodiment, the system and method categorizes electronic content by determining a first value for the electronic content based on whether any filter elements appear in the content body of the electronic content, determining a second value for the electronic content depending on whether any references associated with the content category are cited in the content body of the electronic content, and comparing a threshold value associated with the content category to a third value based on the first and second values to determine whether to assign the electronic content to the content category. The first and second values may be weighted differently in determining whether to categorize electronic content in a content category.
The filter elements, such as the Boolean filter elements and citation entries, may be weighted differently in determining whether to categorize electronic content in a content category.
In a further embodiment, a content category may be associated with a client. The system and method provides the electronic content to the client if the electronic content is categorized in that content category. The electronic content may be delivered to a content management system of the client, across a network. Alternatively, the system and method may maintaining the electronic content for the client, and deliver the electronic content directly to a user requesting the electronic content from the client across a network. The system and method provides an address corresponding to a location of the electronic content to the client to enable access to the electronic content. Other and further aspects of the present invention will become apparent during the course of the following description and by reference to the attached drawings.
BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a system overview of an electronic content distribution system; Fig. 2 is a schematic block diagram of the central server of Fig. 1 including a content engine;
Fig. 3 is a flowchart illustrating a process by which the central server of Fig. 1, in combination with the content engine, categorize electronic content;
Fig. 4 is a flowchart illustrating a process by which the central server of Fig. 1 , in combination with the content engine, normalize electronic content; and Fig. 5 is a flowchart illustrating a process by which the central server of
Fig. 1, in combination with the content engine, analyze a content body as well as traditional index fields of the electronic content to categorize the electronic content.
DETAILED DESCRIPTION OF THE SEVERAL EMBODIMENTS OF THE PRESENT INVENTION
With reference to the Figures, several embodiments of the present invention will now be shown and described. Referring to Fig. 1 , electronic content distribution system 100 includes a central server 1 10. a plurality of content servers 120, a plurality of client servers 130 and a user computer 140, all of which are connected across network backbone 105. Network backbone 105 may include an internet backbone, an intranet backbone or any other conventional network backbone or a combination thereof. Content server 120 may be a conventional server which includes conventional computer hardware and functionality. Content server 120 may be associated with a content provider, such as a publisher (e.g., a magazine publisher, book publisher, etc.), a news agency, or any distributor or provider of electronic content. Electronic content may correspond to any publications (e.g., a news or magazine article), reports, technical papers and so forth. Electronic content may include a content body including text and/or images with associated meta-data as well as traditional index fields generally provided in a header or trailer section of the electronic content. These traditional index fields are typically determined and inserted by human editors.
Client server 130 may also be a conventional server which includes conventional computer hardware and functionality and a content management system 135 for managing the storage of and the access to electronic content, for example, associated with a client operated web site accessible to a user of user computer 140. For example, a client may operate a web site, via client server 130, which provides access to electronic content and which is accessible by the user of user computer 140 through the user of a browser program 145, over the internet. Client server 130 may be associated with any operator of a web site, for example, a business (e.g., etailer), an individual, and so forth. Central server 1 10 may be a conventional server which includes
conventional computer hardware and functionality. Central server 1 10 may be operated or associated with a vendor which provides electronic content to clients according to their needs, e.g.. according to the type of content or content category desired or defined by a client.
Centra] server 1 10 is configured to receive electronic content 125 from any of a plurality of content providers 120 across network backbone 105. Central server 100 categorizes electronic content 125 in a content category based on whether any filter elements, associated with the content category, appear in a content body of the electronic content through the use of content engine 1 15. Central server 100 then provides the electronic content 125 to a client associated with the content category, e.g., to deliver the electronic content to a client server 130 of the client or to maintain the electronic content for the client.
In this way, electronic content may be categorized to a level of granularity to satisfy the expanding needs of content clients while minimizing or eliminating reliance on human editors or predefined categorizations in the categorization process.
Fig. 2 is a schematic block diagram illustrating the components of central server 1 10 of Fig. 1. Conventional computer components are included, such as a processor 200. user input devices 205, e.g., keyboard, mouse, etc., for receiving user inputs, network interface 210 for interconnection to content servers 120 and client servers 130, RAM 215, ROM 220, display 225 and storage device 230. Storage device 230 stores content engine 1 15, persistent object store 240. client configuration files 245 and citation library 250.
Processor 200, in combination with content engine 1 15, are configured to categorize electronic content 125 in a content category based on whether any filter elements, associated with the content category, appear in a content body of the electronic content through the use of content engine 1 15. A detailed discussion of operational examples of content engine 1 15 are described below with reference to Figs. 3-5.
Persistent object store 240 is a dynamic object-oriented database which maintains a plurality of stores of information, such as a store of electronic content vended or provided lo clients, a store of client information associated with distribution of electronic content to clients, and so forth.
Each client configuration file 240 is a data file that defines a client content category or topic. This definition is preferably constructed from a series of filter elements, such as words and phrases, joined in by Boolean operators, following the fundamental principals of mathematical order of operations.
Also specified in each client configuration file 240 is a subsection(s) of persistent object store 240 which content engine 115 is to examine in the performance of the filtering, categorization and distribution of electronic content to clients, executable programs that are invoked in the event of a filter match by content engine 115, a name of a disk cached index file content engine 115 should output filter match results too, syntax of the disk cached index file output, result sort criteria, persistent store search criteria, persistent store maintenance information, topical information, and filter threshold
specification.
Client configuration files 240 may generally contain any other information for a plurality of client associated with the provision and distribution of electronic content to the clients.
Citation library 250 maintains citation entries or lists for a plurality of content categories. Citation library 250 is preferably a dynamic data store that can be manipulated through human intervention in an automated fashion by the electronic content units passing through content engine 1 15.
While various databases in Fig. 2, such as citation library 250 are shown as being separate from persistent object store 240, these databases may be indexed and maintained in persistent object store 240. The databases maintained in storage device 230 may also be distributed across a plurality of storage devices situated at different locations.
While central server 1 10 is shown as a single server unit, the functionality of central server 1 10 may be distributed across a plurality of servers and network devices. Fig. 3 is a flowchart illustrating a process 300 by which central server 1 10, in combination with content engine 1 15, categorize electronic content in one embodiment.
At step 310, central server 1 10 receives electronic content having a content body from any one of a plurality of content servers 120 associated with a content provider. The electronic content may be provided to central server 110 in a data stream from anyone of the plurality of content servers 120, across network backbone 105. The electronic content particularly the content body may include text and/or images with associated meta-data as well as header and trailer sections including traditional index
fields.. At step 320, central server 1 10 normalizes the electronic content. This
enables electronic content from multiple sources to be examined in a manner so that a
diversity of electronic content formats can be "viewed" in a single manner. At step 330 central server 1 10 indexes the normalized electronic content into persistent object store 240. This allows the electronic content to be read and examined by content engine 1 15 in its entirety, e.g., traditional index fields as well as the electronic content payload or body section. At step 340, central server 1 10 hashes the electronic content including the content body of the electronic content and creates a vector of searchable match nodes. Each element in the vector preferably has a subsequent node chain that points at the next word in the electronic content unit.
At step 350, central server 1 10 categorizes the electronic content in a content category based on whether filter elements associated with a content category appear in the content body and traditional index fields. These filter elements may be a word, phrase, citation entries or any information or characteristic which may be identified in a content body of electronic content for the purposes of categorizing the electronic content in a content category. For example, content engine 1 15 can iterate through the vector of match nodes to identify whether any filter elements associated with a content category appear in the content body and traditional index fields. That is, content engine examines the vector of match nodes to find or identify any filter matches. Content engine 1 15 can also determine a relationship between those appearing filter elements or filter matches in the content body as well as in the traditional index fields of the electronic content. Content engine 1 15 may then determine whether the electronic content belongs in the content category based on the filter matches and/or the relationship between the matched filter elements appearing in the content body as well as in the traditional index fields. At step 360. in the event that the electronic content is categorized in the content category, central server 1 10 writes to a disk cached index file for the content category information needed to vend or provide the electronic content to a client associated with that content category. The location of the disk cached index file may be specified in client configurations files 245 of Fig. 2.
At step 370, central sever 1 10 provides the electronic content to the client, for example, to content management system 135 of client server 130 of the client, associated with the content category.
For example, central server 1 10 may deliver to client server 130, via network backbone 105, the electronic content in a variety of formats, such as in HTML, ASCII and so forth, preferably in a format desired by the client. This information may be maintained in a client configuration file 245 of Fig. 2 associated with the client.
Alternatively, central server 1 10 may provide the electronic content to the client by maintaining the electronic content locally and delivering the electronic content to a user via a hyperlink on a web site provided or operated by a client server 130 of the client. This provides a simple method of content delivery since client server 130 does not need to hold or manage the electronic content. Central server 1 10 simply needs to provide the client with data related to a location for accessing the electronic content which may then be incorporated onto the client's web site. While the above describes the normalization, the categorization and the distribution of electronic content related to one content category, central server 1 10 may perform the above operations to normalize and categorize any one of a plurality of electronic content in any one of a plurality of content categories and to provide them to a plurality of clients.
Fig. 4 is a flowchart illustrating a process 400 by which central server 1 10, in combination with content engine 1 15, normalize electronic content received in a data stream from any one of a plurality of content providers via their content server 120. At step 410, central server 1 10 reads local configuration files defining unique features of a data stream to be parsed. The configuration files are maintained at a location accessible by central server 1 10, for example, persistent object store 240..
The local configuration files define the layout of a header section of a discrete electronic content unit in the data stream, unique aspects of the payload of a discrete electronic content unit in the data stream, a trailer of a discrete electronic content unit in the data stream, and unique interpolations that are to take place in the body of a discrete electronic content unit in the data stream.
By maintaining these local configuration files, it is possible to employ a generic program which is not specific to any data stream format or any content provider format to normalize the electronic content in the data stream, as discussed in the steps below. As such, a normalization program does not need to be customized to any data stream format or content proλ'ider format. This is particularly useful because of the diverse nature of electronic content received from different sources. At step 420, central server 1 10 isolates beginning and end points of atomic
electronic content units within the data stream. At step 430. central server 1 10 sub-parses a header section of a discrete electronic content unit to yield the traditional element of electronic content categorization, i.e. CATCODE, SELCODE, and so forth.
At step 440, central server 1 10 similarly sub-parses a trailer section of the atomic electronic content units to yield other traditional elements of electronic content categorization, i.e. CATCODE, SELCODE, and so forth.
At step 450. central server 1 10 performs unique interpolations within the payload of a discrete electronic content unit as specified in the configuration files. Unique interpolations, as defined herein, are functions that are to be performed within the payload of a discrete electronic content unit. These function may include stripping out contact information, expanding unique and common abbreviations, tagging paragraphs and tables and converting unique control characters outside the ASCII range to human readable text.
Fig. 5 is a flowchart illustrating a process 500 by which central server 1 10, in combination with content engine 115, determine whether electronic content is to be categorized in a content category based on an analysis of a content body as well as traditional index fields of the electronic content.
At step 510, central server 1 10 examines the vector of match nodes of hashed electronic content against a filter, e.g.. Boolean filter, associated with a content
category.
At step 515, central server 1 10 then determines a first score or value for the electronic content based on the filter matches and/or the relationship between the matched filter elements appearing in the content body as well as in the traditional index fields. For example, the first score may be based on how many filter matches and/or a proximity of the filter matching elements in the content body as well as in the traditional index fields.
In determining the first score, different weights may also be applied for different filter element matches and/or filter match relations. The first score is preferably a number between zero (0) and one (1).
At step 520, central server 1 10 examines the vector of match nodes of hashed electronic content against citation entries of references associated with the content category. That is, central server 1 10 checks whether any of the citation entries, e.g., references, have been referred to, referenced in or cited in the content body as well as the traditional index fields of the electronic content. These citation entries are maintained in citation library 250.
At step 525, central server 1 10 then determines a second score for the electronic content based on any appearances of references to a citation entry associated with the content category.
In determining the second score, different citation entry matches may have different weights associated therewith. The second score is also preferably a number between zero (0) and one (1).
At step 530, central server 1 10 determines a final score based on the first and second scores. For example, the final score may the first score multiplied by the second score. The final score is preferably a number between zero (0) and one (1 ).
Additionally, these first and second scores may also be weighted differently in the determination of the final score. These weights may be preset according to the client or determined after an initial or preliminary examination through the hashed content of the electronic content based on the appearance or non-appearance of filter elements of the Boolean filter or the citation entries associated with the content category. For example, a greater weight may be given to the score in which more filter matches occurred in the initial examination.
At step 535, central server 1 10 determines whether the final score is less than a threshold score for the content category. This threshold score may be maintained in the client configuration files 245 of a client associated with the content category and is also preferably a number between zero (0) and one (1). If the final score is not less than the threshold score, then central server
1 10 assigns the electronic content to the content category. The electronic content may thereafter be provided to the client associated with the category.
Otherwise, the electronic content is not assigned to the content category. While the above describes categorization of electronic content in one content category, central server 1 10 may perform the above operations to categorize any one of a plurality of electronic content in any one of a plurality of content categories associated with a plurality of clients.
The many features and advantages of the present invention are apparent from the detailed specification, and thus, it is intended by the appended claims to cover all such features and advantages of the invention which fall within the true scope of the
present invention.
Furthermore, since numerous modifications and variations will readily occur to those skilled in the art, it is not desired that the present invention be limited to the exact construction and operation illustrated and described herein, and accordingly, all suitable modifications and equivalents which may be resorted to are intended to fall
within the scope of the claims.

Claims

CLAIMS What is claimed is:
1. A method of processing electronic content, comprising: storing filter elements associated with a content category; and categorizing the electronic content in the content category based on whether any of the filter elements, associated with the content category, appear in a content body of the electronic content.
2. The method according to claim 1, wherein the filter elements are selected from the group consisting of a word, a phrase and a citation.
3. The method according to claim 2, wherein the filter elements are part of a Boolean filter expressing a relationship between the filter elements, said categorizing categorizing the electronic content in the content category based on a relationship of the appearing filter elements in a content body of the electronic content through the use of the Boolean filter.
4. The method according to claim 1 , further comprising receiving in a data stream the electronic content from a content provider, across a network.
5. The method according to claim 4, wherein said receiving receives a plurality of electronic content from any one of a plurality of content providers across the network, said categorizing being performed for each of the plurality of electronic
content.
6. The method according to claim 1 , wherein said categorizing is performed on the electronic content to determine whether the electronic content is to be categorized in any one of a plurality of content categories having associated therewith corresponding filter elements.
7. The method according to claim 1 , further comprising: retrieving configuration information defining attributes of a data stream in which the electronic content is received; and normalizing the electronic content according to the configuration information.
8. The method according to claim 7, further comprising maintaining configuration information associated with a plurality of content providers from which electronic content is to be received.
9. The method according to claim 7, wherein said normalizing comprises sub-parsing the electronic content to obtain header, trailer and payload sections
according to the configuration information.
10. The method according to claim 7. wherein said normalizing compπses performing interpolations within a payload of a discrete electronic content unit including the electronic content.
1 1. The method according to claim 10, wherein the interpolations are selected from the group consisting of: stripping out contact information; expanding at least one of unique and common abbreviations; tagging at least one of paragraphs or tables; and converting control characters outside an ASCII range to human readable
text.
12. The method according to claim 10, wherein the configuration information specifies the interpolations to be performed on the payload.
13. The method according to claim 7, wherein said normalizing is performed by a generic, non-data stream or content provider specific program.
14. The method according to claim 1. further comprising indexing the received electronic content in a persistent data store.
15. The method according to claim 1. further comprising: hashing at least the body content of the electronic content; and creating a searchable vector of nodes.
16. The method according to claim 15, wherein said categorizing searches the vector of nodes to determine the appearance of any filter elements associated with the content category in the content body of the electronic content.
17. The method according to claim 15, wherein said categorizing searches the vector of nodes to determine a relationship between appearing filter elements in the content body of the electronic content.
18. The method according to claim 1 , wherein said categorizing
comprises: determining whether any filter elements associated with the content category appear in the content body of the electronic content; and assigning the electronic content to the content category according to the appearance of the filter element in the content body.
19. The method according to claim 18, wherein said assigning the electronic content further assigns the electronic content to the content category based on a relationship of any appearing filter elements to each other in the content body.
20. The method according to claim 1 , wherein said categorizing comprises: determining a first value for the electronic content based on whether any filter elements appear in the content body of the electronic content; determining a second value for the electronic content depending on whether any references associated with the content category are cited in the content body of the electronic content; and comparing a threshold value associated with the content category to a third value based on the first and second values to determine whether to assign the electronic content to the content category.
21. The method according to claim 1 , wherein the content category is associated with a client, the method further comprising providing the electronic content to the client.
22. The method according to claim 21 , wherein said providing the electronic content comprises delivering the electronic content to a content management system of the client, across a network.
23. The method according to claim 21 , wherein said providing the electronic content comprises: maintaining the electronic content for the client; and delivering the electronic content directly to a user requesting the electronic content from the client across a network.
24. The method according to claim 21 , wherein said providing the electronic content comprises providing an address corresponding to a location of the electronic content to the client to enable access to the electronic content.
25. The method according to claim 1 , wherein the content body includes at least one of text and images.
26. The method according to claim 1 , wherein each filter element has a weight associated therewith.
27. A apparatus for processing electronic content, comprising: means for storing filter elements associated with a content category; and means for categorizing electronic content in the content category based on whether any of the filter elements, associated with the content category, appear in a content body of the electronic content.
28. The apparatus according to claim 27, wherein the filter elements are selected from the group consisting of a word, a phrase and a citation.
29. The apparatus according to claim 28. wherein the filter elements are part of a Boolean filter expressing a relationship between the filter elements, said means for categorizing categorizing the electronic content in the content category based on a relationship of the appearing filter elements in a content body of the electronic content through the use of the Boolean filter.
30. The apparatus according to claim 27, further comprising means for receiving in a data stream the electronic content from a content provider, across a
network.
31. The apparatus according to claim 30, wherein said means for receiving receives a plurality of electronic content from any one of a plurality of content providers across the network, said means for categorizing categorizing each of the plurality of electronic content.
32. The apparatus according to claim 27, wherein said means for categorizing categorizes the electronic content in any one of a plurality of content categories having associated therewith corresponding filter elements.
33. The apparatus according to claim 27, further comprising: means for retrieving configuration information defining attributes of a data
stream in which the electronic content is received; and means for normalizing the electronic content according to the configuration information.
34. The apparatus according to claim 33, further comprising means for maintaining configuration information associated with a plurality of content providers from which electronic content is to be received.
35. The apparatus according to claim 33, wherein said means for normalizing comprises means for sub-parsing the electronic content to obtain header, trailer and payload sections according to the configuration information.
36. The apparatus according to claim 33, wherein said means for normalizing comprises means for performing inteφolations within a payload of a discrete electronic content unit including the electronic content.
37. The apparatus according to claim 36, wherein the inteφolations are selected from the group consisting of: stripping out contact information; expanding at least one of unique and common abbreviations; tagging at least one of paragraphs or tables; and converting control characters outside an ASCII range to human readable
text.
38. The apparatus according to claim 36, wherein the configuration information specifies the interpolations to be performed on the payload.
39. The apparatus according to claim 33, wherein said means for normalizing comprises a generic, non-data stream or content provider specific program.
40. The apparatus according to claim 27, further comprising means for indexing the received electronic content in a persistent data store.
41 . The apparatus according to claim 27, further comprising: means for hashing at least the body content of the electronic content; and means for creating a searchable vector of nodes.
42. The apparatus according to claim 41, wherein said means for categorizing comprises means for searching the vector of nodes to determine the appearance of any filter elements associated with the content category in the content body of the electronic content.
43. The apparatus according to claim 41 , wherein said means for categorizing comprises means for searching the vector of nodes to determine a relationship between appearing filter elements in the content body of the electronic
content.
44. The apparatus according to claim 27. wherein said means for categorizing comprises: means for determining whether any filter elements associated with the content category appear in the content body of the electronic content; and means for assigning the electronic content to the content category according to the appearance of the filter element in the content body.
45. The apparatus according to claim 44, wherein said means for assigning the electronic content further assigns the electronic content to the content category based on a relationship of any appearing filter elements to each other in the content body.
46. The apparatus according to claim 27, wherein said means for categorizing comprises: means for determining a first value for the electronic content based on whether any filter elements appear in the content body of the electronic content; means for determining a second value for the electronic content depending on whether any references associated with the content category are cited in the content body of the electronic content; and means for comparing a threshold value associated with the content category to a third value based on the first and second values to determine whether to assign the electronic content to the content category.
47. The apparatus according to claim 27. wherein the content category is associated with a client, the apparatus further comprising means for providing the electronic content to the client.
48. The apparatus according to claim 47, wherein said means for providing the electronic content comprises means for delivering the electronic content to a content management system of the client, across a network.
49. The apparatus according to claim 47, wherein said means for providing the electronic content comprises: means for maintaining the electronic content for the client; and means for delivering the electronic content directly to a user requesting the electronic content from the client across a network.
50. The apparatus according to claim 47, wherein said means for providing the electronic content comprises means for providing an address corresponding to a location of the electronic content to the client to enable access to the electronic
content.
51 . The apparatus according to claim 27. wherein the content body
includes one of text and images.
52 The method according to claim 27, wherein each filter element has a weight associated therewith.
53. A apparatus for processing electronic content, comprising: a memory device for storing filter elements associated with a content category; and a processor for categorizing electronic content in the content category based on whether any filter elements associated with the content category appear in a content body of the electronic content.
54. The apparatus according to claim 53, wherein the filter elements are selected from the group consisting of a word, a phrase and a citation.
55. The apparatus according to claim 54, wherein the filter elements are part of a Boolean filter expressing a relationship between the filter elements, said processor categorizing the electronic content in the content category based on a relationship of the appearing filter elements in a content body of the electronic content through the use of the Boolean filter.
56. The apparatus according to claim 53, further comprising a communications interface for receiving in a data stream the electronic content from a content provider, across a network.
57. The apparatus according to claim 56, wherein the communications interface receives a plurality of electronic content from any one of a plurality of content providers across the network, said processor categorizing each of the plurality of electronic content.
58. The apparatus according to claim 53, wherein said processor categorizes the electronic content in any one of a plurality of content categories having associated therewith corresponding filter elements.
59. The apparatus according to claim 53, wherein said processor retrieves configuration information defining attributes of a data stream in which the electronic content is received and normalizes the electronic content according to the configuration information.
60. The apparatus according to claim 59, wherein said memory stores configuration information associated with a plurality of content providers from which electronic content is to be received.
61. The apparatus according to claim 59, wherein said processor normalizes the electronic content by sub-parsing the electronic content to obtain header, trailer and payload sections according to the configuration information.
62. The apparatus according to claim 59, wherein said processor normalizes the electronic content by performing interpolations within a payload of a discrete electronic content unit including the electronic content.
63. The apparatus according to claim 62, wherein the inteφolations are selected from the group consisting of: stripping out contact information; expanding at least one of unique and common abbreviations; tagging at least one of paragraphs or tables; and converting control characters outside an ASCII range to human readable
text.
64. The apparatus according to claim 62, wherein the configuration information specifies the inteφolations to be performed on the payload.
65. The apparatus according to claim 59, wherein said processor normalizes the electronic content tlirough the use of a generic, non-data stream or content provider specific program.
66. The apparatus according to claim 53, wherein said processor
indexes the electronic content in a persistent data store.
67. The apparatus according to claim 53, wherein said processor hashes at least the body content of the electronic content and creates a searchable vector of nodes.
68. The apparatus according to claim 67, wherein said processor means categorizes the electronic content by searching the vector of nodes to determine the appearance of any filter elements associated with the content category in the content body of the electronic content.
69. The apparatus according to claim 67, wherein said processor categorizes the electronic content by searching the vector of nodes to determine a relationship between appearing filter elements in the content body of the electronic content.
70. The apparatus according to claim 53, wherein said processor categorizes the electronic content by: determining whether any filter elements associated with the content category appear in the content body of the electronic content; and assigning the electronic content to the content category according to the appearance of the filter element in the content body.
71. The apparatus according to claim 70, wherein said processor further assigns the electronic content to the content category based on a relationship of any appearing filter elements to each other in the content body.
72. The apparatus according to claim 53, wherein said processor categorizes the electronic content by: determining a first value for the electronic content based on whether any filter elements appear in the content body of the electronic content; determining a second value for the electronic content depending on whether any references associated with the content category are cited in the content body of the electronic content; and comparing a threshold value associated with the content category to a third value based on the first and second values to determine whether to assign the electronic content to the content category.
73. The apparatus according to claim 53. wherein the content category is associated with a client, the processor enabling provision of the electronic content to
the client.
74. The apparatus according to claim 73. wherein said processor enables provision of the electronic content by causing the electronic content to be delivered to a content management system of the client, across a network via said communications interface.
75. The apparatus according to claim 73, wherein said processor causes the electronic content to be provided to a user requesting the electronic content from the client across a network.
76. The apparatus according to claim 73, wherein said processor causes information associated with an address corresponding to a location of the electronic content to be provided to the client via said communications interface.
77. The apparatus according to claim 53, wherein the content body includes one of text and images.
78. A method of processing electronic content received in a data stream across a network, comprising: retrieving configuration information defining attributes of a data stream in which the electronic content is received; and normalizing the electronic content according to the configuration information.
79. The method according to claim 78, further comprising maintaining configuration information associated with a plurality of content providers from which electronic content is to be received.
80. The method according to claim 78, wherein said normalizing comprises sub-parsing the electronic content to obtain header, trailer and payload sections according to the configuration information.
81. The method according to claim 78, wherein said normalizing comprises performing interpolations within a payload of a discrete electronic content unit including the electronic content.
82. The method according to claim 81 , wherein the inteφolations are selected from the group consisting of: stripping out contact information; expanding at least one of unique and common abbreviations; tagging at least one of paragraphs or tables; and converting control characters outside an ASCII range to human readable text.
83. The method according to claim 81 , wherein the configuration information specifies the inteφolations to be performed on the payload.
84. The method according to claim 78, wherein said normalizing is performed by a generic, non-data stream or content provider specific program.
PCT/US2000/031016 1999-11-10 2000-11-09 Content engine WO2001035281A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU14842/01A AU1484201A (en) 1999-11-10 2000-11-09 Content engine

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US43800499A 1999-11-10 1999-11-10
US09/438,004 1999-11-10

Publications (1)

Publication Number Publication Date
WO2001035281A1 true WO2001035281A1 (en) 2001-05-17

Family

ID=23738828

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/031016 WO2001035281A1 (en) 1999-11-10 2000-11-09 Content engine

Country Status (2)

Country Link
AU (1) AU1484201A (en)
WO (1) WO2001035281A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5867799A (en) * 1996-04-04 1999-02-02 Lang; Andrew K. Information system and method for filtering a massive flow of information entities to meet user information classification needs
US5905862A (en) * 1996-09-04 1999-05-18 Intel Corporation Automatic web site registration with multiple search engines
US5913215A (en) * 1996-04-09 1999-06-15 Seymour I. Rubinstein Browse by prompted keyword phrases with an improved method for obtaining an initial document set
US5973696A (en) * 1996-08-08 1999-10-26 Agranat Systems, Inc. Embedded web server

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5867799A (en) * 1996-04-04 1999-02-02 Lang; Andrew K. Information system and method for filtering a massive flow of information entities to meet user information classification needs
US5913215A (en) * 1996-04-09 1999-06-15 Seymour I. Rubinstein Browse by prompted keyword phrases with an improved method for obtaining an initial document set
US5973696A (en) * 1996-08-08 1999-10-26 Agranat Systems, Inc. Embedded web server
US5905862A (en) * 1996-09-04 1999-05-18 Intel Corporation Automatic web site registration with multiple search engines

Also Published As

Publication number Publication date
AU1484201A (en) 2001-06-06

Similar Documents

Publication Publication Date Title
US6012053A (en) Computer system with user-controlled relevance ranking of search results
US8099423B2 (en) Hierarchical metadata generator for retrieval systems
US7949660B2 (en) Method and apparatus for searching and resource discovery in a distributed enterprise system
US6182066B1 (en) Category processing of query topics and electronic document content topics
US6334132B1 (en) Method and apparatus for creating a customized summary of text by selection of sub-sections thereof ranked by comparison to target data items
US7562076B2 (en) Systems and methods for search query processing using trend analysis
US8014997B2 (en) Method of search content enhancement
JP3755134B2 (en) Computer-based matched text search system and method
US7039625B2 (en) International information search and delivery system providing search results personalized to a particular natural language
US6236991B1 (en) Method and system for providing access for categorized information from online internet and intranet sources
US6826576B2 (en) Very-large-scale automatic categorizer for web content
JP4274689B2 (en) Method and system for selecting data sets
US6327589B1 (en) Method for searching a file having a format unsupported by a search engine
US8290956B2 (en) Methods and systems for searching and associating information resources such as web pages
US7092938B2 (en) Universal search management over one or more networks
US20050065774A1 (en) Method of self enhancement of search results through analysis of system logs
US20050108200A1 (en) Category based, extensible and interactive system for document retrieval
US20020065857A1 (en) System and method for analysis and clustering of documents for search engine
Crabtree et al. Improving web clustering by cluster selection
EP1606704A2 (en) Systems and methods for interactive search query refinement
US20040015485A1 (en) Method and apparatus for improved internet searching
CN111737607A (en) Data processing method, data processing device, electronic equipment and storage medium
US7483877B2 (en) Dynamic comparison of search systems in a controlled environment
WO2001035281A1 (en) Content engine
KR102351264B1 (en) Method for providing personalized information of new books and system for the same

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase