US20100281009A1 - Hierarchical conditional random fields for web extraction - Google Patents

Hierarchical conditional random fields for web extraction Download PDF

Info

Publication number
US20100281009A1
US20100281009A1 US12/776,308 US77630810A US2010281009A1 US 20100281009 A1 US20100281009 A1 US 20100281009A1 US 77630810 A US77630810 A US 77630810A US 2010281009 A1 US2010281009 A1 US 2010281009A1
Authority
US
United States
Prior art keywords
vertices
clique
observation
observations
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/776,308
Inventor
Ji-Rong Wen
Wei-Ying Ma
Zaiqing Nie
Jun Zhu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Priority to US12/776,308 priority Critical patent/US20100281009A1/en
Publication of US20100281009A1 publication Critical patent/US20100281009A1/en
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC reassignment MICROSOFT TECHNOLOGY LICENSING, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MICROSOFT CORPORATION
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/904Browsing; Visualisation therefor

Definitions

  • Web pages accessible via the Internet contain a vast amount of information.
  • a web page may contain information about various types of objects such as products, people, papers, organizations, and so on.
  • one web page may contain a product review of a certain model of camera, and another web page may contain an advertisement offering to sell that model of camera at a certain price.
  • one web page may contain a journal article, and another web page may be the homepage of an author of the journal article.
  • a person who is searching for information about an object may need information that is contained in different web pages. For example, a person who is interested in purchasing a certain camera may want to read reviews of the camera and to determine who is offering the camera at the lowest price.
  • a person would typically use a search engine to find web pages that contain information about the camera.
  • the person would enter a search query that may include the manufacturer and model number of the camera.
  • the search engine then identifies web pages that match the search query and presents those web pages to the user in an order that is based on how relevant the content of the web page is to the search query.
  • the person would then need to view the various web pages to find the desired information. For example, the person may first try to find web pages that contain reviews of the camera. After reading the reviews, the person may then try to locate a web page that contains an advertisement for the camera at the lowest price.
  • Web search systems have not been particularly helpful to users trying to find information about a specific product because of the difficulty in accurately identifying objects and their attributes from web pages.
  • Web pages often allocate a record for each object that is to be displayed. For example, a web page that lists several cameras for sale may include a record for each camera. Each record contains attributes of the object such as an image of the camera, its make and model, and its price.
  • Web pages contain a wide variety of layouts of records and layouts of attributes within records. Systems identifying records and their attributes from web pages are typically either template-dependent or template-independent. Template-dependent systems may have templates for both the layout of records on web pages and the layout of attributes within a record.
  • Such a system finds record templates that match portions of a web page and then finds attribute templates that match the attributes of the record.
  • Template-independent systems typically try to identify whether a web page is a list page (i.e., listing multiple records) or a detail page (i.e., a single record). The template-independent system then tries to identify records from meta-data of the web page (e.g., tables) based on this distinction. Such systems may then use various heuristics to identify the attributes of the records.
  • a difficulty with these systems is that records are often incorrectly identified. An error in the identification of a record will propagate to the identification of attributes. As a result, the overall accuracy is limited by the accuracy of the identification of records.
  • Another difficulty with these systems is that they typically do not take into consideration the semantics of the content of a portion that is identified as a record. A person, in contrast, can easily identify records by factoring in the semantics of their content.
  • a method and system for labeling object information of an information page is provided.
  • a labeling system identifies an object record of an information page based on the labeling of object elements within an object record and labels object elements based on the identification of an object record that contains the object elements.
  • the labeling system jointly identifies records and label elements in a way that is more effective than if performed separately.
  • the labeling system generates a hierarchical representation of blocks of an information page with blocks being represented as vertices of the hierarchical representation.
  • the labeling system identifies records and elements within the records by propagating probability-related information of record labels and element labels through the hierarchy of the blocks.
  • the labeling system generates a feature vector for each block to represent the block and calculates a probability of a label for a block being correct based on a score derived from the feature vectors associated with related blocks.
  • the labeling system may use a propagation technique to propagate the effect of a labeling of one block to the other blocks within the hierarchical representation.
  • the labeling system searches for the labeling of records and elements that has the highest probability of being correct.
  • a labeling system uses a hierarchical conditional random fields (“CRF”) technique to label the object elements.
  • CRF conditional random fields
  • FIG. 1 is a diagram that illustrates an example web page and its corresponding vision tree.
  • FIG. 2 is a diagram that illustrates the graphical structure of the hierarchical conditional random fields of a vision tree.
  • FIG. 3 is a diagram that represents the junction tree corresponding to FIG. 2 .
  • FIG. 4 is a block diagram that illustrates components of the labeling system in one embodiment.
  • FIG. 5 is a flow diagram that illustrates the processing of the label documents component of the labeling system in one embodiment.
  • FIG. 6 is a flow diagram that illustrates the processing of the generate junction tree component of the labeling system in one embodiment.
  • FIG. 7 is a flow diagram that illustrates the processing of the propagate beliefs component of the labeling system in one embodiment.
  • FIG. 8 is a flow diagram that illustrates the processing of the collect component of the labeling system in one embodiment.
  • FIG. 9 is a flow diagram that illustrates the processing of the distribute component of the labeling system in one embodiment.
  • FIG. 10 is a flow diagram that illustrates the processing of the learn parameters component of the labeling system in one embodiment.
  • a labeling system identifies an object record of an information page, such as a web page, based on the labeling of object elements within an object record and labels object elements based on the identification of an object record that contains the object elements.
  • the labeling system generates a hierarchical representation of blocks of a web page with blocks being represented as vertices of the hierarchical representation.
  • a block may represent a collection of information of a web page that is visually related.
  • a root block represents the entire web page
  • a leaf block is an atomic unit (such as an element of a record)
  • inner blocks represent collections of their child blocks.
  • the labeling system identifies records and elements within the records by propagating probability-related information of record labels and element labels through the hierarchy of the blocks.
  • the labeling system generates a feature vector for each block to represent the block.
  • the labeling system calculates a probability of a label for a block being correct based on a score derived from the feature vectors associated with related blocks.
  • a related block may be a block that is either a parent block or a nearest sibling block within the hierarchical representation.
  • a collection of related blocks is referred to as a “clique.”
  • the labeling system may define feature functions that generate scores, which are combined to give an overall score for a label for a block.
  • a feature function may evaluate the features of a block itself, the combined features of a block and a related block, and the combined features of a block and all its related blocks.
  • the labeling system may use a propagation technique to propagate the effect of a labeling of one block to the other blocks within the hierarchical representation.
  • the labeling system searches for the labeling of records and elements that has the highest probability of being correct.
  • the labeling system uses a vision-based page segmentation (“VIPS”) technique to generate a hierarchical representation of blocks of a web page.
  • VIPS vision-based page segmentation
  • One VIPS technique is described in Cai, D., Yu, S., Wen, J., and Ma, W., “VIPS: A Vision-Based Page Segmentation Algorithm,” Microsoft Technical Report, MSR-TR-2003-79, 2003, which is hereby incorporated by reference.
  • a VIPS technique uses page layout features (e.g., font, color, and size) to construct a “vision tree” for a web page. The technique identifies nodes from the HTML tag tree and identifies separators (e.g., horizontal and vertical lines) between the nodes.
  • FIG. 1 is a diagram that illustrates an example web page and its corresponding vision tree.
  • the web page 110 includes data records 120 and 130 , which correspond to blocks 160 and 170 , respectively, of vision tree 150 .
  • the vision tree includes leaf blocks 162 and 163 corresponding to image 122 and description 123 and leaf blocks 172 , 174 , and 175 , corresponding to image 132 and descriptions 134 and 135 .
  • the labeling system performs a joint optimization for record identification and element (or attribute) labeling.
  • the labeling system generates a feature vector for each block.
  • the goal of the labeling system is to calculate the maximum posterior probability of Y and extract data from the assignment as represented by the following:
  • y* represents the labeling with the highest probability for the block represented by the feature vector x.
  • the labeling system thus provides a uniform framework for record identification and attribute labeling. As a result, records that are wrongly identified and cause attribute labeling to perform badly will have a low probability and thus not be selected as the correct labeling. Furthermore, since record identification and attribute labeling are conducted simultaneously, the labeling system can leverage the attribute labels for a better record identification.
  • the labeling system uses a hierarchical conditional random fields (“CRF”) technique to label the records and elements of the vision tree representing a web page.
  • CRFs are Markov random fields globally conditioned on the observations X.
  • the conditional distribution of the labels y given the observations x has the form represented by the following:
  • C represents a set of cliques in graph G
  • c represents the components of y associated with clique c
  • ⁇ c represents a potential function defined on y
  • Z is a normalization factor
  • FIG. 2 is a diagram that illustrates the graphical structure of the hierarchical conditional random fields of a vision tree.
  • the circles represent inner blocks and the rectangles represent leaf blocks. The observations that are globally conditioned are not shown.
  • the labeling system assumes that every inner block contains at least two child blocks. If an inner block does not contain two child blocks, the labeling system replaces the parent block with the child block.
  • the cliques of the graph in FIG. 2 are its vertices, edges, and triangles.
  • the labeling system represents the conditional probability of Equation 1 as follows:
  • g k , f k , and h k represent feature functions defined on three types of cliques (i.e., vertex, edge, and triangle, respectively); ⁇ k , ⁇ k , and ⁇ k represent the corresponding weights; v ⁇ V; e ⁇ E; and t is a triangle, which is also a maximum clique.
  • the feature functions generate real values, the labeling system may be implemented so that they are Boolean, that is, true if the feature matches and false otherwise.
  • An example feature function is represented by the following:
  • the labeling system represents the log-likelihood of ⁇ tilde over (p) ⁇ (x,y) with respect to a conditional model p(y
  • the labeling system identifies the weights as the values that optimize the concave log-likelihood function.
  • the labeling system may use various techniques to determine the weights. For example, the labeling system can use techniques used in other maximum-entropy models as described in Lafferty, J., McCallum, A., & Pereira, F., “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data,” in Proc. ICML, 2001.
  • the labeling system may also use a gradient-based L-BFGS as described in Liu, D. C., & Nocedal, J., “On The Limited Memory BFGS Method for Large Scale Optimization,” Mathematical Programming 45, pp. 503-528, 1989.
  • the gradient-based model represents each element of the gradient vector as follows:
  • E ⁇ tilde over (p) ⁇ (y,x) [f k ] is the expectation with respect to the empirical distribution
  • x, ⁇ ) [f k ] is the expectation with respect to the conditional model distribution.
  • the expectations of f k are:
  • the labeling system calculates the expectation for the empirical distribution once and calculates the marginal probabilities for the model distribution during each iteration while solving the optimization problem of Equation 4. Since the graph of FIG. 2 is a chordal graph, the labeling system performs inference to calculate the marginal probabilities using a junction tree algorithm.
  • a junction tree algorithm is described in Cowell, R., Dawid, A., Lauritzen, S., and Spiegelhalter, D., “Probabilistic Networks and Expert Systems,” Springer-Verlag, 1999.
  • a junction tree algorithm constructs the junction tree, initializes potentials of the vertices of the junction tree, and propagates beliefs among the vertices.
  • FIG. 3 represents the junction tree corresponding to FIG. 2 .
  • the ellipses represent cliques, and the rectangles represent separators. All the cliques have size 3, since the maximum clique in FIG. 2 is size 3.
  • the labeling system builds the junction tree by obtaining a set of maximal elimination cliques using node elimination. The labeling system then builds a complete cluster graph with weights over cliques. The labeling system selects the spanning tree with the maximum weight as the junction tree.
  • the labeling system initializes all the potentials of the junction tree to have a value of 1 and multiplies the potential of a vertex, an edge, or a triangle into the potential of any one clique node of T which covers its variables.
  • the potential of a vertex v, an edge e, and a triangle t is represented by the following:
  • v , x ) exp ⁇ ( ⁇ k ⁇ ⁇ k ⁇ g k ⁇ ( v , y ⁇
  • e , x ) exp ⁇ ( ⁇ k ⁇ ⁇ k ⁇ f k ⁇ ( e , y ⁇
  • t , x ) exp ⁇ ( ⁇ k ⁇ ⁇ k ⁇ h k ⁇ ( t , y ⁇
  • the labeling system in one embodiment uses a two-phase schedule algorithm to propagate beliefs within the junction tree.
  • a two-phase schedule algorithm is described in Jensen, F., Lauritzen, S., and Olesen, K., “Bayesian Updating in Causal Probabilistic Networks by Local Computations,” Computational Statistics Quarterly, 4:269-82, 1990.
  • the algorithm uses a collection and distribution phase to calculate the potentials for the cliques and separators.
  • One skilled in the art will appreciate that the labeling system can use other message passing techniques to propagate beliefs.
  • the potentials represent marginal potentials that are used by the labeling system to guide finding the solution for the weights that best match the training data.
  • the labeling system uses the weights to find labels for the blocks of web pages.
  • the labeling system uses the VIPS technique, the junction tree algorithm, and a modified two-phase schedule algorithm to find the best labeling.
  • the labeling system generates a vision tree from the web page and generates a junction tree.
  • the labeling system modifies the two-phase schedule algorithm by replacing its summations with maximizations.
  • the best labeling for a block is found from the potential of any clique that contains the block.
  • FIG. 4 is a block diagram that illustrates components of the labeling system in one embodiment.
  • the labeling system 410 is connected to web sites 420 via communications link 430 .
  • the labeling system includes a document store 411 and a training data store 412 .
  • the document store contains web pages that may be collected from the various web sites.
  • the training data store contains vision trees generated from the web pages along with the correct labeling of the records and elements of the web pages.
  • the labeling system also includes a learn parameters component 413 and a label documents component 414 .
  • the learn parameters component inputs the training data and generates the weights for the feature functions.
  • the label documents component inputs a web page and identifies the correct labeling for the blocks within the web page.
  • the labeling system also includes auxiliary components such as a generate junction tree component 415 , a propagate beliefs component 416 , a collect component 417 , and a distribute component 418 as described below in detail.
  • the computing devices on which the labeling system may be implemented may include a central processing unit, memory, input devices (e.g., keyboard and pointing devices), output devices (e.g., display devices), and storage devices (e.g., disk drives).
  • the memory and storage devices are computer-readable media that may contain instructions that implement the labeling system.
  • the data structures and message structures may be stored or transmitted via a data transmission medium, such as a signal on a communications link.
  • Various communications links may be used, such as the Internet, a local area network, a wide area network, or a point-to-point dial-up connection.
  • the labeling system may be used in various operating environments that include personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
  • the labeling system may be described in the general context of computer-executable instructions, such as program modules, executed by one or more computers or other devices.
  • program modules include routines, programs, objects, components, data structures, and so on that perform particular tasks or implement particular abstract data types.
  • functionality of the program modules may be combined or distributed as desired in various embodiments.
  • FIG. 5 is a flow diagram that illustrates the processing of the label documents component of the labeling system in one embodiment.
  • the label documents component is passed a web page and returns an assignment of labels for the blocks of the web page.
  • the component generates a vision tree for the web page.
  • the component invokes the generate junction tree component to generate a junction tree based on the vision tree.
  • the component invokes the propagate beliefs component to propagate the beliefs for the assignments of the labels to the vertices of the junction tree.
  • the component assigns the labels with the highest probability to the blocks of the web page and then completes.
  • FIG. 6 is a flow diagram that illustrates the processing of the generate junction tree component of the labeling system in one embodiment.
  • the component is passed a vision tree and generates a junction tree as described in Cowell, R., Dawid, A., Lauritzen, S., and Spiegelhalter, D., “Probabilistic Networks and Expert Systems,” Springer-Verlag, 1999.
  • the component orders the nodes of the vision tree.
  • the component identifies the elimination cliques using node elimination.
  • the component generates a cluster graph from the elimination cliques.
  • the component adds weights to the edges based on the probabilities.
  • the component identifies a spanning tree with the maximum weight. The identified spanning tree is the junction tree. The component then returns.
  • FIG. 7 is a flow diagram that illustrates the processing of the propagate beliefs component of the labeling system in one embodiment.
  • the component implements the two-phase schedule algorithm by invoking the collect component in block 701 passing the root node of the junction tree and then invoking the distribute component in block 702 passing the root node of the junction tree.
  • the collect component and the distribute component are recursive routines that collect and distribute the potentials of the nodes of the junction tree. The component then returns.
  • FIG. 8 is a flow diagram that illustrates the processing of the collect component of the labeling system in one embodiment.
  • the component recursively invokes itself for each child clique of the passed clique to collect the potentials from the child cliques.
  • the component calculates the potential of the passed clique.
  • the component loops selecting each child clique of the passed clique.
  • the component selects the next child clique.
  • decision block 803 if all the child cliques have already been selected, then the component returns the accumulated potential, else the component continues at block 804 .
  • the component invokes the collect component recursively passing the selected child clique.
  • the component accumulates the product of the potential of the passed clique with the potential provided by the selected child clique and then loops to block 802 to select the next child clique.
  • FIG. 9 is a flow diagram that illustrates the processing of the distribute component of the labeling system in one embodiment.
  • the component is passed a clique along with a potential to be distributed to that clique.
  • the component calculates a new potential for the clique factoring in the passed potential.
  • the component loops selecting each child clique and recursively invoking itself.
  • the component selects the next child clique.
  • decision block 903 if all the child cliques have already been selected, then the component returns, else the component continues at block 904 .
  • the component recursively invokes the distribute component passing the child clique and then loops to block 902 to select the next child clique.
  • FIG. 10 is a flow diagram that illustrates the processing of the learn parameters component of the labeling system in one embodiment.
  • the learn parameters component inputs the training data and learns the weights for the feature functions.
  • the component inputs the training data.
  • the component calculates the expectations based on the training data.
  • the component generates a junction tree for each web page of the training data by invoking the generate junction tree component for each web page.
  • the component initializes potentials of the junction trees.
  • the component loops selecting new weights until the weights converge on a solution.
  • the component selects the next junction tree.
  • decision block 1006 if all the junction trees have already been selected, then the component continues at block 1008 , else the component continues at block 1007 .
  • the component invokes the propagate beliefs component to propagate the beliefs for the selected junction tree and then loops to block 1005 to select the next junction tree.
  • the component calculates a differential between the expectation based on the propagated beliefs and the expectation based on the training data.
  • decision block 1009 if the differential approaches zero, then the component returns with a solution, else the component continues at block 1010 .
  • the component adjusts the weights of the feature functions in the direction of the minimum gradient descent and then loops to block 1005 to propagate beliefs of the junction tree with the new weights.

Abstract

A method and system for labeling object information of an information page is provided. A labeling system identifies an object record of an information page based on the labeling of object elements within an object record and labels object elements based on the identification of an object record that contains the object elements. To identify the records and label the elements, the labeling system generates a hierarchical representation of blocks of an information page. The labeling system identifies records and elements within the records by propagating probability-related information of record labels and element labels through the hierarchy of the blocks. The labeling system generates a feature vector for each block to represent the block and calculates a probability of a label for a block being correct based on a score derived from the feature vectors associated with related blocks. The labeling system searches for the labeling of records and elements that has the highest probability of being correct.

Description

    BACKGROUND
  • Web pages accessible via the Internet contain a vast amount of information. A web page may contain information about various types of objects such as products, people, papers, organizations, and so on. For example, one web page may contain a product review of a certain model of camera, and another web page may contain an advertisement offering to sell that model of camera at a certain price. As another example, one web page may contain a journal article, and another web page may be the homepage of an author of the journal article. A person who is searching for information about an object may need information that is contained in different web pages. For example, a person who is interested in purchasing a certain camera may want to read reviews of the camera and to determine who is offering the camera at the lowest price.
  • To obtain such information, a person would typically use a search engine to find web pages that contain information about the camera. The person would enter a search query that may include the manufacturer and model number of the camera. The search engine then identifies web pages that match the search query and presents those web pages to the user in an order that is based on how relevant the content of the web page is to the search query. The person would then need to view the various web pages to find the desired information. For example, the person may first try to find web pages that contain reviews of the camera. After reading the reviews, the person may then try to locate a web page that contains an advertisement for the camera at the lowest price.
  • Web search systems have not been particularly helpful to users trying to find information about a specific product because of the difficulty in accurately identifying objects and their attributes from web pages. Web pages often allocate a record for each object that is to be displayed. For example, a web page that lists several cameras for sale may include a record for each camera. Each record contains attributes of the object such as an image of the camera, its make and model, and its price. Web pages contain a wide variety of layouts of records and layouts of attributes within records. Systems identifying records and their attributes from web pages are typically either template-dependent or template-independent. Template-dependent systems may have templates for both the layout of records on web pages and the layout of attributes within a record. Such a system finds record templates that match portions of a web page and then finds attribute templates that match the attributes of the record. Template-independent systems, in contrast, typically try to identify whether a web page is a list page (i.e., listing multiple records) or a detail page (i.e., a single record). The template-independent system then tries to identify records from meta-data of the web page (e.g., tables) based on this distinction. Such systems may then use various heuristics to identify the attributes of the records.
  • A difficulty with these systems is that records are often incorrectly identified. An error in the identification of a record will propagate to the identification of attributes. As a result, the overall accuracy is limited by the accuracy of the identification of records. Another difficulty with these systems is that they typically do not take into consideration the semantics of the content of a portion that is identified as a record. A person, in contrast, can easily identify records by factoring in the semantics of their content.
  • SUMMARY
  • A method and system for labeling object information of an information page is provided. A labeling system identifies an object record of an information page based on the labeling of object elements within an object record and labels object elements based on the identification of an object record that contains the object elements. Thus, the labeling system jointly identifies records and label elements in a way that is more effective than if performed separately. To jointly identify the records and label the elements, the labeling system generates a hierarchical representation of blocks of an information page with blocks being represented as vertices of the hierarchical representation. The labeling system identifies records and elements within the records by propagating probability-related information of record labels and element labels through the hierarchy of the blocks. The labeling system generates a feature vector for each block to represent the block and calculates a probability of a label for a block being correct based on a score derived from the feature vectors associated with related blocks. The labeling system may use a propagation technique to propagate the effect of a labeling of one block to the other blocks within the hierarchical representation. The labeling system searches for the labeling of records and elements that has the highest probability of being correct.
  • A labeling system uses a hierarchical conditional random fields (“CRF”) technique to label the object elements.
  • This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a diagram that illustrates an example web page and its corresponding vision tree.
  • FIG. 2 is a diagram that illustrates the graphical structure of the hierarchical conditional random fields of a vision tree.
  • FIG. 3 is a diagram that represents the junction tree corresponding to FIG. 2.
  • FIG. 4 is a block diagram that illustrates components of the labeling system in one embodiment.
  • FIG. 5 is a flow diagram that illustrates the processing of the label documents component of the labeling system in one embodiment.
  • FIG. 6 is a flow diagram that illustrates the processing of the generate junction tree component of the labeling system in one embodiment.
  • FIG. 7 is a flow diagram that illustrates the processing of the propagate beliefs component of the labeling system in one embodiment.
  • FIG. 8 is a flow diagram that illustrates the processing of the collect component of the labeling system in one embodiment.
  • FIG. 9 is a flow diagram that illustrates the processing of the distribute component of the labeling system in one embodiment.
  • FIG. 10 is a flow diagram that illustrates the processing of the learn parameters component of the labeling system in one embodiment.
  • DETAILED DESCRIPTION
  • A method and system for labeling object information of an information page is provided. In one embodiment, a labeling system identifies an object record of an information page, such as a web page, based on the labeling of object elements within an object record and labels object elements based on the identification of an object record that contains the object elements. To identify the records and label the elements, the labeling system generates a hierarchical representation of blocks of a web page with blocks being represented as vertices of the hierarchical representation. A block may represent a collection of information of a web page that is visually related. a root block represents the entire web page, a leaf block is an atomic unit (such as an element of a record), and inner blocks represent collections of their child blocks. The labeling system identifies records and elements within the records by propagating probability-related information of record labels and element labels through the hierarchy of the blocks. The labeling system generates a feature vector for each block to represent the block. The labeling system calculates a probability of a label for a block being correct based on a score derived from the feature vectors associated with related blocks. A related block may be a block that is either a parent block or a nearest sibling block within the hierarchical representation. A collection of related blocks is referred to as a “clique.” The labeling system may define feature functions that generate scores, which are combined to give an overall score for a label for a block. A feature function may evaluate the features of a block itself, the combined features of a block and a related block, and the combined features of a block and all its related blocks. The labeling system may use a propagation technique to propagate the effect of a labeling of one block to the other blocks within the hierarchical representation. The labeling system searches for the labeling of records and elements that has the highest probability of being correct.
  • In one embodiment, the labeling system uses a vision-based page segmentation (“VIPS”) technique to generate a hierarchical representation of blocks of a web page. One VIPS technique is described in Cai, D., Yu, S., Wen, J., and Ma, W., “VIPS: A Vision-Based Page Segmentation Algorithm,” Microsoft Technical Report, MSR-TR-2003-79, 2003, which is hereby incorporated by reference. A VIPS technique uses page layout features (e.g., font, color, and size) to construct a “vision tree” for a web page. The technique identifies nodes from the HTML tag tree and identifies separators (e.g., horizontal and vertical lines) between the nodes. The technique creates a vision tree that has a vertex, referred to as a block, for each identified node. The hierarchical representation of the blocks can effectively keep related blocks together while separating semantically different blocks. FIG. 1 is a diagram that illustrates an example web page and its corresponding vision tree. The web page 110 includes data records 120 and 130, which correspond to blocks 160 and 170, respectively, of vision tree 150. The vision tree includes leaf blocks 162 and 163 corresponding to image 122 and description 123 and leaf blocks 172, 174, and 175, corresponding to image 132 and descriptions 134 and 135.
  • In one embodiment, the labeling system performs a joint optimization for record identification and element (or attribute) labeling. The labeling system generates a feature vector for each block. The feature vectors are represented as X={X0, X1, . . . , XN-1} where Xi represents the feature vector for block i. The labeling system represents the labels of the blocks as the vectors Y={Y0, Y1, . . . , YN-1} where Yi represents the label for block i. The goal of the labeling system is to calculate the maximum posterior probability of Y and extract data from the assignment as represented by the following:

  • y*=arg maxp(y|x)
  • where y* represents the labeling with the highest probability for the block represented by the feature vector x. The labeling system thus provides a uniform framework for record identification and attribute labeling. As a result, records that are wrongly identified and cause attribute labeling to perform badly will have a low probability and thus not be selected as the correct labeling. Furthermore, since record identification and attribute labeling are conducted simultaneously, the labeling system can leverage the attribute labels for a better record identification.
  • In one embodiment, the labeling system uses a hierarchical conditional random fields (“CRF”) technique to label the records and elements of the vision tree representing a web page. CRFs are Markov random fields globally conditioned on the observations X. The graph G=(V, E) is an undirected graph of CRFs. According to CRFs, the conditional distribution of the labels y given the observations x has the form represented by the following:
  • p ( y | x ) = 1 Z ( x ) c C ϕ c ( c , y | c , x ) ( 1 )
  • where C represents a set of cliques in graph G, y|c represents the components of y associated with clique c, φc represents a potential function defined on y|c, and Z is a normalization factor.
  • FIG. 2 is a diagram that illustrates the graphical structure of the hierarchical conditional random fields of a vision tree. The circles represent inner blocks and the rectangles represent leaf blocks. The observations that are globally conditioned are not shown. The labeling system assumes that every inner block contains at least two child blocks. If an inner block does not contain two child blocks, the labeling system replaces the parent block with the child block. The cliques of the graph in FIG. 2 are its vertices, edges, and triangles. The labeling system represents the conditional probability of Equation 1 as follows:
  • p ( y | x ) = 1 Z ( x ) exp ( v , k μ k g k ( v , y | v , x ) + e , k λ k f k ( e , y | e , x ) + t , k γ k h k ( t , y | t , x ) ) ( 2 )
  • where gk, fk, and hk represent feature functions defined on three types of cliques (i.e., vertex, edge, and triangle, respectively); μk, λk, and γk represent the corresponding weights; vεV; eεE; and t is a triangle, which is also a maximum clique. Although the feature functions generate real values, the labeling system may be implemented so that they are Boolean, that is, true if the feature matches and false otherwise. An example feature function is represented by the following:
  • g k ( y i , x ) = { true , if y i = Name and x is capitalized false , otherwise ( 3 )
  • which means that if the content of vertex x is capitalized, then the function returns a value of true when the label yi is “Name.”
  • The labeling system determines weights for the feature functions using the training data D={(y′,x′)}i=1 N with the empirical distribution {tilde over (p)}(x,y) where N is the number of sets of labeled observations in the training data. The labeling system represents the log-likelihood of {tilde over (p)}(x,y) with respect to a conditional model p(y|x,Θ) according to the following:
  • L ( Θ ) = x , y p ~ ( x , y ) log p ( y | x , Θ ) ( 4 )
  • where Θ={μ1, μ2, . . . ; λ1, λ2, . . . ; γ1, γ2} represents the set of weights for the feature functions. The labeling system identifies the weights as the values that optimize the concave log-likelihood function. The labeling system may use various techniques to determine the weights. For example, the labeling system can use techniques used in other maximum-entropy models as described in Lafferty, J., McCallum, A., & Pereira, F., “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data,” in Proc. ICML, 2001. The labeling system may also use a gradient-based L-BFGS as described in Liu, D. C., & Nocedal, J., “On The Limited Memory BFGS Method for Large Scale Optimization,” Mathematical Programming 45, pp. 503-528, 1989. The gradient-based model represents each element of the gradient vector as follows:
  • L ( Θ ) λ k = E p ~ ( x , y ) [ f k ] - E p ( y | x , Θ ) [ f k ] ( 5 )
  • where E{tilde over (p)}(y,x)[fk] is the expectation with respect to the empirical distribution and Ep(y|x,Θ)[fk] is the expectation with respect to the conditional model distribution. For example, the expectations of fk are:
  • E p ~ ( x , y ) [ f k ] = x , y p ~ ( x , y ) e E y i , y j f k ( e , y i , y j , x ) E p ( y | x , Θ ) [ f k ] = x p ~ ( x ) e E y i , y j p ( y i , y j | x ) f k ( e , y i , y j , x ) ( 4 )
  • where e=(i, j) is an edge.
  • The labeling system calculates the expectation for the empirical distribution once and calculates the marginal probabilities for the model distribution during each iteration while solving the optimization problem of Equation 4. Since the graph of FIG. 2 is a chordal graph, the labeling system performs inference to calculate the marginal probabilities using a junction tree algorithm. A junction tree algorithm is described in Cowell, R., Dawid, A., Lauritzen, S., and Spiegelhalter, D., “Probabilistic Networks and Expert Systems,” Springer-Verlag, 1999. A junction tree algorithm constructs the junction tree, initializes potentials of the vertices of the junction tree, and propagates beliefs among the vertices. FIG. 3 represents the junction tree corresponding to FIG. 2. The ellipses represent cliques, and the rectangles represent separators. All the cliques have size 3, since the maximum clique in FIG. 2 is size 3. The labeling system builds the junction tree by obtaining a set of maximal elimination cliques using node elimination. The labeling system then builds a complete cluster graph with weights over cliques. The labeling system selects the spanning tree with the maximum weight as the junction tree.
  • After the junction tree has been constructed, the labeling system initializes all the potentials of the junction tree to have a value of 1 and multiplies the potential of a vertex, an edge, or a triangle into the potential of any one clique node of T which covers its variables. The potential of a vertex v, an edge e, and a triangle t is represented by the following:
  • ϕ v ( y | v , x ) = exp ( k μ k g k ( v , y | v , x ) ) , ϕ e ( y | e , x ) = exp ( k λ k f k ( e , y | e , x ) ) , and ϕ t ( y | t , x ) = exp ( k γ k h k ( t , y | t , x ) ) . ( 5 )
  • The labeling system in one embodiment uses a two-phase schedule algorithm to propagate beliefs within the junction tree. A two-phase schedule algorithm is described in Jensen, F., Lauritzen, S., and Olesen, K., “Bayesian Updating in Causal Probabilistic Networks by Local Computations,” Computational Statistics Quarterly, 4:269-82, 1990. The algorithm uses a collection and distribution phase to calculate the potentials for the cliques and separators. One skilled in the art will appreciate that the labeling system can use other message passing techniques to propagate beliefs. Upon completion of the distribution phase, the potentials represent marginal potentials that are used by the labeling system to guide finding the solution for the weights that best match the training data.
  • After learning the weights, the labeling system uses the weights to find labels for the blocks of web pages. The labeling system uses the VIPS technique, the junction tree algorithm, and a modified two-phase schedule algorithm to find the best labeling. The labeling system generates a vision tree from the web page and generates a junction tree. The labeling system modifies the two-phase schedule algorithm by replacing its summations with maximizations. The best labeling for a block is found from the potential of any clique that contains the block.
  • FIG. 4 is a block diagram that illustrates components of the labeling system in one embodiment. The labeling system 410 is connected to web sites 420 via communications link 430. The labeling system includes a document store 411 and a training data store 412. The document store contains web pages that may be collected from the various web sites. The training data store contains vision trees generated from the web pages along with the correct labeling of the records and elements of the web pages. The labeling system also includes a learn parameters component 413 and a label documents component 414. The learn parameters component inputs the training data and generates the weights for the feature functions. The label documents component inputs a web page and identifies the correct labeling for the blocks within the web page. The labeling system also includes auxiliary components such as a generate junction tree component 415, a propagate beliefs component 416, a collect component 417, and a distribute component 418 as described below in detail.
  • The computing devices on which the labeling system may be implemented may include a central processing unit, memory, input devices (e.g., keyboard and pointing devices), output devices (e.g., display devices), and storage devices (e.g., disk drives). The memory and storage devices are computer-readable media that may contain instructions that implement the labeling system. In addition, the data structures and message structures may be stored or transmitted via a data transmission medium, such as a signal on a communications link. Various communications links may be used, such as the Internet, a local area network, a wide area network, or a point-to-point dial-up connection.
  • The labeling system may be used in various operating environments that include personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
  • The labeling system may be described in the general context of computer-executable instructions, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, and so on that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.
  • FIG. 5 is a flow diagram that illustrates the processing of the label documents component of the labeling system in one embodiment. The label documents component is passed a web page and returns an assignment of labels for the blocks of the web page. In block 501, the component generates a vision tree for the web page. In block 502, the component invokes the generate junction tree component to generate a junction tree based on the vision tree. In block 503, the component invokes the propagate beliefs component to propagate the beliefs for the assignments of the labels to the vertices of the junction tree. In block 504, the component assigns the labels with the highest probability to the blocks of the web page and then completes.
  • FIG. 6 is a flow diagram that illustrates the processing of the generate junction tree component of the labeling system in one embodiment. The component is passed a vision tree and generates a junction tree as described in Cowell, R., Dawid, A., Lauritzen, S., and Spiegelhalter, D., “Probabilistic Networks and Expert Systems,” Springer-Verlag, 1999. In block 601, the component orders the nodes of the vision tree. In block 602, the component identifies the elimination cliques using node elimination. In block 603, the component generates a cluster graph from the elimination cliques. In block 604, the component adds weights to the edges based on the probabilities. In block 605, the component identifies a spanning tree with the maximum weight. The identified spanning tree is the junction tree. The component then returns.
  • FIG. 7 is a flow diagram that illustrates the processing of the propagate beliefs component of the labeling system in one embodiment. The component implements the two-phase schedule algorithm by invoking the collect component in block 701 passing the root node of the junction tree and then invoking the distribute component in block 702 passing the root node of the junction tree. The collect component and the distribute component are recursive routines that collect and distribute the potentials of the nodes of the junction tree. The component then returns.
  • FIG. 8 is a flow diagram that illustrates the processing of the collect component of the labeling system in one embodiment. The component recursively invokes itself for each child clique of the passed clique to collect the potentials from the child cliques. In block 801, the component calculates the potential of the passed clique. In blocks 802-805, the component loops selecting each child clique of the passed clique. In block 802, the component selects the next child clique. In decision block 803, if all the child cliques have already been selected, then the component returns the accumulated potential, else the component continues at block 804. In block 804, the component invokes the collect component recursively passing the selected child clique. In block 805, the component accumulates the product of the potential of the passed clique with the potential provided by the selected child clique and then loops to block 802 to select the next child clique.
  • FIG. 9 is a flow diagram that illustrates the processing of the distribute component of the labeling system in one embodiment. The component is passed a clique along with a potential to be distributed to that clique. In block 901, the component calculates a new potential for the clique factoring in the passed potential. In blocks 902-904, the component loops selecting each child clique and recursively invoking itself. In block 902, the component selects the next child clique. In decision block 903, if all the child cliques have already been selected, then the component returns, else the component continues at block 904. In block 904, the component recursively invokes the distribute component passing the child clique and then loops to block 902 to select the next child clique.
  • FIG. 10 is a flow diagram that illustrates the processing of the learn parameters component of the labeling system in one embodiment. The learn parameters component inputs the training data and learns the weights for the feature functions. In block 1001, the component inputs the training data. In block 1002, the component calculates the expectations based on the training data. In block 1003, the component generates a junction tree for each web page of the training data by invoking the generate junction tree component for each web page. In block 1004, the component initializes potentials of the junction trees. In blocks 1005-1010, the component loops selecting new weights until the weights converge on a solution. In block 1005, the component selects the next junction tree. In decision block 1006, if all the junction trees have already been selected, then the component continues at block 1008, else the component continues at block 1007. In block 1007, the component invokes the propagate beliefs component to propagate the beliefs for the selected junction tree and then loops to block 1005 to select the next junction tree. In block 1008, the component calculates a differential between the expectation based on the propagated beliefs and the expectation based on the training data. In decision block 1009, if the differential approaches zero, then the component returns with a solution, else the component continues at block 1010. In block 1010, the component adjusts the weights of the feature functions in the direction of the minimum gradient descent and then loops to block 1005 to propagate beliefs of the junction tree with the new weights.
  • Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. In particular, the two-dimensional CRF technique may be used to label any type of observations that have a two-dimensional relationship. Accordingly, the invention is not limited except as by the appended claims.

Claims (21)

1-20. (canceled)
21. A method performed by a computing device with a processor and memory for labeling observations, the method comprising:
receiving observations having hierarchical relationships represented by a graph having vertices representing observations and edges representing relationships, a collection of related vertices being a clique, a clique being a subset of vertices of the graph in which each pair of distinct vertices in the subset is joined by an edge;
storing the received observations in the memory;
determining by the computing device a labeling for the observations using a conditional random fields technique that factors in the hierarchical relationships, a conditional probability of a label for a given observation being based on feature functions for a vertex clique, an edge clique, and a triangle clique for the label; and
storing by the computing device the labeling for the observations.
22. The method of claim 21 wherein the observations are represented as a tree of observation vertices and the determining includes identifying a hierarchy of cliques of observation vertices within the tree and calculating a probability for sets of labels based on probabilities derived from features of components of the cliques that contain the observation vertices.
23. The method of claim 22 wherein the components of a clique include the edges and vertices of the clique.
24. The method of claim 22 wherein the calculating of the probability for a set of labels includes generating a junction tree of the cliques and propagating a belief to the cliques of the junction tree.
25. The method of claim 24 wherein the beliefs are propagated using a collection phase and a distribution phase.
26. The method of claim 21 including deriving weights for feature functions based on training data and wherein the determining includes calculating a probability for a set of labels based on the training data.
27. The method of claim 26 wherein the deriving includes optimizing a log-likelihood function based on the training data.
28. The method of claim 26 wherein the optimizing uses a gradient-based L-BFGS technique.
29. The method of claim 21 wherein the determining of the labeling includes propagating probability-related calculations from observation to observation.
30. A computer-readable storage medium containing instructions for controlling a computing device to identify object records and object elements of a web page, by a method comprising:
receiving a hierarchical representation of blocks of the web page, each block representing an object record or an object element, the blocks represented by observations having hierarchical relationships represented by a graph having vertices representing observations and edges representing relationships, a collection of related vertices being a clique, a clique being a subset of vertices of the graph in which each pair of distinct vertices in the subset is joined by an edge; and
applying a hierarchical conditional random fields technique to jointly identify a set of record labels and element labels for the blocks based on the hierarchical relationship of the blocks of the web page, the applying including identifying the labels uses a conditional random fields technique that factors in the hierarchical relationships, a conditional probability of a label for a given observation being based on feature functions for a vertex clique, an edge clique, and a triangle clique for the label.
31. The computer-readable storage medium of claim 30 wherein the observations are represented as a tree of observation vertices and the identifying includes identifying a hierarchy of cliques of observation vertices within the tree and calculating a probability for sets of labels based on probabilities derived from features of components of the cliques that contain the observation vertices.
32. The computer-readable storage medium of claim 31 wherein the calculating of the probability for a set of labels includes generating a junction tree of the cliques and propagating a belief to the cliques of the junction tree.
33. The computer-readable storage medium of claim 32 wherein the beliefs are propagated using a collection phase and a distribution phase.
34. The computer-readable storage medium of claim 30 including deriving weights for feature functions based on training data and wherein the identifying includes calculating a probability for a set of labels based on the training data.
35. The computer-readable storage medium of claim 30 wherein the identifying of the labeling includes propagating probability-related calculations from observation to observation.
36. A computing device for labeling observations, comprising:
a memory storing computer-executable instructions that:
receive observations having hierarchical relationships represented by a graph having vertices representing observations and edges representing relationships, a collection of related vertices being a clique;
determine a labeling for the observations using a conditional random fields technique that factors in the hierarchical relationships, a conditional probability of a label for a given observation being based on feature functions for a vertex clique, an edge clique, and a triangle clique for the label; and
a processor for executing the computer-executable instructions stored in the memory.
37. The computing device of claim 36 wherein the observations are represented as a tree of observation vertices and the determination includes identification of a hierarchy of cliques of observation vertices within the tree and calculation of a probability for sets of labels based on probabilities derived from features of components of the cliques that contain the observation vertices.
38. The computing device of claim 37 wherein the components of a clique include the edges and vertices of the clique.
39. The computing device of claim 36 wherein a clique is a subset of vertices of the graph in which each pair of distinct vertices in the subset is joined by an edge.
40. The computing device of claim 36 including deriving weights for feature functions based on training data and wherein the determination includes calculation of a probability for a set of labels based on the training data.
US12/776,308 2006-07-31 2010-05-07 Hierarchical conditional random fields for web extraction Abandoned US20100281009A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/776,308 US20100281009A1 (en) 2006-07-31 2010-05-07 Hierarchical conditional random fields for web extraction

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11/461,400 US7720830B2 (en) 2006-07-31 2006-07-31 Hierarchical conditional random fields for web extraction
US12/776,308 US20100281009A1 (en) 2006-07-31 2010-05-07 Hierarchical conditional random fields for web extraction

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US11/461,400 Continuation US7720830B2 (en) 2006-07-31 2006-07-31 Hierarchical conditional random fields for web extraction

Publications (1)

Publication Number Publication Date
US20100281009A1 true US20100281009A1 (en) 2010-11-04

Family

ID=38987632

Family Applications (2)

Application Number Title Priority Date Filing Date
US11/461,400 Expired - Fee Related US7720830B2 (en) 2006-07-31 2006-07-31 Hierarchical conditional random fields for web extraction
US12/776,308 Abandoned US20100281009A1 (en) 2006-07-31 2010-05-07 Hierarchical conditional random fields for web extraction

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US11/461,400 Expired - Fee Related US7720830B2 (en) 2006-07-31 2006-07-31 Hierarchical conditional random fields for web extraction

Country Status (1)

Country Link
US (2) US7720830B2 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080027910A1 (en) * 2006-07-25 2008-01-31 Microsoft Corporation Web object retrieval based on a language model
US20090319533A1 (en) * 2008-06-23 2009-12-24 Ashwin Tengli Assigning Human-Understandable Labels to Web Pages
US20130103618A1 (en) * 2011-10-24 2013-04-25 Oracle International Corporation Decision making with analytically combined split conditions
US8873812B2 (en) 2012-08-06 2014-10-28 Xerox Corporation Image segmentation using hierarchical unsupervised segmentation and hierarchical classifiers
US9164983B2 (en) 2011-05-27 2015-10-20 Robert Bosch Gmbh Broad-coverage normalization system for social media language
US11372934B2 (en) 2019-04-18 2022-06-28 Capital One Services, Llc Identifying web elements based on user browsing activity and machine learning

Families Citing this family (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8037102B2 (en) 2004-02-09 2011-10-11 Robert T. and Virginia T. Jenkins Manipulating sets of hierarchical data
US7606793B2 (en) 2004-09-27 2009-10-20 Microsoft Corporation System and method for scoping searches using index keys
US7761448B2 (en) * 2004-09-30 2010-07-20 Microsoft Corporation System and method for ranking search results using click distance
US7801923B2 (en) 2004-10-29 2010-09-21 Robert T. and Virginia T. Jenkins as Trustees of the Jenkins Family Trust Method and/or system for tagging trees
US7627591B2 (en) 2004-10-29 2009-12-01 Skyler Technology, Inc. Method and/or system for manipulating tree expressions
US7630995B2 (en) 2004-11-30 2009-12-08 Skyler Technology, Inc. Method and/or system for transmitting and/or receiving data
US7636727B2 (en) 2004-12-06 2009-12-22 Skyler Technology, Inc. Enumeration of trees from finite number of nodes
US8316059B1 (en) 2004-12-30 2012-11-20 Robert T. and Virginia T. Jenkins Enumeration of rooted partial subtrees
US8615530B1 (en) 2005-01-31 2013-12-24 Robert T. and Virginia T. Jenkins as Trustees for the Jenkins Family Trust Method and/or system for tree transformation
US7681177B2 (en) * 2005-02-28 2010-03-16 Skyler Technology, Inc. Method and/or system for transforming between trees and strings
US7899821B1 (en) 2005-04-29 2011-03-01 Karl Schiffmann Manipulation and/or analysis of hierarchical data
US7921106B2 (en) * 2006-08-03 2011-04-05 Microsoft Corporation Group-by attribute value in search results
US9092434B2 (en) * 2007-01-23 2015-07-28 Symantec Corporation Systems and methods for tagging emails by discussions
US9348912B2 (en) * 2007-10-18 2016-05-24 Microsoft Technology Licensing, Llc Document length as a static relevance feature for ranking search results
WO2009094672A2 (en) * 2008-01-25 2009-07-30 Trustees Of Columbia University In The City Of New York Belief propagation for generalized matching
US8812493B2 (en) 2008-04-11 2014-08-19 Microsoft Corporation Search results ranking using editing distance and document information
US9342589B2 (en) 2008-07-30 2016-05-17 Nec Corporation Data classifier system, data classifier method and data classifier program stored on storage medium
US9361367B2 (en) * 2008-07-30 2016-06-07 Nec Corporation Data classifier system, data classifier method and data classifier program
US8285719B1 (en) * 2008-08-08 2012-10-09 The Research Foundation Of State University Of New York System and method for probabilistic relational clustering
EP2377080A4 (en) 2008-12-12 2014-01-08 Univ Columbia Machine optimization devices, methods, and systems
US20100223214A1 (en) * 2009-02-27 2010-09-02 Kirpal Alok S Automatic extraction using machine learning based robust structural extractors
US20100257440A1 (en) * 2009-04-01 2010-10-07 Meghana Kshirsagar High precision web extraction using site knowledge
WO2010135586A1 (en) 2009-05-20 2010-11-25 The Trustees Of Columbia University In The City Of New York Systems devices and methods for estimating
US9092424B2 (en) * 2009-09-30 2015-07-28 Microsoft Technology Licensing, Llc Webpage entity extraction through joint understanding of page structures and sentences
US20110106807A1 (en) * 2009-10-30 2011-05-05 Janya, Inc Systems and methods for information integration through context-based entity disambiguation
US8738635B2 (en) 2010-06-01 2014-05-27 Microsoft Corporation Detection of junk in search result ranking
US20120005207A1 (en) * 2010-07-01 2012-01-05 Yahoo! Inc. Method and system for web extraction
US8767580B2 (en) * 2011-03-08 2014-07-01 Nec Laboratories America, Inc. Femtocell resource management for interference mitigation
US8832000B2 (en) 2011-06-07 2014-09-09 The Trustees Of Columbia University In The City Of New York Systems, device, and methods for parameter optimization
US20120330880A1 (en) * 2011-06-23 2012-12-27 Microsoft Corporation Synthetic data generation
US9082082B2 (en) 2011-12-06 2015-07-14 The Trustees Of Columbia University In The City Of New York Network information methods devices and systems
US9495462B2 (en) 2012-01-27 2016-11-15 Microsoft Technology Licensing, Llc Re-ranking search results
US10535014B2 (en) 2014-03-10 2020-01-14 California Institute Of Technology Alternative training distribution data in machine learning
US9858534B2 (en) 2013-11-22 2018-01-02 California Institute Of Technology Weight generation in machine learning
US9953271B2 (en) 2013-11-22 2018-04-24 California Institute Of Technology Generation of weights in machine learning
US10558935B2 (en) * 2013-11-22 2020-02-11 California Institute Of Technology Weight benefit evaluator for training data
EP3511868A1 (en) * 2018-01-11 2019-07-17 Onfido Ltd Document authenticity determination
US11227065B2 (en) 2018-11-06 2022-01-18 Microsoft Technology Licensing, Llc Static data masking

Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5594911A (en) * 1994-07-13 1997-01-14 Bell Communications Research, Inc. System and method for preprocessing and delivering multimedia presentations
US6148349A (en) * 1998-02-06 2000-11-14 Ncr Corporation Dynamic and consistent naming of fabric attached storage by a file system on a compute node storing information mapping API system I/O calls for data objects with a globally unique identification
US6266664B1 (en) * 1997-10-01 2001-07-24 Rulespace, Inc. Method for scanning, analyzing and rating digital information content
US6493706B1 (en) * 1999-10-26 2002-12-10 Cisco Technology, Inc. Arrangement for enhancing weighted element searches in dynamically balanced trees
US6539395B1 (en) * 2000-03-22 2003-03-25 Mood Logic, Inc. Method for creating a database for comparing music
US6549896B1 (en) * 2000-04-07 2003-04-15 Nec Usa, Inc. System and method employing random walks for mining web page associations and usage to optimize user-oriented web page refresh and pre-fetch scheduling
US20030220906A1 (en) * 2002-05-16 2003-11-27 Chickering David Maxwell System and method of employing efficient operators for bayesian network search
US6684222B1 (en) * 2000-11-09 2004-01-27 Accenture Llp Method and system for translating data associated with a relational database
US20040034652A1 (en) * 2000-07-26 2004-02-19 Thomas Hofmann System and method for personalized search, information filtering, and for generating recommendations utilizing statistical latent class models
US20040080549A1 (en) * 2000-05-31 2004-04-29 Lord Robert William Data referencing within a database graph
US7003516B2 (en) * 2002-07-03 2006-02-21 Word Data Corp. Text representation and method
US20060080353A1 (en) * 2001-01-11 2006-04-13 Vladimir Miloushev Directory aggregation for files distributed over a plurality of servers in a switched file system
US20060101060A1 (en) * 2004-11-08 2006-05-11 Kai Li Similarity search system with compact data structures
US20060115145A1 (en) * 2004-11-30 2006-06-01 Microsoft Corporation Bayesian conditional random fields
US7058913B1 (en) * 2001-09-06 2006-06-06 Cadence Design Systems, Inc. Analytical placement method and apparatus
US20060147114A1 (en) * 2003-06-12 2006-07-06 Kaus Michael R Image segmentation in time-series images
US20060167928A1 (en) * 2005-01-27 2006-07-27 Amit Chakraborty Method for querying XML documents using a weighted navigational index
US7231388B2 (en) * 2001-11-29 2007-06-12 Hitachi, Ltd. Similar document retrieving method and system
US20070156617A1 (en) * 2005-12-29 2007-07-05 Microsoft Corporation Partitioning data elements

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6076088A (en) * 1996-02-09 2000-06-13 Paik; Woojin Information extraction system and method using concept relation concept (CRC) triples
WO2000057311A2 (en) 1999-03-23 2000-09-28 Quansoo Group, Inc. Method and system for manipulating data from multiple sources
US6304864B1 (en) * 1999-04-20 2001-10-16 Textwise Llc System for retrieving multimedia information from the internet using multiple evolving intelligent agents
WO2000073942A2 (en) 1999-05-27 2000-12-07 Mobile Engines, Inc. Intelligent agent parallel search and comparison engine
US6418434B1 (en) * 1999-06-25 2002-07-09 International Business Machines Corporation Two stage automated electronic messaging system
US6631369B1 (en) * 1999-06-30 2003-10-07 Microsoft Corporation Method and system for incremental web crawling
US6353825B1 (en) * 1999-07-30 2002-03-05 Verizon Laboratories Inc. Method and device for classification using iterative information retrieval techniques
US6665665B1 (en) * 1999-07-30 2003-12-16 Verizon Laboratories Inc. Compressed document surrogates
US6418448B1 (en) * 1999-12-06 2002-07-09 Shyam Sundar Sarkar Method and apparatus for processing markup language specifications for data and metadata used inside multiple related internet documents to navigate, query and manipulate information from a plurality of object relational databases over the web
US6519580B1 (en) * 2000-06-08 2003-02-11 International Business Machines Corporation Decision-tree-based symbolic rule induction system for text categorization
WO2002048866A2 (en) * 2000-12-11 2002-06-20 Microsoft Corporation Method and system for management of multiple network resources
WO2003005235A1 (en) * 2001-07-04 2003-01-16 Cogisum Intermedia Ag Category based, extensible and interactive system for document retrieval
US6965903B1 (en) * 2002-05-07 2005-11-15 Oracle International Corporation Techniques for managing hierarchical data with link attributes in a relational database
US7231395B2 (en) * 2002-05-24 2007-06-12 Overture Services, Inc. Method and apparatus for categorizing and presenting documents of a distributed database
US7305129B2 (en) * 2003-01-29 2007-12-04 Microsoft Corporation Methods and apparatus for populating electronic forms from scanned documents
JP2006048536A (en) * 2004-08-06 2006-02-16 Canon Inc Information processor, document retrieval method, program and storage medium
US20060074881A1 (en) * 2004-10-02 2006-04-06 Adventnet, Inc. Structure independent searching in disparate databases
US7512273B2 (en) * 2004-10-21 2009-03-31 Microsoft Corporation Digital ink labeling
US7685197B2 (en) * 2005-05-05 2010-03-23 Yahoo! Inc. System and methods for indentifying the potential advertising value of terms found on web pages
US7529761B2 (en) * 2005-12-14 2009-05-05 Microsoft Corporation Two-dimensional conditional random fields for web extraction
US8001130B2 (en) * 2006-07-25 2011-08-16 Microsoft Corporation Web object retrieval based on a language model

Patent Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5594911A (en) * 1994-07-13 1997-01-14 Bell Communications Research, Inc. System and method for preprocessing and delivering multimedia presentations
US6266664B1 (en) * 1997-10-01 2001-07-24 Rulespace, Inc. Method for scanning, analyzing and rating digital information content
US6148349A (en) * 1998-02-06 2000-11-14 Ncr Corporation Dynamic and consistent naming of fabric attached storage by a file system on a compute node storing information mapping API system I/O calls for data objects with a globally unique identification
US6493706B1 (en) * 1999-10-26 2002-12-10 Cisco Technology, Inc. Arrangement for enhancing weighted element searches in dynamically balanced trees
US6539395B1 (en) * 2000-03-22 2003-03-25 Mood Logic, Inc. Method for creating a database for comparing music
US6549896B1 (en) * 2000-04-07 2003-04-15 Nec Usa, Inc. System and method employing random walks for mining web page associations and usage to optimize user-oriented web page refresh and pre-fetch scheduling
US20040080549A1 (en) * 2000-05-31 2004-04-29 Lord Robert William Data referencing within a database graph
US20040034652A1 (en) * 2000-07-26 2004-02-19 Thomas Hofmann System and method for personalized search, information filtering, and for generating recommendations utilizing statistical latent class models
US6684222B1 (en) * 2000-11-09 2004-01-27 Accenture Llp Method and system for translating data associated with a relational database
US20060080353A1 (en) * 2001-01-11 2006-04-13 Vladimir Miloushev Directory aggregation for files distributed over a plurality of servers in a switched file system
US7058913B1 (en) * 2001-09-06 2006-06-06 Cadence Design Systems, Inc. Analytical placement method and apparatus
US7231388B2 (en) * 2001-11-29 2007-06-12 Hitachi, Ltd. Similar document retrieving method and system
US20030220906A1 (en) * 2002-05-16 2003-11-27 Chickering David Maxwell System and method of employing efficient operators for bayesian network search
US7003516B2 (en) * 2002-07-03 2006-02-21 Word Data Corp. Text representation and method
US20060147114A1 (en) * 2003-06-12 2006-07-06 Kaus Michael R Image segmentation in time-series images
US20060101060A1 (en) * 2004-11-08 2006-05-11 Kai Li Similarity search system with compact data structures
US20060115145A1 (en) * 2004-11-30 2006-06-01 Microsoft Corporation Bayesian conditional random fields
US20060167928A1 (en) * 2005-01-27 2006-07-27 Amit Chakraborty Method for querying XML documents using a weighted navigational index
US20070156617A1 (en) * 2005-12-29 2007-07-05 Microsoft Corporation Partitioning data elements

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Deng Cai; VIPS: a Vision-based Page Segmentation Algorithm; 2003; Microsoft; Pgs 1-29 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080027910A1 (en) * 2006-07-25 2008-01-31 Microsoft Corporation Web object retrieval based on a language model
US8001130B2 (en) 2006-07-25 2011-08-16 Microsoft Corporation Web object retrieval based on a language model
US20090319533A1 (en) * 2008-06-23 2009-12-24 Ashwin Tengli Assigning Human-Understandable Labels to Web Pages
US8185528B2 (en) * 2008-06-23 2012-05-22 Yahoo! Inc. Assigning human-understandable labels to web pages
US9164983B2 (en) 2011-05-27 2015-10-20 Robert Bosch Gmbh Broad-coverage normalization system for social media language
US20130103618A1 (en) * 2011-10-24 2013-04-25 Oracle International Corporation Decision making with analytically combined split conditions
US8868473B2 (en) * 2011-10-24 2014-10-21 Oracle International Corporation Decision making with analytically combined split conditions
US8873812B2 (en) 2012-08-06 2014-10-28 Xerox Corporation Image segmentation using hierarchical unsupervised segmentation and hierarchical classifiers
US11372934B2 (en) 2019-04-18 2022-06-28 Capital One Services, Llc Identifying web elements based on user browsing activity and machine learning
US11874884B2 (en) 2019-04-18 2024-01-16 Capital One Services, Llc Identifying web elements based on user browsing activity and machine learning

Also Published As

Publication number Publication date
US7720830B2 (en) 2010-05-18
US20080027969A1 (en) 2008-01-31

Similar Documents

Publication Publication Date Title
US7720830B2 (en) Hierarchical conditional random fields for web extraction
US7383254B2 (en) Method and system for identifying object information
US10762283B2 (en) Multimedia document summarization
CA3052527C (en) Target document template generation
US7853596B2 (en) Mining geographic knowledge using a location aware topic model
US8245135B2 (en) Producing a visual summarization of text documents
US8612364B2 (en) Method for categorizing linked documents by co-trained label expansion
US7720773B2 (en) Partitioning data elements of a visual display of a tree using weights obtained during the training state and a maximum a posteriori solution for optimum labeling and probability
US9390165B2 (en) Summarization of short comments
US20130246456A1 (en) Publishing Product Information
US7529761B2 (en) Two-dimensional conditional random fields for web extraction
US20130159277A1 (en) Target based indexing of micro-blog content
US20130246442A1 (en) System for requirement identification and analysis based on capability model structure
US11238225B2 (en) Reading difficulty level based resource recommendation
US8676566B2 (en) Method of extracting experience sentence and classifying verb in blog
Giabelli et al. NEO: A tool for taxonomy enrichment with new emerging occupations
Shaikh et al. Bloom’s learning outcomes’ automatic classification using lstm and pretrained word embeddings
CN110472155A (en) Collaborative recommendation method, device, equipment and the storage medium of knowledge based map
US7783588B2 (en) Context modeling architecture and framework
Jensen et al. Semi-supervised fuzzy-rough feature selection
CN116521892A (en) Knowledge graph application method, knowledge graph application device, electronic equipment, medium and program product
CN116383354A (en) Automatic visual question-answering method based on knowledge graph
Zubrinic et al. Comparison of Naıve Bayes and SVM Classifiers in Categorization of Concept Maps
Kang et al. ExpFinder: An Ensemble Expert Finding Model Integrating $ N $-gram Vector Space Model and $\mu $ CO-HITS
Robinson Disaster tweet classification using parts-of-speech tags: a domain adaptation approach

Legal Events

Date Code Title Description
AS Assignment

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034544/0001

Effective date: 20141014

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION