WO2013043159A1 - Document analysis - Google Patents

Document analysis Download PDF

Info

Publication number
WO2013043159A1
WO2013043159A1 PCT/US2011/052374 US2011052374W WO2013043159A1 WO 2013043159 A1 WO2013043159 A1 WO 2013043159A1 US 2011052374 W US2011052374 W US 2011052374W WO 2013043159 A1 WO2013043159 A1 WO 2013043159A1
Authority
WO
WIPO (PCT)
Prior art keywords
parameter
document
graph
meaningfulness
text features
Prior art date
Application number
PCT/US2011/052374
Other languages
French (fr)
Inventor
Helen Y. BALINSKY
Alexander BALINSKY
Steven J. Simske
Original Assignee
Hewlett-Packard Development Company, L.P.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett-Packard Development Company, L.P. filed Critical Hewlett-Packard Development Company, L.P.
Priority to PCT/US2011/052374 priority Critical patent/WO2013043159A1/en
Publication of WO2013043159A1 publication Critical patent/WO2013043159A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling

Definitions

  • Automated natural language processing techniques may be used to perform a variety of document-related tasks.
  • a business, academic organization, or other entity may desire to automatically classify documents, and/or create a searchable database of documents, such that a user may quickly access a desired document using search terms.
  • Figure 1 illustrates a path graph that may be formed based on one value of a meaningfulness parameter according to various examples of the present disclosure.
  • Figure 2 illustrates a graph that may be formed based on another value of a meaningfulness parameter according to various examples of the present disclosure.
  • Figure 3 conceptually illustrates one example of a method for document analysis according to various examples of the present disclosure.
  • Figures 4A and 4B plot a relationship between a number of edges and a level of meaningfulness parameter for documents analyzed according to various examples of the present disclosure.
  • Figure 5 illustrates an example method for document analysis according to various examples of the present disclosure.
  • FIG. 6 illustrates a block diagram of an example computing system used to implement a method for document analysis according to the present disclosure.
  • Figure 7 illustrates a block diagram of an example computer readable medium (CRM) in communication with processing resources according to the present disclosure.
  • CRM computer readable medium
  • an example method of document analysis can include determining, via a computing system, a parametric family of graphs corresponding to a structure of a document. A parameter of the parametric family of graphs is varied, via the computing system, and at least one graph with a small world structure is identified via the computing system.
  • Previous approaches to automated natural language processing are often limited to empirical keyword analysis. Previous approaches to automated natural language processing typically have not utilized graph-based techniques, at least in part because of the difficulty of determining an appropriate graphing scheme.
  • the present disclosure is directed to a graph-based document analysis methods and systems.
  • a text document is modeled by one-parameter family graphs with text features (e.g., sentences, paragraphs, sections, pages, chapters, other language structures that include more than one sentence) as the set of graph nodes.
  • Text features e.g., sentences, paragraphs, sections, pages, chapters, other language structures that include more than one sentence
  • Text features exclude language structures smaller than one sentence such as words alone, phrases (i.e., partial sentences).
  • Edges are defined by the relationships between the text features. Such relationships can be determined based on keywords and associated characteristics such as occurrence, proximity, and other attributes discussed later. Keywords can be a carefully selected family of "meaningful" words. For example, a family of meaningful words can be selected using the Helmholtz principle, with edges being defined therefrom.
  • Meaningfulness can be determined relative to a meaningfulness parameter that can be used as a threshold for determining words, and/or relationships based on the keywords, as being meaningful. While a document is built from words, the document is not simply a bag of words. The words are organized into various text features to convey the meaning of documents. Relationships between different text features, the position of respective text features, the order the text features appear in a document can all be relevant for document understanding.
  • Documents can be modeled by networks, with text features as the nodes.
  • the edges between the nodes are used to represent the relationships between pairs of entities. Nearest text features in a document are often logically connected to create a flow of information. Therefore, an edge can be created between a pair of nearest text features. Such edges can be referred to as local or sequential network connections.
  • Figure 1 illustrates a path graph that may be formed based on one value of a meaningfulness parameter according to various examples of the present disclosure.
  • a network of relationships between nodes (e.g., 132, 134) representing text features is a linear path graph 130 representing each text feature being connected to those text features by an edge 136 created between a pair of nearest text features. That is, in a document text features can be arranged one after another, and are at least related by their relative locations to one another. This arrangement relationship can be represented by the path graph 130 shown in Figure 1 illustrating a first text feature being related to a second paragraph by proximity in the document thereto.
  • a document can have a more complicated structure. Different parts of a document, either proximate or distant, can be logically connected. Therefore, text features should be connected if they have something in common and are referring to something similar. For example, an author can recall or reference words in one location that also appear in another location. Such references can create distant relations inside the document. Due to these types of logical relationships that exist in the document, the relationships between text features can be reflected in the modeling technique of the present disclosure. Edges can be established between nodes representing non- adjacent text features based on keyword, logical, or other relationship between the text features. A meaningfulness parameter can be used as a threshold with respect to the characteristic(s) "strength" of a relationship for establishing an edge between particular nodes on the graph.
  • a set of keywords of a document can be identified.
  • the size and elements of the set of keywords can be a function of a meaningfulness parameter, which may operate as a threshold for meaningfulness of the keywords included in the set of keywords.
  • "Meaningfulness" can be determined in according to various techniques, such as identification and ranking as defined by the Helmholtz principal (discussed below).
  • a document can be represented by a network, where nodes of the network correspond to text features of the document, and edges between particular nodes represent relationships between the text features represented by the particular nodes.
  • Nodes representing adjacent text features in the document can be joined by an edge, as is shown in Figure 1.
  • nodes representing text features that include at least one keyword included in the identified set of keywords can also be joined by an edge, such as is shown below with respect to Figure 2.
  • Representing a document as a network can be distinguished from previous approaches, which may let nodes represent terms such as words or phrases (rather than text features such as sentences or larger), and may let edges merely represent co-occurrence of terms or weighting factors based upon mutual information between terms, which may not account for proximity or other relationship aspects that can convey information between text features larger than words/phrases. More specifically, for a previous approach network having nodes representing words occurring in a document, one node represents a word that can appear in many instances across the document. Such a representation may not capture the connectivity of information of the text features (e.g., sentences, paragraphs) across the many instances of the word represented by a same node.
  • a network according to the present disclosure can capture these relationships between sentences and paragraphs, based on proximity and keyword occurrence, since sentences and paragraphs (and other text features) are typically unique.
  • a node in a network according to the present disclosure typically represents a singular occurrence in a document ⁇ e.g. , of a sentence, paragraph) rather than multiple occurrences of words, and can result in finer tuning of the text feature relationships within the document.
  • the methods of the present disclosure provides a mechanism for ranking nodes of the network representing the document, and thus for ranking the corresponding text features
  • the graph of the network can also include attributes of the nodes (e.g., 132, 134) and/or edges 36 therebetween conveying certain information regarding the characteristics of the respective text feature or relationship(s).
  • node size can be used to indicate text feature size such as the number of words in a paragraph.
  • the edge length can be used to indicate some information about the relationship between the connected text feature, such as whether the adjacent text features are successive paragraphs within a chapter, or successive paragraphs that end one chapter and begin a next chapter.
  • Figure 2 illustrates a graph that may be formed based on another value of a meaningfulness parameter according to various examples of the present disclosure.
  • every node can be connected to every other node in a network of relationships between the text features, which can correspond to a very low threshold for determining some relationship exists between each node representing a text feature.
  • the graph 240 shown in Figure 2 represents an intermediate scenario between the simple path graph shown in Figure 1 and the extreme condition of almost every node being connected to almost every other node.
  • nodes e.g., 242, 244 representing text features are shown around a periphery, with edges 246 connecting certain nodes.
  • the node around a periphery can be connected to adjacent nodes similar to that shown in Figure , representative of physical proximity of text features in a document.
  • the nodes representing the first and last appearing text features in a document may not be interconnected in Figure 2.
  • the graph shown in Figure 2 can include some nodes (e.g. , 242) that are not connected to non-adjacent nodes, and some nodes (e.g., 244) that are connected to non-adjacent nodes.
  • the quantity of edges in the relationship network shown in the graph can change as a value of a meaningfulness parameter varies, where the
  • meaningfulness parameter is a threshold for determining whether a relationship exists between pairs of nodes. That is, an edge can be defined to exist when a relationship between nodes representing text features is determined relative to the meaningfulness parameter.
  • the graph 240 shown in Figure 2 can display those relationships that meet or exceed the threshold based on the meaningfulness parameter. Therefore, the appearance of the graph 240 can change as the meaningfulness parameter varies.
  • the document analysis methodology of the present disclosure does not require the graph to be plotted and/or visually displayed.
  • the graph can be formed or represented mathematically without plotting and/or display, such as by data stored in a memory, and attributes of the graph determined by computational techniques other than by visual inspection.
  • the consequence of the meaningfulness parameter on the structure of the network relationships of a document is discussed in more detail with respect to Figures 4A and 4B below.
  • Figure 3 conceptually illustrates one example of a method for document analysis according to various examples of the present disclosure.
  • Figure 3 shows a document analysis system 308 for implementing graph-based natural language text processing.
  • the document analysis system 308 can be a computing system such as described further with respect to Figure 6.
  • the document analysis system 308 can access a document (e.g., natural language text, or collection of texts) 300 that includes a plurality of text features, in various alternative examples, the natural language text or collection of texts 300 may be in any desirable format including, but not limited to, formats associated with known word processing programs, markup languages, and the like.
  • the texts 300 can be in any language or combination of languages.
  • the document analysis system 308 can identify and/or select keywords (e.g., 312-1 , 312-2, . . ., 312-N) and/or text features from the text 300. These can be organized into list(s) 310.
  • the document analysis system 308 can also determine various connecting relationships between the text features, and the network of relationships formed by the nodes and edges, which can be based on a value of a meaningfulness parameter (e.g. , used as a threshold for characterization of relationships).
  • the graph 314 includes graph nodes 316 associated with the text features and graph edges 318 associated with the connecting relationships.
  • the text features may include any desirable type of text features including, but not limited to, sentences 302, paragraphs 304, sections, pages, chapters, other language structures that include more than one sentence, and combinations thereof.
  • the document analysis system 308 can further determine (e.g., form, compute, draw, represent, etc.) a parametric family of graphs 314 (such as those shown and described with respect to Figures 1 and 2) for the network of relationships, including those relationship networks that have a small world structure. Such a small world structure can occur in many biological, social, and man-made systems, and other applications of networks.
  • the document analysis system 308 can also determine, from the determined graph 314, corresponding value(s), or ranges of values, of a meaningfulness parameter for which the graph(s) exhibit certain structural characteristics (e.g. , small world structure).
  • the document analysis system 308 can analyze the graph 314 for small world structure, such as by analyzing certain characteristics of the graph 314.
  • One such analysis can be the relationship between a number of edges and the meaningfulness parameter, as shown in Figure 3 by chart 320.
  • Chart 320 plots a graph 326 of the number of edges 322 versus a level of
  • the meaningfulness parameter ( ⁇ ) 324 can vary from negative values to positive values.
  • Chart 320 shows a curve 326 of number of edges 322 as a function of the meaningfulness parameter ( ⁇ ) 324.
  • Chart 320 is representative of a parametric family of graphs that can correspond to a structure of a document.
  • chart 320 can be at least one graph, of the parametric family of graphs, with a small world structure.
  • the curve 326 has first portion 326 that is a relative flat (e.g., small slope) portion for negative values of the
  • the curve 326 also includes a third portion, between the first 327 and second 329 portions, for values of the meaningfulness parameter ( ⁇ ) 324 approximately between 0 and 1 , the curve 326 becomes steep, and can include an inflection point.
  • the range of curve 326 after the third portion (e.g., second portion 329), for values for the meaningfulness parameter ( ⁇ ) 324 for which curve 326 includes a steep portion, can be associated with the network of relationships having a small world structure.
  • a small-world network can be a type of mathematical graph in which most nodes are not neighbors of one another, but most nodes can be reached from every other by a small number of hops or steps. More specifically, a small-world network can be defined to be a network where the typical distance L between two randomly chosen nodes (the number of steps) grows proportionally to the logarithm of the number of nodes N in the network, that is L oc.
  • the small world structure (e.g., topology) can be expected to appear after the sharp drop in the number of edges in functions of the number of edges 322 as a function of the meaningfulness parameter ( ⁇ ) 324, as can be observed from a graph thereof. This is because there is of the order of N 2 edges in a complete graph (i.e., every node connected to every other node), while the number of edges of a network having a small world structure is of the order of N logN. However, it should be noted that it is not sufficient if edges are randomly removed - the small world structure will not appear.
  • the behavior of the small world structure relative to the N logN behavior usually anticipated may be a signature for the category/classification of the text itself. Therefore, small world behavior (e.g. , structure, topology) can be tested to see if the text fits a certain category. For example, more learned text (e.g. , more extensive vocabulary) may be greater than N logN and less learned text (e.g., pulp fiction) may be less than N logN, etc.
  • more learned text e.g. , more extensive vocabulary
  • less learned text e.g., pulp fiction
  • a small world structure can represent text features being linked to other text features by a relationship of a defined strength.
  • the results of analyzing the graph 314 may be represented as a chart, such as 320, or as a list or table quantifying the relationship between the meaningfulness parameter ( ⁇ ) 324 and the number of edges 322.
  • Other attributes of graph 314 may also be analyzed with respect to text features and relationships therebetween.
  • the phrase "analyzing the graph" can refer to techniques for deciding on determining appropriate indicators of a small world structure.
  • Figures 4A and 4B plot a relationship between a number of edges and a level of meaningfulness parameter for documents analyzed according to various examples of the present disclosure.
  • Figure 4A is a plot 450 of experimental results regarding the relationship between the meaningfulness parameter ( ⁇ ) 424 and the number of edges 422 for several documents including the 201 1 State of the Union address given by President Barak Obama 426- ⁇ and the State of the Union address given by President Bill Clinton 426-C.
  • Figure 4A is a plot 450 of experimental results regarding the relationship between the meaningfulness parameter ( ⁇ ) 424 and the number of edges 422 for the Book of Genesis 426-G.
  • a network relationship graph transforms from a large random graph to a regular graph.
  • the transition into a small world structure can happen in between the extreme cases, such as just after the portion of the curve where the number of edges decreases significantly as was discussed with respect to graph 320 in Figure 3 (e.g., the second portion of 329 of curve 326 for a range of meaningfulness parameter greater than one).
  • Having a small world topology is of interest in many ranking text features since in such graphs different nodes have different contributions to a graph being a small world.
  • a one-parameter family of meaningful words MeaningfulSet(s) can be constructed for the document. That is, elements of the MeaningfulSet(s) are keywords.
  • Two text features can be connected by edge if they have at least one word from the MeaningfulSet(e) in common.
  • This type of network is common in the modeling of social networks as Affiliation Networks.
  • the underlying idea behind affiliation networks is that in social networks there are two types of entities: actors and societies. The entities can be related by affiliation of the actors to the societies. In an affiliation network, the actors are represented by the nodes. Two actors can be related if they both belong to at least one common society.
  • the sentences can be the "actors” and each member of the MeaningfulSet(s) can be a “society.”
  • a sentence can "belong” to a word if this word appears in the sentence.
  • MeaningfulSet(8) can depend on the meaningfulness parameter ⁇ .
  • the family of graphs can become an affiliation network with a variable number of societies.
  • a one-parameter family of graphs can have the same set of nodes but a different set of edges for different values of the meaningfulness parameter. If a set of meaningful words is too small, then the local relations (e.g., physical proximity of adjacent nodes) can be present and the graph will look like a regular graph. If, however, too many meaningful words are selected, then the graph can look like a large random graph with too many edges.
  • MeaningfulSet(s) can be selected using the Helmhoitz's principle such one parameter family of graphs becomes an interpolation between these two limiting cases with a defined "phase transition" (e.g., for values of the meaningfulness parameter ( ⁇ ) 424 where the slope of a plot of the number of edges 422 as a function of the meaningfulness parameter ( ⁇ ) 424 becomes steep).
  • the graphs become a small world structure, and can have a self-organized system, for some range of the meaningfulness parameter ( ⁇ ) 424 (e.g., greater than one).
  • a graph topology becomes a small world structure
  • the most relevant nodes and edges of such a graph can be identified. That is, for a small world structure graph topology, the nodes and edges that contribute to the graph being a small world structure can be ascertained, which can provide a mechanism for determining the most relevant text features of a document. Since nodes can represent text features of a document according to the document analysis techniques of the present disclosure, identifying the most relevant nodes in a small world structure identifies most relevant text features in a document. Once identified, these relevant text features can be used for further document processing techniques. Such an approach can bring a better understanding of complex logical structures and flows in text documents.
  • Co- occurrence graphs are constructed by selecting words as nodes, and edges are introduced between two words based on the appearance of the two words in a same sentence.
  • various examples of the present disclosure utilize graphs built with text features that are other than single words as the nodes. The set of edges depends on the meaningfulness parameter ( ⁇ ), which reflects a level of meaningfulness of the relationship between the text features, thus forming a one-parameter family of graphs.
  • meaningfulness parameter
  • a set of meaningful words in P is defined as words with Meaning(w, P, D)
  • MeaningfulSet(s) can be defined as a set of all words with Meaning(w, P, D) > ⁇ for at least one paragraph P.
  • paragraphs need not be disjoint. If a document does not have a natural subdivision into paragraphs, then several consecutive sentences (e.g., four or five consecutive sentences) can be used as the text feature (e.g., paragraph).
  • the set eaningfulSet(e) may be empty.
  • the set MeaningfulSetfs) can contain all the words from D. It has been observed for test documents that the size of MeaningfulSet(E) can have a sharp drop from the total number of words in a document toward zero words around some reference value ⁇ 0 > 0.
  • a one parameter family of graphs Gr(D, ⁇ ) can be defined for a document D.
  • Document D can be pre-processed, for example, by splitting the words by non-alphabetic characters and down-casing all words. Stemming can be applied, for example, thereafter.
  • S-i , S 2 , . . .. S denote the sequence of consecutive text features (e.g. , sentences) in the document D. For the discussion that follows, sentences are used to illustrate the method.
  • the graph Gr(D, e) can have sentences S-i , S 2 , . . ., S n as its vertex set.
  • MeaningfulSet(e) is empty, and thus, Gr(D, ⁇ ) is the path graph (e.g., example of a path graph is illustrated in Figure 1 ).
  • the MeaningfulSet(8) increases in size. More and more edges can be added to the graph until the graph Gr(D, z) can look like a random graph with a large number of edges.
  • the path graph and the large random graph are two extreme cases, neither of which reveals desired document analysis information. Of more interest is what happens between these two extreme scenarios.
  • Gr(D, ⁇ ) Different clustering measures for Gr(D, ⁇ ) can also be utilized.
  • hubs i.e., strongly connected nodes
  • Graphs with a small world structure are usual in social networks, where there are a lot of local connections with a few long range ones. What makes such graphs informative is that a small number of long-range short- cuts make the resulting graphs much more compact than the original regular graphs with only local connections.
  • the Gr(D, s) models of the present disclosure are much closer to the Newman and Weinberg models than to the Watts-Strogatz one.
  • the documents can be pre-processed, including splitting the words by non-alphabetic characters, making all words in lower case, and applying stemming, for example.
  • natural paragraphs were used as a text feature for the two documents illustrated in Figure 4A, and a text feature (e.g., paragraph) was defined as any four nearest sentences.
  • ⁇ y denote the geodesic distance between two different nodes V/ and v,.
  • the geodesic distance is the length of a shortest path counted in number of edges in the path.
  • the characteristic path length (or the mean inter-node distance), L, is defined as the average of l over all pairs of different nodes (I, j):
  • the graph Gr(D, ⁇ ) depends on the parameter ⁇ , so the characteristic path length becomes a function L(E) of the parameter ⁇ .
  • L(s) is also a non- decreasing function of ⁇ . That is, path length (i.e., between nodes) tends to increase as the parameter ⁇ increases, and can exhibit dramatic increases for values of the parameter ⁇ approximately greater than 1 , which can be used to identify a range for parameter ⁇ for which the network is a small world structure.
  • the example values of the characteristic path length L(e) for the tested documents are shown in Table II below:
  • clustering is a description of the interconnectedness of the nearest neighbors of a node in a graph. Clustering is a non-local characteristic of a node and goes one step further than the degree. Clustering can be used in the study of many social networks. There are two widely-used measures of clustering: clustering coefficient and transitivity.
  • the clustering coefficient C(Vj) of a node j is the probability that two nearest neighbors of vi are themselves nearest neighbors. In other words,
  • n is the number of vertices in the network.
  • the network Gr(D, ⁇ ) is a small world structure in the case of 2000 State of the Union address given by President Bill Clinton and in the case of the 201 1 State of the Union address given by President Barack Obama, as may be seen by graphing mean clustering C W s as a function of the parameter ⁇ . Both documents have a small degree of separation, high mean clustering C W s, and a relatively small number of edges.
  • the range ⁇ ⁇ [2, 3] also produces a small world structure with even more striking values of the mean clustering C W s- Historically, C W s can be the first measure of clustering in the study of networks and can be characteristic used as an indication of the method of the present disclosure. Another measure of clustering, transitivity, can also be used.
  • the clustering coefficient and the transitivity are not equivalent. They can produce substantially different values for a given network. Many consider the transitivity to be a more reliable characteristic of a small world structure than the clustering coefficient. Transitivity is often an interesting and natural concept in social networks modeling.
  • the level of transitivity can be quantified in graphs as follows. If u is connected to v and v is connected to w, then there is a path uvw of two edges in the graph. If u is also connected to tv, the path is a triangle. If the transitivity of a network is defined as the fraction of paths of length two in the network that are triangle, then:
  • a "connected triple” means three nodes u, v and w with edges (u, v) and (v, w).
  • the factor of three in the numerator arises because each triangle will be counted three times during counting all connected triples in the network.
  • Transitivity for social networks Some typical values of transitivity for social networks are provided for context.
  • the network Gr(D, ⁇ ) has high transitivity in the case of the 2000 State of the Union address given by President Bill Clinton and in the case of the 201 1 State of the Union address given by President Barack Obama, as may be seen by graphing transitivity C as a function of the
  • Gr(D, z) has 435 nodes in the Obama 201 1 address, 533 nodes in the Clinton 2000 address, and 2343 nodes in the case of the Book of Genesis. So, it is not easy to represent such graphs graphically. A much nicer picture can be produced for the graph with the text features being paragraphs as a node set.
  • the paragraphs can be connected by the same example rule provided above: two paragraphs are connected if they have meaningful words in common.
  • d(s) can be plotted for several values of ⁇ (e.g., first fifty elements) as the degree-rank function for different values of ⁇ .
  • the value can be scaled such that the largest one, di (e) can be set equal to one.
  • PageRankType methods can be used to produce relevant rankings of nodes.
  • Known social networks demonstrated that real-world networks can become denser over time, and their diameters effectively become smaller over time.
  • a time parameter t can also be introduced in the method of the present disclosure by considering various document portions (e.g., the first t sentences of a document).
  • Figure 5 illustrates an example method for document analysis according to various examples of the present disclosure.
  • the example method of document analysis includes determining, via a computing system, a parametric family of graphs corresponding to a structure of a document at 560, and varying a parameter of the parametric family of graphs via the computing system at 562. At least one graph with a small world structure is identified via the computing, as illustrated at 564.
  • Figure 6 illustrates a block diagram of an example computing system used to implement a document analysis system according to the present disclosure.
  • the computing system 674 can be comprised of a number of computing resources communicatively coupled to the network 678.
  • Figure 6 shows a first computing device 675 that may also have an associated data source 676, and may have input/output devices (e.g., keyboard, electronic display).
  • a second computing device 679 is also shown in Figure 6 being communicatively coupled to the network 678, such that executable instructions may be communicated through the network between the first and second computing devices.
  • Second computing device 679 may include a processor 680
  • the non-transitory computer-readable medium 681 may be structured to store executable instructions 682 (e.g., programs) that can be executed by the processor 680 and/or data.
  • the second computing device 679 may be further communicatively coupled to a production device 683 (e.g. , electronic display, printer, etc.). Second computing device 679 can also be communicatively coupled to an external computer- read able memory 684.
  • the second computing device 679 can cause an output to the production device 683, for example, as a result of executing instructions of programs stored on non-transitory computer-readable medium 681 , by the at least one processor 680, to implement a system for incremental image clustering according to the present disclosure.
  • Causing an output can include, but is not limited to, displaying text and images to an electronic display and/or printing text and images to a tangible medium (e.g., paper).
  • implement incremental image clustering may be executed by the first 675 and/or second 679 computing device, stored in a database such as may be maintained in external computer-readable memory 684, output to production device 683, and/or printed to a tangible medium.
  • Additional computers 677 may also be communicatively coupled to the network 678 via a communication link that includes a wired and/or wireless portion.
  • the computing system can be comprised of additional multiple interconnected computing devices, such as server devices and/or clients.
  • Each computing device can include control circuitry such as a processor, a state machine, application specific integrated circuit (ASIC), controller, and/or similar machine.
  • ASIC application specific integrated circuit
  • the control circuitry can have a structure that provides a given
  • non-transitory computer-readable medium e.g., 676, 681 , and 684.
  • the non- transitory computer-readable medium can be integral (e.g., 681 ), or
  • the non- transitory computer-readable medium can be an internal memory, a portable memory, a portable disk, or a memory located internal to another computing resource (e.g., enabling the computer-readable instructions to be downloaded over the Internet).
  • the non-transitory computer-readable medium e.g., 676, 681 , and 684 can have computer-readable instructions stored thereon that are executed by the control circuitry (e.g., processor) to provide a particular functionality.
  • the non-transitory computer-readable medium can include volatile and/or non-volatile memory.
  • Volatile memory can include memory that depends upon power to store information, such as various types of dynamic random access memory (DRAM), among others.
  • Non-volatile memory can include memory that does not depend upon power to store information. Examples of non-volatile memory can include solid state media such as flash memory, EEPROM, phase change random access memory (PCRAM), among others.
  • the non-transitory computer-readable medium can include optical discs, digital video discs (DVD), Blu-ray discs, compact discs (CD), laser discs, and magnetic media such as tape drives, floppy discs, and hard drives, solid state media such as flash memory, EEPROM, phase change random access memory (PCRAM), as well as other types of machine-readable media.
  • DVD digital video discs
  • CD compact discs
  • laser discs and magnetic media such as tape drives, floppy discs, and hard drives
  • solid state media such as flash memory, EEPROM, phase change random access memory (PCRAM), as well as other types of machine-readable media.
  • Logic can be used to implement the method(s) of the present disclosure, in whole or part. Logic can be implemented using appropriately configured hardware and/or software (i.e., machine readable instructions). The above- mention logic portions may be discretely implemented and/or implemented in a common arrangement.
  • FIG. 7 illustrates a block diagram of an example computer readable medium (CRM) 795 in communication, e.g., via a communication path 796, with processing resources 793 according to the present disclosure.
  • processor resources 793 can include one or a plurality of processors 794 such as in a parallel processing arrangement.
  • a computing device having processor resources can be in communication with, and/or receive a tangible non- transitory computer readable medium (CRM) 795 storing a set of computer readable instructions for capturing and/or replaying network traffic, as described herein.
  • CRM computer readable medium

Abstract

Methods, systems, and computer readable media with executable instructions, and/or logic are provided for document analysis. An example method of document analysis can include determining, via a computing system (674), a parametric family of graphs (314) corresponding to a structure of a document (300) (560). A parameter of the parametric family of graphs (314) is varied, via the computing system (674) (562), and at least one graph (314) with a small world structure is identified via the computing system (674) (564).

Description

DOCUMENT ANALYSIS
Background
With the number of electronically-accessible documents now greater than ever before in business, academic, and other settings, techniques for accurately summarizing and indexing large bodies of documents are of increasing importance. Automated natural language processing techniques may be used to perform a variety of document-related tasks. For example, in some applications, a business, academic organization, or other entity may desire to automatically classify documents, and/or create a searchable database of documents, such that a user may quickly access a desired document using search terms.
Brief Description of the Drawings
Figure 1 illustrates a path graph that may be formed based on one value of a meaningfulness parameter according to various examples of the present disclosure.
Figure 2 illustrates a graph that may be formed based on another value of a meaningfulness parameter according to various examples of the present disclosure.
Figure 3 conceptually illustrates one example of a method for document analysis according to various examples of the present disclosure.
Figures 4A and 4B plot a relationship between a number of edges and a level of meaningfulness parameter for documents analyzed according to various examples of the present disclosure.
Figure 5 illustrates an example method for document analysis according to various examples of the present disclosure.
Figure 6 illustrates a block diagram of an example computing system used to implement a method for document analysis according to the present disclosure. Figure 7 illustrates a block diagram of an example computer readable medium (CRM) in communication with processing resources according to the present disclosure. Detailed Description
Examples of the present disclosure may include methods, systems, and computer readable media with executable instructions, and/orlogic. According to various examples of the present disclosure, an example method of document analysis can include determining, via a computing system, a parametric family of graphs corresponding to a structure of a document. A parameter of the parametric family of graphs is varied, via the computing system, and at least one graph with a small world structure is identified via the computing system.
Previous approaches to automated natural language processing are often limited to empirical keyword analysis. Previous approaches to automated natural language processing typically have not utilized graph-based techniques, at least in part because of the difficulty of determining an appropriate graphing scheme. The present disclosure is directed to a graph-based document analysis methods and systems.
According to various examples of the present disclosure, a text document is modeled by one-parameter family graphs with text features (e.g., sentences, paragraphs, sections, pages, chapters, other language structures that include more than one sentence) as the set of graph nodes. Text features, as used herein, exclude language structures smaller than one sentence such as words alone, phrases (i.e., partial sentences). Edges are defined by the relationships between the text features. Such relationships can be determined based on keywords and associated characteristics such as occurrence, proximity, and other attributes discussed later. Keywords can be a carefully selected family of "meaningful" words. For example, a family of meaningful words can be selected using the Helmholtz principle, with edges being defined therefrom.
Meaningfulness can be determined relative to a meaningfulness parameter that can be used as a threshold for determining words, and/or relationships based on the keywords, as being meaningful. While a document is built from words, the document is not simply a bag of words. The words are organized into various text features to convey the meaning of documents. Relationships between different text features, the position of respective text features, the order the text features appear in a document can all be relevant for document understanding.
Documents can be modeled by networks, with text features as the nodes. The edges between the nodes are used to represent the relationships between pairs of entities. Nearest text features in a document are often logically connected to create a flow of information. Therefore, an edge can be created between a pair of nearest text features. Such edges can be referred to as local or sequential network connections.
Figure 1 illustrates a path graph that may be formed based on one value of a meaningfulness parameter according to various examples of the present disclosure. In a simple form, a network of relationships between nodes (e.g., 132, 134) representing text features is a linear path graph 130 representing each text feature being connected to those text features by an edge 136 created between a pair of nearest text features. That is, in a document text features can be arranged one after another, and are at least related by their relative locations to one another. This arrangement relationship can be represented by the path graph 130 shown in Figure 1 illustrating a first text feature being related to a second paragraph by proximity in the document thereto.
However, a document can have a more complicated structure. Different parts of a document, either proximate or distant, can be logically connected. Therefore, text features should be connected if they have something in common and are referring to something similar. For example, an author can recall or reference words in one location that also appear in another location. Such references can create distant relations inside the document. Due to these types of logical relationships that exist in the document, the relationships between text features can be reflected in the modeling technique of the present disclosure. Edges can be established between nodes representing non- adjacent text features based on keyword, logical, or other relationship between the text features. A meaningfulness parameter can be used as a threshold with respect to the characteristic(s) "strength" of a relationship for establishing an edge between particular nodes on the graph.
For example, a set of keywords of a document can be identified. The size and elements of the set of keywords can be a function of a meaningfulness parameter, which may operate as a threshold for meaningfulness of the keywords included in the set of keywords. "Meaningfulness" can be determined in according to various techniques, such as identification and ranking as defined by the Helmholtz principal (discussed below).
A document can be represented by a network, where nodes of the network correspond to text features of the document, and edges between particular nodes represent relationships between the text features represented by the particular nodes. Nodes representing adjacent text features in the document can be joined by an edge, as is shown in Figure 1. Furthermore, nodes representing text features that include at least one keyword included in the identified set of keywords can also be joined by an edge, such as is shown below with respect to Figure 2.
Representing a document as a network according to the present disclosure, can be distinguished from previous approaches, which may let nodes represent terms such as words or phrases (rather than text features such as sentences or larger), and may let edges merely represent co-occurrence of terms or weighting factors based upon mutual information between terms, which may not account for proximity or other relationship aspects that can convey information between text features larger than words/phrases. More specifically, for a previous approach network having nodes representing words occurring in a document, one node represents a word that can appear in many instances across the document. Such a representation may not capture the connectivity of information of the text features (e.g., sentences, paragraphs) across the many instances of the word represented by a same node.
A network according to the present disclosure however, can capture these relationships between sentences and paragraphs, based on proximity and keyword occurrence, since sentences and paragraphs (and other text features) are typically unique. Thus, a node in a network according to the present disclosure typically represents a singular occurrence in a document {e.g. , of a sentence, paragraph) rather than multiple occurrences of words, and can result in finer tuning of the text feature relationships within the document.
Furthermore, the methods of the present disclosure provides a mechanism for ranking nodes of the network representing the document, and thus for ranking the corresponding text features
The graph of the network can also include attributes of the nodes (e.g., 132, 134) and/or edges 36 therebetween conveying certain information regarding the characteristics of the respective text feature or relationship(s). For example, node size can be used to indicate text feature size such as the number of words in a paragraph. The edge length can be used to indicate some information about the relationship between the connected text feature, such as whether the adjacent text features are successive paragraphs within a chapter, or successive paragraphs that end one chapter and begin a next chapter.
Figure 2 illustrates a graph that may be formed based on another value of a meaningfulness parameter according to various examples of the present disclosure. At one extreme, every node can be connected to every other node in a network of relationships between the text features, which can correspond to a very low threshold for determining some relationship exists between each node representing a text feature. The graph 240 shown in Figure 2 represents an intermediate scenario between the simple path graph shown in Figure 1 and the extreme condition of almost every node being connected to almost every other node.
As shown in Figure 2, nodes (e.g., 242, 244) representing text features are shown around a periphery, with edges 246 connecting certain nodes.
Although not readily visible due to the scale of the graph, the node around a periphery can be connected to adjacent nodes similar to that shown in Figure , representative of physical proximity of text features in a document. The nodes representing the first and last appearing text features in a document may not be interconnected in Figure 2. The graph shown in Figure 2 can include some nodes (e.g. , 242) that are not connected to non-adjacent nodes, and some nodes (e.g., 244) that are connected to non-adjacent nodes.
The quantity of edges in the relationship network shown in the graph can change as a value of a meaningfulness parameter varies, where the
meaningfulness parameter is a threshold for determining whether a relationship exists between pairs of nodes. That is, an edge can be defined to exist when a relationship between nodes representing text features is determined relative to the meaningfulness parameter. For example, the graph 240 shown in Figure 2 can display those relationships that meet or exceed the threshold based on the meaningfulness parameter. Therefore, the appearance of the graph 240 can change as the meaningfulness parameter varies.
The document analysis methodology of the present disclosure does not require the graph to be plotted and/or visually displayed. The graph can be formed or represented mathematically without plotting and/or display, such as by data stored in a memory, and attributes of the graph determined by computational techniques other than by visual inspection. The consequence of the meaningfulness parameter on the structure of the network relationships of a document is discussed in more detail with respect to Figures 4A and 4B below.
Figure 3 conceptually illustrates one example of a method for document analysis according to various examples of the present disclosure. Figure 3 shows a document analysis system 308 for implementing graph-based natural language text processing. The document analysis system 308 can be a computing system such as described further with respect to Figure 6. The document analysis system 308 can access a document (e.g., natural language text, or collection of texts) 300 that includes a plurality of text features, in various alternative examples, the natural language text or collection of texts 300 may be in any desirable format including, but not limited to, formats associated with known word processing programs, markup languages, and the like.
Furthermore, the texts 300 can be in any language or combination of languages.
As will be discussed in detail below, the document analysis system 308 can identify and/or select keywords (e.g., 312-1 , 312-2, . . ., 312-N) and/or text features from the text 300. These can be organized into list(s) 310. The document analysis system 308 can also determine various connecting relationships between the text features, and the network of relationships formed by the nodes and edges, which can be based on a value of a meaningfulness parameter (e.g. , used as a threshold for characterization of relationships). The graph 314 includes graph nodes 316 associated with the text features and graph edges 318 associated with the connecting relationships.
As previously discussed, the text features may include any desirable type of text features including, but not limited to, sentences 302, paragraphs 304, sections, pages, chapters, other language structures that include more than one sentence, and combinations thereof.
The document analysis system 308 can further determine (e.g., form, compute, draw, represent, etc.) a parametric family of graphs 314 (such as those shown and described with respect to Figures 1 and 2) for the network of relationships, including those relationship networks that have a small world structure. Such a small world structure can occur in many biological, social, and man-made systems, and other applications of networks. The document analysis system 308 can also determine, from the determined graph 314, corresponding value(s), or ranges of values, of a meaningfulness parameter for which the graph(s) exhibit certain structural characteristics (e.g. , small world structure).
That is, the document analysis system 308 can analyze the graph 314 for small world structure, such as by analyzing certain characteristics of the graph 314. One such analysis can be the relationship between a number of edges and the meaningfulness parameter, as shown in Figure 3 by chart 320. Chart 320 plots a graph 326 of the number of edges 322 versus a level of
meaningfulness parameter (ε) 324. As an example of possible behavior for a document, the meaningfulness parameter (ε) 324 can vary from negative values to positive values.
Chart 320 shows a curve 326 of number of edges 322 as a function of the meaningfulness parameter (ε) 324. Chart 320 is representative of a parametric family of graphs that can correspond to a structure of a document. For example, chart 320 can be at least one graph, of the parametric family of graphs, with a small world structure. The curve 326 has first portion 326 that is a relative flat (e.g., small slope) portion for negative values of the
meaningfulness parameter (ε) 324, and a second portion 329 for positive values of the meaningfulness parameter (ε) 324 greater than 1. The curve 326 also includes a third portion, between the first 327 and second 329 portions, for values of the meaningfulness parameter (ε) 324 approximately between 0 and 1 , the curve 326 becomes steep, and can include an inflection point. The range of curve 326 after the third portion (e.g., second portion 329), for values for the meaningfulness parameter (ε) 324 for which curve 326 includes a steep portion, can be associated with the network of relationships having a small world structure.
In mathematics, physics and sociology a small-world network can be a type of mathematical graph in which most nodes are not neighbors of one another, but most nodes can be reached from every other by a small number of hops or steps. More specifically, a small-world network can be defined to be a network where the typical distance L between two randomly chosen nodes (the number of steps) grows proportionally to the logarithm of the number of nodes N in the network, that is L oc
Figure imgf000009_0001
The idea of a small average length of edges accompanied by high clustering was first introduced in the classical Watts-Strogatz model, which was further refined in models by Newman and Weinberg.
The small world structure (e.g., topology) can be expected to appear after the sharp drop in the number of edges in functions of the number of edges 322 as a function of the meaningfulness parameter (ε) 324, as can be observed from a graph thereof. This is because there is of the order of N2 edges in a complete graph (i.e., every node connected to every other node), while the number of edges of a network having a small world structure is of the order of N logN. However, it should be noted that it is not sufficient if edges are randomly removed - the small world structure will not appear.
The behavior of the small world structure relative to the N logN behavior usually anticipated may be a signature for the category/classification of the text itself. Therefore, small world behavior (e.g. , structure, topology) can be tested to see if the text fits a certain category. For example, more learned text (e.g. , more extensive vocabulary) may be greater than N logN and less learned text (e.g., pulp fiction) may be less than N logN, etc.
In the context of a social network, this can result in the small world phenomenon of strangers being linked by a mutual acquaintance. In the context of document analysis, and the graph of network relationships between nodes representing text features and edges representing relationships above a threshold between the text features, a small world structure can represent text features being linked to other text features by a relationship of a defined strength.
The results of analyzing the graph 314 may be represented as a chart, such as 320, or as a list or table quantifying the relationship between the meaningfulness parameter (ε) 324 and the number of edges 322. Other attributes of graph 314 may also be analyzed with respect to text features and relationships therebetween. As used herein, the phrase "analyzing the graph " can refer to techniques for deciding on determining appropriate indicators of a small world structure.
Figures 4A and 4B plot a relationship between a number of edges and a level of meaningfulness parameter for documents analyzed according to various examples of the present disclosure. Figure 4A is a plot 450 of experimental results regarding the relationship between the meaningfulness parameter (ε) 424 and the number of edges 422 for several documents including the 201 1 State of the Union address given by President Barak Obama 426-Ο and the State of the Union address given by President Bill Clinton 426-C. Figure 4A is a plot 450 of experimental results regarding the relationship between the meaningfulness parameter (ε) 424 and the number of edges 422 for the Book of Genesis 426-G. The text feature represented by nodes in a network
relationship graph in each case was sentences. Therefore, edges represented relationships between sentences. The discussion that follows refers specifically to sentences as an example of a text feature, but examples of the present disclosure are not so limited and can be applied to other text features as previously defined.
Generally, when the meaningfulness parameter goes from a negative to a large positive value, a network relationship graph (e.g. , as shown in Figure 2) transforms from a large random graph to a regular graph. The transition into a small world structure (e.g. , topology) can happen in between the extreme cases, such as just after the portion of the curve where the number of edges decreases significantly as was discussed with respect to graph 320 in Figure 3 (e.g., the second portion of 329 of curve 326 for a range of meaningfulness parameter greater than one). Having a small world topology is of interest in many ranking text features since in such graphs different nodes have different contributions to a graph being a small world.
One example approach of the present disclosure to the challenge of defining relationships between text features is as follows. A one-parameter family of meaningful words MeaningfulSet(s) can be constructed for the document. That is, elements of the MeaningfulSet(s) are keywords. Two text features can be connected by edge if they have at least one word from the MeaningfulSet(e) in common. This type of network is common in the modeling of social networks as Affiliation Networks. The underlying idea behind affiliation networks is that in social networks there are two types of entities: actors and societies. The entities can be related by affiliation of the actors to the societies. In an affiliation network, the actors are represented by the nodes. Two actors can be related if they both belong to at least one common society.
With respect to document analysis, the sentences can be the "actors" and each member of the MeaningfulSet(s) can be a "society." A sentence can "belong" to a word if this word appears in the sentence. The society
MeaningfulSet(8) can depend on the meaningfulness parameter ε. The family of graphs can become an affiliation network with a variable number of societies. A one-parameter family of graphs can have the same set of nodes but a different set of edges for different values of the meaningfulness parameter. If a set of meaningful words is too small, then the local relations (e.g., physical proximity of adjacent nodes) can be present and the graph will look like a regular graph. If, however, too many meaningful words are selected, then the graph can look like a large random graph with too many edges.
According to various examples of the present disclosure, the
MeaningfulSet(s) can be selected using the Helmhoitz's principle such one parameter family of graphs becomes an interpolation between these two limiting cases with a defined "phase transition" (e.g., for values of the meaningfulness parameter (ε) 424 where the slope of a plot of the number of edges 422 as a function of the meaningfulness parameter (ε) 424 becomes steep). The graphs become a small world structure, and can have a self-organized system, for some range of the meaningfulness parameter (ε) 424 (e.g., greater than one).
According to various examples of the present disclosure, when a graph topology becomes a small world structure, the most relevant nodes and edges of such a graph can be identified. That is, for a small world structure graph topology, the nodes and edges that contribute to the graph being a small world structure can be ascertained, which can provide a mechanism for determining the most relevant text features of a document. Since nodes can represent text features of a document according to the document analysis techniques of the present disclosure, identifying the most relevant nodes in a small world structure identifies most relevant text features in a document. Once identified, these relevant text features can be used for further document processing techniques. Such an approach can bring a better understanding of complex logical structures and flows in text documents.
Some previous approaches of text data mining used the concept of a small world from social networking for keyword extraction in documents. Co- occurrence graphs are constructed by selecting words as nodes, and edges are introduced between two words based on the appearance of the two words in a same sentence. In contrast, various examples of the present disclosure utilize graphs built with text features that are other than single words as the nodes. The set of edges depends on the meaningfulness parameter (ε), which reflects a level of meaningfulness of the relationship between the text features, thus forming a one-parameter family of graphs. A more rigorous discussion of example graphs of network relationships ascertained from document analysis follows. Let D denote a text document and P denote a text feature portion of text document D. P can be a paragraph of the text document D, for example, where the document is divided into paragraphs. P can alternatively be several consecutive sentences, for example, where the document is not divided into paragraphs.
Based on the Helmholtz Principle from the Gestalt Theory of human perception, a measure of meaningfulness of a word w from D inside P can be defined. If the word w appears m times in P and K times in the whole document D, then the number of false alarms NFA( >, P, D) can be defined by the following expression:
where binomial coefficient, !n equation (1 ) the number N
Figure imgf000013_0001
is [L/B] where L is the length of the document D, and B is the length of P in words. The following expression is a measure of meaningfulness of the word w in P:
Meaning{w, P, D). = -— log NFA(w, P, D) . (2)
m
The justification for using Meaning(w, P, D) is based on arguments from statistical physics.
A set of meaningful words in P is defined as words with Meaning(w, P, D)
> 0 and larger positive values of Meaning(w, P, D) give larger levels of meaningfulness. For example, given a document subdivided into paragraphs, MeaningfulSet(s) can be defined as a set of all words with Meaning(w, P, D) > ε for at least one paragraph P. In general, paragraphs need not be disjoint. If a document does not have a natural subdivision into paragraphs, then several consecutive sentences (e.g., four or five consecutive sentences) can be used as the text feature (e.g., paragraph).
For a sufficiently large positive ε, the set eaningfulSet(e) may be empty. For ε « 0 the set MeaningfulSetfs) can contain all the words from D. It has been observed for test documents that the size of MeaningfulSet(E) can have a sharp drop from the total number of words in a document toward zero words around some reference value ε0 > 0.
A one parameter family of graphs Gr(D, ε) can be defined for a document D. Document D can be pre-processed, for example, by splitting the words by non-alphabetic characters and down-casing all words. Stemming can be applied, for example, thereafter. Let S-i , S2, . . .. S„ denote the sequence of consecutive text features (e.g. , sentences) in the document D. For the discussion that follows, sentences are used to illustrate the method.
The graph Gr(D, e) can have sentences S-i , S2, . . ., Sn as its vertex set.
Since the order of text features (e.g., sentences) is relevant in documents, and since the nearest sentences are usually related, an edge can be added for every pair of consecutive sentences (Sj, Si+i). This also assists connectivity of the graph to avoid unnecessary complications that can be associated with several connected components. Finally, if two sentences share at least one word from the set MeaningfulSetfs) they too can be connected by an edge. In this manner, the family of graphs Gr(D, ε) can be defined, for example.
For a sufficiently large positive number s, MeaningfulSet(e) is empty, and thus, Gr(D, ε) is the path graph (e.g., example of a path graph is illustrated in Figure 1 ). As ε decreases, the MeaningfulSet(8) increases in size. More and more edges can be added to the graph until the graph Gr(D, z) can look like a random graph with a large number of edges. As preciously mentioned, the path graph and the large random graph are two extreme cases, neither of which reveals desired document analysis information. Of more interest is what happens between these two extreme scenarios.
There is a range of the parameter ε where Gr(D, ε) becomes a small world structure. That is, for some range of the parameter ε there can be a large change (e.g. , drop) in the inter-node distances after adding a relatively small number of edges.
Different clustering measures for Gr(D, ε) can also be utilized. With respect to complex architectures, hubs (i.e., strongly connected nodes) serve a pivotal role for ranking and classifications of nodes representing text features for analysis of documents. Graphs with a small world structure are usual in social networks, where there are a lot of local connections with a few long range ones. What makes such graphs informative is that a small number of long-range short- cuts make the resulting graphs much more compact than the original regular graphs with only local connections. The Gr(D, s) models of the present disclosure are much closer to the Newman and Weinberg models than to the Watts-Strogatz one.
Referring again to Figures 4A and 4B, which present experimental results for numerical experiments on the three different text documents indicated. As discussed generally above, the documents can be pre-processed, including splitting the words by non-alphabetic characters, making all words in lower case, and applying stemming, for example. With respect to the three test documents, natural paragraphs were used as a text feature for the two documents illustrated in Figure 4A, and a text feature (e.g., paragraph) was defined as any four nearest sentences.
For the three indicated text documents, the numbers of sentences, paragraphs, words and different words are presented in Table I.
Figure imgf000015_0001
TABLE I - DOCUMENT STATISTICS
To better understand the properties of networks Gr(D, ε), different measures and metrics were examined. First of all, the number of edges in Gr(D, ε) were plotted for each of the three documents as a function of ε, the results of which are illustrated in Figures 4A and 4B. As can be observed in Figures 4A and 4B, there is a dramatic change (e.g. , drop) in the number of edges in Gr(D, ε) for some ranges of positive values of ε. These are areas where small world structures are expected to be observed for the graphs Gr(D, ε). To formalize the notion of a small world structure, Watts and Strogatz defined the clustering coefficient and the characteristic path length of a network. Let G = (V, E) be a simple, undirected and connected graph with the set of nodes V = {v-i , . . ., vn} and the set of edges E. Let \y denote the geodesic distance between two different nodes V/ and v,. The geodesic distance is the length of a shortest path counted in number of edges in the path. The characteristic path length (or the mean inter-node distance), L, is defined as the average of l over all pairs of different nodes (I, j):
Figure imgf000016_0001
The graph Gr(D, ε) depends on the parameter ε, so the characteristic path length becomes a function L(E) of the parameter ε. L(s) is also a non- decreasing function of ε. That is, path length (i.e., between nodes) tends to increase as the parameter ε increases, and can exhibit dramatic increases for values of the parameter ε approximately greater than 1 , which can be used to identify a range for parameter ε for which the network is a small world structure. The example values of the characteristic path length L(e) for the tested documents are shown in Table II below:
ε Obama Clinton The Book of Genesis
-1.0 1 .358748 1.319542 1 .309066
0.0 1.622702 1 .773237 1 .527610
1 .0 2.937931 2.861523 2.079833
1.5 5.514275 3.945697 2.580943
2.0 12.274517 12.715485 3.727103
2.5 22.471095 52.442205 7.280936
3.0 89.049007 1 13.237971 18.874327
3.5 144.854071 177.272814 96.873744
4.0 145.333333 178.000000 317.638370
4.5 145.333333 178.000000 779.802265 TABLE II - SOME VALUES OF L(s) FOR THE 3 TEST DOCUMENTS
With respect to clustering properties of the parametric graph Gr(D, s , clustering is a description of the interconnectedness of the nearest neighbors of a node in a graph. Clustering is a non-local characteristic of a node and goes one step further than the degree. Clustering can be used in the study of many social networks. There are two widely-used measures of clustering: clustering coefficient and transitivity. The clustering coefficient C(Vj) of a node j is the probability that two nearest neighbors of vi are themselves nearest neighbors. In other words,
- number _of _ pairs _of neighbors of _ vi that are _ connected number _ of _ pairs of _ neighbors _of _ vi where qj is a number of nearest neighbors of v, (degree of the vertex) with t, connections between them. C( i) is always between 0 and 1 . When ail the nearest neighbors of a node v, are interconnected, C(Vj) = 1 , and when there are no connections between the nearest neighbors, as in trees, C{Vi) = 0. Most real- world networks have strong clustering. The clustering coefficient (or mean clustering) for an entire network can be calculated as the mean of local clusterin coefficients of all nodes:
Figure imgf000017_0001
where n is the number of vertices in the network. In several example of CWs for real-world networks, for the collaboration graph of actors CWs ~ 0.79, for the electrical power grid of the western United State Cws = 0.08, and for the neural network of the nematode worm C.elegans CWs = 0.28.
In the range ε e [1 .0, 2.5] the network Gr(D, ε) is a small world structure in the case of 2000 State of the Union address given by President Bill Clinton and in the case of the 201 1 State of the Union address given by President Barack Obama, as may be seen by graphing mean clustering CWs as a function of the parameter ε. Both documents have a small degree of separation, high mean clustering CWs, and a relatively small number of edges. For the Book of Genesis, the range ε <≡ [2, 3] also produces a small world structure with even more striking values of the mean clustering CWs- Historically, CWs can be the first measure of clustering in the study of networks and can be characteristic used as an indication of the method of the present disclosure. Another measure of clustering, transitivity, can also be used.
The clustering coefficient and the transitivity are not equivalent. They can produce substantially different values for a given network. Many consider the transitivity to be a more reliable characteristic of a small world structure than the clustering coefficient. Transitivity is often an interesting and natural concept in social networks modeling.
In mathematics, a relation R is said to be transitive if aRb and bRc together imply aRc. In networks, there are many different relationships between pairs of nodes. The simplest relation is "connected by an edge." If the
"connected by an edge" relation was transitive it would mean that if a node u is connected to a node v, and v is connected to w, then u is also connected to w. For social networks this can mean that "the friend of my friend is also my friend." Perfect transitivity can occur in networks where each connected component is a complete graph (i.e., all nodes are connected to all other nodes). In general, the friend of my friend is not necessarily my friend.
However, intuitively, a high level of transitivity can be expected between people. In the case of document analysis graphs Gr(D, ε), the transitivity can mean that if a sentence S, describes something similar to a sentence Sj , and Sj is also similar to a sentence S^, then S, and Sk probably may also have something in common. So, it is natural to expect a high level of transitivity in graph Gr(D, ε) for some range of parameter ε.
The level of transitivity can be quantified in graphs as follows. If u is connected to v and v is connected to w, then there is a path uvw of two edges in the graph. If u is also connected to tv, the path is a triangle. If the transitivity of a network is defined as the fraction of paths of length two in the network that are triangle, then:
(number _of _ triangles)xi
{number of connected _ triples) where a "connected triple" means three nodes u, v and w with edges (u, v) and (v, w). The factor of three in the numerator arises because each triangle will be counted three times during counting all connected triples in the network.
Some typical values of transitivity for social networks are provided for context. For example, the network of film actor collaborations has been found to have C = 0.20; a network a collaborations between biologists has C = 0.09; a network of people who send email to other people in a large university has C = 0.16. In the range se [ .0, 2.5], the network Gr(D, ε) has high transitivity in the case of the 2000 State of the Union address given by President Bill Clinton and in the case of the 201 1 State of the Union address given by President Barack Obama, as may be seen by graphing transitivity C as a function of the
parameter ε. For the Book of Genesis, ε in the range ε e [2, 3], the transitivity is also quite high (i.e. , greater than 0.6).
From the Table I, Gr(D, z) has 435 nodes in the Obama 201 1 address, 533 nodes in the Clinton 2000 address, and 2343 nodes in the case of the Book of Genesis. So, it is not easy to represent such graphs graphically. A much nicer picture can be produced for the graph with the text features being paragraphs as a node set. The paragraphs can be connected by the same example rule provided above: two paragraphs are connected if they have meaningful words in common.
In the case of the 20 1 State of the Union address given by President Barack Obama there are 95 paragraphs. For the value ε = 2, several highly- connected nodes make a small world structure. If nodes are ranked according to some ranking function, this function should provide a wide range of values. With respect to ranking of sentences according to their degree, all nodes in Gr(D, e) can be sorted in decreasing order of degree to get a degree sequence d(s) = {di(s) dn(s)}, where di(s) > d2(e) > . . . ≥ dn(e).
Consider, for example, the first fifty values of d¾ in the case of the Obama speech. To have a reliable selection of five, ten, or more highest-ranked sentences, a wide range of values of the degree function are needed. The term d(s) can be plotted for several values of ε (e.g., first fifty elements) as the degree-rank function for different values of ε. The value can be scaled such that the largest one, di (e) can be set equal to one. The values ε = 1 .0 and ε = 2.0 have the best dynamic range, and where the graphs have a small world structure. According to experimental results, the most connected sentence in the 2011 Obama address (for ε = 2) is 'The plan that has made all of this possible, from the tax cuts to the jobs, is the Recovery Act." with a degree of 29.
Different measures and metrics for complex networks, such as the eigenvector centrality, Katz centrality, hubs and authorities, betweenness centrality, power law and scale-free networks, can be used to evaluate document analysis effectiveness. These metrics and measures can be used to help quantify document analysis criteria.
For two connected text features, it can be determined which one appears first and which one appears second, according to their position in a document. This can make such a graph look like small WWW-type network and
PageRankType methods can be used to produce relevant rankings of nodes. Known social networks demonstrated that real-world networks can become denser over time, and their diameters effectively become smaller over time. A time parameter t can also be introduced in the method of the present disclosure by considering various document portions (e.g., the first t sentences of a document).
Figure 5 illustrates an example method for document analysis according to various examples of the present disclosure. The example method of document analysis includes determining, via a computing system, a parametric family of graphs corresponding to a structure of a document at 560, and varying a parameter of the parametric family of graphs via the computing system at 562. At least one graph with a small world structure is identified via the computing, as illustrated at 564.
Figure 6 illustrates a block diagram of an example computing system used to implement a document analysis system according to the present disclosure. The computing system 674 can be comprised of a number of computing resources communicatively coupled to the network 678. Figure 6 shows a first computing device 675 that may also have an associated data source 676, and may have input/output devices (e.g., keyboard, electronic display). A second computing device 679 is also shown in Figure 6 being communicatively coupled to the network 678, such that executable instructions may be communicated through the network between the first and second computing devices.
Second computing device 679 may include a processor 680
communicatively coupled to a non-transitory computer-readable medium 681. The non-transitory computer-readable medium 681 may be structured to store executable instructions 682 (e.g., programs) that can be executed by the processor 680 and/or data. The second computing device 679 may be further communicatively coupled to a production device 683 (e.g. , electronic display, printer, etc.). Second computing device 679 can also be communicatively coupled to an external computer- read able memory 684.
The second computing device 679 can cause an output to the production device 683, for example, as a result of executing instructions of programs stored on non-transitory computer-readable medium 681 , by the at least one processor 680, to implement a system for incremental image clustering according to the present disclosure. Causing an output can include, but is not limited to, displaying text and images to an electronic display and/or printing text and images to a tangible medium (e.g., paper). Executable instructions to
implement incremental image clustering may be executed by the first 675 and/or second 679 computing device, stored in a database such as may be maintained in external computer-readable memory 684, output to production device 683, and/or printed to a tangible medium.
Additional computers 677 may also be communicatively coupled to the network 678 via a communication link that includes a wired and/or wireless portion. The computing system can be comprised of additional multiple interconnected computing devices, such as server devices and/or clients. Each computing device can include control circuitry such as a processor, a state machine, application specific integrated circuit (ASIC), controller, and/or similar machine.
The control circuitry can have a structure that provides a given
functionality, and/or execute computer-readable instructions that are stored on a non-transitory computer-readable medium (e.g., 676, 681 , and 684). The non- transitory computer-readable medium can be integral (e.g., 681 ), or
communicatively coupled (e.g. , 676, 684) to the respective computing device (e.g. 675, 679) in either a wired or wireless manner. For example, the non- transitory computer-readable medium can be an internal memory, a portable memory, a portable disk, or a memory located internal to another computing resource (e.g., enabling the computer-readable instructions to be downloaded over the Internet). The non-transitory computer-readable medium (e.g., 676, 681 , and 684) can have computer-readable instructions stored thereon that are executed by the control circuitry (e.g., processor) to provide a particular functionality.
The non-transitory computer-readable medium, as used herein, can include volatile and/or non-volatile memory. Volatile memory can include memory that depends upon power to store information, such as various types of dynamic random access memory (DRAM), among others. Non-volatile memory can include memory that does not depend upon power to store information. Examples of non-volatile memory can include solid state media such as flash memory, EEPROM, phase change random access memory (PCRAM), among others. The non-transitory computer-readable medium can include optical discs, digital video discs (DVD), Blu-ray discs, compact discs (CD), laser discs, and magnetic media such as tape drives, floppy discs, and hard drives, solid state media such as flash memory, EEPROM, phase change random access memory (PCRAM), as well as other types of machine-readable media.
Logic can be used to implement the method(s) of the present disclosure, in whole or part. Logic can be implemented using appropriately configured hardware and/or software (i.e., machine readable instructions). The above- mention logic portions may be discretely implemented and/or implemented in a common arrangement.
Figure 7 illustrates a block diagram of an example computer readable medium (CRM) 795 in communication, e.g., via a communication path 796, with processing resources 793 according to the present disclosure. As used herein, processor resources 793 can include one or a plurality of processors 794 such as in a parallel processing arrangement. A computing device having processor resources can be in communication with, and/or receive a tangible non- transitory computer readable medium (CRM) 795 storing a set of computer readable instructions for capturing and/or replaying network traffic, as described herein.
The above specification, examples and data provide a description of the method and applications, and use of the system and method of the present disclosure. Since many examples can be made without departing from the spirit and scope of the system and method of the present disclosure, this specification merely sets forth some of the many possible example configurations and implementations.
Although specific examples have been illustrated and described herein, those of ordinary skill in the art will appreciate that an arrangement calculated to achieve the same results can be substituted for the specific examples shown. This disclosure is intended to cover adaptations or variations of various examples provided herein. The above description has been made in an illustrative fashion, and not a restrictive one. Combination of the above examples, and other examples not specifically described herein will be apparent upon reviewing the above description. Therefore, the scope of various examples of the present disclosure should be determined based on the appended claims, along with the full range of equivalents that are entitled.
Throughout the specification and claims, the meanings identified below do not necessarily limit the terms, but merely provide illustrative examples for the terms. The meaning of "a," "an," and "the" includes plural reference, and the meaning of "in" includes "in" and "on." "Example," as used herein, does not necessarily refer to the same example, although it may.
In the foregoing discussion of the present disclosure, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration how examples of the disclosure may be practiced. These examples are described in sufficient detail to enable those of ordinary skill in the art to practice the examples of this disclosure, and it is to be understood that other examples may be utilized and that process, electrical, and/or structural changes may be made without departing from the scope of this disclosure.
Some features are grouped together in a single example for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the disclosed examples of the present disclosure have to use more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed example. Thus, the following claims are hereby
incorporated into the Detailed Description, with each claim standing on its own as a separate example.

Claims

What is claimed:
1 . A method, comprising:
determining, via a computing system (674), a parametric family of graphs (314) corresponding to a structure of a document (300) (560);
varying, via the computing system (674), a parameter of the parametric family of graphs (314) (562); and
identifying, via the computing system (674), at least one graph (314) with a small world structure (564).
2. The method of claim 1 , further comprising:
identifying, via the computing system (674), a set of keywords (312-1 , 312-2, 312-N) of the document (300) as a function of a meaningfulness parameter (324);
representing, via the computing system (674), a network, wherein nodes
(316) of the graph (314) correspond to text features (302, 304) of the document (300) and edges (318) between particular nodes (316) represent relationships between the text features (302, 304) represented by the particular nodes (316); joining, via the computing system (674), nodes (316) representing adjacent text features (302, 304) in the document (300) by an edge (318) and nodes (316) representing text features (302, 304) that include at least one keyword (312-1 , 312-2, 312-N) included in the identified set of keywords (312-1 , 312-2, 312-N) by an edge (318); and
identifying, via the computing system (674), text features (302, 304) of the document (300) based on connectivity of nodes (316) for at least one value of the meaningfulness parameter (324) for which the graph (314) has a small world structure.
3. The method of claim 2, wherein the text features (302, 304) are language structures that include more than one paragraph (304).
4. The method of claim 2, further comprising ranking keywords (312-1 , 312- 2, .... 312-N) of the set of keywords (312-1 , 312-2 312-N) based on the meaningfulness parameter (324).
5. The method of claim 4, wherein identifying the text features (302, 304) of the document (300) includes ranking the text features (302, 304) based on the quantity of edges (318) associated with the respective nodes (3 6)
corresponding to the text features (302, 304).
6. The method of claim 4, wherein identifying the set of document (300) keywords (312-1 , 312-2 312-N) includes selecting and ranking keywords
(312-1 , 312-2, 312-N) of the set of keywords (312-1 , 312-2 312-N) using the Helmholtz principle.
7. The method of claim 2, wherein:
varying the parameter of the parametric family of graphs (314) includes varying the meaningfulness parameter (324);
the parametric family of graphs (314) is a one-parametric graph corresponding to the varying meaningfulness parameter (324); and
varying the parameter of the parametric family of graphs (314) includes determining a range (329, 328) for the meaningfulness parameter (324) for which the one-parametric graphs (314) have a small world structure.
8. The method of claim 2, wherein the graph (314) has the small world structure within a defined range about an inflection point in a graph (320) of the quantity of edges (318) as a function of a value of the meaningfulness parameter (324).
9. The method of claim 1 , wherein identifying at least one graph (314) with the small world structure includes:
determining a signature for a category of text of the document (300), the category of text of the document (300) having a range (329, 328) for the meaningfulness parameter (324) defined as being a small world structure; and determining a quantity of edges (318) with the range (329, 328) for the meaningfulness parameter (324) defined as being a small world structure for at least one graph (314).
10. The method of claim 1 , wherein the graph (314) has the small world structure where there is a large change in inter-node distances corresponding to addition of a small quantity of edges (3 8).
1 1 . A non-transitory computer-readable medium (676, 681 , 684, 795) having computer-readable instructions (682) stored thereon that, if executed by a processor (680, 794), cause the processor (680, 794) to:
determine a one-parameter family of graphs (314) corresponding to a structure of a document (300);
vary the parameter; and
identify, based on the parameter, at least one graph (314) with a small world structure,
wherein the parameter is a meaningfulness parameter (324).
12. The non-transitory computer-readable medium (676, 681 , 684, 795) of claim 11 , further having computer-readable instructions (682) stored thereon that, if executed by the processor (680, 794), cause the processor (680, 794) to: determine a set of keywords (312-1 , 312-2, 312-N) in the document
(300), the quantity of keywords (312-1 , 312-2, 312-N) being relative to the meaningfulness parameter (324);
form a graph (314), wherein nodes (316) represent text features (302,
304) in the document (300) and edges (318) represent relationships between text features (302, 304), wherein relationships between text features (302, 304) are based on the set of keywords (312-1 , 312-2, 312-N); and
identify text features (302, 304) of the document (300) based on connectivity of nodes (316) for a range (329, 328) of the meaningfulness parameter (324) for which the graph (314) has a small world structure
13. The non-transitory computer-readable medium (676, 681 , 684, 795) of claim 12, further having computer-readable instructions (682) stored thereon that, if executed by the processor (680, 794), cause the processor (680, 794) to rank text features (302, 304) corresponding to nodes (316) according to quantity of hub node connections in the graph (314) having a small world structure.
14. A computing system (674), comprising:
a non-transitory computer-readable medium (676, 681 , 684, 795) having computer-readable instructions (682) stored thereon; and
a processor (680, 794) coupled to the non-transitory computer- read able medium (676, 681 , 684, 795), wherein the processor (680, 794) executes the computer-readable instructions (682) to:
determine a one-parameter family of graphs (314) corresponding to a structure of a document (300);
vary the parameter; and
identify, based on the parameter, at least one graph (314) with a small world structure,
wherein the parameter is a Helmholtz meaningfulness parameter (324).
15. The computing system (674) of claim 14, wherein the processor (680, 794) executes the computer-readable instructions (682) to:
determine a set of keywords (312-1 , 312-2, 312-N) in the document (300) based on the Helmholtz meaningfulness parameter (324);
represent a graph (314) having nodes (316) representative of text features (302, 304) in the document (300) and edges (3 8) representative of relationships between text features (302, 304) based on the keywords (312-1 , 312-2, 312-N); and
vary the set of keywords (312-1 , 312-2, .... 312-N) according to the Helmholtz meaningfulness parameter (324);
rank text features (302, 304) corresponding to hub nodes according a quantity of edges (318) connected thereto; and
select a number of highest ranked identified text features (302, 304).
PCT/US2011/052374 2011-09-20 2011-09-20 Document analysis WO2013043159A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2011/052374 WO2013043159A1 (en) 2011-09-20 2011-09-20 Document analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2011/052374 WO2013043159A1 (en) 2011-09-20 2011-09-20 Document analysis

Publications (1)

Publication Number Publication Date
WO2013043159A1 true WO2013043159A1 (en) 2013-03-28

Family

ID=47914701

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2011/052374 WO2013043159A1 (en) 2011-09-20 2011-09-20 Document analysis

Country Status (1)

Country Link
WO (1) WO2013043159A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050276479A1 (en) * 2004-06-10 2005-12-15 The Board Of Trustees Of The University Of Illinois Methods and systems for computer based collaboration
US20050278325A1 (en) * 2004-06-14 2005-12-15 Rada Mihalcea Graph-based ranking algorithms for text processing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050276479A1 (en) * 2004-06-10 2005-12-15 The Board Of Trustees Of The University Of Illinois Methods and systems for computer based collaboration
US20050278325A1 (en) * 2004-06-14 2005-12-15 Rada Mihalcea Graph-based ranking algorithms for text processing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Y.MATSUO ET AL.: "A document as a small world", LECTURE NOTES IN COMPUTER SCIENCE, vol. 2253, 1 January 2001 (2001-01-01), pages 444 - 448 *
YUTAKA MATSUO ET AL.: "KeyWorld: Extracting Keywords from a Document as a Small World", DISCOVERY SCIENCE, 28 November 2001 (2001-11-28), pages 271 - 280 *

Similar Documents

Publication Publication Date Title
WO2013043160A1 (en) Text summarization
US11416535B2 (en) User interface for visualizing search data
US9928234B2 (en) Natural language text classification based on semantic features
US10078688B2 (en) Evaluating text classifier parameters based on semantic features
US8245135B2 (en) Producing a visual summarization of text documents
US9286380B2 (en) Social media data analysis system and method
US8676730B2 (en) Sentiment classifiers based on feature extraction
US8805840B1 (en) Classification of documents
CN106462604B (en) Identifying query intent
US10200260B2 (en) Hierarchical service oriented application topology generation for a network
CN108182245A (en) The construction method and device of people&#39;s object properties classificating knowledge collection of illustrative plates
US20200301987A1 (en) Taste extraction curation and tagging
US10467252B1 (en) Document classification and characterization using human judgment, tiered similarity analysis and language/concept analysis
CN108776709A (en) Computer readable storage medium and dictionary update method
Balinsky et al. Automatic text summarization and small-world networks
Balinsky et al. Rapid change detection and text mining
US20230010680A1 (en) Business Lines
CN108228566A (en) More document keyword Automatic method and system, computer program
Jaman et al. Sentiment analysis of customers on utilizing online motorcycle taxi service at twitter with the support vector machine
Hsu et al. Hierarchical comments-based clustering
Konagala et al. Fake news detection using deep learning: supervised fake news detection analysis in social media with semantic similarity method
Mediayani et al. Determining Trending Topics in Twitter with a Data Streaming Method in R
CN110347934A (en) A kind of text data filtering method, device and medium
US8819023B1 (en) Thematic clustering
WO2013043159A1 (en) Document analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11872832

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11872832

Country of ref document: EP

Kind code of ref document: A1