WO2002063497A1

WO2002063497A1 - Proximity for computer represented graphs

Info

Publication number: WO2002063497A1
Application number: PCT/AU2002/000121
Authority: WO
Inventors: Kwok Kay Wong
Original assignee: Soda Technologies Pty Ltd
Priority date: 2001-02-08
Filing date: 2002-02-08
Publication date: 2002-08-15
Also published as: AUPR295501A0

Abstract

The present invention relates to a method and apparatus of searching information which may be stored in a semi-structured database. The invention uses the idea for searching for information by 'proximity'. Semi-structured information can be represented in graph form, the graph including a plurality of nodes connected by edges. The present invention provides a simple coding method which can result in rapid searching, for coding the positions of the nodes and edges so that proximity of respective nodes can be easily determined.

Description

STRUCTURAL PROXIMITY SEARCHING INFORMATION

FIELD OF THE INVENTION The present invention relates to a method and apparatus for facilitating searching of collections of unstructured and semi-structured data, a method and apparatus for producing coded indexes representing graphs which represent information, the graphs including nodes representing information objects interconnected by edges.

BACKGROUND OF THE INVENTION

Information, including any collection of data, may be stored in many locations, and in many forms. For example, information may be stored in large structured databases. Searches of these structured databases may be made using queries dictated by the database structure. The problem with structured databases, of course, is that a structure has to be provided, and much information is not suited to be organised within a rigid structure.

There therefore exist many sources of unstructured and semi-structured data. Searching such data is difficult because there is no structure on the basis of which any logical query can be formulated, in the traditional sense of structured databases . Examples of semi-structured data can be found in documents distributed over computer networks such as the Internet .

One of the languages which is now more popularly being used to store information over the Internet is the

Extensible Markup Language ("XML") . Each XML document will have some structure. However, that structure may or may not be consistent between different documents. The hierarchical and semantic information about the content contained in one XML document may depend solely upon the author. It is likely that different authors will encode the same information in entirely different ways, resulting in different structures encoding the same meaning. For example, one person may encode "comedy" as a category of "movies", whereas another person may do the opposite.

Suppose, for example, we are looking for trends in "insurance claims" related to "smoking". The information we are after may be contained in insurance company records, court transcripts, or even newspaper articles. Even if we decide we are only interested in examining court transcripts, we do not know the structural relationship between the terms of interest. We are left in the predicament of knowing exactly what we are looking for, but not knowing how to find it. A number of solutions to this problem have already been explored. Naturally, all these solutions are less precise than an exact database query where the structure of the data is known. However, an increasing number of users are engaging in interactive searches, such as online web based searching, where queries are successively refined until the desired information is located. In such circumstances, an useful and accepted approach is to provide approximate or "likely" answers which can be refined by the user.

A number of proposals adapt methods from information retrieval (IR) to address this problem. Methods such as keyword searching and building traditional IR indexes for different element types are adapted to XML query processing.

Hayashi et al [Y.Hayashi, J. Tomita, and G.Kikui. Searching Text-rich XML Documents with Relevance Ranking.

In ACM SIGIR 2000 Workshop on XML and Information Retrieval , July, 2000] have proposed a method for searching XML documents with relevance ranking. They define a subset of the XML tags to be search fields, and then build indexes on these search fields using traditional IR methods. Query execution involves accessing these indexes, combining and ranking results. This method again provides the ability to search data without knowledge of the underlying structure. This approach, however, requires a separate index for each search field, and does not provide a mechanism for efficiently reflecting changes to the structure of the document. Navarro and Baeza-Yates [G. Navarro, and R. Baeza- Yates . Proximal Nodes : A model to query document databases by content and structure. ACM Transactions on Information Systems, 15 (4): 400-435 , October 1997] have proposed a generic model in which proximal nodes are used to query document databases by content and structure. The paper includes also extensive survey of the related work from IR community. However, implementation and system details, efficiency, indexing, etc. are not the focus of the paper. An alternative to these approaches is to treat the data as a graph, and use proximity (in the structural sense) between elements as a means of determining the nodes of interest. This approach is especially applicable to XML, for which a precise and well defined mapping exists between the data and its representation as a graph. Florescu et al [D. Florescu, D. Koss ann, and I. Manolescu. Integrating keyword searhc into XML query processing. WWW9 /Computer Networks, 33 (1-6) : 119-135 , 2000] have proposed a novel method for extending XML query processing by incorporating a keyword search facility on element names . They utilise an inverted file to index the name and depth of individual element manes . During a search, nodes can be efficiently retrieved by name, and optionally limited by the depth of occurrence. This allows a user to query data without prior knowledge of the underlying structure. Their approach, however, is limited to relatively static data, and considers only the raw tree structure of the XML document.

Goldman et al [R.Goldman, N. Shivakumar, S.Venkatasubramanian, and H. Garcia-Milina. Roxi ity Search in Databases. In International Conference on VLDB, 26-37, 1998] have proposed a method for returning one set of nodes which is close (in the structural sense) to another set of nodes, with returned nodes ranked by their proximity to elements of the second set. They precompute the shortest distance between all points in the graph, and use a method called hub-indexing to significantly reduce the space requirements of the index. This, method produces good results for users who are not aware of the structure of the data. Their approach, however, relies on pre-computed values to determine proximity, and does not provide a mechanism for efficiently reflecting changes to the structure of the data. Their index size is further reduced by considering only distances less than some pre-defined value. As such, their ranking is only precise for nodes less than this pre-defined value.

There is a need for an improved way of searching unstructured and semi-structured data, and in particular data that can be represented in graph form.

SUMMARY OF THE INVENTION

In accordance with the first aspect, the present invention provides a method of searching information, the information being representable in graph form and including nodes representing objects and edges representing relationships between nodes, the method comprising the steps of encoding a representation of the graph, the code identifying the position of the nodes- within the graph, and providing the code as an index arranged to facilitate a determination of proximity between the positions of nodes within the graph, whereby to facilitate searching of the information represented by the node .

Preferably, the code is arranged to represent the pathway to nodes via interconnecting edges connecting the nodes and a root nodes .

Preferably the step of encoding comprises the step. of assigning edgeidentifiers to each edge, the edgeidentifier for a particular edge being the smallest unused positive number which is unique only amongst all edges originating from a given node. Preferably, a code is assigned to each node that corresponds to the sequence of edgeidentifiers of the edges in the pathway to that node through connected nodes from the root node. The codes are preferably implemented in the form of bit arrays . This has the advantage that the index can be included in a very small storage means.

Preferably, the code produced for the index can be used to determine the positional proximity within the graph of a first node (termed a "find" node) with respect to a second node (termed a "near" node) , and this can be done by comparing an encoded subgraph of all nodes and edges of the find node with the encoded subgraphical nodes and edges of the near node. Such a searching step can therefore return all the find nodes in a set near all the nearest nodes in a set, and these can all be preferably be ranked in accordance with proximity. That is, it would be possible to find the nearest "restaurant" to "Newtown" . Proximity is proximity within the graph, not physical location.

Comparison of subgraphs is preferably done by comparing bit patterns from the codes for each of the subgraphs .

The method of searching of the present invention can be applied to any information and is not limited to XML data, although XML data is convenient. As long as the data or information can be represented in graph form (prior art methods of translating semi-structured information and other information into graph form including nodes and edges are already known) then the present invention can be applied.

BRIEF DESCRIPTION OF THE DRAWINGS

Features and advantages of the present invention will become apparent from the following description and embodiment thereof, with reference to the accompanying drawings, in which;

Figure 1 illustrates an example graph showing edges and nodes for sample data;

Figure 2 is a representation of a graph illustrating an encoding scheme utilised in an embodiment of the present invention;

Figure 3 is an illustration of a compressed bit array utilised in accordance with an embodiment of present invention;

Figure 4 is an illustration of a sample compressed array with multiple encoding, utilised in an embodiment of the present invention;

Figure 5 are a series of example graphs illustrating application of a multi-element comparison algorithm, in accordance with an embodiment of the present invention; Figure 6 is a further illustration of a graph for illustrating an implementation of an encoding system in accordance with an embodiment of the present invention;

Figure 7 is a further illustration of an encoded bit pattern of code in accordance with the present invention, similar that of figure 3;

Figure 8 is a further illustration of an encoded bit pattern in accordance with an embodiment of the present invention, similar to figure 4; Figure 9 is a schematic diagram of a searching system in accordance with an embodiment of the present invention, and;

Figure 10 is a diagram of a searching system in accordance with a further embodiment of the present invention.

DESCRIPTION OF PREFERRED EMBODIMENT

The following description of a preferred embodiment gives a description of coding schemes which can be used to encode a representation of a graph including nodes representing objects and edges representing relationships between nodes, being representative of information being held, for example, within a database. The coding schemes can be used to compare portions (sub-graphs of the graphs) in order to determine the proximity between nodes . As the relationships represented by the edges can be logical relationships between the nodes (representing information) , determination of proximity of nodes can be used to search the information in a logical fashion.

The following description will first of all give an overview of the theory behind the approach of this embodiment of the present invention, and then will give examples of implementations of coding schemes and algorithms for implementing the present invention.

The approach of the present invention extends to ranking proximity of nodes that we wish to find ("find" nodes) based on the closeness in proximity to "near" nodes .

To increase the power of the proximity ranking, we enable results to be ranked according to either ascending or descending proximity. Rather than pre-compute distances, we propose a method of dynamically calculating distances as required.

Suppose, for example, we axe interested in locating a "restaurant" in "Soho" . As we are entering this request over the Internet, we have no idea of the underlying structure of the data. Indeed, it is very possible that the web site accesses data from many different sources with vastly differing structures, as illustrated in figure 1. For this reason we cannot easily issue a meaningful database query. However, if we can find "restaurants", ranked by proximity to the nearest occurrence of "Soho", we are likely to find what we are looking for.

Suppose further that we do not want any restaurant which serves seafood. We could refine the previous results by looking for all such "restaurants" not near "seafood". This raises the question, however, of how to determine what is meant by "near" . We could decide that everything beneath some arbitrary threshold is "near", or decide upon some other means of distinguishing between "near" and "far". This, however, entails the problem of imposing an artificial and possibly incorrect structure on the data, with the increased likelihood of returning inappropriate results. If we instead rank everything in decreasing order of proximity, so that the "restaurant" furthest away from "seafood" is listed first, we are likely to find a restaurant we want without the risk of distorting the underlying data.

For structural proximity determination, we consider the XML database (such as a Lore database [9] J. McHugh et al . Lore: A database management system for semistructured data. SIGMOD Record, 26(3):54-66, September 1997.) as a general graph, comprised of one or more individual data sets. Whilst raw XML can always be represented as a tree, the interpreted structure (i.e. the structure obtained if links are materialised as edges) can be an arbitrary graph. As links generally indicate a "relatedness" between elements, we must consider this interpreted structure to maximise the effectiveness of our proximity searches. Our proximity search thus requires that we be able to find the distance between any two nodes in an arbitrary graph. This distance is defined in terms of the shortest path between two nodes. Such graphs may easily contain a huge number of nodes (easily 0(106) nodes for a database of 1GB). Any proximity search thus potentially contains a large number of such calculations, and a single query may be comprised of more than one proximity search. Given that the proximity search itself is only part of the entire query processing, the speed of such determination is especially crucial. Furthermore, as the user may require the results ranked according to either ascending or descending proximity, we need to be able to determine the precise distance between all nodes.

The embodiment of the present invention provides a method for encoding graphs, and a family of encoding schemes for representing this information in a compressed space. Our encoding schemes are specifically designed not only to be as small as possible, but to facilitate the direct calculation of proximity. We describe further optimisations to perform the comparisons very quickly (Ranking 100,000 nodes according to their proximity to the closest element of a 100,000-node set in 0.44 seconds).

Conceptual Model

A well-defined and precise correspondence exists between a single XML document and its representation as a tree. This idea can easily be extended to a collection of documents by the addition of a root node, of which all documents are children. This, in turn, can likewise be extended to include multiple XML repositories. Again, all that is necessary is the inclusion of a new root node, of which all XML repositories are children.

A collection of XML documents can therefore be viewed as a single document (although very possibly with inconsistent structure) . We are unconcerned, therefore, whether the desired information is contained in a single document, in multiple documents in the same database, or in multiple databases. For this reason, we assume that all the XML data of interest has a single, common root.

Although the raw XML structure is guaranteed to be a tree, its logical or interpreted structure may not be. XML documents may contain links to portions of the same or different documents. Such links indicate some close relationship (in the semantic or logical sense) between the two portions, as determined by the author of the document. As we are concerned with determining proximity in the semantic and logical sense, it makes sense to materialise these links, and consider them as actual directed edges. We thus consider the XML repository as a directed, possibly cyclic, graph. .

The Problem

Informally, we are looking for all F near N, where F and N each represent a set of nodes . It is important to realise that F and N may be specified by some inexact criteria (for example, find all elements containing "ticket" near all elements containing "price"), and so may not be disjoint. We refer to F as the Find Set (ie. what we want to find), and N the Near Set (i.e. what it is near) .

The problem can now formally be stated as follows: we wish to return all elements of the Find Set, ranked by their proximity to the nearest element of the Near Set .

According to this definition, we must be able to rank elements of the Find Set in both ascending and descending order. Ideally, we would also like the option of returning a subset of the Find Set based on proximity (for example, return only the nearest 100 nodes, only return nodes with a proximity of less than 5, etc) .

The definition also raises the question of the precise definition of "proximity" . Proximity can be naturally defined as the shortest path between the two nodes. If we consider, however, that the purpose of the proximity search is to find nodes which are somehow semantically or conceptually related, it may make sense only to consider paths which pass through common ancestors. Common ancestors can be seen as concepts which include both target nodes. Proximity can then be defined as shortest path between two points, considering only paths through common ancestors . For the implementation of the present embodiment, we have chosen this second definition of proximity, with good results . Our mechanism works equally well for either definition of proximity, the former requiring slightly longer time when updating the index.

Fundamental Approach

When considering solutions to this problem, there are two fundamental approaches. On one hand, we could pre- compute all pairwise shortest distances, and then look these up as required. This method has the advantage of retrieving any distance in constant time. Algorithms which employ this method, however, necessarily involve 0 ( | F] x | N| ) comparisons. Furthermore, such pre-computed indexes are very large ( | v\ ² in the worst case for a graph with V vertices) , although methods have been proposed for minimising this problem [6] . Updating the index to reflect changes in the database is also expensive. The underlying database needs to be extensively examined to determine all shortest distances involving the single modified node.

The other approach is to calculate distances as required, using some form of graph algorithm. This has the advantage of virtually no overhead to reflect changes to the database, as well as much more reasonable space requirements (θ|v|). However, as any graph algorithm requires arbitrary traversal through an arbitrary graph, such an algorithm could require O ( | |x ( | F]+ \ N\ ) ) random disk seeks in the worst case. Thus this solution is impractical for any real implementation.

Our approach fundamentally falls into the second category, calculating the distances as required using a graph algorithm. Instead of directly examining the graph, however, we use a family of encoding schemes to represent the relevant subgraphs in a very small space (typically no more than 20 bytes for a single subgraph) . The distance is then calculated by directly comparing these encodings . As the encodings are so small, the entire subgraph comparison can be performed in main memory, often utilising only the CPU cache. As the comparisons themselves heavily utilise bitwise comparisons and optimisations, distance calculations are performed very quickly. An index in accordance with an embodiment of the invention contains an entry for each node in the graph (In the implementation, nodes are indexed by OID and stored in a hash table) . This entry contains the encoded subgraph containing all paths from the root to the given node. As many XML documents (or portions of these documents) are trees, many of these subgraphs will be a single path.

Proximity is determined using the following two phase approach:

[1] Obtain encoded subgraph containing all elements of Near Set '

[2] Compare the encoded subgraph of each element of . Find Set with the encoding obtained in step [11

The encoded subgraph obtained in step [1] can either be generated dynamically or retrieved from a cache. Dynamic generation is achieved by retrieving the encoding for each element of Near Set and combining them into a single subgraph. Again, this process only involves the encodings themselves, and not the underlying graph. The encoding schemes have been specifically designed in such a way as to make this process of combination quick and - In efficient. This process again heavily utilises bitwise operations and optimisations, resulting in very fast performance. If dynamic generation of this encoding is required, this process approaches 0 ( | N| ) in practice. For each element of Find Set, step [2] involves a comparison with the encoded subgraph obtained in step [1] . Conceptually, this step involves "overlaying" the subgraph for the Find Set element with the subgraph of the Near Set, and seeing where they diverge. In practice, this overlaying is done using bitwise comparisons, and so many edges are typically compared in a single, cheap operation. In practice, this step tends towards 0(|F|).

Utilising this two phase approach, we avoid the need for performing 0 ( | F| x | N| ) comparisons. In practice, this approach tends towards 0 ( | F| + | N| ) comparisons if the subgraph in step (11 must be dynamically generated, or 0()F|) comparisons if the subgraph in step [1] is retrieved from the cache .

Representing Subgraphs

The efficiency of our index relies largely on our encoding schemes. The emphasis of our encoding schemes is to represent a subgraph in such a way which minimises space requirements whilst maximising the ability to calculate distance as efficiently as possible. Minimising space is important as it allows more of the index to be held in main memory. Looking ahead a little, our encoding scheme can represent all subgraphs from the root to each of 500,000 nodes in 3.8 MB. Given current systems, it is not unreasonable to hold this entirely in main memory.

In order to efficiently encode a subgraph, each edge in the main graph is assigned the smallest unused positive number which is unique only amongst all edges originating from a given node . This means that two edges can be assigned the same number as long as they originate from different nodes. This number is referred to as the edge identifier. This decision has the important consequence that all such numbers will be relatively small compared to the total number of edges/vertices in the graph. A typical XML document will frequently have one node (typically the root) with a large number of children, with all subsequent nodes having a relatively small number of children (typically less than 10) . Whilst these examples are indicative only, in general they mean that most edges will have relatively small numbers used to identify them.

Representing Single Paths

As previously mentioned, many XML documents, or portions of documents, are trees. As such, the subtree containing all paths from the root to a single node is frequently a single path. We first describe our mechanism, path encoding, for encoding a single path, and then extend that notion to encode more complex subgraphs .

Paths in the main graph are identified by the sequence of individual edge identifiers, which implicitly start from a (virtual) incoming edge to the root. Nodes are identified as being the terminus of one or more paths. This concept is illustrated in figure 2. The node y_ι is identified by the sequence of edge identifiers "1.1.2". Note that this sequence of edge identifiers both uniquely identifies the node itself and the path from the root to the node .

Our encoding scheme exploits the low numerical value of the edge identifiers, by only allocating twice the minimum space required to store the numbers. For example, as the number " 1" is represented by 1 bit, and the number "2" by 2 bits, the path "1.1.2" is represented in only 8 bits (2 x 4 bits) . This approach offers a great space saving over methods which typically use a 4 byte integer to represent each node (thus requiring 16 bytes instead of 1 to represent the previous path) .

It is now possible to begin to see how the distance calculation works. As yi is encoded by "1.1.2" and xj is encoded by "1.1.3", the distance between them can be determined by observing that the paths are the same for the first two edges (" 1.1" ), and so this contributes nothing to the shortest path between them. This information is found using a single bitwise exclusive or operation. After the paths diverge, the path from the root to yi contains 1 edge, as does the path from the root to xj. This information is found in utilising a non-iterative bit counting algorithm. We can thus determine the distance between these two nodes is 2 in constant time.

Representing Multiple Paths

The method described above is extended to represent general subgraphs . Suppose we want to encode the subgraph containing all paths from the root to y₂. This must include not only the direct path to y₂, but the cycle from y₂ to itself. Obviously our method of listing edges sequentially is not sufficient when multiple paths are involved. To deal with multiple paths in a subgraph, we number nodes which contain more than 2 incoming or more than 2 outgoing edges, within a single subgraph. Note that we are not concerned about the total number of incoming and outgoing edges from a node. We are only concerned with the number of incoming and outgoing edges which are included in the subgraph of interest. This is illustrated in figure 2 by the nodes labeled "A" and "B" . Note that even though many nodes in the graph have more than 2 incoming or outgoing edges, within the subgraph containing all paths from the root to y₂, there are only 2 such nodes.

Such nodes (referred to as common nodes) are numbered separately from the edge identifier numbering. In figure 2, -the node y2 is labeled "A" for clarity. In the encoding scheme is implemented as the number "1" with a marker bit set to indicate this number refers to a common node and not an edge identifier.

Common nodes are given numbers which are unique wi thin the subgraph being encoded. This is generally substantially smaller than the total number of such nodes within the entire graph. (Thus, for example, a different common node may also be identified as "A" in a different subgraph) . The mechanism for labeling multiple paths can now be seen.

1. The method used- in section 4.1 is used as long as the subgraph contains only a single path. When a common node is encountered, it is inserted into the pattern with a bit marker to indicate the path terminates at this common node.

2. When a common node is encountered, the common node is inserted into the pattern (with a bit marker indicating that this is a common node from which paths originate) . 3. Each path originating from this common node is then included, using the method from section 4.1, until another common node is reached. When a common node is reached, the number of the common node which terminates the path is then inserted, with a bit masker to indicate the path terminates at this common node .

The entire encoding for the subgraph of all paths from the root to y2 is therefore given by: 1.2.- A.A.1.1. -_B.2.1. - B.B.l. ~>A

Given that each of these numbers are represented in the minimum possible space, the entire subgraph is represented in only 48 bits.

Representing Multiple Elements

So far the encoding methods we have looked at have all had only a single node of interest (the terminal node of all the paths) . However, if we wish to encode the subgraph containing the Near Set in such a way as to determine distance, we need to have some mechanism for representing a subgraph with multiple terminal nodes.

We achieve this with a very minor addition to the ^■ method described above, which additionally helps to make distance determination more efficient.

As we are interested in finding distances quickly, we do not want to have to always travel down all paths of a subgraph. Indeed, doing so would make this technique substantially less efficient than many others previously mentioned. To overcome this necessity, for each common node we pre-compute the minimum distance from that node to the terminal node, considering only forward edges. (If the definition of proximity chosen is shortest path between nodes, this pre-computed distance must consider all edges) . We store this pre-computed value in the encoding, immediately after the occurrence of the common node as the path origin. ("A" or "B" without the arrows in the example above) . Note that this pre-computed distance is local to the encoded subgraph. Changes in the database only need to be reflected here if the encoded subgraph is involved. Even when this does occur, the pre-computed distance is local, and the value can be directly computed from the encoding. As such, re-computing this distance in response to database changes is not too expensive.

As mentioned above, the basic approach is to "overlay" subgraphs until they diverge, and then use the encoding to compute the remaining distance. Inclusion of this pre-computed value means that we only need to count individual nodes (using a non-iterative bit counting algorithm) until we reach the next common node, from which we can extract the minimum distance. Thus inclusion of this minimum distance significantly reduces the amount of processing required for determining proximity.

Pre-computed values are included for both the multipath encoding described above and the set encoding described here. Once we have this notion of including pre-computed distances, identifying multiple terminal nodes becomes easy. For multi path encoding, we only include the pre-computed value for common nodes . Where multiple terminal nodes are involved, we include pre-computed values for both common nodes and terminal nodes. A terminal node is therefore identified as any node with a pre-computed distance of zero. As both common nodes and terminal nodes have the shortest distance explicitly stored, they are collectively referred to as annotated nodes. Note that when dealing with multiple elements, common nodes or terminal numbers are assigned identifying numbers so as to be unique amongst enumerated nodes .

Compressed Arrays

In order to efficiently implement this encoding scheme, we utilise a data structure called a compressed array. Compressed arrays are designed to store numbers in a small amount of space in such a way as supports our encoding mechanisms . Encoding mechanisms are supported by a structure which enables them to be self describing, facilitates the efficient determination of different types of numbers (or rather, efficient determination of numbers which represent different things) , enables efficient comparisons between a series of numbers in a single operation, and allows efficient traversal through the data structure. Compressed arrays make heavy use of bitwise operations and optimisations for their most efficient usage. A compressed array is shown in figure 3, storing the numbers 1 , 1 and 3.

Conceptually, a compressed array can be thought of as a pair of parallel bit patterns. One pattern, the identifier pattern, contains the bit patterns necessary for representing numbers . The bit patterns which represent two numbers typically follow directly on from one another, with no space between the most significant bit of one number and the least significant bit of the second. Sometimes, for reasons which will be discussed shortly, numbers are separated by 1 or 2 unset (ie. "zero") bits.

The second bit pattern, the boundary pattern, contains a set bit which denotes the boundary (or most significant bit) of the corresponding identifier pattern. This has the useful property that the number of elements in a compressed array can be determined by counting the number of set bits in the boundary pattern. If a non-iterative bit counting algorithm is used, this can provide a constant time determination of the number of set bits. Looking ahead a little, each set bit of a path encoding corresponds to an edge in the graph. Counting the number of set bits in the encoding therefore corresponds to counting the number of edges. Using this technique, we are therefore able to count the number of edges in an encoding in constant time.

Conceptually, in an embodiment of the present invention, the boundary pattern can be used to indicate the level of a node in a graph (it acts as a

"levelidentifier" - see later) , and the identifier patterns can be used as the "edgeidentifier" (see later) .

We can now see how the encodings are actually stored in the data structure. Figure 3 shows the path encoding for the path from the root to node xl in figure 2. Note that all cells are the minimum width (ie. all numbers are represented in the minimum space) .

Above we mentioned that a special bit-marker was included to indicate that a path terminates at a common or terminal node. This is achieved by including a single leading zero. Thus, if a cell has a single leading zero, it indicates that the number identifies the annotated node where the path terminates . Figure 4 shows an example of multi-path encoding for the subgraph containing all paths from the root to y2 in figure 2. In figure 4, each cell which represents the annotated node where the path ends, is labeled "Path encoding segment terminates ..." . Note that each of these cells has a most significant bit of zero. All such cells can be identified non-iteratively by obtaining a bit mask as follows:

path terminus cell bitmask = ca . bound AND NOT ( ca . ident)

Similarly, we mentioned that all annotated nodes from which paths originate were specially marked. Such cells in a compressed array are indicated by a cell with value 0 and width 1. These are distinguished by their position in the encoding. For example, the cell which represents annotated node "3" will be the third annotated node included in the encoding. Thus, in figure 4 node "A" corresponds to annotated node "1", and node "B" corresponds to annotated node " 2 " . Depending on the size of the encoding and frequency of use, these nodes can be either indexed for constant time retrieval of the i^fcή node, or iterated through (a typical encoding has relatively few annotated nodes) .

Only cells which indicate annotated nodes may have a value of zero. This poses a problem, as previously we stated that distances of annotated nodes were stored, and that terminal nodes were identified by the value zero. To solve this problem, the value stored is actually 1 greater than the distance. This ensures that only cells indicating annotated nodes will have the value zero.

Annotated cells can similarly be identified by a non-iterative bitwise operation, as follows:

path originating cell bitmask = ( ca . bound AND NOT ( ca . ident) ) AND ( ( ca . bound « 1) OR 1)

Our encoding schemes are designed in such a way as to be easily self identifying, without the need for any additional storage . The method of indicating annotated nodes leads to the following corollaries:

- Bit Index 0

Path encoding segments-

Originate from noede "A" Originate from incoming

Boundary Pattern

Identifier Pattern

^L Path encoding segment terminates - Bit Index 12

Path encoding segment terminates

Figure 4: Compressed Array with Multiple Path Encoding For compressed array ca, an encoding scheme uses: multi-path encoding iff ca . bound AND ca . ident ≠ ca . ident path encoding iff ca . bound AND ca . ident = ca . ident

Proximity Determination

We can now look at the algorithms that enable us to quickly and efficiently calculate distance, through exploitation of the encoding schemes and the compressed array data structure.

The most fundamental algorithm, described in section

(a) , enables us to determine the position where two paths (or two path encoding segments) diverge, in constant time.

This algorithm heavily utilises bitwise operations and optimisations, to ensure fast operation.

As is apparent, the algorithm described in section

(b) contain a loop. In the worst case, this executes once for every edge in the subgraph for the element of Find

Set . In practice, however, this rarely occurs. Wherever possible, the algorithms considers, multiple edges in a single execution of a loop cycle. Furthermore, loops are only repeated if the path segment being considered is exactly the same in both Near Set, N, and the element from Find Set, F. Even if the worst case does occur, for a typical subgraph, G, with E edges, which represents all paths from the root to a single element of Find Set, JEJ ≤ K «

, for some constant K. Thus, even though in the worst case the algorithms are 0 (

x F) for the entire

Find Set, in practice, performance tends to be 0 (F) .

(a) Path Divergence The path divergence algorithm provides a fast, constant time algorithm for determining where two paths diverge. These paths may terminate at the final node (for example, the element of Find Set) ox any node in the encoding (if the path encoding segment has been extracted from a multi-path encoding, for example) .

This algorithm is fundamental to the proximity determination algorithm. The divergence algorithm is described below.

Algorithm: Direct Path Comparison

Input: Two compressed arrays, cal and cat, which contain the normalised path encoding segments terminating at Node, and Node2 respectively Output: DPC_index, the offset + 1 (in bits) from the start of cal and cat where the paths diverge if they diverge) . 0, if the paths do not diverge.

[1] MaxDiff = ( c&χ . bound XOR ca₂ . bound) OR (ca± . ident XOR ca₂. ident) [2] DPC_index = LowBit (MaxDiff) + 1

where LowBit (n) is a non-iterative algorithm which returns the index of the lowest set bit in n, or -1 if n = 0. Understanding the algorithm is much easier if we recall that for path encoding, every set bit in the boundary pattern of a compressed array corresponds to an edge in the graph. Step [1] finds the first bit where the compressed arrays first differ, considering them from the start of the compressed array (which corresponds to the incoming root edge) . For the purposes of this algorithm, we axe only concerned with the first unset bit. Conceptually, this corresponds to the outgoing edges of the node where the paths diverge.

Consider using this algorithm to find the node where the paths from the root to xx and y diverge in figure 2. Step [1] corresponds to "blanking out" the incoming root edge and the incoming edge to w_{l r} as these edges are common to both paths. The first unset bit will lie somewhere in the cell which contains either the incoming edge to xx or y (it does not matter which) . What is important is only that common edges, have been "blanked out".

Note that in step [1] , we are unconcerned about the value of the number stored in the compressed array. From the perspective of this algorithm, the only purpose of unique edge identifiers is to guarantee different bit patterns where appropriate. By treating the compressed arrays as bit patterns, we can efficiently compare and count many edges . in a single comparison.

Step [2] then takes the bit mask where the least significant set bit corresponds to the point of divergence, and returns the index of this set bit, or -1 if it does not exist.

The input compressed arrays for this algorithm contain "normalised" path encoding segments. This means that the compressed arrays contain only path encoding segments, both starting at the least significant bit. In practice, these segments are often extracted from multi path encoding. For a path encoding segment starting at bit index m and ending at bit index n, this is easily achieved using the following bitwise operations.

c _te_m - bound = (ca . bound AND low_bit__mask(l « n) ) » m catβmp - ident = (ca . ident AND low_bit_mask (1 « n) ) » where low__bit_mask( ) provides a bit mask which sets all bits less than or equal to the least significant set bit of n, and unsets all other bits . This allows us to unset all bits which are above the highest bit we are interested in. low_bit_mask ( ) is easily implemented by:

low_bit_mask(n) = NOT ( n XOR NOT(n - 1))

(b) Multi Element Comparison

This algorithm finds the proximity of the element of Find Set to the nearest element of the encoded Near Set . The algorithm itself is shown below.

Input: A compressed array, ca_N, containing the multi-element encoding of the Near Set . A Compressed array, ca_F, containing the multi-path path encoding of all paths from the root to the specified element of the Find Set, E_F.

Output: dist, the shortest path from E_F to the closest element in Near Set Notes: In step [3], we assume the path encoding segment found is bounded by annotated nodes Ai and Aj , and goes from bit indexes m to n

In step [5] , we assume the path encoding segment found bounded by annotated nodes A_g and A_h, and goes from bit indexes o to p

[1] Initialise Prev_Tmp_Dist_N = Prev_Tmp_Dist_N = MAXINT, PE_Start_F = PE_Start_N = 0 [2] Loop [3] Find PE_F, the maximal path encoding segment starting from PE_Start_F [4] Next_Tmp__Dist_F = MIN ( Prev_Tmp_Dist + No of set bits in ca_F. bound between m and n, Stored distance for Aj)

[5] Symmetrical step to [3] and [4] for ca_N

[6] If Length ( PE_F) > Length ( PE_N)

[7] DPC_Index = Index of path divergence + 1, using DPC

Index algorithm, for corresponding portions of ca_F and ca_N.

[8] If DPC_Index > 0

[9] Find__Dist = MIN (Next_Tmp_Dist_F + No of set bits in ca_F. bound between (DPC_Index - 1) and n,

Prev_Tmp_Dist + No of set bits in ca_F. bound between m and (DPC_Index - 2))

[10] Symmetric step to [9] for ca_N [11] dist = MIN (dist, Find_Dist + Near_Dist) [12] If no paths in queue, exit loop. Otherwise, restore state from head of queue and continue loop. [13] Find-Dist = MIN ( Prev_Trap_Dist_F + No of set bits in ca_F. bound between m and m+p-o,

Stored distance for Aj + No of set bits in ca_F. bound between m+p-o+1 and n) [14] Near_Dist = Next_Temp_Dist_N [15] dist = MIN (dist, Find_Dist + Near_Dist)

[16] Next_Edge_ID = element in ca_F starting at index m+p-o+1

[17] If a path originating from Aj has initial edge identifier Next_Edge__ID [18] PE__Start_F = m + p - o + 1. Reset values to correspond to new path encoding segment. [19] else [20 ] If no paths in queue, exit loop. Otherwise, restore state from head of queue and continue loop.

[ 21 ] else if Length ( PE_F) < Length ( PE_N) , perform symmetric steps to [7] through [16]

[ 22 ] else

[ 23 ] Symmetric steps to [7] through [16]

[ 24 ] If we have visited this annotated node before

[ 25 ] If no paths in queue, exit loop. Otherwise, restore state from head of queue and continue loop.

[[2266]] For each path encoding segment originating from Aj with initial edge identifier Next_Edge_ID

( 27 ) If path with initial edge identifier Next_Edge_ID originates fro A_h, add edge and current state to queue . [28] If no paths in queue, exit loop. Otherwise, restore state from head of queue and continue loop.

Conceptually, this algorithm works as follows.

(a) The path to the element from Find Set is "overlaid" on the subgraph containing the Near Set until the paths diverge, or an annotated node is reached.

(b) If the paths diverge, we determine if the point of divergence is closer to an annotated node in the Find Set or the Near Set .

(c) If we come to an annotated node, our actions depend on which graph the annotated node was in.

(d) If the annotated node is only in the encoding for Near Set, or if the point of divergence was closer to an annotated node in the Near Set, we store the minimum distance from this annotated node to the terminal node in Near Set . We then continue following the appropriate edge in Find Set , if it exists (if it doesn't exist, this means the paths diverge at this annotated node) . If the next edge does not exist, we calculate the shortest distance for this path, and continue considering the other queued paths . (e) If the annotated node is only in the encoding for the Find Set, or if the point of divergence was closer to an annotated node in the Find Set, we perform symmetrical ^■ steps to step (d) . (f) If we come to annotated nodes in both Find Set and Near Set, or if the point of divergence is equal distance from annotated nodes in both graphs, we also perform symmetric steps to step d. However, in this situation we also have the possibility that more than a single path will be found in common. For that reason, we check each edge which leaves the annotated node in Find Set . If this edge also appears in Near Set, we add that edge to a queue. We also have the possibility that this node indicates a cycle. For this reason, we only add the edge to the queue if we have not visited this node before. For the algorithm described in figure 7, steps [3] through [5] correspond to point (a) above. Steps [6], [21] and [22] correspond to points (b) and (c) above. Steps [7] through [20] correspond to point d above. Step [21] corresponds to point (a) above. Finally, steps [23] through [27] correspond to point (f) above. Step [28] ensures we consider all relevant edges .

Many of the steps above require us to store the minimum distance between element of Find Set and the nearest element of Near Set . Strictly speaking, for each iteration of the loop, we progressively obtain the minimum distance out of all paths considered so far. This is a simple case of taking the smaller of the previously found minimum distance and the minimum distance for the path we are examining .

Calculating the minimum path for the path we are examining is therefore a very important procedure. As this operation is frequently performed, we need an efficient mechanism of doing so. Utilising the path divergence algorithm described in (a) and exploiting some bitwise optimization algorithms, we can do this in constant time.

When calculating distance, we are not sure if the shortest path passes through the previous annotated node we visited, or the next annotated node we would visit (or are visiting if the paths diverge at an annotated node) . We deal with this by calculating the minimum distance from the point of divergence to the element of Find Set, calculating the minimum distance from the point of divergence to the nearest element of Near Set, and then summing these distances. Each of these individual distance calculations is performed in constant time using a combination of the path divergence algorithm described in (a), the stored distances, and non-iterative bit counting algorithms .

Examining this distance calculation step in more details, we perform each distance calculation by first identifying the point of divergence using the constant time path divergence algorithm described in (a) . We then calculate the minimum distance by taking the minimum of the path through the previous annotated node and the path through the next annotated node. As we store the minimum distance at each annotated node, this calculation is simply a combination of retrieving this stored distance and counting the additional edges from the point of divergence to the node using a non-iterative bit counting algorithm. Tracing an Example

This algorithm can be more easily understood by following an example. Graph 1 in figure 5 shows an arbitrary sample graph. Suppose we want to find all x near y. Thus x nodes comprise the Find Set and y nodes comprise the Near Set . Graph 2 shows the subgraph which includes the Near Set . Graph 3 represents the subgraph containing all paths from the root to i . Graph 4 represents the subgraph containing all paths from the root to node x₂. We will trace this examples with reference to the graph for purposes of clarity, and indicate the corresponding steps of the algorithm and discuss the actual bitwise operations which correspond to each step.

Considering Graph 3

First we consider comparing graph 3 (xi) with graph 2 (the Near Set) . As graph 3 is "overlaid" with graph 2, we first come to an annotated node (the parent of yi) . The algorithm determines this in steps [3] through [5]. We always start "overlaying" edges from the incoming root edge, which corresponds to bit index 0 for all encodings. Step [31 then calculates the longest path encoding segment for the encoding of graph 3 , starting at the current beginning index. As path encoding is used, this value is

2 , which corresponds to the fact that there are 2 edges in graph 3 and no annotated nodes . (Recall that as the encoding contains no annotated nodes, the node of interest is assumed to lie at the end of the listed edges) .

Similarly step [5] determines that the maximum length of path encoding segment for graph 3 starting from the current beginning index is 1.

We now know that we are going down the branch which corresponds to steps [7] through [20] . At this point, since we have reached a common node in Find Set, we need to record the stored minimum distance from this node to the nearest element of Find Set (in this case 1, as yi is only 1 edge away from the annotated node) . This is performed in step [4] . Retrieving this value is done using a series of simple bitwise operations on the underlying compressed array. Note that this method allows us to find the distance between two nodes in different paths, without the need of actually traversing both paths. This benefit is compounded if a number of other annotated nodes existed between the (current) parent of yi and yi .

We now reset the appropriate index counters (in step [18] ) for the encoding of graph 2 and graph 3 separately, to indicate where in the graph comparison we are up to.

We now repeat the search for the node of divergence or annotated node (or the end of the Find Set encoding, as occurs here) , starting from the bit index just calculated. Conceptually, this means that we start our new search from the annotated node we have just left (the parent of yi) . This time, this comparison leads us down the branch which corresponds to step [23].

We now find we have come to the end of the path encoding segment. It is now time to calculate the minimum distance. This is performed in step [13]. We next explain the rationale behind this step and why it works. Firstly, we have retrieve the stored distance for the next annotated node (y₂. The value is 0 in this case) . We calculate the forward distance for the near set by counting the number of edges from the current node (xi) to the next annotated node (3 edges) and adding it to the stored distance we just retrieved. (This gives a total forward distance-of 3). Next, we "calculate the shortest distance considering paths through the previous annotated node) . We count the number of edges from the current node to the previous annotated node (yi. This gives a distance of 1) . We add this distance to the minimum value retrieved from this node (1), to give the total distance through the previous annotated node (2 in this case) . In this case, there is no need to repeat previous operations for the encoding of graph 3, as the paths do not diverge. If they did, we would perform the wane operations for the encoded Find Set .

Note that all the distance counting and stored distance retrieval are implemented using optimized bitwise operations, and are performed in constant time.

We have now determined the minimum distance from the first element of Find Set to the nearest node in the encoding of Find Set . Note that whilst the encoding of Find Set itself is not a tree, the subset of the graph we needed to directly consider was a tree.

Considering Graph 4

Graph 4 shows a more complex example for the graph comparison algorithm. The same basic ideas, however, are followed.

Once again, the encodings are compared to see which graph contains an annotated node first. In this case, they both come to the same annotated node. Once again, the shortest distance is computed. We now must check for the possibility that there is more than one common path exiting from this annotated node. To do this, we check the edge identifier of each outgoing edge in the encoding of Find Set . If this also appears as an outgoing edge identifier for the encoding of Near Set, the edge is added to the queue to be considered further. If not, the edge is not considered any further. (Note that any contributions discarded edges make to the minimum value are already taken into account in the stored value for the node in the encoding of Find Set) . In this case, both outgoing edges axe considered. We next consider each such path individually. Note that the left path to y₂ contains no annotated nodes . This means that this segment corresponds to a path encoding segment in both encodings . This means we can compare the entire path of 4 edges in a single, constant time operation. The right hand path requires 2 such operations, as the path is broken by the annotated node yi.

This brings us to node y₂. Again, upon arriving we have already calculated the minimum distance to the nearest element of Near Set (which is 0) , and the minimum distance to the element of Find Set (which is 1) . Once again, we check whether any outgoing edges from this node in Find Set exist in Near Set . In this case, there is none. As the queue is empty, we return the current value of dist as the final distance. Note that we are guaranteed to have calculated the minimum distance, as each iteration of the loop, for any condition, calculates the minimum distance it has observed so far (this of course if frequently revised in subsequent iterations) . This is done in steps [11] and [15] , and the corresponding symmetrical steps implied in steps [21] and

[22] . Updates

As for many Web databases, we consider that reading will occur more frequently than updates. Handling efficient updates, however, is vital for this scheme to be workable in practice.

Updates to the index are performed in the following steps :

1. Obtain encodings affected by the changes

2. Reflect changes in the encoding schemes

3. Re-calculate shortest distances (if necessary) .

Perhaps surprisingly, if no new mechanism is introduced, the most expensive step in this process is step 1. Steps 2 and 3 modify existing graph algorithms. In practice, most subgraphs are single paths, or possibly DAGs. This means that whilst the worst case performance may not be good, the average case performance is indeed very good. With the optimisations and increased speed provided by our encoding schemes and bitwise operations, these algorithms have increased performance in practice. We assume that retrieval is more frequent than updates. As such, our primary index is optimized for this (We implement this as a hash table using the OID as the key) . To minimise retrieval time and space requirements, our encodings are stored in a hash table, with the OED as key. Unless we provide an alternative method for accessing encodings, this means that unless a node is the terminal node in an encoding, retrieval can be very expensive. To cope with this, we build an alternative index on the encodings, which allows indexes to be retrieved based on included nodes, rather than just on terminal nodes.

Reflecting changes in the encodings

Obtaining the encodings affected by the changes is no different than evaluating the proximity queries, i.e., retrieving the encodings according to the OIDs in the result set returned by the query processor. Next, we need to update the changes into the encodings. From a structural perspective, there are 4 primitive operations we need to handle .

1. Inserting a new node :

This step covers the cases of inserting a new external node connected by a single edge, and inserting a node in the middle of an edge. This case is the easiest to reflect in our encoding scheme. If the added edge, OID_new, is external, (connected by a single new edge to 0ID_oχ_d) , all that is required is to retrieve the encoding which represents 0ID_oχ_d, copy it, and insert the new edge identifier and encoding for OID_new.

If the edge is inserted in the middle of an existing edge, we additionally need to retrieve the encodings which contain this edge. As the newly inserted node will (initially) have only a single outgoing edge

(corresponding to the edge into which it is being inserted) , all we need do is insert a cell representing the number 1 (the edge identifier of the outgoing edge from this new node) . This is easily accomplished in a single sequence of bitwise operations. 2. Inserting a new edge amongst, existing nodes :

A new edge may be inserted within an existing subgraph, or it may connect two previously disconnected subgraphs. If inserted within an existing subgraph, we obtain the encodings that contain the origin and destination of the new edge, and include the new edge in the encoding scheme .

If the edge connects two previously disconnected sub-graphs, we need to incorporate this information into the relevant encodings. As we consider all graphs to be directed, we are only interested in new paths which pass from the origin to the destination. For this reason, we retrieve the encoding corresponding to the origin, and all encodings which include the destination node. The encoding containing the origin node is then incorporated into all encodings containing the destination node.

Deleting a node (and all associated incident edges) :

Deleting a node potentially results in disconnecting a subgraph from the root. If this occurs, it is possible that removal of one node will result in the deletion of many other nodes . We cater for this by again obtaining each encoding which contains this -node. If the terminal node in each encoding is disconnected from the root as a result of the deletion, this indicates that this node has also been deleted. As such, we remove the encoding from our index. If the disconnected region does not contain the terminal node, we retain the encoding without the disconnected subgraph. If no graph is disconnected, the least we must do is delete associated incident edges.

4. Deleting an edge wi thout removing any nodes:

Deletion of an edge may also potentially disconnect the subgraph. This is treated similarly to step 3 above.

Updates can be viewed as a combination of these 4 primitive operations. In order for the graph to be represented by our index, it must remain connected to the root. If any portion becomes disconnected from the root, we assume this subgraph has been deleted from the main graph and we no longer represent it in the index. We assume the database informs us of any such changes, and provides enough information for us to locate the appropriate encodings .

Re-calculating shortest distances

Whilst distances often need to be recalculated, this is not necessary all the time. If the new encoding uses path encoding, no distances need to be computed. As many encodings do use path encodings, frequently this step is not required at all.

If we are required to re-calculate minimum distances, it is imperative that we do so as efficiently as possible. A classical algorithm for computing the shortest distance between two points is Dijkstra's single-source shortest path algorithm (E.W. Dijkstra. A note on two problems in connexions with graphs. Numerische Mathematik, 1:269-271 , 1959) . This algorithm is efficient for graphs in main memory. As we can always fit the subgraph in main memory, we have adapted Dijkstra's to work with our encoding scheme. We exploit our ability to consider multiple edges at once and our fast traversal between nodes using our bitwise operations to produce a very fast implementation of Dijkstra's algorithm.

Two other factors contribute to lowering the cost of recomputing the shortest distance. The first is that we only need to compute this shortest distance for a subset of the nodes in the subgraph. A typical subgraph has very few annotated nodes (if any) which require pre-computation of distance. The second contributing factor is that a typical subgraph is relatively small, reducing the time taken by the algorithm. These three factors combine to make the distance computation efficiently in practice.

Experimental Results

We have implemented this system on a Pentium III 800MHz processor with 128MB RAM running linux Redhat 7.0. We generated our XML data set by randomly generating XML elements of with random topology. Multiple data sets were generated and average performance were obtained by repeatedly running the program with different data sets . Choosing the metric of database size is somewhat misleading for such examples, as our index considers only the structure itself. A document with 100 nodes can be as small as 1 KB or as large as 1MB. To give a fair indication, therefore, we consider only the number of nodes in the graph, without any reference to the size of the data stored at a single node.

Table 1 shows a comparison of the number of nodes and the size. We di ferentiate between the size of the encoding and the total size, which includes the space overhead of implementing the hash table . One reason for the precise linear nature of the space measurements is our implementation. In our implementation, we store the actual encoding in a multiple of 8 bytes (corresponding to 2 x 4 byte unsigned integers) . This means that if the actual encoding only requires 24 bits, we still use 8 bytes for storage. This is done for performance considerations when calculating proximity. A result of this, however, is that whilst different graph topologies do have different space requirements, storing encodings in multiples of 8 bytes tends to "smoothe" out this variation, resulting in the linear space measurements.

Table 2 shows the time taken to perform the proximity ranking for various sizes of Near Set and Find Set, when the Near Set is cached. As can be seen, the time is linear in practice for this case. It is worth noting that for any method which requires 0(|F| X | N/ J comparisons, then comparing Find Set and Near Set of 100,000 nodes each would require O(10¹⁰) comparisons. We manage to achieve the same result in 0.44 seconds.

Table 3 shows the time taken to perform the proximity ranking for various sizes of Near Set and Find Set, when the Near Set is uncached. The increase in time is taken by the need to generate the single encoding of the Near Set . Once again, the time taken to generate the encoding for the Near Set, is approximately linear in time. The larger jump which occurs as the size of Near Set and Find Set increase is due to the introduction of random disk accesses and paging, caused by the query processing. The overall performance, however still, is impressive in practice.

Table 1 - Number of nodes vs size 0

5

Table 2 - Number of Nodes vs Speed - Near Set Cached

0

5

Table 3 - Number of Nodes vs Speed - Near Set Uncached

JU

The following description is of a particular implementation of an embodiment of the present invention. Path Encoding

Path Encoding Schema

Contents: An encoded path, consisting of 2 bit patterns this embodiment (see figure 3 above) : Levelldentifier, a bit pattern representing the nodes existing in the encoded path; edgeidentif ier , a bit pattern representing the edges followed in the encoded path. The following description only applies to path encoding, but it is extended to encode sets below. All paths are encoded assuming a, possibly virtual, incoming edge as the beginning of the path. The following terminology and symbols are used in this description.

• CurrentBi t is a positive integer which representing the set bit in Levelldentifier currently under consideration. Numerically it is equal to the number of set bits between (and including) itself and the start of the bit pattern Levelldentifier .

• Bi tPos [CurrentBi t] is a non-negative integer and refers to the bit position of CurrentBi t . Bi tPos [0] = 0. • Edgeidentif ier (Bi tPos [CurrentBi t] ) refers to the bit position indicated by Bi tPos [CurrentBi t] in the bit pattern Edgeidentif ier

This encoding assumes that each outgoing edge from a given node has been assigned a unique number, EdgelD, and that this number remains constant for the lifetime of the encoding. To be compatible with set encoding, EdgelD should be non-zero.

An encoded path should be interpreted as follows :

1. Scan the bit pattern Levelldentifier from right to left.

2. For each set bit, CurrentBi t, encountered in the bit pattern Levelldentifier :

(a) Edgeidentif ier (Bi tPos /CurrentBi t) - Edgeidentif ier (Bi tPos [CurrentBit - 1] + 1) is the minimum number of bits needed to represent the EdgelD which indicates the Current Bit edge in the path. (b) The bit pattern in Edge Identifier bounded by Edgelden ti fi er (Bi tPos [Curren tBi t] ) and Edge Identifier (Bi tPos [CurrentBi t - 1] + 1) , and Right Shifted BitPos [CurrentBi t - 1] yields the unique EdgelD followed by the path.

(c) Assume that this edge is followed before considering the next CurrentBi t .

Although not specified in the encoding schema above, it is envisioned and highly recommended that the schema only be used to encode the minimum path, if any, which results in a cycle. Such a condition is built into the algorithm presented in example 2.

Step 2 (b) involves obtaining only a subset of the set bits in Edgeidentif ier . One method of achieving this is described as follows .

If Levelldentifier is implemented as a numeric type, obtain a copy of Levelldentifier, LvlCopy, where the least significant set bit is Bi tPos [CurrentBi t - 1] . This may be done by finding the least significant bit in a copy of

Levelldentifier and setting it to 0 when it is no longer needed, or by a variety of other methods .

A mask of the desired bits may be found by the following operations : NextLvICopy = (LvlCopy Bi twise-XOR Bi twise_NOT (LOCopy -

D )

Mask = (LvlCopy Bi twise-XOR Bi twise_NOT (LvlCopy - 1) )

Bi twise-XOR (NextLvICopy Bi twise-XOR

Bi twis e_NOT (NextLvl Copy - 1 ) ) The desired bits may then be found by found by

Desired-bi ts = Edgeidentif ier Bi twise-AND Mask

Desired-bits can then be right shifted appropriately to obtain the required number. Path Encoding Algorithm This algorithm is one possible method for encoding a single path in a graph to arrive at the above coding system. Input: The root (virtual or actual) of the graph: Some means of identifying the path to be encoded (subsequently referred to as "the path")

Output: Levelldentifier, a bit pattern representing the nodes existing in the encoded path:

Edgeidentif ier, a bit pattern representing the edges followed in the encoded path.

Together Levelldentifier and Edgeidentif ier uniquely encode the path. The following algorithm assumes a bit encoding starting from the right of the bit pattern and growing towards the left.

The following algorithm only encodes the minimum path that results in a cycle. It assumes that less significant bits occur to the right of more significant bits, and bit positions are numbered from right to left, and that the right most bit position has a value of 1. However, the algorithm works for any other direction of growth and any order of bit significance, whether sequential or not.

[1] Let all bits of Edgeidentif ier and Levelldentifier be 0

[2] Assume that the root (virtual or actual) has a single incoming edge (actual or virtual) with an associated number EdgelD

[3] For each edge, CurrentEdge, in the path, from the incoming root edge to the final edge in the path to be encoded: [4] Let CurrentNode be the node containing CurrentEdge as an outgoing edge

[5] If CurrentEdge has not previously been assigned an identifying number, EdgelD,

[6] CurrentEdge is assigned a non-zero identifying number, EdgelD, which differentiates it from all other outgoing edges originating from CurrentNode . [7] Let LevelLength be the bit position of the most significant set bit of Levelldentifier (or 0 if no bit is set)

[8] Let EdgelDLength be the smallest number of bits needed to represent the number EdgelD

[9] Edgeidentif ier =

(Edgeidentif ier) Bi twiseOR (EdgelDLeftShiftLevelLength)

[10] Set the bit in Levelldentifier at bit position

LevelLength + EdgelDLength to 1 [11] If the node at the other end of CurrentEdge has already been visited, terminate algorithm

The encoding algorithm assumes that the graph has a single root. If this is not the case, then the graph can be considered to have a virtual root whose children are all the roots, or arbitrary entry points if no natural set of roots exist, of the actual graph. In step [2] this root is considered to have a single incoming edge. If one or more roots of the actual graph have more than one incoming edge, as may be the case for multiple entry points referring to the same root, then once again a virtual root can be considered whose outgoing edges are the incoming edges of all the roots of the graph.

This algorithm only encodes the minimum path necessary to form a cycle if any cycles exist in the path at all. This approach is sensible if one is encoding for the purposes of proximity detection. If one is encoding for other reasons- different criteria may be imposed.

Step [6] assigns a non-zero unique number to each outgoing edge. Note that this number need only be unique amongst all outgoing edges for a single node. The same number can be used to represent outgoing edges from different nodes . This unique number can be obtained in a variety of ways . One such way is described below, but is by no means the only method for generating such unique numbers . One method of obtaining a unique number is to keep a count of the number of EdglDs assigned for each node. A new edge would be assigned an EdgelD of (count + 1) , with an accompanying increase in the count . This method does not guarantee that the graph can be reconstructed so that edges appear in the same order as in he original graph. This is not a problem if one is only using this encoding for determining proximity, but may have implications for other uses of this encoding method. Such a method of generating unique EdgelDs has the advantage of being efficient for reflecting updates, insertions or additions and deletions in the underlying graph. Where updates involve changing the data that appears at a node, no change needs to be made to the encoding. Updates. As a result structural changes can be considered a combination of insertions and deletions . Inserting an outgoing edge to a node is reflected by assigning a new EdgelD (count + 1) to the new edge irrespective of where the new edge is positioned with respect to existing edges. This ensures that the new edge will have a unique ID, but does not guarantee that the original positioning of the edges can be recreated from the encoding. The new path(s) from the root, whether or not new nodes are added, can then be encoded. Deleting a node merely requires the removal of the associated EdgelD as a valid EdgelD for the node in question. This may result in non-sequential Edge IDs . This is not a problem for proximity determination for the reason stated above, but may be relevant for other uses of such encoding. Care must be taken to ensure the deletion is reflected in all paths which use the edge, but is easy to determine from examining the original graph.

The heart of the encoding method is found in steps [7] - [10] . These steps seek to represent the encoding in as small a space as possible. The smaller the EdgelD, the less bits required to represent the number, and so the less space required to encode the path. Figure 1 shows a generic Directed Acyclic Graph, with a particular path indicated. Figure 2 shows one possible encoding of this path/ using the algorithm above. Figure 2 shows that the first edge on the encoded path (which is the incoming edge to the root) has an EdgelD 1 , the second edge followed in the path has an EdgelD 6, which in this case corresponds to the sixth outgoing edge from node Nl in Figure 1, and the final edge followed has EdgelD 3 (which in this case corresponds to the third outgoing edge from node Nl in Figure 1) . The correspondence of EdgelD and the position of the outgoing edge is for illustrative purposes and is not guaranteed to be maintained over time .

The algorithm can easily be adapted to encode the path from the root to a given node, where it is passed an encoded path from the root to the given node. One such possible modification is given below for the purposes of illustration.

To modify the algorithm in the manner described, steps [1] to [3] become: [1] Let Edgeidentif ier be the passed Edgeidentif ier , being one half of an encoded path from the root to GivenNode . [2] Let Level Identifier be the passed Levelldentifier, being the other half of an encoded path from the root to GivenNode . [3] For. each edge, CurrentEdge, in the path, from

GivenNode to the final edge in the path is to be encoded.

Encoding Sets of Paths

The encoding method can be extended depending on the encoding and decoding requirements and the precise topology of the graph.

Where multiple paths exist from the root to a given node, one solution is to use the path encoding s described above one time for each path. This has the advantage of being very fast to decode. But it also has the disadvantage of requiring, possibly significant, extra storage space. Depending on the number of paths and the precise topologies involved, this solution may therefore not be appropriate. This solution is most appropriate for situations where only a single path exists' from the root to any given node, as is the case with trees, or in the case where multiple paths do not share common nodes.

In cases where multiple paths share common nodes, this encoding method can lead to a multiple number of encoded paths being stored. For example, in Figure 6 the three edges from Nl to N5 , and three paths from Nl to N7 yield a total of nine different paths passing through Nl . N5 and N7. Path encoding would encode each of these nine separate paths as individual instances . This can be inefficient with regards to total space needed to encode the entire set of paths, as well as an increased number of checks needed when the paths are decoded.

The path encoding scheme can be extended to encode sets of paths in a compact space. This encoded set would . need to be expanded before distance calculation can be performed on it. As such expansion primarily involves bit masking and bit shifting, it is a relatively inexpensive process when compared with multiple disk seeks.

Set Encoding Schema

Contents: An encoded set of paths consisting of 2 or 3 bit patterns:

Levelldentifier, is a bit pattern representing the nodes existing in the encoded path; Edgeidentif ier , is a bit pattern representing the edges followed and common nodes in the encoded set of paths; The optional Nodeldentifier, is a bit pattern representing the terminal nodes in the set. Nodeldentifier is required if the set contains more than one terminal node.

Set encoding is used to encode a set of one or more paths.

The following description assumes a bit encoding starting from the right of the bit pattern and growing towards the left. It assumes that less significant bits occur to the right of more significant bits. This, once again, is for the purpose of illustration and not limitation. For explanatory purposes, this algorithm assumes that bit positions are numbered from right to left, and that the rightmost bit position has a value of 1..

The following explanations all assume that the encoded set is being scanned from right to left. The following encoding assumes a single root for the set of paths encoded. If this is not the case, a virtual root may be considered. This root may or may not have a single, virtual or actual, incoming edge.

The following terminology and symbols are used in this description:

CurrentBi t is a positive integer which representing the set bit in Levelldentifier currently under consideration. Numerically it is equal to the number of set bits between, and including, itself and the start of the bit pattern Levelldentifier .

Bi tPos [CurrentBi t] is a non-negative integer and refers to the bit position of CurrentBi t . Bi tPos [O] 0.

Edgeidentif ier (Bi tPos [CurrentBi t] ) refers to the bit position indicated by Bi tPos [CurrentBi t] in the bit pattern Edgeidentif ier .

CommonNode is any node in the graph which has more than one incoming edge or more than one outgoing edge, considering only edges from the paths to be encoded.

CurrentCommonNode is the common node currently being considered.

FromNode is the common node from which the neat EdgelD is an outgoing edge. The first common node encountered in the encoding is always FromNode. Whenever a FromNode is encountered the next common node encountered will always be a ToNode (see below) .

ToNode is a common node to which the following EdgelD is an incoming edge. The last common node encountered in the encoding is a ToNode . Whenever a ToNode is encountered, the neat common node encountered will always be a FromNode . Each FromNode has a corresponding ToNode . TerminalNode is a node at which a given path terminates . CommonNodelD is a number which uniquely identifies each common node within the encoded set of paths .

• A non-zero CommonNodelD explicitly indicates the CommonNodelD of the common node under consideration.

• If the common node currently indicated is a FromNode and the CommonNodelD is 0, the actual CommonNodelD referred to is the CommonNodelD of the closest FromNode (scanning from right to left) that was explicitly indicated.

• If the common node currently indicated is a ToNode and the CommonNodelD is 0, the actual CommonNodelD referred to is the CommonNodelD of the closest ToNode (scanning from right to left) that was explicitly indicated.

A path encoded using set encoding should be interpreted as follows:

1. Scanning the bit pattern Levelldentifier from right to left.

2. For each set bit, CurrentBi t , encountered in the bit pattern Levelldentifier (a) If Edgeidentif ier (Bi tPos [CurrentBi t] ) is 1 // this indicates an EdgelD i. Edgeidentif ier (Bi tPos [CurrentBi t] ) - Edgeidentif ier (Bi tPos [CurrentBi t - 1] + 1) is the minimum number of bits needed to represent the EdgelD which indicates the CurrentBi t^th edge in the path. ii. The bit pattern to Edgeidentif ier bounded by Edgeidentif ier bounded by Edgeidentif ier (Bi tPos [CurrentBi t] ) and

Edgeidentif ier (Bi tPos [CurrentBi t - 1] ) + 1) , and Right

Shifted Bi tPos [CurrentBi t - 1 ] yields the unique EdgelD followed by the path. iii. If the EdgelD is immediately preceded by a

FromNode or is immediately preceded by a ToNode which is immediately preceded by a FromNode

A. The EdgelD is considered to indicate an outgoing edge from this FromNode , where the edge has the unique identifying number EdgelD. iv. If the EdgelD is immediately preceded by a ToNode A. The EdgelD is considered to indicate an outgoing edge from the node indicated by the proceeding

EdgelD, where the current edge has the unique identifying number EdgelD. v. If the EdgelD is the first ID in the bit pattern A.The EdgelD is considered to indicate the (virtual or actual) incoming edge to the encoded set. vi . If the EdgelD is immediately proceeded by another EdgelD

A. The EdgelD is considered to indicate an outgoing edge to this ToNode. vii. Assume that this edge is followed before considering the next CurrentBi t . (b) otherwise //Edgeiden tif ier (Bi tPos [Curren tBi t] ) is 0 which indicates a common node i. If this is the first common node encountered in this encoded set, this node is considered the "root" of this graph. ii. Determine if this is a FromNode or a ToNode . The first common node encountered is a FromNode . Subsequently common nodes alternate between ToNodes and FromNodes . iii. If CurrentNodelD is not 0 A. Edgeidentif ier (Bi tPos [CurrentBi t] ) -

Edgeidentif ier (Bi tPos [CurrentBi t - 1 ] + 1 ) is one more than the minimum number of bits needed to represent the CommonNodelD which indicates the CurrentNode under consideration. The CommonNodelD is stored in the range Edgeidentif ier (Bi tPos [CurrentBi t - 1] ) to Edgeidentif ier (Bi tPos [CurrentBi t - 1] + 1 ) Edgeidentif ier (Bi tPos [CurrentBi t] set to 0. iv. otherwise

A. Edgeidentif i r (Bi tPos [Curren tBi t] ) - Edgeidentif ier (Bi tPos [CurrentBi t - 1] +1) is the minimum number of bits needed to represent the CommonNodelD which indicates the CurrentNode under consideration. This number of bits is 1 (as the number being represented is 0) . v. The bit pattern in Edgeidentif ier bounded by Edgelden ti fi er (Bi tPos [Curren tBi t] ) and Edgeidentif ier (Bi tPos [CurrentBi t-1] + 1 ) , and Right

Shifted Bi tPos [CurrentBi t -1] yields the CommonNodelD under consideration. vi . If the CommonNodelD is 0

A. If this is a FromNode, CurrentCo monNode is the last FromNode that was explicitly referred to (ie with a

CommonNodelD which was not 1) .

B. Else This is a ToNode, CurrentCommonNode is the last ToNode that was explicitly referred to (ie with a CommonNodelD which was not 1) . vii. otherwise // CommonNodelD is not 1

A. CurrentCommonNode is the common node referred to by CommonNodelD. viii. If CurrentCommonNode is a FromNode A. The next EdgelD encountered is an outgoing edge from this node. ix. otherwise CurrentCommonNode is a ToNode . A. The next EdgelD encountered is an incoming edge to this node

(c) If Nodeldentifier exists and if Nodeldentifier (Bi tPos [CurrentBi t] ) is 1 // this indicates a terminal node i. If Edgeidentif ier (Bi tPos [CurrentBi t] ) is 1 // indicating an EdgelD

A. The node pointed to by the edge indicated by EdgelD is a terminal node . ii. otherwise // a CommonNodelD is indicated.

A. The node indicated by the CommonNodelD is a terminal node.

If Edgeidentif ier contains any set bits to the left of the left most bit of Levelldentifier, then this set encoding contains a virtual root which is different to the virtual root of the encoded graph. The positioning of the encoded set within a larger graph is user defined (perhaps a single incoming edge identifies where the encoded set falls within the larger graph, or any other method) . These trailing bits do not otherwise contribute to the expansion of the encoded set.

This encoding is designed to be effective when multiple paths share one or more common nodes . A common node is a node which has more than one incoming or outgoing edge, considering only edges that are part of the set to be encoded. This encoding scheme can encode instances involving multiple paths to a single node, or multiple paths to multiple nodes .

Note this encoding scheme does not assume a single root. However, if it is to be used in a consistent manner with path encoding, or to be consistently considered with other set encoded graphs . it is necessary to assume a single (virtual or actual) root. Note that this root is defined as the first common node encountered in the encoding .

This encoding scheme works by firstly identifying all the common nodes in the set to be encoded. When encoding, an initial pass of all nodes is performed and all common nodes identified. Each common node is then assigned a unique number. CommonNodelD. Note that this number need only be unique amongst the set of common nodes. It may be repeated for different sets . CommonNodelD may be any non- zero number, but for space efficiency it is recommended that all CommonNodelDs be as small as possible (positive integers are recommended) .

The encoded path is to be read from right to left, and is informally read as follows:

The path(s) exist (encoded as for path encoding described above) going from the most recent FromNode to the next ToNode. FromNodes and ToNodes are nodes selected from the pool of common nodes, and indicate the origin and destination of the encoded path. The ToNode is specified before the incoming edge so the bit pattern can be unencoded in a sequential scan without any need for backtracking. The first common node encountered is always a FromNode. The common node encountered after a FromNode is always a ToNode . Similarly The common node encountered after a ToNode is always a FromNode .

As common nodes will, by definition, be repeated a number of times, the encoding assumes a notion of CommmonNodelD buffering. The most recent FromNode encountered can be considered to be stored in a

FromNodeBuffer. Similarly, the most recent ToNode encountered can be considered to be stored in a ToNodeBuffer . Whenever a Common . NodelD of 0 is encountered, the real CommonNodelD is considered to that stored in the appropriate buffer. Note that the type of node (FromNode or ToNode) can be determined by keeping track of the type of the previous common node encountered.

Figure 8 shows an example of set encoding for all paths between Nl and N7 in Figure 1. In this example, the common nodes are Nl, N5 and N7. Nl has been assigned the CommonNodelD 1 , N5 has been assigned the CommonNodelD 2 and N7 has been assigned the CommonNodeID3 . Note that all

10 paths (9 through N 1 Ni— 7 and one through Nl N8-

3N7 ) are encoded in 43 bits . The bit pattern Nodeldentifier is designed to be used in situations where the set of paths being encoded do not all terminate at the same node. Nodeldentifier can be used in situations where all paths terminate at the same node, but such a use conveys no extra meaning. In this case, Nodeldentifier indicates which nodes are terminal nodes, that is nodes at which each path terminates. Although not strictly required by the encoding scheme, if an implementation chooses to indicate each terminal node only once in . Nodeldentifier , then the number of terminal nodes can easily be determined by counting the number of set bits in Nodeldentifier . If the set of paths encoded share a different virtual root from the graph as a whole, this is indicated in step [2] . This form of encoding is not recommended if it is necessary to be able to expand the set of paths such that the expansion can be merged with other path or set expansions to form a subgraph of the original graph, as is the case, for example, with proximity determination) .

Relationship Between Path and Set Encoding

Path and set encoding are complementary. A graph can be encoded using a combination of set encoding and path encoding.

For any pair, Levelldentifier and Edgeidentif ier, if: (Levelldentifier Bi twise_AND Edgeidentif ier) equals Levelldentifier AND (Nodeldentifier does not exist OR Nodeldentifier exists and has only one set bit) then Path encoding has been used for this pair, otherwise set encoding has been used.

Encoding a graph An algorithm can then be designed to traverse a graph to encode all paths to every node in it. Depending on the topology of the graph and the precise requirements, various algorithms will be appropriate. The common feature of such algorithms is that the paths must extend from the root (virtual or actual) to each given node. Depending on the purpose of the encoding, it may not be necessary, to encode all paths to every node. It may be sufficient to merely encode an appropriate subset. For example, if a graph is being encoded for proximity detection, then paths need only be considered if they either do not contain any cycles, or if the last node in the path forms the only cycle in that path.

It is then possible to group paths as desired, possibly grouping all paths from the root to each node. These groups can then be encoded using set encoding if desired, and optionally be stored with some appropriate identifier, for example in a Inode, set encoding of paths from root to this node? tuple.

Determining Proximity of Two Nodes

This invention includes a number of ways of determining proximity for a number of nodes.

In the case where two paths have been encoded using path encoding, the proximity of the two terminal nodes can be very quickly determined by directly examining the two path encodings. The next algorithm is an implementation of the distance determination for this case.

Path Encoding Proximity Determination Algorithm

Input: Two nodes, NodeA and NodeB, each encoded using path encoding. NodeA and NodeB are the last nodes on the two encoded paths .

Each encoded node has 2 bit patterns, Edgeidentif ier and Levelldentifier denoted by NodeX. Edgeidentif ier and NodeX. Levelldentifier respectively. Output: Dist, the number of edges between the two nodes, considering only the encoded paths.

The following algorithm assumes a bit encoding starting from the right of the bit pattern and growing towards the left . It assumes that less significant bits occur to the right of more significant bits. This algorithm works for any other direction of growth and any order of bit significance, whether sequential or not.

For explanatory purposes, this algorithm assumes that bit positions are numbered from right to left, and that the right most bit position has a value of 1.

This is for illustrative purposes only, and does not imply the only embodiment of this algorithm.

[1] max_diff = (NodeA . Levelldentifier Bi twise_X OR NodeB . Levelldentifier ) Bi twise_OR (NodeA . Edge I den tifier Bi twis e_XOR NodeB . Edgeidentif i er) [2] Let Bi tPos [LSSB] be the bit position of the Least Significant Set Bit in max_diff

[3] diff-mask = a bit pattern with: all bits with bit position >Bi tPos [LSSBj set to 1 AND all bits with bit position <Bi tPos [LSSB] set to 0. [4] Comp . A = NodeA . Levelldentifier Bi twise_AND diff -mask [5] CompB = NodeB . Levelldentifier Bi twise-AND diff -mask [6] XOR_amt = CompA Bi twise_XOR CompB [7] AND_amt = CompA Bi twise__AND CompB

[8] Dist - 2x (Number of set bits in AND_amt) + (Number of set bits in XOR_amt)

The algorithm works for the following reason. Considering a path from the root to the terminal nodes, while paths converge, their encodings (in both Levelldentifier and Edgeidentif ier) will be identical. As such, an exclusive OR will yield 0 for all bits where the paths converge (step [1] ) . What is of interest is only the first instance where these paths do not converge. It is important to note that the first bit set by this exclusive OR indicates the first edge/node after the node of divergence. This is the node where the two paths when considered from the root to the given nodes diverge. Thus, considering the remaining set bits in both Levelldentifiers yields the total number of nodes in the path between the two nodes less one, the node of divergence, which is not included in this node count. Considering the remaining bits in the Levelldentifiers is achieved by generating (step [3]) and applying (steps [4] and [5])) a bit mask to the Levelldentifiers. One embodiment of step [3] is the following:

[3] diff_mask = (max, diffBi twise_XOR Bi twise_NOT(max_diff- 1) )

Bi twise_ORmax_di ff

Step [6] finds the number of bits in the remainder of the Levelldentifiers which do not coincide with each other, and thus must be included once each in the node count .

Step [7] finds the number of bits in the remainder of the Levelldentifiers which do coincide with each other, and thus must be included twice each in the node count. As previously stated, this count will give the number of nodes in the path between the two nodes, less one. However, as the number of edges in the path is also calculated as the number of nodes less one, this method finds the number of edges between the two nodes. One embodiment of a use for this encoding is for determining proximity of one node to another. If the path from the root to node A and from the root to node B are encoded using path encoding, then the distance between these two nodes, along the encoded paths, can be determined using this algorithm.

Determining Proximity between two Sets of Nodes

The proximity, or distance, of two nodes is defined as the minimum number of edges between the two nodes . The proximity of two sets of nodes, the "find" set and the "near" set, is defined as the set of tuples, <nodel, distance> where 1. nodel is an element of the "find" set

2. distance is the smallest number, distance2, from the set of tuples. <nodel, node2 , distance2>, where

(a) node2 is an element of the "near" set (b) one such tuple, <nodel, node2 , distance2>. exists for every, element of the "near" set. AND one such tuple, ! odel, distance?, exists for every element of the "find" set.

The desired set may be a subset of the full proximity, limited by some ordinance or radial criteria. For example, find only the "closest" occurrence or only the "furthest" occurrence, find all occurrences with a distance less than 12, find all occurrences with distance between 6 and 15, find all occurrences with distances greater than 23. etc.

Path by Path Comparison

As seen from the above definition, the naive method of performing this calculation is to perform |Find set)x|Near set | calculations. In situations where the find set contains only one element, or possibly some other small number, and depending on the precise topology of the graph being encoded, for example, considering a tree or directed acyclic graph, this may in fact, be the most efficient method. In this case, it is best to use the method for determining the proximity of two nodes if possible; that is in situations when path encoding has been used for both nodes . Annotated graph comparison In other instances, it is most efficient to first obtain an annotated graph (described below) containing the nodes of the "near" set, representing a subgraph of the graph containing the "near" set. This can be dynamically generated at runtime or can be stored as a precompiled "annotated" graph for popular "near" sets. Next, each node is considered from the "find" set, and "unrolled on top of"this annotated graph. This yields the proximity for each node in the "find" set. If the search criteria desires a "find" set ordered by proximity, a bucket sort can then be used to immediately position the node in the appropriate position in the output set. A bucket sort is then used to place elements of the "find" set at the appropriate position in the output. One "bucket" is created for every potential valid distance in the answer set. Note that there are at most "diameter of the graph" buckets required for this bucket sort. The annotated graph of the "near" set contains the subgraph of the original graph representing the "near" set. Each node is "annotated" by recording the distance from that node to the closest node in the "near" set, considering only "forward" edges. This means that the distance is calculated by considering all paths from the given node to the each node in the "near" set, as opposed to counting the number of edges . which is a different measure. One embodiment of an algorithm is:

Annotated Expanded Set Generation Algorithm

Input: An encoded set of paths, each consisting of 2 or 3 bit patterns:

Levelldentifier, a bit pattern representing the nodes existing in the encoded path; Edgeidentif ier, a bit pattern representing the edges followed and common nodes in the encoded set of paths; The optional Nodeldentifier, a bit pattern representing the terminal nodes in the set, if more than one terminal node exists. Output:A graph representing the encoded set, with each node "annotated" with the distance (considering only "forward" edges) to the closest terminal node.

The following algorithm assumes CurrentNode . distance is initialized to MAXINT. [1] Expand the encoded set according to the schema given in example 3. During the expansion, record reverse pointers to allow movement "up" the graph. Keep track of all the terminal nodes in the expanded graph.

[2] For each terminal node in the expanded graph

[3] Append <this terminal node, 0> to NodesRemaining [4] While NodesRemaining is not empty

[5] Remove <CurrentNode . Dist>from head of NodesRemaining

[6] If CurrentNode . distance>Dist

[7] CurrentNode . distance = Dist [8] For each incoming edge. NewEdge, of CurrentNode

[9] If NewEdge has not been followed//cater for cycles

[10] Append <CurrentNode—>NewEdgeID, CurrentNode . distance +1> to NodesRemaining note that dynamic annotated graph generation is more efficient for path encoding:

Annotated Individual Graph Generation Algorithm

Input: An encoded set of paths, consisting of 2 or 3 bit patterns: Levelldentifier , a bit pattern representing the nodes existing in the encoded path;

Edgeidentif ier , a bit pattern representing the edges followed and common nodes in the encoded set of paths;

The optional Nodeldentifier , a bit pattern representing the terminal nodes in the set, if more than one terminal node exists.

Output: A Set of graphs, graphSet . Each graph represents all paths from the root to one terminal node. Each node in the graph is "annotated" with the distance (considering only "forward" edges) to the terminal node. One graph exists in the set for each terminal node

The following algorithm assumes CurrentNode . distance is initialized to MAXINT.

[1] Expand the encoded set according to the set encoding schema.

During the expansion, record reverse pointers to allow movement "up" the graph. Keep track of all the terminal nodes in the expanded graph.

[2] For each terminal node in the expanded graph

[3] Append <this terminal node, 0> to NodesRemaining [4] While NodesRemaining is not empty'

[5] Remove < CurrentNode, Dist > from head of

NodesRemaining

[6] If CurrentNode does not exist in new graph, insert it

[7] If CurrentNode . distcance > Dist [8] CurrentNode . distance = Dist

[9] For each incoming edge, NewEdge, of CurrentNode

[10] If NewEdge does not already exist in graph being generated

[11] Insert CurrentNode-α NewEdgelD into the graph being generated

[12] If NewEdge has not been followed/ /cater for cycles

[13] Append < CurrentNode.^ NewEdgelD,

CurrentNode . distance + 1 > to NodesRemaining

[14] Insert new graph into AnswerSet

Annotated Graph Generation Algorithm

Input: A set of encoded nodes, Nodeset . Nodes can be encoded using a combination of either path encoding or set encoding . Output: A graph, with each node "annotated" by recording the minimum distance

(considering only "forward" edges) from that node to the nearest node in the input set Note. The following algorithm assumes CurrentNode . distance is initialized to MAXIN .

[1] For each element of NodeSet

[2] If the element of NodeSet uses path encoding

[3] Let CurrentNode_^ the root of the graph

[4J Let NodeCount -the number of set bits in Levelldentifier

[5] For each EdgelD indicated (as determined by the path encoding schema) from the second to the last [6] CurrentNode. distance = min (CurrentNode. distance,

NodeCount)

[7] NodeCount = NodeCount - 1

[8] If EdgelD does not already exists for CurrentNode [9] Insert CurrentNode- > EdgelD into the graph

[10] CurrentNode = CurrentNode -> EdgelD

[11] Otherwise //element of NodeSet uses set encoding

[12] Create annotated graph from element of NodeSet (as per the algorithm for annotated expanded set generation) [13] Let CurrentNode be the root of this annotated graph

[14] Assume CurrentNode corresponds to the root of the graph being generated.

[15] Append < CurrentNode, CurrentNode . distance >to

NodesRemaining [16] While NodesRemaining is not empty

[17] Remove < CurrentNode, Dist > from head of

NodesRemaining

[18] Let graphNode be the node corresponding to

CurrentNode in the graph being generated [19] If graphNode . distance > Dist

[20] graphNode . distance = Dist

[21] For each outgoing edge, NewEdge, of CurrentNode

[22] If NewEdge does not already exist in graph being generated [23] Insert CurrentNode -- NewEdgelD into the graph being generated

[24] If NewEdge has not been followed/ /cater for cycles

[25] Append < CurrentNode — NewEdgelD, graphNode, distance

+ 1 > to NodesRemaining

Annotated Graph Proximity Determination Algorithm

Input: An annotated graph, Neargraph, representing the "near" set:

A set of encoded nodes, FindSet . . Nodes can be encoded using a combination of either path encoding or set encoding. Output: A set, AnswerSet , of tuples. < FindNode, proximi ty >, where FindNode is a node from FindSet and proximi ty is the minimum number of edges between FindNode and the closest element in the "near" set

[I] For each element of FindSet

[2] If the element of FindSet uses path encoding//the element indicates FindNode

[3] Let CurrentNode= the root of Neargraph, proximi ty =MAXINT

[4] Let NodeCount -the number of set bits in

Level Iden ti fi er

[5] For each EdgelD indicated (as determined by the path encoding schema) from the second to the last [6] proximity = min (proximity, CurrentNode . distance +

NodeCount)

[7] If EdgelD does not exist for CurrentNode or this is the last EdgelD

[8] Insert ! FindNode, proximity? into Answerset [9] break [IOJ NodeCount = NodeCount - 1

[10] CurrentNode = CurrentNode- > Edge-TD

[II] Otherwise //element of FindSet uses set encoding [12] Create set of annotated graphs, graphSet, from element of findSet (as per the algorithm for annotated individual graph generation)

[13] For each annotated graph, Angraph, in graphSet [14] Let CurrentNode be the root of this annotated graph, proximi ty =MAXINT [15] Assume CurrentNode corresponds to the root of Neargraph

[16] Append < CurrentNode > to Nodes Remaining [17] While NodesRemaining is not empty

[18] Remove < CurrentNode > from head of NodesRemaining [19] Let graphNode be the node corresponding to CurrentNode in Neargraph

[20] proximi ty = min (proximity, CurrentNode .distance -f- graphNode . distance) [21] For each outgoing edge, NewEdge, of CurrentNode [22] If NewEdge exists-in Neargraph AND NewEdge has not been followed in Angraph

[23] Append < CurrentNode > to NodesRemaining [24] Insert <terminal node of Angraph, proximi ty > into AnswerSet ?

Step [5] of the algorithm for annotated graph proximity considers only edges from the second edge indicated in the encoded path. This is because the first edge is always a single pointer to the root. As the algorithm assumes the root of the graph and the root of the encoded path are the same, the first edge is immaterial .

Nodes from the "Find" set are then "unrolled" on top of this annotated GRAPH. The algorithm annotated graph proximity determination shows an embodiment of one such unrolling algorithm. Note that the output, a set of tuples, \ FindNode, proximi ty?, where proximity is the minimum number of edges between FindNode and the closest element in the "near" set, can easily be sorted using a bucket sort to achieve an answer set ordered by proximity.

As with the previous algorithm, this algorithm is more efficient for nodes using path encoding, and after that, most efficient for set encoding containing a single terminal node.

By identifying the proximal nodes as a relation name and using proximity to find the containment relationship of its tuples, standard RDBMS query language such as SQL as shown below can be used to query the sample XML information presented above. Note that the output can be displayed as XML (as shown) or displayed as a typical tabular form as if in RDBMS by using an appropriate XSL (XML Stylesheet Language) file.

It will be appreciated that there are many applications of this invention, and particularly for searching information, and also more generally for determining the proximity between any nodes connected by a graph for any general purpose, for any information that can be represented by a graph including nodes and edgers . The general system for searching databases will now be described with reference to the Figure 9 reference numeral 10 indicates a database storing information. The database may be any type of database, which may be stored on a computing system which may include a computer network such as the Internet (where the information may be distributed over many computers) .

In order for the present invention to be applied, information from this database must be represented in graph form, including nodes connected by edges. Converters for converting any information to graph form are known in the prior art. The system of the present invention, generally designated by reference numeral 100 in this embodiment, includes translator 101 for converting incoming information into graph format. It also includes a proximity engine 102. The proximity engine 102 is arranged to encode a graph as discussed above, in accordance with embodiments of the present invention. The coding may be stored in any convenient manner as a searchable index. Preferably, it is stored as a hash index. As discussed above, coding may be dynamic (as searching is required) or a large index may be stored in memory. The system also then comprises as a query means utilising simple syntax to enable a query to be entered to return a search result .

It will be appreciated that the system in Figure 9 can be implemented by any convenient computing system, including computer hardware and/or software. It will also be appreciated that the encoding scheme and proximity search discussed above will be implemented by suitable software/hardware in a computing system.

The system of the present invention can search any type of information that can be represented graphically "which is all information" . The database 10 in the Figure 9 embodiment would generally include free-format information and maybe entirely free form information which is then represented in graph form by the translator 101. There are, however, in existence already, many organised databases which require specialised languages such as SQL to search them. One of the problems with searching such databases is that the input language needs to be absolutely correct otherwise an incorrect result or no result will be returned. The present invention also has application in facilitating simple language searches of databases that are organised.

Referring to Figure 9, reference numeral 20 indicates an ordered database containing information in ordered form. This would normally be searched directly using complex language such as SQL. Utilising the present invention, however, the system in accordance with a further embodiment of the present invention comprises translator 201 for translating information from the ordered database.20 into graph form and proximity engine 202 for encoding the graph in accordance with an embodiment of the present invention. Simple queries can then be made using the proximity engine and using simple SQL syntax. SQL query: select * from restaurant; Answe :

<name> ABC Seafood Restaurant </name> <phone> ... <address> ... <rating> ...

<name> Great Thai </name> <phone> ...

<name> Quick n Good </name>

SQL query: select * from dining; Answer:

<restaurant> ... <restaurant> ... <restaurant> ... <cafe> ...

SQL query: select cafe from dining; Answer: <cafe> ...

SQL query: select restaurant from dining; where name like "Great%"; Answer:

<name> Great Thai </name> SQL query : select name from (select restaurant from dining) ; Answer: <name> ABC Seafood Restaurant </name> <name> Great Thai </name> <name> Quick n Good </name>

SQL query: select name from dining; Answer:

<name> ABC Seafood Restaurant </name> <name> Great Thai </name> <name> Quick n Good </name> <name> Snack House </name>

Other applications of this invention include searches in an environment such as mobile phone environments . A mobile telephone can utilise voice or SMS based input in order to search the database such as e.g yellow pages, or the Internet. This type of searching is not often used at present, because of the necessity to have organised databases and the time that such searches take. The present invention, however, can be applied in this area and generally for any search system for semi structured data that requires intelligent matching based on proximity, or is adapted so that it can do so. It can also be applied in the natural language query environment for other text or semistructured data. Modifications and variations as would be apparent to a skilled addressee are deemed to be within the scope of the present invention. Due to the primitive user interface of mobile devices, small display and keypad, short messages are popular and important for sending messages or commands from these devices . SMS is a messaging application for mobile phones that allow users to send a short message in asynchronous manner. To facilitate information search services effectively using SMS, finding what user wants to ask intelligently is crucial. By utilizing proximity search on the semistructured data, users can query information in a very precise way and yet the search result is likely to be highly accurated and precisely ranked. For example, user may ask "Thai restaurant" so that the system will return the information about "Great Thai" by finding the closest proximity between the word "Thai" and the element tag "restaurant". This is different from traditional text proximity search as topological structure of the semistructured data has not been considered, while this structure in general captures the semantic meaning of the information such as its catalogical or ontological information.

Searching information based on natural language can also be supported by combining proximity search with the traditional stemming and stoplist techniques from information retrieval. For example, a query "Find me a restaurant that has good Thai food near Kensington" can be evaluated by proximity search. First, the query might become "Find restaurant good Thai food near Kensington" after the stemming and stoplist phases. The keywords of this sentence will then be matched against the document structure (such as the DTD of an XML document) of the semistructured data based on proximity search. These keywords will then be matched with the actual data by proximity search and can be optionally relaxed by an electronic thesaurus. Using the sample XML document above, only "restaurant", "Thai" and "Kensington" will match the vocabularies appeared in the data or its DTD. These keywords will be used to perform the proximity search and hence the "Great Thai" information will be returned.

Claims

CLAIM

1. A method of searching information, the information being representable in graph form and including nodes representing objects and edges representing relationships between nodes, the method comprising the steps of encoding a representation of the graph, the code identifying the position of the nodes within the graph, and providing the codes as an index arranged to facilitate a determination of proximity between the positions of nodes within the graph, whereby to facilitate searching of the information represented by the nodes .

2. A method in accordance with claim 1, wherein the code is arranged to represent the pathway to the nodes via interconnecting edges connecting the node(s) and a root node .

3. A method in accordance with claim 2, wherein the step of encoding comprises the step of assigning edge identifiers to each edge, the edge identifier for a particular edge being the smallest unused positive number which is unique only amongst all edges originating from a given node .

4. A method in accordance with claim 3 , wherein the step of encoding includes assigning a code to each node that corresponds to the sequence of edgeidentifiers of the edges in the pathway to that node through connecting nodes from the root node .

5. A method in accordance with claim 4, wherein the step of encoding includes the step of assigning a virtual incoming edge to the root node, assign an edgeidentifier to the virtual incoming edge, and utilising the edge identifier in the assigning of code.

6. A method in accordance with claim 4 or claim 5, wherein the graph includes common nodes, and the method of encoding includes the step of assigning a marker to each common node and including the marker in the code identifying each node connected in the pathway to the common node

7. A method in accordance with any one of the preceding claims, comprising the further step of storing the codes as bit patterns in a computing system.

8. A method in accordance with claim 7, wherein the bit patterns are stored as compressed arrays.

9. A method in accordance with claim 8, wherein the boundary pattern of the compressed array is representative of as a levelidentifier of the graph and the identifier of the pattern of the compressed array is representative of the edgeidentifier .

10. A method in accordance with any one of the preceding claims, wherein the method of encoding is carried out dynamically as searching is required.

11. A method in accordance with any one of the preceding claims, wherein the method of encoding is carried out as a separate process from the searching and the index is stored for subsequent searching.

12. A method in accordance with any one of the preceding claims, wherein the information representable in graph form is information stored as data accessible by a computing system.

13. A method in accordance with claim 12, wherein the computing system is the Internet.

14. A method in accordance with claim 12 or 13, wherein the information is stored as XML data.

15. A method in accordance with any one of the preceding claims, comprising the searching step of utilising the code to determine the positional proximity within the graph of a first node (termed a "find" node) with respect to a second node (termed a "near" node) .

16. A method in accordance with claim 15, when the step of determining the positional proximity includes the step of comparing the encoded subgraph of all nodes and edges of the find node with the encoded subgraph of all nodes and edges of the near node .

17. A method in accordance with claim 16, the codes being stored as bit patterns in accordance with claims 7 or 8 , the step of comparing the encoded subgraphs comprising the steps of comparing the bit patterns .

18. A method in accordance with claim 17, wherein the step of comparing the bit patterns comprises the step of subtracting the bit patterns from one another, and counting the remaining bits .

19. A method of searching information, comprising the step of translating the information into graph form, the graph form including nodes representing objects and edges representing relationships between nodes, preparing a search index in accordance with the steps of any one of claims 1 to 14, and searching the index in accordance with any one of claims 15 to 18.

20. A method of searching an ordered database, comprising the steps of translating the information from the ordered database into a graph form, the graph form including nodes representing objects and edges representing relationships between nodes, applying the method steps of any one of claims of 1 to 14 to prepare a coded index, and searching the coded index in accordance with the method steps of any one of claims 15 to 18.

21. A method in accordance with claim 20, wherein a simple query language is utilised to instruct the search.

22. A method in accordance with claim 19 or 20, wherein the search is instructed from a mobile device, such as a mobile telephone.

23. A method of determining proximity between computer represented graphs which involves nodes extending from a root to a terminal node, the nodes being interconnected by edges, the method including operating a computer in which the graphs are represented, by the steps of: inputting a first coded set of nodes representing the "near" ; inputting a second coded set of nodes representing the "find"; inputting a proximity criteria; unrolling nodes from the "find" set on top of the "near" set; outputting an answer set of nodes and respective proximities, where the nodes of the answer set are from the "find" set and the respective proximity's are the minimum number of edges between that node and the closest element in the "near" set.

24. A method of determining proximity between nodes in a graph comprising nodes connected by edges, the method comprising the steps of encoding the graph, the code identifying the position of nodes within the graph, the step of encoding comprising the step of assigning edge identifiers to each edge, the edge identifier for a particular edge being the smallest unused positive number which is unique only amongst all edges originating from a given node .

25. A searchable index produced in accordance with the method steps of any one of claims 1 to 14.

26. A searchable index in accordance with claim 25, stored in a computing system.

27. An apparatus for searching information, the information being representable in graph form including nodes representing objects and edges representing relationships between nodes, the apparatus comprising an encoding means for encoding the representation of the graph, the encoding means producing code identifying position of the nodes within the graph, and storage means for storing the codes as an index arranged to facilitate a determination of proximity between the positions of nodes within the graph, whereby to facilitate searching of the information represented by the nodes.

28. An apparatus in accordance with claim 27, wherein the code is arranged to represent the pathway to a node via interconnecting edges connecting the node(s) and a root node .

29. An apparatus in accordance with claim 28, wherein the encoding means is arranged to assign edge identifiers to each edge, the edge identifier for a particular edge being the smallest unused positive number which is unique only amongst all edges originating from a given node.

30. An apparatus in accordance with claim 29, wherein the encoding means is arranged to assign a code to each node that corresponds to the sequence of edge identifiers of the edges and the pathway to that node through connecting nodes from the root node .

31. An apparatus in accordance with claim 30, wherein the encoding means is arranged to assign a virtual incoming- edge to the root node, assign an of edge identifier to the virtual incoming edge, and utilise the edge identifier in the assigning of code.

32. An apparatus in accordance with claim 30 or claim 31, wherein the graph includes common nodes, and the encoding means is arranged to assign a marker to each common node and include the marker in the code identifying each node connected in the pathway to the common node.

33. An apparatus in accordance with any one of claims 27 to 32, the storage means storing the codes as bit patterns .

34. An apparatus in accordance with claim 33, the storage means storing the bit patterns as compressed arrays .

35. An apparatus in accordance with claim 34, wherein the boundary pattern of the compressed array is representative of a level identifier of the graph and the identifier of the pattern of the compressed array is representative of the edge identifier.

36. An apparatus in accordance with any one of claims 27 to 35, wherein the encoding means is arranged to encode dynamically as searching is required.

37. An apparatus in accordance with any one of claims 27 to 36, wherein the encoding means is arranged to encode as a separate process from the searching and an index is stored in the storage means for subsequent searching.

38. An apparatus in accordance with any one of claims 27 to 37, comprising a searching means, the searching means being arranged to utilise the code to determine the positional proximity within the graph of a first node (termed a "find" node) with respect to a second node (termed a "near" node) .

39. An apparatus in accordance with claim 38, wherein the searching means is arranged to determine the positional proximity, by comparing an encoded sub-graph of all nodes and edges of the find node with encoded sub-graph of the nodes and edges of the near node .

40. An apparatus in accordance with claim 39, wherein the codes are stored as bit patterns in accordance with claims 32 and 33, and wherein the searching means is arranged to compare the encoded sub-graphs by comparing the bit patterns .

41. An apparatus in accordance with claim 40, wherein the ' searching means is arranged to compare the bit patterns by subtracting the bit patterns from one another, and counting remaining bits.

42. An apparatus for searching information, comprising a translation means for translating the information into graph form, the graph form including nodes representing objects and edges representing relationships between nodes, a search index preparation means for preparing a search index in accordance with any one of the method claims of claims 1 to 14, and searching means for searching the index in accordance with the method steps of any one of claims 15 to 18.

43. An apparatus for searching a structured database, comprising a translation means for translating the information from the structured database into a graph form, the graph form including nodes representing objects and edges representing relationships between nodes, and encoding means arranged to apply the method steps from any one of claims 1 to 14 to prepare a coded index, and searching means arranged to search the index by applying the steps of any one of the method claims 15 to 18.

44. A mobile telephone or personal digital assistant including means for interfacing with an apparatus in accordance with claim 20 or claim 19, whereby to instruct a search and receive results of the search.

45. A computer system, comprising; a memory in which graphs are represented by nodes extending from a root to a terminal node, the nodes being interconnected by edges; data input means to input a coded set of nodes representing graphs, and proximity criteria; computation means to unroll nodes from a "find" set on top of a "near" set ; an output means to output an answer set of nodes in respective proximities, where the nodes of the answer set are from the "find" set and the respective proximities are the minimum number of edges between that node and the closest element in the "near" set.

46. A computer program including instructions for controlling the computer to implement the method of any one of claims 1 to 23.

47. A computer readable medium including a computer program in accordance with claim 46.