US20130254219A1 - Processing structured data - Google Patents

Processing structured data Download PDF

Info

Publication number
US20130254219A1
US20130254219A1 US13/894,118 US201313894118A US2013254219A1 US 20130254219 A1 US20130254219 A1 US 20130254219A1 US 201313894118 A US201313894118 A US 201313894118A US 2013254219 A1 US2013254219 A1 US 2013254219A1
Authority
US
United States
Prior art keywords
structured data
file
record
data file
bmf
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/894,118
Inventor
Zhengyu Zhang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XimpleWare Inc
Original Assignee
XimpleWare Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Family has litigation
First worldwide family litigation filed litigation Critical https://patents.darts-ip.com/?family=42332718&utm_source=google_patent&utm_medium=platform_link&utm_campaign=public_patent_search&patent=US20130254219(A1) "Global patent litigation dataset” by Darts-ip is licensed under a Creative Commons Attribution 4.0 International License.
Priority claimed from US10/272,077 external-priority patent/US7133857B1/en
Application filed by XimpleWare Inc filed Critical XimpleWare Inc
Priority to US13/894,118 priority Critical patent/US20130254219A1/en
Publication of US20130254219A1 publication Critical patent/US20130254219A1/en
Assigned to XIMPLEWARE, INC. reassignment XIMPLEWARE, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ZHANG, ZHENGYU
Priority to US15/393,481 priority patent/US9990364B2/en
Priority to US15/970,775 priority patent/US10698861B2/en
Priority to US16/821,307 priority patent/US20200285607A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • G06F17/2705
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/116Details of conversion of file system types or formats
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Definitions

  • the present invention relates to the field of structured data files in computer systems. More specifically, the present invention relates to the processing of structured data in an efficient manner.
  • Structured data represents a large portion of the information accessed on the Internet and other computer networks. There are several reasons why structured data is so popular.
  • ASCII American Standard Code for Information Interchange
  • UTF-8 and UTF-16 are among the most common standard encoding formats.
  • Text encoding puts information into a format that is easily readable by a human, thus it is easy for programmers to develop and debug applications.
  • textual encoding is extensible and adding new information may be as simple as adding a new key-value pair.
  • XML Extensible Markup Language
  • HTML Hypertext Markup Language
  • tags are used to instruct a web browser how to render data
  • HTML Hypertext Markup Language
  • the tags are designed to describe the data fields themselves.
  • XML therefore, provides a facility to define tags and the structural relationships between them. This allows a great deal of flexibility in defining markup languages to using information. Because XML is not designed to do anything other than describe what the data is, it serves as the perfect data interchange format.
  • XML is not without its drawbacks. Compared with other data formats, XML can be very verbose. Processing an XML file can be very CPU and memory intensive, severely degrading overall application performance. Additionally, XML suffers many of the same problems that other software-based text-based processing methods have. Modern processors prefer binary data representations, particularly ones that fit the width of the registers, over text-based representations. Furthermore, the architecture of many general-purpose processors trades performance for programmability, thus making them ill-suited for text processing. Lastly, the efficient parsing of structured text, no matter the format, can present a challenge because of the added steps required to handle the structural elements.
  • XML parsers are software-based solutions that follow either the Document Object Model (DOM) or Simple API for XML (SAX) technologies.
  • DOM parsers convert an XML document into an in-memory hierarchical representation (known as a DOM tree), which can later be accessed and manipulated by programmers through a standard interface.
  • SAX parsers treat an XML document as a stream of characters. SAX is event-driven, meaning that the programmer specifies an event that may happen, and if that event occurs, SAX gets control and handles the situation.
  • DOM and SAX are complementary, not competing, XML processing models, each with its own benefits and drawbacks.
  • DOM programming is programmer-friendly, as the processing phase is separate from application logic. Additionally, because the data resides in the memory, repetitive access is fast and flexible. However, DOM requires that the entire document data structure, usually occupying 7-10 times the size of the original XML document, be loaded into the memory, thus making it impractical for large XML documents.
  • SAX can be efficient in parsing large XML documents (at least when only small amounts of information need to be processed at once), but it maintains little of the structural information of the XML data, putting more of a burden on programmers and resulting in code that is hardwired, bulky, and difficult to maintain.
  • API application program interface
  • the present invention provides a fast and efficient way of processing structured data by utilizing an intermediate file to store the structural information.
  • the structured data may be processed into a Binary mask Format (BMF) file which may serve as a starting point for post-processing.
  • BMF Binary mask Format
  • a tree structure built on top of the BMF file may be constructed very quickly, and also takes up less space than a DOM tree.
  • BMF records may reside entirely in the memory and contain structural information, allowing SAX-like sequential data access.
  • FIG. 1 is a block diagram illustrating a layer view of an apparatus for efficiently processing structured data in accordance with an embodiment of the present invention.
  • FIG. 2 is a block diagram illustrating an apparatus for efficiently processing structured data in accordance with an embodiment of the present invention.
  • FIG. 3 is a timing diagram illustrating the operation of the hardware in accordance with an embodiment of the present invention.
  • FIG. 4 is a flow diagram illustrating a method for modifying the content of a target string in a BMF file from an old string to a new string in accordance with an embodiment of the present invention.
  • FIG. 5 is a diagram illustrating a BMF record format in accordance with an embodiment of the present invention.
  • FIG. 6 is a flow diagram illustrating a method for efficiently processing a structured data file, the structured data file including one or more pieces of content, in accordance with an embodiment of the present invention.
  • the components, process steps, and/or data structures may be implemented using various types of operating systems, computing platforms, computer programs, and/or general purpose machines.
  • devices of a less general purpose nature such as hardwired devices, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), or the like, may also be used without departing from the scope and spirit of the inventive concepts disclosed herein.
  • a structured data file is any file containing content as well as some information regarding the structural organization of the content.
  • the present invention provides a fast and efficient way of processing structured data by utilizing an intermediate file to store the structural information.
  • the structured data may be processed into a Binary mask Format (BMF) file which may serve as a starting point for post-processing.
  • BMF Binary mask Format
  • a tree structure built on top of the BMF file may be constructed very quickly, and also takes up less space than a DOM tree.
  • BMF records may reside entirely in the memory and contain structural information, allowing SAX-like sequential data access.
  • this document will describe advantages that the present invention provides over DOM or SAX, one of ordinary skill in the art will recognize that the present invention need not be limited to replacing DOM or SAX, and can be expanded to non-XML type processing.
  • FIG. 1 is a block diagram illustrating a layer view of an apparatus for efficiently processing structured data in accordance with an embodiment of the present invention.
  • the apparatus may comprise three layers.
  • a hardware text processing accelerator 100 occupying the lowest layer, may offer the horsepower necessary to relieve the central processing unit (CPU) from the most processor intensive part of the task.
  • On top of the hardware text processing accelerator 100 may lie a device driver layer 102 that is responsible for the communication between the hardware text processing accelerator 100 and a software layer 104 .
  • the software layer 104 may be designed to offer maximum flexibility and further improve the performance. It may export APIs that are standard-compliant.
  • FIG. 2 is a block diagram illustrating an apparatus for efficiently processing structured data in accordance with an embodiment of the present invention.
  • a text processor 200 may be the core of the accelerator. It may be composed of multiple Finite State Machines (FSMs) that process an incoming document in parallel. The output may be the BMF file. It also may set several result registers (not pictured).
  • a PCI interface 202 may handle all handshaking between the hardware and a server PCI bus 204 .
  • a memory controller 206 may receive commands from the PCI interface 202 and convert the PCI address to on-board memory address space. It also may access the board memory accordingly.
  • Configuration registers 208 may determine the configuration of the text processing pipeline and the organization of the memory controller 206 . It may load default values from configuration ROM 210 . Some of the values may be modified by software through the PCI interface 202 .
  • the Configuration ROM 210 may store the default setting of the text processor configuration. It also may store the configuration map of FPGAs.
  • a document buffer 212 may store the incoming document. This may be a Synchronous Dynamic Random Access Memory (SDRAM). Paging may be utilized if the incoming document is larger than the total buffer size.
  • SDRAM Synchronous Dynamic Random Access Memory
  • a BMF buffer 214 may store the output BMF files, together with several other text processor result register values. This may be a separate SDRAM, although in one embodiment of the present invention it may share a single SDRAM with the document buffer.
  • a string cache 216 may buffer the incoming data to smooth out and speed up SDRAM access.
  • a DMA engine 218 may grab the frame data from server main memory and send it back the BMF file.
  • FIG. 3 is a timing diagram illustrating the operation of the hardware in accordance with an embodiment of the present invention.
  • a reset 300 may be sent out by the host computer system, or by a specific application. While reset is asserted, the configuration may be loaded from ROM to the configuration registers. The text processor then may be set to a default state according to the configuration registers.
  • a start signal 302 may be sent through the PCI to indicate the beginning of a document processing cycle.
  • the PCI master may assert a frame number 304 to indicate the beginning of a write transaction.
  • the PCI master may then drive the address/data 306 to the PCI bus.
  • the PCI target interface may respond, causing the DMA to read the document into the SDRAM document buffer 308 .
  • the memory controller may be activated by PCI command. It may start processing data in the string buffer 310 . It also may send sync signals to the memory controller. The transferring and processing may be repeated.
  • the PCI target may sense a valid window to send data. Then the PCI master may assert the frame number to indicate the beginning of a read transaction 312 . The PCI target holds the bus. The DMA engine may then transfer the BMF and result register data to main memory 314 . When all the data is transferred, the PCI target interface may send an end signal to the device driver 316 . The next document processing cycle may start again with a start signal from the device driver 318 .
  • the output of the hardware is a BMF.
  • the BMF defines a binary record format that is used to describe various fields in a structured data file. It can be viewed as a two-dimensional field of bits. Each piece of useful information in the structured data file may correspond to a record in the BMF file.
  • a record may comprise a starting offset and length of a target string in the structured data file. It may also comprise the depth value, node type, and bit-wide flags. These will be discussed in more detail below.
  • the total length of a record may be an integer multiple of a 32-bit word—the width of the memory bus in most commercial architectures.
  • Two record types may be defined: a full version of 12 bytes in length, and a compressed version of 8 bytes in length. The full version may be based on the assumption that both the string length and the starting offset are 32-bits wide, whereas the compressed version may assume a 16-bit field, which translates to a maximum length of 65536 bytes.
  • FIG. 4 is a flow diagram illustrating a method for modifying the content of a target string in a BMF file from an old string to a new string in accordance with an embodiment of the present invention.
  • a piece of memory of the length of the new string may be allocated.
  • the new string may be filled into the memory.
  • a record in the BMF file corresponding to the old string may be located.
  • a corresponding bit flag for the record may be changed from relative offset to absolute offset.
  • relative offsets may be used. However, as soon as a modification is made to a specific string, it can often be difficult if not impossible to continue to track the relative offset for that string.
  • absolute offsets may be utilized for all modified strings.
  • an offset value in the BMF record for the old string may be replaced with a pointer value of the new string in memory.
  • a length field in the BMF record for the old string may be replaced with the length of the new string.
  • BMF file modes There are at least three types of possible BMF file modes: read-only mode, read-modify mode, and read-modify-add mode.
  • read-only mode records representing various types of nodes may be placed sequentially into a BMF file, leaving no empty records.
  • a leaf-level element may be represented as a record for the starting tag, 2n records (one of r property name and one property value) for n properties, and one record for the text for the element, and finally one record for an ending tag name. The presence of the ending tag record may be used for document validation.
  • the read-modify mode may be identical to read-only mode except each record allows for limited write-access, meaning content can be altered, but not added.
  • the read-modify-add mode allows complete write-access, which is done by embedding empty records into the file.
  • the record format may be picked to efficiently represent the necessary information of the original data structure. It may also be made less efficient on purpose to speed up downstream processing.
  • FIG. 5 is a diagram illustrating a BMF record format in accordance with an embodiment of the present invention.
  • a control word 500 may be thirty-two bits in total in this embodiment. This may include a depth value 502 of sixteen bits. The depth value may indicate the depth of a tag in the hierarchy of tags in the structured data file. Thus, the first tag in a file will have a depth of zero, whereas if another starting tag appears before an ending tag for the first tag, that second starting tag will have a depth of one.
  • a content type 504 may be provided, which indicates what type of information the content is. In an embodiment of the present invention, the following value/content type pairs may be used in this field:
  • a modification indicator 506 may also be provided, which indicates whether or not the record has been modified. This is important because, as described above, if the record has been modified, then the offset field will contain the real pointer value, not a relative offset.
  • An insertion indicator 508 may indicate that data was inserted in between two existing records. Once again, this is important in determining how to utilize the offset field. If the insertion indicator is set to 1, it indicates that the offset field contains a pointer to an external piece of memory, one that can be used to add child nodes to the current node.
  • An end of document indicator 510 may indicate whether the tag is the last one in the document. This can be important because in some embodiments, ending tags may be ignored when encoding the BMF file in order to save space. Therefore, the last tag in the BMF file may not correspond to the last tag in the structured data file.
  • a current record in use field 512 may be used to indicate that a record has been deleted. If the field is set to 0, the record may be safely ignored because it has been deleted.
  • a reference bit 516 may indicate when there is an external reference, such as an “&” in a text string.
  • a length field 520 may indicate the length of the content.
  • the BMF file together with the original data in memory, completely describes the original data and its inherent data structure. Traversing the data structure may be easily accomplished using the BMF records. Higher level applications and processing are therefore facilitated by using the BMF. To make it easily accessible and readily integrated to higher level application and processing, device drivers and an application programming interface (API) may be built on top of the BMF.
  • API application programming interface
  • FIG. 6 is a flow diagram illustrating a method for efficiently processing a structured data file, the structured data file including one or more pieces of content, in accordance with an embodiment of the present invention.
  • the structured data file may be an extensible markup language file.
  • the process loops through each piece of content. In another embodiment of the present invention, the process loops through each relevant piece of content. Relevancy can be determined by the programmer and may be chosen so as to minimize the amount of space used for a BMF file.
  • a BMF record is created in a BMF file, the BMF record corresponding to the piece of content.
  • an offset may be stored in the BMF record indicating a starting position for the piece of content relative to the beginning of the structured data file.
  • a depth of the piece of content may be stored in the BMF record, the depth indicating a level in a hierarchy of tags in the structured data file.
  • a content type of the piece of content may be stored in the BMF record, the content type indicating a type of information for the piece of content.
  • the content type may take many forms, such as a starting tag, ending tag, property name, property value, text, comment, processing instruction, markup declaration name, markup declaration value, external reference, property name pair, etc.
  • a length may be stored for the piece of content in the BMF record.
  • a modification indicator for the piece of content may be stored in the BMF record, the modification indicator indicating if the BMF record has been modified and the modification indicator initially set to indicate that no modification has been made.
  • an insertion indicator for the piece of content may be stored in the BMF record, the insertion indicator indicating if the BMF record has been inserted between two existing BMF records and the insertion indicator initially set to indicate that the BMF record has not been inserted between two existing BMF records.
  • an end of document indicator for the piece of content may be stored in the BMF record, the end of document indicator indicating if the BMF record corresponds to a last piece of content in the structured data file.
  • a current record in use field may be stored for the piece of content in the BMF record, the current record in use field indicating whether the piece of content has been deleted.
  • DOM a W3C standard
  • DOM represents an XML document as a tree structure, with the elements, attributes, and text defined as nodes.
  • a node may have a single parent node, sibling nodes and child nodes.
  • the node named “B1” has a parent node named “A.” It also has two child nodes, respectively named “C1” and “C2.”
  • the “C1” node is the first child node as it appears before the “C2” node in the XML text.
  • the “B1” node also has sibling nodes named “text0”, “B2” and “B3” respectively.
  • the text node named “Text 0” is the previous sibling of the node “B1.”
  • the “B2” is the next sibling for “B1” node as it appears before “B3” node.
  • the “B3” node is the next sibling node for “B2” node.
  • the first and only child of the “C1” node is a text node named “text.”
  • DOM node types have their equivalent BMF types. For example, an element type in DOM corresponds to the starting tag. DOM, however, does not have a node type corresponding to BMF's ending tag type.
  • ending tags it will be difficult to determine whether b2 is the sibling, or child, of b2. With ending tags, one can clearly tell the relationship between b1 and b2 in the above examples.
  • Example 1 the token types are:
  • Example 2 the token types are:
  • b1 is a sibling, or child, of b2
  • b1 and b2 both have the same depth value so they are siblings.
  • b1 and b2 have depth value of 1 and 2 respectively, so b2 is the child of b1.
  • ending tags can be ignored to save space.
  • the first option is to assign the starting tag of an empty element a content type that is different from a starting tag of a non-empty element.
  • XML ⁇ root> ⁇ element/> ⁇ /root>.
  • Its corresponding BMF file may contain three BMF records: the first one for “root” as a non-empty starting tag, the second for “element” as an empty starting tag, and the third for the ending tag for “ ⁇ /root>.”
  • the BMF records choose to include a depth value, the ending tag's BMF record may be ignored.
  • the second option is to use the same content type for both empty and non-empty starting tag, and to insert a BMF record of a “dummy” ending tag for an empty starting tag.
  • the dummy tag can take various forms, all aiming to preserve the structural integrity of the BMF file. For example, one can insert a BMF record corresponding to a zero-length ending tag at the end of an empty element. Or alternatively, he can add a BMF record for “>” or “/>” to emulate the ending tag. It is called a “dummy” because it doesn't represent a real ending tag.
  • a BMF record can contain a 32-bit descriptor which contains the reference in various forms, such as the relative index value, absolute index value or memory address, of the next sibling or first child, but not both, as there is additional storage overhead for having such descriptors.
  • the reference to the next sibling makes it possible to jump to the next sibling without scanning the BMF records between the current record and its next sibling.
  • the reference to the first child record makes it possible to jump to the first child without scanning the BMF records between the current record and first child record.
  • Some of the other possible references a BMF record can have are parent, root, previous sibling, last child. It should be noted that the reference to a child node is actually a reference to a record corresponding to the child node, as the nodes are represented in the intermediate file as records Likewise the reference to a next sibling node is actually a reference to a record corresponding to the next sibling node.
  • JSON JavaScript Object Notion
  • XML and JSON are similar as both represent tree-structure and are human readable.
  • the basic textual content types in JSON are keys and values.
  • the ‘ ⁇ ’ and ‘ ⁇ ’ in JSON delimit a new level of nesting.
  • the key “popup” has a value following “:” and because there is a ‘ ⁇ ,’ it just indicates that there is a next level of nesting, potentially consisting of a new key value pair.
  • To extend BMF for JSON processing there needs to be content types “left brace” and “right brace” respectively corresponding to ‘ ⁇ ’ and ‘ ⁇ .’
  • a BMF record for ‘ ⁇ ’ whose content type is left brace a BMF record for the “menu” whose content type is “key,” a BMF record for “ ⁇ ” whose content type is “left brace,” a BMF record for “id” whose content type is “key,” a BMF record for “file” whose content type is “value,” a BMF record for “ ⁇ ” whose content type is “right brace,” and a BMF record for “ ⁇ ” whose content type is “right brace.”
  • Table 2 summarizes the content types for the above example.

Abstract

The present invention provides a fast and efficient way of processing structured data by utilizing an intermediate file to store the structural information. The structured data may be processed into a Binary mask Format (BMF) file which may serve as a starting point for post-processing. A tree structure built on top of the BMF file may be constructed very quickly, and also takes up less space than a DOM tree. Additionally, BMF records may reside entirely in the memory and contain structural information, allowing SAX-like sequential data access.

Description

    CROSS REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation of U.S. patent application Ser. No. 11/777,110, filed Jul. 12, 2007, entitled “PROCESSING STRUCTURED DATA,” which is a continuation-in-part of U.S. patent application Ser. No. 11/581,211 filed Oct. 13, 2006, U.S. Pat. No. 7,761,459, entitled “PROCESSING STRUCTURED DATA,” which is a continuation-in-part of U.S. patent application Ser. No. 10/272,077, filed Oct. 15, 2002, now U.S. Pat. No. 7,133,857, issued Nov. 7, 2006, entitled “PROCESSING STRUCTURED DATA,” all of which are hereby incorporated by reference.
  • FIELD OF THE INVENTION
  • The present invention relates to the field of structured data files in computer systems. More specifically, the present invention relates to the processing of structured data in an efficient manner.
  • BACKGROUND OF THE INVENTION
  • Structured data represents a large portion of the information accessed on the Internet and other computer networks. There are several reasons why structured data is so popular. American Standard Code for Information Interchange (ASCII) and its extensions, such as Unicode Transformation Formats UTF-8 and UTF-16 are among the most common standard encoding formats. Text encoding puts information into a format that is easily readable by a human, thus it is easy for programmers to develop and debug applications. Lastly, textual encoding is extensible and adding new information may be as simple as adding a new key-value pair.
  • Recently, Extensible Markup Language (XML) has been growing in popularity. XML is a markup language for documents containing structured information. Unlike its predecessor, Hypertext Markup Language (HTML), where tags are used to instruct a web browser how to render data, in XML the tags are designed to describe the data fields themselves. XML, therefore, provides a facility to define tags and the structural relationships between them. This allows a great deal of flexibility in defining markup languages to using information. Because XML is not designed to do anything other than describe what the data is, it serves as the perfect data interchange format.
  • XML, however, is not without its drawbacks. Compared with other data formats, XML can be very verbose. Processing an XML file can be very CPU and memory intensive, severely degrading overall application performance. Additionally, XML suffers many of the same problems that other software-based text-based processing methods have. Modern processors prefer binary data representations, particularly ones that fit the width of the registers, over text-based representations. Furthermore, the architecture of many general-purpose processors trades performance for programmability, thus making them ill-suited for text processing. Lastly, the efficient parsing of structured text, no matter the format, can present a challenge because of the added steps required to handle the structural elements.
  • Most current XML parsers are software-based solutions that follow either the Document Object Model (DOM) or Simple API for XML (SAX) technologies. DOM parsers convert an XML document into an in-memory hierarchical representation (known as a DOM tree), which can later be accessed and manipulated by programmers through a standard interface. SAX parsers, on the other hand, treat an XML document as a stream of characters. SAX is event-driven, meaning that the programmer specifies an event that may happen, and if that event occurs, SAX gets control and handles the situation.
  • In general, DOM and SAX are complementary, not competing, XML processing models, each with its own benefits and drawbacks. DOM programming is programmer-friendly, as the processing phase is separate from application logic. Additionally, because the data resides in the memory, repetitive access is fast and flexible. However, DOM requires that the entire document data structure, usually occupying 7-10 times the size of the original XML document, be loaded into the memory, thus making it impractical for large XML documents. SAX, on the other hand, can be efficient in parsing large XML documents (at least when only small amounts of information need to be processed at once), but it maintains little of the structural information of the XML data, putting more of a burden on programmers and resulting in code that is hardwired, bulky, and difficult to maintain.
  • What is needed is an application program interface (API) that combines the best attributes of both DOM and SAX parsing.
  • BRIEF DESCRIPTION OF THE INVENTION
  • The present invention provides a fast and efficient way of processing structured data by utilizing an intermediate file to store the structural information. The structured data may be processed into a Binary mask Format (BMF) file which may serve as a starting point for post-processing. A tree structure built on top of the BMF file may be constructed very quickly, and also takes up less space than a DOM tree. Additionally, BMF records may reside entirely in the memory and contain structural information, allowing SAX-like sequential data access.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The accompanying drawings, which are incorporated into and constitute a part of this specification, illustrate one or more embodiments of the present invention and, together with the detailed description, serve to explain the principles and implementations of the invention.
  • In the drawings:
  • FIG. 1 is a block diagram illustrating a layer view of an apparatus for efficiently processing structured data in accordance with an embodiment of the present invention.
  • FIG. 2 is a block diagram illustrating an apparatus for efficiently processing structured data in accordance with an embodiment of the present invention.
  • FIG. 3 is a timing diagram illustrating the operation of the hardware in accordance with an embodiment of the present invention.
  • FIG. 4 is a flow diagram illustrating a method for modifying the content of a target string in a BMF file from an old string to a new string in accordance with an embodiment of the present invention.
  • FIG. 5 is a diagram illustrating a BMF record format in accordance with an embodiment of the present invention.
  • FIG. 6 is a flow diagram illustrating a method for efficiently processing a structured data file, the structured data file including one or more pieces of content, in accordance with an embodiment of the present invention.
  • DETAILED DESCRIPTION
  • Embodiments of the present invention are described herein in the context of a system of computers, servers, and software. Those of ordinary skill in the art will realize that the following detailed description of the present invention is illustrative only and is not intended to be in any way limiting. Other embodiments of the present invention will readily suggest themselves to such skilled persons having the benefit of this disclosure. Reference will now be made in detail to implementations of the present invention as illustrated in the accompanying drawings. The same reference indicators will be used throughout the drawings and the following detailed description to refer to the same or like parts.
  • In the interest of clarity, not all of the routine features of the implementations described herein are shown and described. It will, of course, be appreciated that in the development of any such actual implementation, numerous implementation-specific decisions must be made in order to achieve the developer's specific goals, such as compliance with application- and business-related constraints, and that these specific goals will vary from one implementation to another and from one developer to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking of engineering for those of ordinary skill in the art having the benefit of this disclosure.
  • In accordance with the present invention, the components, process steps, and/or data structures may be implemented using various types of operating systems, computing platforms, computer programs, and/or general purpose machines. In addition, those of ordinary skill in the art will recognize that devices of a less general purpose nature, such as hardwired devices, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), or the like, may also be used without departing from the scope and spirit of the inventive concepts disclosed herein.
  • For purposes of this disclosure, a structured data file is any file containing content as well as some information regarding the structural organization of the content. The present invention provides a fast and efficient way of processing structured data by utilizing an intermediate file to store the structural information. The structured data may be processed into a Binary mask Format (BMF) file which may serve as a starting point for post-processing. A tree structure built on top of the BMF file may be constructed very quickly, and also takes up less space than a DOM tree. Additionally, BMF records may reside entirely in the memory and contain structural information, allowing SAX-like sequential data access. However, while this document will describe advantages that the present invention provides over DOM or SAX, one of ordinary skill in the art will recognize that the present invention need not be limited to replacing DOM or SAX, and can be expanded to non-XML type processing.
  • FIG. 1 is a block diagram illustrating a layer view of an apparatus for efficiently processing structured data in accordance with an embodiment of the present invention. The apparatus may comprise three layers. A hardware text processing accelerator 100, occupying the lowest layer, may offer the horsepower necessary to relieve the central processing unit (CPU) from the most processor intensive part of the task. On top of the hardware text processing accelerator 100 may lie a device driver layer 102 that is responsible for the communication between the hardware text processing accelerator 100 and a software layer 104. The software layer 104 may be designed to offer maximum flexibility and further improve the performance. It may export APIs that are standard-compliant.
  • The hardware may be designed such that it may quickly match multiple patterns against an incoming data stream. FIG. 2 is a block diagram illustrating an apparatus for efficiently processing structured data in accordance with an embodiment of the present invention. A text processor 200 may be the core of the accelerator. It may be composed of multiple Finite State Machines (FSMs) that process an incoming document in parallel. The output may be the BMF file. It also may set several result registers (not pictured). A PCI interface 202 may handle all handshaking between the hardware and a server PCI bus 204. A memory controller 206 may receive commands from the PCI interface 202 and convert the PCI address to on-board memory address space. It also may access the board memory accordingly. Configuration registers 208 may determine the configuration of the text processing pipeline and the organization of the memory controller 206. It may load default values from configuration ROM 210. Some of the values may be modified by software through the PCI interface 202. The Configuration ROM 210 may store the default setting of the text processor configuration. It also may store the configuration map of FPGAs.
  • A document buffer 212 may store the incoming document. This may be a Synchronous Dynamic Random Access Memory (SDRAM). Paging may be utilized if the incoming document is larger than the total buffer size. A BMF buffer 214 may store the output BMF files, together with several other text processor result register values. This may be a separate SDRAM, although in one embodiment of the present invention it may share a single SDRAM with the document buffer. A string cache 216 may buffer the incoming data to smooth out and speed up SDRAM access. A DMA engine 218 may grab the frame data from server main memory and send it back the BMF file.
  • FIG. 3 is a timing diagram illustrating the operation of the hardware in accordance with an embodiment of the present invention. A reset 300 may be sent out by the host computer system, or by a specific application. While reset is asserted, the configuration may be loaded from ROM to the configuration registers. The text processor then may be set to a default state according to the configuration registers. When software calls the device driver, a start signal 302 may be sent through the PCI to indicate the beginning of a document processing cycle. Then the PCI master may assert a frame number 304 to indicate the beginning of a write transaction. The PCI master may then drive the address/data 306 to the PCI bus. The PCI target interface may respond, causing the DMA to read the document into the SDRAM document buffer 308. There may also be certain PCI commands reserved to update the configuration registers. The memory controller may be activated by PCI command. It may start processing data in the string buffer 310. It also may send sync signals to the memory controller. The transferring and processing may be repeated.
  • The PCI target may sense a valid window to send data. Then the PCI master may assert the frame number to indicate the beginning of a read transaction 312. The PCI target holds the bus. The DMA engine may then transfer the BMF and result register data to main memory 314. When all the data is transferred, the PCI target interface may send an end signal to the device driver 316. The next document processing cycle may start again with a start signal from the device driver 318.
  • The output of the hardware is a BMF. In one embodiment of the present invention, the BMF defines a binary record format that is used to describe various fields in a structured data file. It can be viewed as a two-dimensional field of bits. Each piece of useful information in the structured data file may correspond to a record in the BMF file. A record may comprise a starting offset and length of a target string in the structured data file. It may also comprise the depth value, node type, and bit-wide flags. These will be discussed in more detail below. The total length of a record may be an integer multiple of a 32-bit word—the width of the memory bus in most commercial architectures. Two record types may be defined: a full version of 12 bytes in length, and a compressed version of 8 bytes in length. The full version may be based on the assumption that both the string length and the starting offset are 32-bits wide, whereas the compressed version may assume a 16-bit field, which translates to a maximum length of 65536 bytes.
  • FIG. 4 is a flow diagram illustrating a method for modifying the content of a target string in a BMF file from an old string to a new string in accordance with an embodiment of the present invention. At 400, a piece of memory of the length of the new string may be allocated. At 402, the new string may be filled into the memory. At 404, a record in the BMF file corresponding to the old string may be located. At 406, a corresponding bit flag for the record may be changed from relative offset to absolute offset. The first time a file is converted to BMF form, relative offsets may be used. However, as soon as a modification is made to a specific string, it can often be difficult if not impossible to continue to track the relative offset for that string. Therefore, absolute offsets may be utilized for all modified strings. At 408, an offset value in the BMF record for the old string may be replaced with a pointer value of the new string in memory. At 410, a length field in the BMF record for the old string may be replaced with the length of the new string.
  • There are at least three types of possible BMF file modes: read-only mode, read-modify mode, and read-modify-add mode. In read-only mode, records representing various types of nodes may be placed sequentially into a BMF file, leaving no empty records. For example, a leaf-level element may be represented as a record for the starting tag, 2n records (one of r property name and one property value) for n properties, and one record for the text for the element, and finally one record for an ending tag name. The presence of the ending tag record may be used for document validation.
  • The read-modify mode may be identical to read-only mode except each record allows for limited write-access, meaning content can be altered, but not added.
  • The read-modify-add mode allows complete write-access, which is done by embedding empty records into the file.
  • The record format may be picked to efficiently represent the necessary information of the original data structure. It may also be made less efficient on purpose to speed up downstream processing.
  • FIG. 5 is a diagram illustrating a BMF record format in accordance with an embodiment of the present invention. A control word 500 may be thirty-two bits in total in this embodiment. This may include a depth value 502 of sixteen bits. The depth value may indicate the depth of a tag in the hierarchy of tags in the structured data file. Thus, the first tag in a file will have a depth of zero, whereas if another starting tag appears before an ending tag for the first tag, that second starting tag will have a depth of one. A content type 504 may be provided, which indicates what type of information the content is. In an embodiment of the present invention, the following value/content type pairs may be used in this field:
  • TABLE 1
    Content Types and Corresponding Values
    Content
    Type
    Value Content name Example
    0 Starting Tag <example>
    1 Ending Tag </example>
    2 Property Name <example property1=”this”>
    3 Property Value <example property2=”that”>
    4 Text <example> tasty fruit
    </example>
    5 Comment <!- - this is a comment - ->
    6 Processing Instruction <? ...... ?>
    7 Markup declaration I <![CDATA[...<<<>>>...]]>
    name
    8 Markup declaration I <![CDATA[...<<<>>>...]]>
    value
    9 Markup declaration II <!ENTITY ...>
    name
    10 Markup declaration II <!ENTITY ...>
    value
    11 Entity reference &example.bib;
    12 Property Name Value <example property1=”this”>
    Pair
    13 Starting tag for empty <example/>
    element
  • A modification indicator 506 may also be provided, which indicates whether or not the record has been modified. This is important because, as described above, if the record has been modified, then the offset field will contain the real pointer value, not a relative offset. An insertion indicator 508 may indicate that data was inserted in between two existing records. Once again, this is important in determining how to utilize the offset field. If the insertion indicator is set to 1, it indicates that the offset field contains a pointer to an external piece of memory, one that can be used to add child nodes to the current node. An end of document indicator 510 may indicate whether the tag is the last one in the document. This can be important because in some embodiments, ending tags may be ignored when encoding the BMF file in order to save space. Therefore, the last tag in the BMF file may not correspond to the last tag in the structured data file.
  • A current record in use field 512 may be used to indicate that a record has been deleted. If the field is set to 0, the record may be safely ignored because it has been deleted. A name space indicator 514 may indicate whether or not there is a name space within the token (which may be represented by an “=” sign). A reference bit 516 may indicate when there is an external reference, such as an “&” in a text string.
  • There may be one or more reserved bits 518, which are set aside for future uses. Lastly, a length field 520 may indicate the length of the content.
  • The BMF file, together with the original data in memory, completely describes the original data and its inherent data structure. Traversing the data structure may be easily accomplished using the BMF records. Higher level applications and processing are therefore facilitated by using the BMF. To make it easily accessible and readily integrated to higher level application and processing, device drivers and an application programming interface (API) may be built on top of the BMF.
  • FIG. 6 is a flow diagram illustrating a method for efficiently processing a structured data file, the structured data file including one or more pieces of content, in accordance with an embodiment of the present invention. The structured data file may be an extensible markup language file. The process loops through each piece of content. In another embodiment of the present invention, the process loops through each relevant piece of content. Relevancy can be determined by the programmer and may be chosen so as to minimize the amount of space used for a BMF file. At 600, a BMF record is created in a BMF file, the BMF record corresponding to the piece of content. At 602, an offset may be stored in the BMF record indicating a starting position for the piece of content relative to the beginning of the structured data file. At 604, a depth of the piece of content may be stored in the BMF record, the depth indicating a level in a hierarchy of tags in the structured data file. At 606, a content type of the piece of content may be stored in the BMF record, the content type indicating a type of information for the piece of content. The content type may take many forms, such as a starting tag, ending tag, property name, property value, text, comment, processing instruction, markup declaration name, markup declaration value, external reference, property name pair, etc. At 608, a length may be stored for the piece of content in the BMF record.
  • At 610, a modification indicator for the piece of content may be stored in the BMF record, the modification indicator indicating if the BMF record has been modified and the modification indicator initially set to indicate that no modification has been made. At 612, an insertion indicator for the piece of content may be stored in the BMF record, the insertion indicator indicating if the BMF record has been inserted between two existing BMF records and the insertion indicator initially set to indicate that the BMF record has not been inserted between two existing BMF records. At 614, an end of document indicator for the piece of content may be stored in the BMF record, the end of document indicator indicating if the BMF record corresponds to a last piece of content in the structured data file. At 616, a current record in use field may be stored for the piece of content in the BMF record, the current record in use field indicating whether the piece of content has been deleted.
  • The following example may be used to illustrate an embodiment of the present invention. One of ordinary skill in the art will recognize that this is merely an example and should not be read to be limiting in any way. Suppose an XML file as follows:
  • <?xml version=“1.0” encoding=“US-ASCII”?>
    <benchmark:database
    xmlns:benchmark=“http://example.com/xml/benchmark”>
    <benchmark:person id=“012345”>
    <benchmark:email name=“Name012345” />
    <!--Edited with XML spy v4.2 -->
    <benchmark:line1>L i n e 1 012345 012345</benchmark:line1>
    </benchmark:person>
    </benchmark:database>
  • An embodiment of the present invention may ignore ending tags and produce the following BMF file:
  • End of Current
    Starting Token Modify Insertion Document Record Name space
    offset Depth type indicator Indicator Indicator in use indicator Reference Length
    32 bit 5 bit 4 bit 1 bit 1 bit 1 bit 1 bit 1 bit indicator unused 16 bit
    2 0 6 0 0 0 1 0 0 0 38
    42 0 0 0 0 0 1 1 0 0 18
    61 0 2 0 0 0 1 1 0 0 14
    78 0 3 0 0 0 1 0 0 0 35
    116 1 0 0 0 0 1 1 0 0 16
    133 1 2 0 0 0 1 0 0 0 2
    137 1 3 0 0 0 1 0 0 0 6
    147 2 0 0 0 0 1 1 0 0 15
    163 2 2 0 0 0 1 0 0 0 4
    169 2 3 0 0 0 1 0 0 0 10
    185 1 5 0 0 0 1 0 0 0 25
    218 2 0 0 0 0 1 1 0 0 15
    234 2 4 0 0 0 1 0 0 0 23
    0 0 0 0 0 1 1 0 0 0 0

    The packet BMF records are:
  • 00000000000000000000000000000010 00000 0110 0 0 0 1 0 0 0 00100110
    00000000000000000000000000101010 00000 0000 0 0 0 1 1 0 0 00010010
    00000000000000000000000000111101 00000 0010 0 0 0 1 1 0 0 00001110
    00000000000000000000000001001110 00001 0011 0 0 0 1 0 0 0 00100011
    00000000000000000000000001110100 00001 0000 0 0 0 1 1 0 0 00010000
    00000000000000000000000010000101 00001 0010 0 0 0 1 0 0 0 00000010
    00000000000000000000000010001001 00010 0011 0 0 0 1 0 0 0 00000110
    00000000000000000000000010010011 00010 0000 0 0 0 1 1 0 0 00001111
    00000000000000000000000010100011 00010 0010 0 0 0 1 0 0 0 00000100
    00000000000000000000000010101001 00001 0011 0 0 0 1 0 0 0 00001010
    00000000000000000000000010111001 00010 0101 0 0 0 1 0 0 0 00011001
    00000000000000000000000011011010 00010 0000 0 0 0 1 1 0 0 00001111
    00000000000000000000000011101010 00010 0100 0 0 0 1 0 0 0 00010111
    00000000000000000000000000000000 00000 0000 0 1 1 1 0 0 0 00000000
  • Currently, DOM (a W3C standard) is well-defined and the most widely used representation of XML's inherent hierarchy. DOM represents an XML document as a tree structure, with the elements, attributes, and text defined as nodes. A node may have a single parent node, sibling nodes and child nodes. For example, consider the following XML snippet:
  • <A>text0 <B1 attrName=“val”><C1>text1</C1><C2>text2
    </C2></B1><B2></B2><B3></B3></A>
  • The node named “B1” has a parent node named “A.” It also has two child nodes, respectively named “C1” and “C2.” The “C1” node is the first child node as it appears before the “C2” node in the XML text. The “B1” node also has sibling nodes named “text0”, “B2” and “B3” respectively. The text node named “Text 0” is the previous sibling of the node “B1.” The “B2” is the next sibling for “B1” node as it appears before “B3” node. By the same token, the “B3” node is the next sibling node for “B2” node. Also the first and only child of the “C1” node is a text node named “text.”
  • DOM treats attribute nodes differently. In the XML snippet shown above, the “B1” node doesn't treat its attribute named “attrName” as its child.
  • Many DOM node types have their equivalent BMF types. For example, an element type in DOM corresponds to the starting tag. DOM, however, does not have a node type corresponding to BMF's ending tag type.
  • Since a BMF file completely describes the inherent structure in the data file as one can navigate the document by scanning across of BMF records and keeping track of their token types. And they don't need any additional descriptors to identify its siblings, children, or parent. The inclusion of ending tag as a type is important. DOM resorts to various pointers and complex data structures to maintain the hierarchical information of XML, and does not have a node type corresponding to ending tag. SAX returns ending tags of XML, but discards them by default. In contrast, a BMF file maintains the ending tag in memory as a record so the structure information of an XML file is unambiguous. Consider the following examples:
  • Example 1:
    • <a><b1></b1><b2></b2></a>
  • Example 2:
    • <a><b1><b2></b2></b1></a>
  • If the ending tags are missing, the corresponding BMF have identical record types
    • Starting tag for a
    • Starting tag for b1
    • Starting tag for b2
  • Without ending tags, it will be difficult to determine whether b2 is the sibling, or child, of b2. With ending tags, one can clearly tell the relationship between b1 and b2 in the above examples.
  • In Example 1, the token types are:
    • Starting tag for a
    • Starting tag for b1
    • Ending tag for b1
    • Starting tag for b2
    • Ending tag for b2
    • Ending tag for a
  • In Example 2, the token types are:
    • Starting tag for a
    • Starting tag for b1
    • Starting tag for b2
    • Ending tag for b2
    • Ending tag for b1
    • Ending tag for a
  • To tell whether b1 is a sibling, or child, of b2, one can calculate the depth value of each tags. In example 1, b1 and b2 both have the same depth value so they are siblings. In example 2, b1 and b2 have depth value of 1 and 2 respectively, so b2 is the child of b1.
  • When the depth value is included in the BMF records, ending tags can be ignored to save space.
  • For BMF records to maintain structural information of XML documents containing empty elements (elements having no content, denoted by a specially defined starting tag that indicates an empty element), there are at least two options.
  • The first option is to assign the starting tag of an empty element a content type that is different from a starting tag of a non-empty element. Consider the following XML: <root><element/></root>. Its corresponding BMF file may contain three BMF records: the first one for “root” as a non-empty starting tag, the second for “element” as an empty starting tag, and the third for the ending tag for “</root>.” When the BMF records choose to include a depth value, the ending tag's BMF record may be ignored.
  • The second option is to use the same content type for both empty and non-empty starting tag, and to insert a BMF record of a “dummy” ending tag for an empty starting tag. The dummy tag can take various forms, all aiming to preserve the structural integrity of the BMF file. For example, one can insert a BMF record corresponding to a zero-length ending tag at the end of an empty element. Or alternatively, he can add a BMF record for “>” or “/>” to emulate the ending tag. It is called a “dummy” because it doesn't represent a real ending tag.
  • In some cases, it would be beneficial to have some additional way to speed up the traversal of document structure. For example, a BMF record can contain a 32-bit descriptor which contains the reference in various forms, such as the relative index value, absolute index value or memory address, of the next sibling or first child, but not both, as there is additional storage overhead for having such descriptors.
  • The reference to the next sibling makes it possible to jump to the next sibling without scanning the BMF records between the current record and its next sibling. The reference to the first child record makes it possible to jump to the first child without scanning the BMF records between the current record and first child record. Some of the other possible references a BMF record can have are parent, root, previous sibling, last child. It should be noted that the reference to a child node is actually a reference to a record corresponding to the child node, as the nodes are represented in the intermediate file as records Likewise the reference to a next sibling node is actually a reference to a record corresponding to the next sibling node.
  • When a record does not have a sibling, it is convenient to use some constant value to denote the absence of the sibling. That constant value can be thought of as a special reference value. For example, a constant value of zero at the descriptor field could be interpreted as there is not sibling or child, depending on the actual usage of the descriptor.
  • The concept outlined in this specification can also be applied to processing JSON (JavaScript Object Notion). JSON is invented to allow web browsers to exchange data structure easily as a JSON string has browser's default support, such as Javascript's eval( ).
  • XML and JSON are similar as both represent tree-structure and are human readable. The basic textual content types in JSON are keys and values. Consider the following XML file:
  • <menu id=“file” value=“File”>
    <popup>
    <menuitem><value>New</value><onclick>CreateNewDoc( )</onclick>
    </menuitem>
    <menuitem><value>Open</value><onclick>OpenDoc( )</onclick> </menuitem>
    <menuitem><value>Close</value><onclick>CloseDoc( )</onclick></menuitem>
    </popup>
    </menu>
    The equivalent JSON representation is shown below
    {“menu”: {
    “id”: “file”,
    “value”: “File”,
    “popup”: {
    “menuitem”: [
    {“value”: “New”, “onclick”: “CreateNewDoc( )”},
    {“value”: “Open”, “onclick”: “OpenDoc( )”},
    {“value”: “Close”, “onclick”: “CloseDoc( )”}
    ]
    }
    }}
  • The ‘{’ and ‘}’ in JSON delimit a new level of nesting. For example, in the JSON example above, the key “popup” has a value following “:” and because there is a ‘{,’ it just indicates that there is a next level of nesting, potentially consisting of a new key value pair. To extend BMF for JSON processing, there needs to be content types “left brace” and “right brace” respectively corresponding to ‘{’ and ‘}.’ Consider the JSON file below: {“menu”: {“id”: “file”}}
  • To create its corresponding BMF file, one inserts a BMF record for ‘{’ whose content type is left brace, a BMF record for the “menu” whose content type is “key,” a BMF record for “{” whose content type is “left brace,” a BMF record for “id” whose content type is “key,” a BMF record for “file” whose content type is “value,” a BMF record for “}” whose content type is “right brace,” and a BMF record for “}” whose content type is “right brace.” Table 2 summarizes the content types for the above example.
  • TABLE 2
    Content types and values for the JSON files
    JSON Content Value JSON Content Name Example
    0 Left brace {“id”: “file”}
    1 Right brace {“id”: “file”}
    2 Key {“id”: “file”}
    3 Value {“id”: “file”}
  • While embodiments and applications of this invention have been shown and described, it would be apparent to those skilled in the art having the benefit of this disclosure that many more modifications than mentioned above are possible without departing from the inventive concepts herein. The invention, therefore, is not to be restricted except in the spirit of the appended claims.

Claims (21)

1.-20. (canceled)
21. A method for efficiently processing a structured data file, the method comprising:
receiving the structured data file;
creating an intermediate file, wherein the intermediate file is a binary file having a plurality of cells organized into groupings, wherein each of the groupings of cells constitutes a record;
parsing the structured data file by:
creating a first record in an intermediate file for an element in the structured data file, wherein the first record further contains one or more descriptors including an offset value identifying a location, within the structured data file, of the element;
creating a second record in the intermediate file for an attribute name in the structured data file, wherein the second record further contains one or more descriptors including an offset value identifying a location, within the structured data file, of the attribute name; and
creating a third record in the intermediate file for an attribute value in the structured data file, wherein the third record further contains one or more descriptors including an offset value identifying a location, within the structured data file, of the attribute value; and
transmitting the intermediate file and the structured data file to a component so that the component accesses data from the structured data file using both the intermediate file and the structured data file together.
22. The method of claim 21, wherein the creating a first record comprises:
creating a binary mask format (BMF) record in the intermediate file, the BMF record corresponding to the element; and
storing an offset in the BMF record indicating a position for the starting tag relative to a point in the structured data file.
23. The method of claim 21, wherein the intermediate file is a BMF file.
24. The method of claim 22, wherein the creating a first record in an intermediate file further comprises:
storing a depth of the element in the BMF record, the depth indicating a level in a hierarchy of tags in the structured data file.
25. The method of claim 22, wherein the creating a first record in an intermediate file further comprises:
storing a content type of content in the BMF record, the content type indicating a type of information for the content.
26. The method of claim 25, wherein the creating a first record in an intermediate file further comprises:
storing a length for the content in the BMF record.
27. The method of claim 22, wherein the offset indicates a starting position for the starting tag relative to a beginning of the structured data file.
28. An apparatus for efficiently processing a structured data file, the structured data file including a starting tag, an attribute name, and content, the apparatus comprising:
a peripheral component interface (PCI) interface;
a direct memory access (DMA) engine coupled to the PCI interface;
a text processor coupled to the PCI interface, the text processor configured to:
receive the structured data file;
create an intermediate file, wherein the intermediate file is a binary file having a plurality of cells organized into groupings, wherein each of the groupings of cells constitutes
a record;
parse the structured data file by:
creating a first record in an intermediate file for an element in the structured data file, wherein the first record further contains one or more descriptors including an offset value identifying a location, within the structured data file, of the element;
creating a second record in the intermediate file for an attribute name in the structured data file, wherein the second record further contains one or more descriptors including an offset value identifying a location, within the structured data file, of the attribute name; and
creating a third record in the intermediate file for an attribute value in the structured data file, wherein the third record further contains one or more descriptors including an offset value identifying a location, within the structured data file, of the attribute value; and
transmit the intermediate file and the structured data file to a component so that the component accesses data from the structured data file using both the intermediate file and the structured data file together.
configuration memory coupled to the text processor and to the PCI interface;
a memory controller coupled to the PCI interface;
BMF memory coupled to the DMA engine, the memory controller, and the text processor;
a document buffer coupled to the DMA engine, the memory controller, and the text processor; and
a string cache coupled to the DMA engine, the memory controller, and the text processor.
29. The apparatus of claim 28, wherein the configuration memory includes:
one or more configuration registers; and
configuration read-only-memory coupled to the one or more configuration registers.
30. The apparatus of claim 28, wherein the PCI interface is configured to handle all handshaking between the apparatus and a server PCI bus.
31. The apparatus of claim 28, wherein the memory controller is configured to receive commands from the PCI interface and convert a PCI address to on-board memory address space.
32. The apparatus of claim 31, wherein the memory controller is further configured to access board memory according to the PCI address.
33. The apparatus of claim 31, wherein the configuration register contains a configuration of a text processing pipeline and organization of the memory controller.
34. The apparatus of claim 31, wherein the memory buffer is configured to store an incoming document.
35. The apparatus of claim 31, wherein the memory buffer is Synchronous Dynamic Random Access memory (SDRAM).
36. The apparatus of claim 31, wherein the DMA engine is configured to grab frame data from server main memory and send back a BMF file.
37. The apparatus of claim 31, wherein the formatting by the text processor includes formatting the record in a way that allows data to be accessed using both the intermediate file and the structured data file without traversing the entire structured data file to determine the depth value.
38. An apparatus for efficiently processing a structured data file, the apparatus comprising:
means for receiving the structured data file;
means for creating an intermediate file, wherein the intermediate file is a binary file having a plurality of cells organized into groupings, wherein each of the groupings of cells constitutes a record;
means for parsing the structured data file by:
creating a first record in an intermediate file for an element in the structured data file, wherein the first record further contains one or more descriptors including an offset value identifying a location, within the structured data file, of the element;
creating a second record in the intermediate file for an attribute name in the structured data file, wherein the second record further contains one or more descriptors including an offset value identifying a location, within the structured data file, of the attribute name; and
creating a third record in the intermediate file for an attribute value in the structured data file, wherein the third record further contains one or more descriptors including an offset value identifying a location, within the structured data file, of the attribute value; and
means for transmitting the intermediate file and the structured data file to a component so that the component accesses data from the structured data file using both the intermediate file and the structured data file together.
39. The apparatus of claim 38, wherein the means for creating a first record comprises:
means for creating a binary mask format (BMF) record in the intermediate file, the BMF record corresponding to the element; and
means for storing an offset in the BMF record indicating a position for the starting tag relative to a point in the structured data file.
40. The apparatus of claim 38, wherein the intermediate file is a BMF file.
US13/894,118 2002-10-15 2013-05-14 Processing structured data Abandoned US20130254219A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
US13/894,118 US20130254219A1 (en) 2002-10-15 2013-05-14 Processing structured data
US15/393,481 US9990364B2 (en) 2002-10-15 2016-12-29 Processing structured data
US15/970,775 US10698861B2 (en) 2002-10-15 2018-05-03 Processing structured data
US16/821,307 US20200285607A1 (en) 2002-10-15 2020-03-17 Processing structured data

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US10/272,077 US7133857B1 (en) 2002-10-15 2002-10-15 Processing structured data
US11/581,211 US7761459B1 (en) 2002-10-15 2006-10-13 Processing structured data
US77711007A 2007-07-12 2007-07-12
US201113099237A 2011-05-02 2011-05-02
US13/894,118 US20130254219A1 (en) 2002-10-15 2013-05-14 Processing structured data

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US201113099237A Continuation 2002-10-15 2011-05-02

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/393,481 Continuation US9990364B2 (en) 2002-10-15 2016-12-29 Processing structured data

Publications (1)

Publication Number Publication Date
US20130254219A1 true US20130254219A1 (en) 2013-09-26

Family

ID=42332718

Family Applications (5)

Application Number Title Priority Date Filing Date
US11/581,211 Expired - Lifetime US7761459B1 (en) 2002-10-15 2006-10-13 Processing structured data
US13/894,118 Abandoned US20130254219A1 (en) 2002-10-15 2013-05-14 Processing structured data
US15/393,481 Expired - Lifetime US9990364B2 (en) 2002-10-15 2016-12-29 Processing structured data
US15/970,775 Expired - Lifetime US10698861B2 (en) 2002-10-15 2018-05-03 Processing structured data
US16/821,307 Abandoned US20200285607A1 (en) 2002-10-15 2020-03-17 Processing structured data

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US11/581,211 Expired - Lifetime US7761459B1 (en) 2002-10-15 2006-10-13 Processing structured data

Family Applications After (3)

Application Number Title Priority Date Filing Date
US15/393,481 Expired - Lifetime US9990364B2 (en) 2002-10-15 2016-12-29 Processing structured data
US15/970,775 Expired - Lifetime US10698861B2 (en) 2002-10-15 2018-05-03 Processing structured data
US16/821,307 Abandoned US20200285607A1 (en) 2002-10-15 2020-03-17 Processing structured data

Country Status (1)

Country Link
US (5) US7761459B1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104111987A (en) * 2014-07-01 2014-10-22 西安交通大学 Tax intermediate index extraction method based on subtree pattern mining
US9990364B2 (en) 2002-10-15 2018-06-05 Ximpleware, Inc. Processing structured data
CN108874985A (en) * 2018-06-12 2018-11-23 东方电子股份有限公司 The distributed parsing configuration method of intelligent substation SCD file
US20190155896A1 (en) * 2015-08-31 2019-05-23 Ayla Networks, Inc. Compact schedules for resource-constrained devices

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10572824B2 (en) 2003-05-23 2020-02-25 Ip Reservoir, Llc System and method for low latency multi-functional pipeline with correlation logic and selectively activated/deactivated pipelined data processing engines
JP2006526227A (en) 2003-05-23 2006-11-16 ワシントン ユニヴァーシティー Intelligent data storage and processing using FPGA devices
US7921046B2 (en) 2006-06-19 2011-04-05 Exegy Incorporated High speed processing of financial information using FPGA devices
US7840482B2 (en) 2006-06-19 2010-11-23 Exegy Incorporated Method and system for high speed options pricing
US7660793B2 (en) 2006-11-13 2010-02-09 Exegy Incorporated Method and system for high performance integration, processing and searching of structured and unstructured data using coprocessors
US8326819B2 (en) 2006-11-13 2012-12-04 Exegy Incorporated Method and system for high performance data metatagging and data indexing using coprocessors
US8260790B2 (en) * 2007-04-27 2012-09-04 Hewlett-Packard Development Company, L.P. System and method for using indexes to parse static XML documents
US10229453B2 (en) 2008-01-11 2019-03-12 Ip Reservoir, Llc Method and system for low latency basket calculation
US20120095893A1 (en) 2008-12-15 2012-04-19 Exegy Incorporated Method and apparatus for high-speed processing of financial market depth data
US10037568B2 (en) 2010-12-09 2018-07-31 Ip Reservoir, Llc Method and apparatus for managing orders in financial markets
CN102650992B (en) * 2011-02-25 2014-07-30 国际商业机器公司 Method and device for generating binary XML (extensible markup language) data and locating nodes of the binary XML data
US9990393B2 (en) 2012-03-27 2018-06-05 Ip Reservoir, Llc Intelligent feed switch
US11436672B2 (en) 2012-03-27 2022-09-06 Exegy Incorporated Intelligent switch for processing financial market data
US10121196B2 (en) 2012-03-27 2018-11-06 Ip Reservoir, Llc Offload processing of data packets containing financial market data
US10650452B2 (en) 2012-03-27 2020-05-12 Ip Reservoir, Llc Offload processing of data packets
US9760549B2 (en) 2012-07-18 2017-09-12 Software Ag Usa, Inc. Systems and/or methods for performing atomic updates on large XML information sets
US10515141B2 (en) 2012-07-18 2019-12-24 Software Ag Usa, Inc. Systems and/or methods for delayed encoding of XML information sets
US9922089B2 (en) 2012-07-18 2018-03-20 Software Ag Usa, Inc. Systems and/or methods for caching XML information sets with delayed node instantiation
US10803413B1 (en) * 2016-06-23 2020-10-13 Amazon Technologies, Inc. Workflow service with translator
EP3560135A4 (en) 2016-12-22 2020-08-05 IP Reservoir, LLC Pipelines for hardware-accelerated machine learning
CN107092656B (en) * 2017-03-23 2019-12-03 中国科学院计算技术研究所 A kind of tree data processing method and system
CN107977440B (en) * 2017-12-07 2020-11-27 网宿科技股份有限公司 Method, device and system for analyzing data file

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5748953A (en) * 1989-06-14 1998-05-05 Hitachi, Ltd. Document search method wherein stored documents and search queries comprise segmented text data of spaced, nonconsecutive text elements and words segmented by predetermined symbols
US5892924A (en) * 1996-01-31 1999-04-06 Ipsilon Networks, Inc. Method and apparatus for dynamically shifting between routing and switching packets in a transmission network
US6209124B1 (en) * 1999-08-30 2001-03-27 Touchnet Information Systems, Inc. Method of markup language accessing of host systems and data using a constructed intermediary
US6336124B1 (en) * 1998-10-01 2002-01-01 Bcl Computers, Inc. Conversion data representing a document to other formats for manipulation and display
US20020029229A1 (en) * 2000-06-30 2002-03-07 Jakopac David E. Systems and methods for data compression
US20020038363A1 (en) * 2000-09-28 2002-03-28 Maclean John M. Transaction management system
US20020038319A1 (en) * 2000-09-28 2002-03-28 Hironori Yahagi Apparatus converting a structured document having a hierarchy
US20020087596A1 (en) * 2000-12-29 2002-07-04 Steve Lewontin Compact tree representation of markup languages
US6418446B1 (en) * 1999-03-01 2002-07-09 International Business Machines Corporation Method for grouping of dynamic schema data using XML
US20020103970A1 (en) * 2000-08-15 2002-08-01 Gut Ron Abraham Cache system and method for generating uncached objects from cached and stored object components
US20020143521A1 (en) * 2000-12-15 2002-10-03 Call Charles G. Methods and apparatus for storing and manipulating variable length and fixed length data elements as a sequence of fixed length integers
US20030041304A1 (en) * 2001-08-24 2003-02-27 Fuji Xerox Co., Ltd. Structured document management system and structured document management method
US6567612B2 (en) * 1996-04-05 2003-05-20 Pioneer Electronic Corporation Information record medium, apparatus for recording the same and apparatus for reproducing the same
US6947932B2 (en) * 2001-01-23 2005-09-20 Xpriori, Llc Method of performing a search of a numerical document object model
US7127467B2 (en) * 2002-05-10 2006-10-24 Oracle International Corporation Managing expressions in a database system
US7133857B1 (en) * 2002-10-15 2006-11-07 Ximpleware, Inc. Processing structured data
US7321900B1 (en) * 2001-06-15 2008-01-22 Oracle International Corporation Reducing memory requirements needed to represent XML entities

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6886130B1 (en) * 1997-11-26 2005-04-26 International Business Machines Corporation Compiled structure for efficient operation of distributed hypertext
US7761459B1 (en) 2002-10-15 2010-07-20 Ximpleware, Inc. Processing structured data
US8316034B2 (en) * 2010-10-20 2012-11-20 International Business Machines Corporaion Analyzing binary data streams to identify embedded record structures

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5748953A (en) * 1989-06-14 1998-05-05 Hitachi, Ltd. Document search method wherein stored documents and search queries comprise segmented text data of spaced, nonconsecutive text elements and words segmented by predetermined symbols
US5892924A (en) * 1996-01-31 1999-04-06 Ipsilon Networks, Inc. Method and apparatus for dynamically shifting between routing and switching packets in a transmission network
US6567612B2 (en) * 1996-04-05 2003-05-20 Pioneer Electronic Corporation Information record medium, apparatus for recording the same and apparatus for reproducing the same
US6336124B1 (en) * 1998-10-01 2002-01-01 Bcl Computers, Inc. Conversion data representing a document to other formats for manipulation and display
US6418446B1 (en) * 1999-03-01 2002-07-09 International Business Machines Corporation Method for grouping of dynamic schema data using XML
US6209124B1 (en) * 1999-08-30 2001-03-27 Touchnet Information Systems, Inc. Method of markup language accessing of host systems and data using a constructed intermediary
US20020029229A1 (en) * 2000-06-30 2002-03-07 Jakopac David E. Systems and methods for data compression
US20020103970A1 (en) * 2000-08-15 2002-08-01 Gut Ron Abraham Cache system and method for generating uncached objects from cached and stored object components
US20020038363A1 (en) * 2000-09-28 2002-03-28 Maclean John M. Transaction management system
US20020038319A1 (en) * 2000-09-28 2002-03-28 Hironori Yahagi Apparatus converting a structured document having a hierarchy
US20020143521A1 (en) * 2000-12-15 2002-10-03 Call Charles G. Methods and apparatus for storing and manipulating variable length and fixed length data elements as a sequence of fixed length integers
US20020087596A1 (en) * 2000-12-29 2002-07-04 Steve Lewontin Compact tree representation of markup languages
US6947932B2 (en) * 2001-01-23 2005-09-20 Xpriori, Llc Method of performing a search of a numerical document object model
US7321900B1 (en) * 2001-06-15 2008-01-22 Oracle International Corporation Reducing memory requirements needed to represent XML entities
US20030041304A1 (en) * 2001-08-24 2003-02-27 Fuji Xerox Co., Ltd. Structured document management system and structured document management method
US7127467B2 (en) * 2002-05-10 2006-10-24 Oracle International Corporation Managing expressions in a database system
US7133857B1 (en) * 2002-10-15 2006-11-07 Ximpleware, Inc. Processing structured data

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
"Algorithms and Programming Models for Efficient Representation of XML for Internet Applications", by Sundaxesan et al., WWW10, May 1-5, 2001, Hong Kong. *
"Binary XML Content Format Specification", Version 1.3, 25 July 2001, Wireless Application Protocol WAP- 192-WBXML-20010725-a. *
"BitCube: A Three-Dimensional Bitmap Indexing for XML Documents", by Yoon et al., Journal of Intelligent Information Systems, 17:2/3,241-254, 2001. *
"Efficient Storage of XML Data" by Kanne et al., June 16, 1999 (Abstract). *
"Generic Programming for XML Tools", by Jeuring et al., Institute of Information and Computer Sciences, Utrecht University, The Netherlands, May 27, 2002. *
"Millau: An Encoding Format for Efficient Representation and Exchange of XML over the Web", by Girardot et al., Computer Networks: The International Journal of Computer and Telecommunications Networking, Volume 33, Issue 1-6 (June 2000), Pages 747-765. *
"Schema Extraction for Multimedia XML Document Retrieval", by Yoon et al., Proceedings of the First International Conference on Web Information Systems Engineering (WISE'00)- Volume 2, Year of Publication: 2000, ISBN: 0-7695- 0577-5-2. *
H. Blume et al., "Integration of High-Performance ASICs into Reconfigurable Systems Providing Additional Multimedia Functionality, 2002, IEEE, 10 pgs. *
Holger Blume, Hans-Martin, Christiane Henning, and Patrick Osterloh. 2000. Integration of High-Performance ASICs into Reconfigurable Systems Providing Additional Multimedia Functionality. In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors (ASAP '00). *
M. Morris Mano, "Computer System Architecture", 1993. *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9990364B2 (en) 2002-10-15 2018-06-05 Ximpleware, Inc. Processing structured data
US10698861B2 (en) 2002-10-15 2020-06-30 Ximpleware, Inc. Processing structured data
CN104111987A (en) * 2014-07-01 2014-10-22 西安交通大学 Tax intermediate index extraction method based on subtree pattern mining
US20190155896A1 (en) * 2015-08-31 2019-05-23 Ayla Networks, Inc. Compact schedules for resource-constrained devices
US10949255B2 (en) * 2015-08-31 2021-03-16 Ayla Networks, Inc. Compact schedules for resource-constrained devices
CN108874985A (en) * 2018-06-12 2018-11-23 东方电子股份有限公司 The distributed parsing configuration method of intelligent substation SCD file

Also Published As

Publication number Publication date
US9990364B2 (en) 2018-06-05
US7761459B1 (en) 2010-07-20
US20180253436A1 (en) 2018-09-06
US20200285607A1 (en) 2020-09-10
US10698861B2 (en) 2020-06-30
US20170109359A1 (en) 2017-04-20

Similar Documents

Publication Publication Date Title
US10698861B2 (en) Processing structured data
US7620652B2 (en) Processing structured data
US7356764B2 (en) System and method for efficient processing of XML documents represented as an event stream
US7802180B2 (en) Techniques for serialization of instances of the XQuery data model
US6981212B1 (en) Extensible markup language (XML) server pages having custom document object model (DOM) tags
Wood et al. Document object model (dom) level 1 specification
US8739022B2 (en) Parallel approach to XML parsing
US7665015B2 (en) Hardware unit for parsing an XML document
US7581172B2 (en) Method and apparatus for efficient management of XML documents
Lam et al. XML document parsing: Operational and performance characteristics
US20150026565A1 (en) Xml streaming transformer (xst)
US20060167869A1 (en) Multi-path simultaneous Xpath evaluation over data streams
US20030172348A1 (en) Streaming parser API
Dai et al. A 1 cycle-per-byte XML parsing accelerator
US6981211B1 (en) Method for processing a document object model (DOM) tree using a tagbean
US7266766B1 (en) Method for developing a custom tagbean
US20060167907A1 (en) System and method for processing XML documents
WO2002103554A1 (en) Data processing method, data processing program, and data processing apparatus
US6772395B1 (en) Self-modifying data flow execution architecture
Esposito Applied XML programming for Microsoft. NET
JP2007501464A (en) Method and system for probability-based verification of XML documents
Li XML Parsing, SAX/DOM.
JP2006505044A (en) Validation parser accelerated by hardware
Zhang Embedding parallel bit stream technology into Expat
Ozden A Binary Encoding for Efficient XML Processing

Legal Events

Date Code Title Description
AS Assignment

Owner name: XIMPLEWARE, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ZHANG, ZHENGYU;REEL/FRAME:031970/0430

Effective date: 20021014

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION