US20030126144A1 - Method and apparatus for access, integration, and analysis of heterogeneous data sources via the manipulation of metadata objects - Google Patents

Method and apparatus for access, integration, and analysis of heterogeneous data sources via the manipulation of metadata objects Download PDF

Info

Publication number
US20030126144A1
US20030126144A1 US10/217,492 US21749202A US2003126144A1 US 20030126144 A1 US20030126144 A1 US 20030126144A1 US 21749202 A US21749202 A US 21749202A US 2003126144 A1 US2003126144 A1 US 2003126144A1
Authority
US
United States
Prior art keywords
blocking
dataset
variables
variable
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/217,492
Inventor
Sharyn O'Halloran
David Epstein
Martin Betz
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to US10/217,492 priority Critical patent/US20030126144A1/en
Publication of US20030126144A1 publication Critical patent/US20030126144A1/en
Priority to US11/056,636 priority patent/US20050154727A1/en
Priority to US12/169,477 priority patent/US8171050B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2428Query predicate definition using graphical user interfaces, including menus and forms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing

Definitions

  • This invention relates to computer information systems, and more particularly to the storage and manipulation of metadata for data sources.
  • An organization's mission-critical information can be housed in a central server, updated on a continual basis via an online transaction processing (OLTP) system or an enterprise resource planning (ERP) system, using relational databases like Oracle, Sybase, or Microsoft Access.
  • OTP online transaction processing
  • ERP enterprise resource planning
  • Other information can be housed remotely on servers and downloaded via specialized software.
  • Still other information can be stored on Compact Discs and updated periodically with new releases.
  • the invention features a computer-based method of representing a subset of a dataset table of rows and columns.
  • the method includes selecting a set of blocking variables corresponding to blocking columns of the dataset table. For each row in the dataset table, a tuple of values for the blocking columns uniquely identifies the row within the dataset table.
  • the method also includes selecting a set of non-blocking variables that correspond to columns of the subset. The set of non-blocking variables does not intersect the set of blocking variables.
  • the method also includes creating a block information structure that includes both the set of non-blocking variables and, for each blocking variable in the set of blocking variables, a set of values.
  • Preferred embodiments include the following. For each row in the subset that has a tuple of values for the blocking columns, the values of the tuple are included in the corresponding sets of blocking-variable values.
  • the subset of the dataset table includes the entire dataset table.
  • the number of elements in the set of non-blocking columns, plus the number of elements in the sets of values for the blocking columns, is linearly proportionate to an upper bound on the binary storage requirements of the block information structure, particularly when such sets are arbitrarily large.
  • the block information structure is stored on a machine-readable medium.
  • this aspect of the invention provides a method for combining, selecting, and delivering data from heterogeneous databases.
  • the method allows for the creation and manipulation of metadata entities called blocksets.
  • Blocksets contain summary information about sets of data, and can therefore be manipulated quickly, flexibly, and efficiently in place of the datasets themselves.
  • Blocksets contain metadata about the datasets, allowing a user to choose combinations of the datasets for viewing without having to access the datasets directly.
  • the invention features a computer-based method of accessing information in heterogeneous databases.
  • the method includes presenting a graphical user interface, with controls representing a data cart and a plurality of datasets.
  • the method also includes receiving user input that selects a dataset to add to the data cart.
  • the method also includes generating a block information structure that specifies the dataset, and adding the block information structure to the data cart.
  • Preferred embodiments include the following.
  • the method incorporates into the block information structure a set of non-blocking variables, a set of blocking variables, and for each such blocking variable, a set of values.
  • the dataset includes a plurality of rows, each identified by a corresponding tuple of values from the sets of values for the blocking variables.
  • the method further includes presenting enumeration controls in the graphical user interface. Each enumeration control corresponds to an existing value in the dataset for the blocking variable.
  • the method further includes collecting user input that specifies a subset of the dataset, and includes representing the subset in corresponding block information structure.
  • the method further includes saving the data cart to a persistent storage medium.
  • the method further includes adding a second block information structure to the data cart, in response to user input.
  • the controls representing a data cart include a symbol of a shopping cart.
  • the graphical user interface allows a user to construct blocksets known as data carts.
  • the method further includes collaboration features such as the ability to save and comment on blocksets, and the option of peer-to-peer systems that accommodate geographically dispersed data sources.
  • the invention features a computer-based method of retrieving information represented by a blockset.
  • the method includes connecting to databases, wherein each database corresponds to a block in the blockset. Each such block specifies a subset of a dataset stored in the corresponding database.
  • the blockset has a plurality of blocking variables.
  • the blocks each include a set of non-blocking variables, and have a set of values for each blocking variable in the plurality of blocking variables.
  • the method includes adding a blocking column to a derived table, once for each blocking variable in the set of blocking variables.
  • the method also includes adding to the derived table a non-blocking column for each element in a union of the non-blocking variables in the plurality of blocks.
  • the method includes adding a row to the derived table.
  • the row includes a cell for each column in the derived table.
  • the row is uniquely identified by a tuple of values from the sets of values for the blocking variables.
  • the method also includes populating a cell of a non-blocking column in the row, using a value retrieved from the database corresponding to the block.
  • the block contains the non-blocking variable corresponding to the non-blocking column.
  • Preferred embodiments include the following.
  • the method further includes adding a row for each tuple of values from the sets of values for the blocking variables, provided the tuple occurs in at least one dataset corresponding to a block in the plurality of blocks.
  • the method includes, when connecting, using each block as a basis for a database query that specifies the corresponding subset.
  • the database query uses Structured Query Language.
  • the invention features a computer-based method of representing a table derived from a blockset, including outputting blockset metadata that describes the blockset.
  • the blockset metadata includes fields for a blockset title and a blockset description.
  • the method also includes outputting column metadata for a column in the table, such that the column metadata describes a variable associated with the column.
  • the variable is associated with an underlying dataset that provides data to the table in the blockset.
  • the column metadata includes fields for a title of the variable and for a title of the underlying dataset.
  • the invention features a computer-based method of collecting metadata for a dataset, including prompting a user to provide a database name.
  • the method also includes confirming that the database name represents a database, displaying a list of tables in the database to the user, receiving user input specifying a table in the list of tables, and prompting the user to confirm that a list of blocking variables and a list of non-blocking variables are correct for the database.
  • the method includes prompting the user to confirm metadata for the dataset and for the list of non-blocking variables. If the user confirms the list of blocking variables, the list of non-blocking variables, and the metadata, the method also includes adding a dataset corresponding to the table to a collection of datasets.
  • Preferred embodiments include the following.
  • the metadata includes a title for the dataset.
  • the metadata includes a description for the dataset.
  • the metadata includes a title for a non-blocking variable in the list of non-blocking variables.
  • the metadata includes a description for a non-blocking variable in the list of non-blocking variables.
  • FIG. 1A is a schematic diagram of an information processing system.
  • FIG. 1B is a schematic diagram of physical components of an information processing system.
  • FIG. 1C is a diagram of a database schema for a data group example.
  • FIG. 2 is a diagram is a schematic diagram of relationships between datasets, blocks, blocksets, and blockset derived tables.
  • FIG. 3A is a diagram of block creation from datasets.
  • FIG. 3B is a formula expressing a block.
  • FIG. 4 is a schematic diagram of relationships between blocks, blocksets, and blockset derived tables, including n-tuples.
  • FIG. 5A is a flowchart of a derivation process.
  • FIG. 5B is a formula for a database query to retrieve a dataset table underlying a block.
  • FIG. 5C is a codebook table.
  • FIGS. 7A through 7E are snapshots of a user interface for a browsing environment.
  • FIG. 9 is a snapshot of a user interface for a sample codebook.
  • FIG. 10 is a snapshot of a user interface for a data cart information entry page.
  • FIGS. 12A through 12F are snapshots of a user interface for an administrator page.
  • FIG. 12G is a flowchart for an upload process.
  • Server 12 includes software components such as a web server engine 12 a , databases 18 , and an application database 20 .
  • Web server engine 12 a serves out web pages 12 b over the network.
  • Web server 12 a connects to database tables 18 a of databases 18 via a database connection protocol, such as ODBC (Open DataBase Connectivity) 12 c .
  • Web server 12 a also connects to application database 20 , which includes metadata tables and user information tables.
  • web server engine 12 a encodes procedures for creating and manipulating metadata objects called blocksets and related entities, as will be explained in more detail with regard to FIG. 1C.
  • a user 22 interacts with information processing system 10 via client machine 14 .
  • client machine 14 includes a processor 14 a that operates on data stored in memory 14 b and executes machine-readable instructions stored in storage 14 e and in memory 14 b .
  • Client machine 14 renders visual information on a display device 14 c and receives input from a user 22 via one or more input devices 14 d , such as a mouse and keyboard.
  • Storage 14 e includes an operating system that uses a point-and-click GUI.
  • Client machine 14 also runs a web browser application, not shown, that logically connects to web server engine 12 a (FIG. 1A).
  • Network interface 14 f connects standalone device 14 to network 16 .
  • Bus 14 g carries information between components of standalone device 14 .
  • processor 12 h executes machine-readable instructions that encode web server engine 12 a , databases 18 , and application database 20 .
  • logical entities used in system 10 include a dataset 24 , a block 26 , a blockset 28 , and a blockset derived table 30 .
  • a dataset 24 represents a table of data.
  • a block 26 summarizes a portion of a dataset (including perhaps the entire dataset); the summarized dataset is said to “underlie” the block.
  • a blockset 28 collects one or more blocks.
  • a blockset derived table 30 is a table of data for a blockset, collecting, for the blocks in the blockset, the corresponding portions of the underlying datasets.
  • dataset 24 is an information structure that encodes a series of observations on a given set of variables.
  • the observations are stored in data sources 18 a (FIG. 1).
  • This description refers to elements of dataset 24 in the standard manner, with the data arranged in a table whose columns 24 a represent variables and whose rows 24 b represent observations.
  • a variable i that includes an M by 1 column vector Xi, where “M” is the number of rows
  • an M by N dataset is a matrix of the form [X 1 X 2 . . . X n ].
  • This matrix contains blocking variables 24 c which, collectively, uniquely identify a given observation.
  • Other variables in the dataset are non-blocking variables 24 d . For instance, a dataset on cars might uniquely identify each car by its make, model and year. Make, model and year would be encoded as blocking variables of the corresponding dataset.
  • Non-blocking variables might include the car's size, price and gas mileage.
  • a financial dataset might identify stocks by the ticker symbol and the day from which the closing price was taken.
  • the non-blocking variables could include the stock price, volume traded, and any splits that might have occurred on that day.
  • Each dataset contains at least one non-blocking variable.
  • the necessary metadata elements are the database name, the blocking variable list, and the non-blocking variable list.
  • the other metadata are optional, and may be used to describe the data set in the user interface, described with regard to FIGS. 6 through 12F, below.
  • Other corresponding tables are used in conjunction with the dataset metadata, such as a Variables table with variable descriptions, a Categories table with textual descriptions of the category codes, or Authors and Citations tables with more extensive bibliographic information.
  • a data group 34 is a collection of datasets 24 having identical blocking variables.
  • datasets within a data group are logically connected to each other by referencing the same basic type of data.
  • WorldInfo example of Table 1
  • all datasets have blocking variables for Country and Year.
  • they form a data group describing countries' economic, political, demographic, social, and geographic conditions on a year-to-year basis.
  • a single data set could logically belong to more than one group; for example, a data set containing daily stock prices could also belong to a group of yearly corporate data, simply by averaging daily information over each one-year period.
  • various data groups can be combined together, for instance via the use of composite primary keys and intermediary translation tables.
  • an example application database 20 encodes a single data group 34 conforming to the “WorldInfo” example. That is, the blocking variables for each dataset are Country and Year.
  • Application database 20 includes a dataset table 36 , a variable table 38 , an author table 40 , an author-dataset table 40 p , a category table 42 , a category-dataset table 42 f , a user table 44 , a block table 46 , and a blockset table 48 .
  • Dataset table 36 stores information for a dataset 24 .
  • Each row of dataset table 36 includes a dataset key 36 a , which uniquely identifies rows within dataset table 36 .
  • Such a row also includes metadata fields for presentation to a user, such as a title 36 b , which can be used as a caption in a user interface; description 36 f , which can store phrases describing dataset 24 in detail; and codebook name 36 g .
  • Each row also includes metadata for locating the dataset 24 in a data source 18 a , such as a table name 36 c and a database name 36 d .
  • Fields for variable list 36 k , country list 36 m , and year list 36 n store comma-delimited lists of values from the corresponding data source 18 a . These lists cache data that specifies a block, reducing the need to connect directly to data source 18 a .
  • country list 36 stores the distinct values in the underlying data for the blocking variable Country in the WorldInfo example.
  • year list 36 n stores distinct values for Year.
  • Other bookkeeping data in dataset table 36 includes a first version 36 e and a creation date 36 p.
  • Variable table 38 stores information about non-blocking variables associated with a dataset 24 .
  • Each row of variable table 38 describes a variable.
  • One row of dataset table 36 can be associated with many rows of variable table 38 via the foreign key field variable dataset ID 38 b .
  • Each row includes a title 38 c for the variable, a units notation 38 f of the values the variable measures, a list of keywords 36 d associated with variable for searching, and a description 36 e , which can store phrases describing the variable in detail.
  • Name 38 g indicates the corresponding column name for the variable in the underlying table.
  • Rows also include variable key 38 a , which uniquely identifies the rows within variable table 38 .
  • Author table 40 and category table 42 provide supplementary metadata for a dataset 24 , allowing many-to-many relationships between dataset table 36 and author table 40 , and between dataset table 36 and category table 42 .
  • a given author can be associated with one or more datasets, and vice versa.
  • categories can be used to describe one or more datasets, and a given dataset can belong to one or more categories.
  • Each row of author table 40 represents an author that can be associated with data in a dataset 24 .
  • Each such row includes fields for a first name 40 b , a last name 40 c , an email address 40 d , a website address 40 e , and an affiliation 40 f .
  • Rows also include an author key 38 a , which uniquely identifies the rows within author table 40 .
  • Author-dataset table 40 p implements the many-to-many relationship between dataset table 36 and author table 40 .
  • Link key 40 q uniquely identifies the rows within author-dataset table 40 p .
  • Author FK 40 r is a foreign key referencing author key 40 a .
  • Dataset FK 40 s is a foreign key referencing dataset key 36 a .
  • Each row of category table 42 represents a category that can be associated with data in a dataset 24 .
  • Rows include a category key 42 a , which uniquely identifies the rows within author table 40 .
  • Each such row includes fields for a category name 42 b and a description 42 d , which can store phrases describing the category in detail.
  • a field for parent category 42 c is a reflexive foreign key, i.e., one that references category key 42 a of its own table, allowing category table 42 to store nested hierarchies of categories.
  • Category-dataset table 42 f implements the many-to-many relationship between dataset table 36 and category table 42 .
  • Link key 42 g uniquely identifies rows within author-dataset table 42 f .
  • Category FK 42 h is a foreign key referencing category key 42 a .
  • Dataset FK 42 k is a foreign key referencing dataset key 36 a.
  • Each row of user table 44 stores persistent information about a system account for a human user 22 , including a user name 44 b , a login name 44 c , a password 44 d , and admin level 44 m , which specifies a level of system privileges granted to the user.
  • Each row also stores metadata about the user, such as title 44 e , an email address 44 f , a website 44 g , and a description 44 h .
  • a field for affiliation 44 k indicates an organization or domain for the user.
  • Latest dataset 44 n indicates the dataset 24 last used by the user, allowing the user to return to this dataset in a subsequent session.
  • Each row of block table 46 represents a block 26 (FIG. 2).
  • One row of dataset table 36 can be associated with many rows of block table 46 via the foreign key field dataset FK 46 b .
  • one row of block table 46 references one row of dataset table 36 . That is, each block 26 has one dataset 24 , but one dataset 24 can have many blocks 26 .
  • One row of blockset table 48 can be associated with many rows of block table 46 via the foreign key field blockset FK 46 c .
  • Each block 26 has one blockset 28 , but one blockset 28 can have many blocks 26 .
  • Each row of block table 46 in the described example has a list 46 d of non-blocking variables associated with the block 26 .
  • Each row also has a set of values for the blocking variables Country and Year in the fields countries 46 e and years 46 f , respectively.
  • Rows also include a block key 46 a , which uniquely identifies the rows within block table 46 .
  • Each row of blockset table 48 represents a blockset 28 .
  • Each row has metadata describing the blockset 28 for presentation to a user, including fields for title 48 e , description 48 c , authors 48 m , citations 48 n , and categories 48 p .
  • Each row also includes bookkeeping fields such as date 48 d , version 48 g , published 48 h , submitted 48 k , and deleted 48 q .
  • a foreign key field user FK 48 b references a row in user table 44 , indicating the user 22 who owns the blockset 28 .
  • a field for public 48 f indicates whether the blockset 28 should be shared with other users.
  • Rows also include a blockset key 48 a , which uniquely identifies the rows within blockset table 48 .
  • a country table maps three-letter abbreviations to country names. For example, an entry in the table contains an abbreviation value of “ITA” and a country name value of “Italy”.
  • blocks refer to sets of data via metadata information.
  • a block summarizes a set of data by including blocking-variable values that specify the set of data, together with metadata about additional, non-blocking-variable columns to associate with the block.
  • a block 26 describes a subset of data from one dataset 24 .
  • the subset is defined by specifying a set of non-blocking variables 24 d from the dataset and a set of values from each of the blocking variables 24 c .
  • a block B from a dataset having m blocking variables and n non-blocking variables is represented by a (m+1)-tuple, shown in FIG. 3B.
  • the V i are non-blocking variables and the X 1 are legal subsets of values from the m blocking variables.
  • a corresponding block definition includes: [ ⁇ Non-blocking vars ⁇ , Country set, Year set].
  • VariableSet ⁇ 23, 45, 73, 215 ⁇ .
  • CountrySet ⁇ GER, USA, SWE, AUS, CAN, JAP ⁇ ;
  • YearSet ⁇ 1980, 1985, 1990, 1995, 2000 ⁇ ;
  • Table 2 gives a sample block definition for the WorldInfo example.
  • the specified values for the Country blocking variable are codes for Germany, USA, Sweden, Austria, Canada, and Japan.
  • the specified values for the Year blocking variable are 1980, 1985, 1990, 1995, 2000.
  • Non-blocking variable are referred to by their key values 38 a (FIG. 1C).
  • block 26 can represent an entire dataset 24 .
  • dataset 24 contains blocking variables 24 c labeled BV1 and BV2 and non-blocking variables 24 d labeled NBV1 and NBV2.
  • the block 26 will therefore include the values 26 a that BV1 and BV2 take on within the desired subset. For instance, a year set might consist of the set ⁇ 1991, 1992, 1993, 1994, 1995, 1996 ⁇ .
  • the block 26 also includes the names or labels of the non-blocking variables NBV1 and NBV2, which are stored in a list in field 26 b . Notice that the names themselves or pointers to name identification fields are sufficient; the actual data within these fields need not be stored in the block itself.
  • Blocks require relatively little storage memory to reference large collections of data. For instance, a block containing i variables for j countries and k years can reference as many as i*j*k rows from the corresponding dataset, but the block can describe these rows using only i+j+k elements—a significant improvement for large values of i, j, or k.
  • the number of elements in the set of non-blocking variables and in the sets of values for the blocking columns provides an approximate upper bound on binary storage requirements of the block, subject to a linear scaling factor. As the sets grow arbitrarily large, the binary storage requirements of the block are on the order of the sum of the sizes of these sets. In other words, the binary storage requirements are linearly proportionate to the cardinality of these sets.
  • a blockset 28 is an unordered set of one or more blocks 26 from a single data group. For every block, there is a corresponding simple blockset with only one block.
  • a user 22 interacts with system 10 in the context of one or more sessions 52 on client machine 14 . Operations in one session 52 are independent of other sessions 52 .
  • a data cart 50 is a blockset 28 corresponding to the set of blocks currently selected in a session. Each session has either zero or one data cart 50 . When initialized, a data cart is empty.
  • blocksets include collections of blocks, in general, they place minimal demands on memory.
  • a blockset 28 is encoded in software as a blockset object 54 .
  • blockset objects contain both properties 54 a and methods 54 b ; that is, data and operations.
  • Properties 54 a include the set 54 c of blocks associated with the blockset 28 , such as the vector myBlocks in the example of Table 3.
  • Blockset objects 54 can also possess methods 54 b which contain instructions on how to perform certain operations on blockset objects 54 .
  • a blockset object 54 can display summary information about itself by counting the number of variable entries in each of its constituent blocks. It can also associate a new block to itself by adding a reference to that block to its collection 54 c .
  • a blockset object 54 can merge itself with another blockset object 54 via a set union operation.
  • System 10 can download blockset data to a client machine via a number of standard output formats.
  • a derived table 30 relates to a blockset 28 in the following way.
  • Table 30 has blocking variable columns 30 a , one for each blocking variable in the data group.
  • the other columns form a set 30 b , which is the union of the non-blocking variables from the constituent blocks 26 of blockset 28 .
  • the derived table contains only one column for each overlapping variable.
  • Table 30 has rows 30 c for each of the blocking variable values specified in some block from the blockset. Specifically, if there are N blocking variables, the new table will have a row for every N-tuple 30 d of blocking variable values, [valBV 1 , valBV 2 , . . . , val N ], where val X is one of the values specified for variable X in one of the blockset's blocks.
  • Table 30 has two types of columns: blocking variable columns 30 a , and data columns 30 b .
  • Blocking variable columns have cells which contain the appropriate blocking variable values. That is, for a row associated to [val 1 , val 2 , . . . , val N ], the value for blocking variable i will be val 1 .
  • the data columns have cells that can be referenced by a row identifier, [val 1 , val 2 , . . . , val N ], and a non-blocking variable from one of the blocks in the defining blockset, e.g. NBvar from block K.
  • val 1 , val 2 , . . . , val N is an allowed blocking variable combination
  • NBvar is an included variable
  • this cell is filled in with the value from that block's dataset. Note that this value is the same if there are 2 such blocks in the blockset. If, for all such blocks, that cell does not exist, then this cell is empty in the new table. If ⁇ [val 1 , val 2 , . . . , val N ], Nbvar ⁇ is not an allowed combination for any block included in the dataset, then the cell is empty in the new table. Hence this algorithm produces an outer join of the multiple dataset tables referenced in the blockset's constituent blocks.
  • a derivation process 56 generates a blockset derived table 30 from a given blockset 28 .
  • This discussion assumes the dataset tables are stored in an SQL database, or some other container such that the data can be accessed via an SQL-like query language.
  • Derivation process 56 sequentially returns column values and table rows of a blockset derived table 30 . Each returned row is represented as a list of values. The first row is header row, listing column names.
  • derivation process 56 defines an order for the blocking variable set 30 a (step 56 a ), either arbitrarily or by user response. Derivation process 56 also defines an order for the blocks in the blockset.
  • Derivation process 56 generates a hashtable of query result sets (step 56 b ).
  • Most standard query engines allow a client to define a result set and retrieve the data row by row. This allows the client to make only one query per block, and then use the natural ordering of the returned data to progressively fill in the newly generated table.
  • the result sets are indexed by a block id, where the id reflects the block ordering defined above.
  • Derivation process 56 therefore generates a hashtable which maps integers to data result sets from which data can be incrementally retrieved.
  • Derivation process 56 then connects to each of the data sources 18 a (FIG. 1A) that provide the underlying datasets for the blocks in the blockset (step 56 c ). Derivation process 56 returns the first row (step 56 d ), generating the row incrementally, starting with the blocking variable names in their specified order. Derivation process 56 completes the row by going through the ordered blocks, appending each block's variable list.
  • Derivation process 56 traverses the result sets (loop bounded by 56 e and 56 n ) in the same order as the rows that will be returned to the user. This way, no further requests need be made to the query engine.
  • This ordering is created in the following way. For each dataset from which a block has been defined, generate a variable list V (FIG. 5B) which includes both blocking and non-blocking variables. Also generate the corresponding database table name D. Label the blocking variables BV1, BV2, . . . , and denote the union of values that BV1 takes on in all blocks with the set ⁇ A, B, C, . . .
  • derivation process 56 submits a query of the type shown in FIG. 5B, during connection to the underlying datasets (step 56 c ).
  • Derivation process 56 returns data for the blockset derived table 30 one row at a time, creating one row for each distinct n-tuple (step 56 f ). Derivation process 56 proceeds from one n-tuple to the next in the same order that the SQL engine uses when it returns data via “ORDER BY BV1, BV2, . . . .” In particular, derivation process 56 progresses through the blocking variable values alphabetically, as if the n-tuple was concatenated into one long string. For example, this can be done by first looping through the values of the nth-blocking variable, returning it to its starting point, incrementing the n-1st variable value, and so on. In this way, every row of the resulting new data table is returned.
  • derivation process 56 For each combination of allowed blocking variable values, i.e., for each n-tuple, derivation process 56 goes through the blocks in the block set (loop bounded by 56 g and 56 k ). For each block, derivation process 56 retrieves a row of data from the corresponding result set, comparing the blocking variable values of that row to the current n-tuple. If it is a match, derivation process 56 adds the retrieved non-blocking variable values to the data row being created for table 30 (step 56 h ).
  • derivation process 56 After all the blocks have been checked for a given n-tuple, derivation process 56 returns the data row (step 56 m ). Often, some of the data values will be empty. These may be represented by a “.” in the returned list.
  • the resulting dataset can be further modified via standard techniques to be read into various programs; e.g., XML output, spreadsheet programs such as Excel or Quattro Pro, database programs such as Oracle, Access, and SQL Server, statistical programs such as Stata, SPSS, and SAS, and so on.
  • a login web page 60 contains a login section 60 a , a set of links 60 b to selected datasets, a set of instructions 60 c on how to use the page, and a set of quick links 60 d to other sections of the site.
  • the system assigns a unique client identifier 44 a (FIG. 1C) to each user 22 , which is then used to access preferences and previous saved data sets for that user.
  • the user need only be able to enter the system by supplying his or her login information, and be given access to the datasets and previously stored data carts.
  • a main browsing section 62 contains a link bar 62 a , which may be identical to 60 d or contain different options.
  • Section 62 also contains a frame for a list of all available datasets 62 b and a browsing area 62 c .
  • the datasets listed in 62 b may be identical for all users, or they can vary depending on the user's identification and level of access privileges, stored for example in admin level 44 m (FIG. 1C).
  • the datasets area 62 b also displays a Search option, which implements a search routine on all dataset titles, variable names, and descriptive metadata. If the system 10 hosts only one type of data, such as data grouped by country and year, then this browse page can be accessed directly.
  • an intermediate selection stage allows the user to choose which type of data she wants to view, according to the data's blocking variables.
  • These data groups could include, for instance, country-year, firm-year, firm-quarter, legislative district, stock-day, and so on.
  • the system examines the metadata for the data sets currently available and selects those data sets of the currently selected data group. The browsing page for that data group is then dynamically generated and displayed to the user, and only those data sets in the system of the selected group are shown in area 62 b of the resulting page.
  • FIG. 7B shows the main browsing section 62 with the browsing area 62 c replaced by the data cart summary 62 d , the dataset summary 62 e , and the data selection area 62 f .
  • This screen results when the user clicks on one of the datasets listed in 62 b .
  • the data shown in the data cart summary 62 d summarizes the current blockset.
  • Data cart summary 62 d includes an icon 63 that has an image of a physical shopping cart.
  • the dataset summary 62 e is constructed from dataset metadata stored in application database 20 .
  • the cart summary contains descriptive information on the current data cart, described more fully below.
  • the dataset summary contains summary information on a given dataset and a Quick Download option, in which the entire dataset is downloaded as a single blockset, using derivation process 56 (FIG. 5A).
  • the Add Data to Cart option adds the blockset consisting of the entire dataset to the current cart.
  • the data selection area contains a number of tabs: one for each blocking variable and one for all other non-blocking variables.
  • the blocking variables are country and year, while the non-blocking variables are SYSTEM, YRSOFFICE, FINITTRM, etc.
  • the system allows the user to choose one or more variables; in the example, this is done through checkboxes. It also allows for the immediate download of a subset via the Download Subset link, which downloads the currently selected blockset as defined by the choices made on the tabs. Various default options are available if the Download Subset link is pressed before all the tabs have been filled out; unused tabs may be assumed to be empty, for instance, or have all possible choices filled.
  • the Add Subset to Cart option adds the subset defined by the tabs to the current cart. Not all the variables in a dataset need be available to all users; dataset access, variable access, and data point access can all be limited via security clearance codes and data filtering. Each variable can also have one or more comments associated with it, stored in description 44 h (FIG. 1C). The user can access controls to create comments by clicking on the comment link next to any variable. These comments can be emailed to the system administrator and other users via standard mailing programs.
  • FIG. 7C shows main browsing section 62 with the years tab 62 g selected.
  • the user can select all years, select individual years one by one, or select years in groups decade by decade. Individual controls exist for each value the variable can adopt.
  • FIG. 7D shows the same view, but with the countries tab 62 h selected. In the example, countries can be selected one at a time, or in groups by continent or international affiliations.
  • an example data cart screen 66 results from the user's clicking on the Add Data to Cart button or the Add Subset to Cart button, shown in FIG. 7A.
  • the cart summary area 66 a displays summary information about the current cart, while the list of variables 66 b contains a list of the variables currently included in the data cart 50 (FIG. 2), and any relevant metadata information.
  • the dataset listing 62 b and data cart summary information 62 d remain, as in browsing area 62 .
  • data cart screen 66 allows a user to view and generate the codebook 66 c associated with the data cart 50 .
  • Codebook 66 c is a custom-generated list of the metadata associated with the variables in the current blockset.
  • a sample codebook window 68 includes the metadata 68 a for a sample variable.
  • the process of selecting data in system 10 allows users to place variables in their data cart 50 , just as online shopping providers use a shopping cart for goods and services.
  • the process of selecting data for data cart 50 is analogous: users place variables in their data cart.
  • the cart analogy makes system navigation easy and intuitive; the users need only select the variables that they want and put them into a cart. They can combine variables from more than one dataset as long as all datasets belong to the same data group. When all desired variables have been added, the user can “check out” the cart by downloading the data, as described below.
  • This construction also allows for asynchronous data selection; users can build their custom-made data sets a little at a time, as opposed to systems in which the variable selection must be made all at once.
  • Data carts can be created and modified quickly due to their underlying blockset construction, whereby variables can be added and subtracted from blocksets via the manipulation of their metadata only, allowing the browsing process to occur without noticeable system delays.
  • the currently defined Block is added to the list of blocks making up the current data cart blockset for that user.
  • a user can view a web page which shows a list of the blocks contained in their current Datacart. This page is generated by iterating through the distinct Blocks in the Block list which defines the users datacart object. From this page, the user can choose to “Remove” individual Blocks from their cart. When the remove operation is requested, the system shortens the list of blocks in the user's datacart by one item, and the specified block is no longer referenced by the datacart object.
  • a user session can continue, with the user defining, adding, and removing Blocks from their dataset multiple times. If and when they decide they want to “Save” this cart, the system prompts for information about the cart.
  • the user provides a title 48 , a description 48 c , and a public flag 48 f for the cart.
  • the entered title and description are combined with the privacy flag, the user's ID 44 a , the current date, and a unique ID (blockset key 48 a ).
  • This information is saved into application database 20 .
  • a description of each Block in the datacart's list of Blocks is saved as a separate row in block table 46 . Rows in this table contain a blockset FK 46 c column which references back to this cart's ID. In this way, blocks for a specified cart are uniquely identified.
  • the data cart information form 70 a can contain any of a number of fields relating to the summary cart information.
  • the cart may also be saved as public or private. If the former, then any other registered user may view that data cart; if private, then only the user herself may view the cart in the Archive, described below. These choices need not be dichotomous; intermediate levels of access can be specified as well, depending on the particular organization's needs. For instance, only registered users with a certain clearance level or above may have access to certain data and saved carts.
  • the user can return to browsing the data sets, as illustrated in FIG. 7E. Any other dataset can be selected, and its variables can be added to the variables currently in the data cart.
  • the system displays the choices on the blocking variables made on the previous blockset added to the cart.
  • the previous countries and years are both automatically selected and highlighted, the countries tab being shown as 62 h , with highlighted previous choices as 72 i.
  • the list of variables in the cart display 66 f is expanded to include the variables from all data sets added to the cart. Items can be removed from the cart either one at a time via 66 g , or in their entirety via 66 h . Summary information from the cart is available via 66 i . As above, the entire blockset can be viewed with control 66 j or “checked out”—downloaded—with control 66 k.
  • archive section 72 the user accesses archive section 72 through the standard toolbar 60 d .
  • One area of archive section 72 shows saved carts 72 a , as well as carts 72 b from other users who have placed public carts on the system. The user can delete her own carts. When another user's name is selected, then their publicly saved carts appear. Clicking on any saved cart brings that cart's information into the viewing area 72 c , as illustrated in FIG 11 B. This shows the variables in the saved cart 72 d , which can then be manipulated as any other blockset.
  • the archive section 72 allows for collaboration by geographically dispersed users. Datasets can be saved, edited, and then saved again online, by manipulating the metadata of the saved blocksets. This makes dataset storage inexpensive from the viewpoint of system resources, and it makes the saving and retrieval of data carts quick and efficient. As with the data cart, the archive area also allows the user to select variables asynchronously. Not only can data be added to a cart a little at a time over a single login session, but saved carts add the possibility of stretching the dataset creation process across multiple sessions without having to rebuild the dataset from scratch every time.
  • an administrator screen 74 is accessible to users with administrative privileges.
  • the left-hand pane 74 b in FIG. 12A lists administrative options, including List All Users, and Add datasets 74 n .
  • the system initiates a wizard interface as illustrated in the right-hand pane 74 a .
  • the databases to be loaded into the system are first located within a single directory on the server; hence only the database name need be entered.
  • a fully qualified directory path or URL could be entered, requiring only that the server have access privileges to the specified location.
  • the database name is a value to be stored in database name field 36 d of dataset table 36 (FIG. 1C).
  • Dataset 74 d is a control that specifies a data source table 84 (FIG. 12H) of data that will be the basis for a dataset 24 .
  • the table is identified within its database by a data source table name 84 d .
  • Codebook 74 c is a control that allows the user to specify a codebook table 82 , which system 10 uses as a source of metadata to describe the dataset 24 to upload.
  • Codebook table 82 is identified within its database by codebook table name 82 f . If the dataset uploads successfully, system 10 stores this value in the field codebook name 36 g of the corresponding row in dataset table 36 (FIG. 12H).
  • a codebook table 82 stores information used to identify a dataset 24 , such as the dataset name, the set of blocking variables, and the names of the non-blocking variables. Other useful metadata can be added as well, such as author, variable descriptions, and coding rules.
  • Each row of the codebook table 82 represents information applicable either to an entire dataset or to a variable within a dataset.
  • a row includes the keyword “dataset” in code field 82
  • the row represents a dataset 24 .
  • code field 82 gives the name of a non-blocking variable field 84 b in data source table 84 .
  • Title 82 c stores a name for the corresponding variable.
  • Description 82 d stores phrases describing dataset 24 in detail.
  • Author FK 82 e is a foreign key referencing author table 40 , which allows a dataset or variable to be associated with a particular author.
  • each non-blocking variable has a corresponding row in codebook table 82 , which serves as the basis for a new corresponding row in variable table 38 .
  • title 82 c maps to title 38 c , description 82 d to description 38 e , and code 82 b to name 38 g.
  • the row in codebook table 82 with “dataset” in code field 82 b corresponds to a new row in dataset table 36 .
  • description 82 d maps to description 36 f.
  • FIG. 12D An illustration of the latter is provided in FIG. 12D, where the administrator is presented with a form 74 g through which she can modify a dataset's author and category information.
  • the author editing screen is illustrated in FIG. 12E, which displays the current author information 74 h , the list of potential authors currently in the system 74 i , and a form 74 j for adding a new author.
  • FIG. 12F offers the possibility of adding or amending category information.
  • the current set of categories is provided in 74 k , possible additional categories already in the system are provided in 74 m , while new categories can be added via 74 p.
  • an upload process 80 guides a user through a process of adding a dataset 24 to the system, verifying that the necessary metadata is in place.
  • Upload process 80 receives a data group 34 , either explicitly as a passed value or reference, or implicitly by a default value (step 80 a ).
  • the data group 34 has a set of blocking variables.
  • Upload process 80 presents a user interface that prompts a user for a database name (step 80 b ). Upload process 80 then receives user input specifying database name, which upload process 80 stores (step 80 c ). Upload process 80 compares the database name to its current set of databases 18 (FIG. 1A) and determines whether the database name specifies a database 18 that system 10 can connect to, for example using an ODBC connection (step 80 d ). If so (result 80 e ), upload process 80 connects to the database and retrieves a list of tables in the database to present to the user (step 80 g ). Otherwise, if the database is not available (result 80 f ), the process prompts again for database name (step 80 b ).
  • the user can choose a data source table 84 from the list of tables, as well as a codebook table 82 (FIG. 12H).
  • upload process 80 Upon receiving input specifying the user's choice ( 80 h ), upload process 80 prompts the user to confirm blocking and non-blocking variables in the data source table (step 80 i ).
  • Upload process 80 tests whether the user input confirms that the blocking and non-blocking variables are identified correctly (step 80 j ). With confirmation (result 80 k ), upload process 80 retrieves rows from codebook table 82 that describe the dataset and variables (step 80 p ). If the user does not provide confirmation (result 80 m ), upload process 80 returns failure (step 80 n ) and terminates without adding a dataset 24 to the system.
  • Upload process 80 verifies that the rows in codebook table 82 correspond to the dataset and variables of data source table 84 (step 80 q ). If any row is missing (result 80 r ), upload process 80 returns failure (step 80 n ) and terminates without adding a dataset 24 to the system. Otherwise (result 80 s ), upload process 80 prompts the user and receives input to confirm the metadata for the prospective dataset 24 itself, as opposed to the metadata for the variables (step 80 t ).
  • metadata for the dataset includes its storage name, its name in presentation to users (i.e., a caption), a textual description, and its location.
  • upload process 80 loops to confirm each of the variables (loop bounded by 80 w and 80 ab ). Otherwise, if the dataset metadata is not confirmed (result 80 v ), the process returns failure (step 80 n ) and terminates without adding a dataset 24 to the system.
  • Upload process 80 iterates over each variable to prompt the user with the associated metadata (step 80 x ), for example the storage name, the title, a textual description, associated keywords, and units. Upload process 80 tests the user's response (step 80 y ). If the user rejects any variable's metadata (result 80 aa ), the process returns failure (step 80 n ) and terminates without adding a dataset 24 to the system. Otherwise (result 80 z ), upload process 80 commits the dataset 24 to application database 20 and adds a corresponding database 18 as the provider of the underlying data.
  • the associated metadata for example the storage name, the title, a textual description, associated keywords, and units.
  • Upload process 80 tests the user's response (step 80 y ). If the user rejects any variable's metadata (result 80 aa ), the process returns failure (step 80 n ) and terminates without adding a dataset 24 to the system. Otherwise (result 80 z ), upload process 80 commits the dataset
  • the system When a user asks to download the codebook information about a specified dataset or saved datacart, the system iterates through each block of the datacart. For each block, it retrieves all codebook records which reference either the dataset containing that block, or a variable within that block's definition. This codebook information is stored in memory and formatted for display in a web page, or for printing. When a secondary table row is referenced, e.g. when an author is specified, the author table is queried, the information is retrieved and then formatted.
  • a codebook process 86 receives a request for a codebook for a table derived from a blockset (step 86 a ).
  • Codebook process 86 retrieves metadata for blockset from application database 20 , for example title 48 e and description 48 c from blockset table 48 (step 86 b ).
  • Codebook process 86 outputs the metadata for blockset (step 86 c ), then loops over the variables of the derived table (loop bounded by 86 d and 86 g ), as encoded in the rows in variable table 38 .
  • codebook process 86 retrieves metadata from the application database (step 86 e ).
  • metadata includes a variable title 38 c , a description 38 e , and the title 36 b of the associated dataset in dataset table 36 .
  • databases 18 can be located on other machines than server 12 , and can be connected via a database server.
  • server 12 can include multiple physical machines distributed across a network.
  • each block 26 has one blockset 28 .
  • This provides an administrative advantage, in that edits to the block of a first blockset 28 cannot affect other blocksets, since blocks are not shared. In alternative embodiments, however, blocks could be shared by blocksets.
  • codebook table 82 includes basic text information like descriptions and comments.
  • codebook table 82 could also reference rows in other tables of application database 20 , such as citation records or categories.

Abstract

A computer-based method represents a subset of a dataset table of rows and columns. The method includes selecting a set of blocking variables corresponding to blocking columns of the dataset table. For each row in the dataset table, a tuple of values for the blocking columns uniquely identifies the row within the dataset table. The method also includes selecting a set of non-blocking variables that correspond to columns of the subset. The set of non-blocking variables does not intersect the set of blocking variables. The method also includes creating a block information structure that includes both the set of non-blocking variables and, for each blocking variable in the set of blocking variables, a set of values.

Description

    CLAIM OF PRIORITY
  • This application claims priority under 35 USC §[0001] 119(e) to U.S. Patent Application Serial No. 60/311,495, filed on Aug. 10, 2001, the entire contents of which are hereby incorporated by reference.
  • TECHNICAL FIELD
  • This invention relates to computer information systems, and more particularly to the storage and manipulation of metadata for data sources. [0002]
  • BACKGROUND
  • In a complex technological environment, information is commonly stored in heterogeneous databases on a wide range of media. An organization's mission-critical information can be housed in a central server, updated on a continual basis via an online transaction processing (OLTP) system or an enterprise resource planning (ERP) system, using relational databases like Oracle, Sybase, or Microsoft Access. Other information can be housed remotely on servers and downloaded via specialized software. Still other information can be stored on Compact Discs and updated periodically with new releases. [0003]
  • One traditional approach to integrating disparate data sources is built around the notion of a “data cube,” or dimensional database, which employs a star schema to organize the constituent data sets. This technology can place high demands on system resources when the entire data cube must be rebuilt after a single data set or data point changes. Also, since the entire data cube must be traversed when merging, subsetting, or analyzing data, this process can be slow, creating system bottlenecks. [0004]
  • SUMMARY
  • In general, in one aspect, the invention features a computer-based method of representing a subset of a dataset table of rows and columns. The method includes selecting a set of blocking variables corresponding to blocking columns of the dataset table. For each row in the dataset table, a tuple of values for the blocking columns uniquely identifies the row within the dataset table. The method also includes selecting a set of non-blocking variables that correspond to columns of the subset. The set of non-blocking variables does not intersect the set of blocking variables. The method also includes creating a block information structure that includes both the set of non-blocking variables and, for each blocking variable in the set of blocking variables, a set of values. [0005]
  • Preferred embodiments include the following. For each row in the subset that has a tuple of values for the blocking columns, the values of the tuple are included in the corresponding sets of blocking-variable values. The subset of the dataset table includes the entire dataset table. The number of elements in the set of non-blocking columns, plus the number of elements in the sets of values for the blocking columns, is linearly proportionate to an upper bound on the binary storage requirements of the block information structure, particularly when such sets are arbitrarily large. The block information structure is stored on a machine-readable medium. [0006]
  • Among other advantages, this aspect of the invention provides a method for combining, selecting, and delivering data from heterogeneous databases. The method allows for the creation and manipulation of metadata entities called blocksets. Blocksets contain summary information about sets of data, and can therefore be manipulated quickly, flexibly, and efficiently in place of the datasets themselves. Blocksets contain metadata about the datasets, allowing a user to choose combinations of the datasets for viewing without having to access the datasets directly. [0007]
  • In general, in another aspect, the invention features a computer-based method of accessing information in heterogeneous databases. The method includes presenting a graphical user interface, with controls representing a data cart and a plurality of datasets. The method also includes receiving user input that selects a dataset to add to the data cart. The method also includes generating a block information structure that specifies the dataset, and adding the block information structure to the data cart. [0008]
  • Preferred embodiments include the following. The method incorporates into the block information structure a set of non-blocking variables, a set of blocking variables, and for each such blocking variable, a set of values. The dataset includes a plurality of rows, each identified by a corresponding tuple of values from the sets of values for the blocking variables. For a blocking variable in the set of blocking variables, the method further includes presenting enumeration controls in the graphical user interface. Each enumeration control corresponds to an existing value in the dataset for the blocking variable. The method further includes collecting user input that specifies a subset of the dataset, and includes representing the subset in corresponding block information structure. The method further includes saving the data cart to a persistent storage medium. The method further includes adding a second block information structure to the data cart, in response to user input. The controls representing a data cart include a symbol of a shopping cart. [0009]
  • The graphical user interface allows a user to construct blocksets known as data carts. The method further includes collaboration features such as the ability to save and comment on blocksets, and the option of peer-to-peer systems that accommodate geographically dispersed data sources. [0010]
  • In general, in still another aspect, the invention features a computer-based method of retrieving information represented by a blockset. The method includes connecting to databases, wherein each database corresponds to a block in the blockset. Each such block specifies a subset of a dataset stored in the corresponding database. The blockset has a plurality of blocking variables. The blocks each include a set of non-blocking variables, and have a set of values for each blocking variable in the plurality of blocking variables. The method includes adding a blocking column to a derived table, once for each blocking variable in the set of blocking variables. The method also includes adding to the derived table a non-blocking column for each element in a union of the non-blocking variables in the plurality of blocks. Furthermore, the method includes adding a row to the derived table. The row includes a cell for each column in the derived table. The row is uniquely identified by a tuple of values from the sets of values for the blocking variables. The method also includes populating a cell of a non-blocking column in the row, using a value retrieved from the database corresponding to the block. The block contains the non-blocking variable corresponding to the non-blocking column. [0011]
  • Preferred embodiments include the following. The method further includes adding a row for each tuple of values from the sets of values for the blocking variables, provided the tuple occurs in at least one dataset corresponding to a block in the plurality of blocks. The method includes, when connecting, using each block as a basis for a database query that specifies the corresponding subset. The database query uses Structured Query Language. [0012]
  • In general, in yet another aspect, the invention features a computer-based method of representing a table derived from a blockset, including outputting blockset metadata that describes the blockset. The blockset metadata includes fields for a blockset title and a blockset description. The method also includes outputting column metadata for a column in the table, such that the column metadata describes a variable associated with the column. The variable is associated with an underlying dataset that provides data to the table in the blockset. The column metadata includes fields for a title of the variable and for a title of the underlying dataset. [0013]
  • In general, in another aspect still, the invention features a computer-based method of collecting metadata for a dataset, including prompting a user to provide a database name. The method also includes confirming that the database name represents a database, displaying a list of tables in the database to the user, receiving user input specifying a table in the list of tables, and prompting the user to confirm that a list of blocking variables and a list of non-blocking variables are correct for the database. In addition, the method includes prompting the user to confirm metadata for the dataset and for the list of non-blocking variables. If the user confirms the list of blocking variables, the list of non-blocking variables, and the metadata, the method also includes adding a dataset corresponding to the table to a collection of datasets. [0014]
  • Preferred embodiments include the following. The metadata includes a title for the dataset. The metadata includes a description for the dataset. The metadata includes a title for a non-blocking variable in the list of non-blocking variables. The metadata includes a description for a non-blocking variable in the list of non-blocking variables. [0015]
  • The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims.[0016]
  • DESCRIPTION OF DRAWINGS
  • FIG. 1A is a schematic diagram of an information processing system. [0017]
  • FIG. 1B is a schematic diagram of physical components of an information processing system. [0018]
  • FIG. 1C is a diagram of a database schema for a data group example. [0019]
  • FIG. 2 is a diagram is a schematic diagram of relationships between datasets, blocks, blocksets, and blockset derived tables. [0020]
  • FIG. 3A is a diagram of block creation from datasets. [0021]
  • FIG. 3B is a formula expressing a block. [0022]
  • FIG. 4 is a schematic diagram of relationships between blocks, blocksets, and blockset derived tables, including n-tuples. [0023]
  • FIG. 5A is a flowchart of a derivation process. [0024]
  • FIG. 5B is a formula for a database query to retrieve a dataset table underlying a block. [0025]
  • FIG. 5C is a codebook table. [0026]
  • FIG. 6 is a snapshot of a user interface for a login page. [0027]
  • FIGS. 7A through 7E are snapshots of a user interface for a browsing environment. [0028]
  • FIGS. 8A and 8B are snapshots of a user interface for a data cart. [0029]
  • FIG. 9 is a snapshot of a user interface for a sample codebook. [0030]
  • FIG. 10 is a snapshot of a user interface for a data cart information entry page. [0031]
  • FIGS. 11A and 11B are snapshots of a user interface for a dataset archive. [0032]
  • FIGS. 12A through 12F are snapshots of a user interface for an administrator page. [0033]
  • FIG. 12G is a flowchart for an upload process. [0034]
  • FIG. 12H is a diagram of a database schema for a codebook table. [0035]
  • FIG. 13 is a flowchart of a codebook process.[0036]
  • DETAILED DESCRIPTION
  • Referring to FIG. 1A, physical elements of an [0037] information processing system 10 include a server 12 connected to client machines 14 such as workstations, laptops, or desktop computers via a network 16. For example, network 16 could include an Internet connection, an intranet within a single organization, or an extranet connecting an organization and its associates.
  • [0038] Server 12 includes software components such as a web server engine 12 a, databases 18, and an application database 20. Web server engine 12 a serves out web pages 12 b over the network. Web server 12 a connects to database tables 18 a of databases 18 via a database connection protocol, such as ODBC (Open DataBase Connectivity) 12 c. Web server 12 a also connects to application database 20, which includes metadata tables and user information tables. Furthermore, web server engine 12 a encodes procedures for creating and manipulating metadata objects called blocksets and related entities, as will be explained in more detail with regard to FIG. 1C.
  • A user [0039] 22 interacts with information processing system 10 via client machine 14.
  • Referring to FIG. 1B, [0040] client machine 14 includes a processor 14 a that operates on data stored in memory 14 b and executes machine-readable instructions stored in storage 14 e and in memory 14 b. Client machine 14 renders visual information on a display device 14 c and receives input from a user 22 via one or more input devices 14 d, such as a mouse and keyboard. Storage 14 e includes an operating system that uses a point-and-click GUI. Client machine 14 also runs a web browser application, not shown, that logically connects to web server engine 12 a (FIG. 1A). Network interface 14 f connects standalone device 14 to network 16. Bus 14 g carries information between components of standalone device 14.
  • Components of [0041] server 12 are comparable in general structure and function to their like-named counterparts in client machine 14. In particular, processor 12 h executes machine-readable instructions that encode web server engine 12 a, databases 18, and application database 20.
  • A [0042] human administrator 22 a interacts with server 12. Administrator 22 a typically has advanced privileges on server 12, while typical users 22 do not. In this discussion, however, unless specified otherwise, the term “user” refers to both users 22 and to administrators 22 a. That is, administrators 22 a are a subset of users 22.
  • Referring to FIG. 2, logical entities used in [0043] system 10 include a dataset 24, a block 26, a blockset 28, and a blockset derived table 30. Broadly, and as will be explained in more detail, a dataset 24 represents a table of data. A block 26 summarizes a portion of a dataset (including perhaps the entire dataset); the summarized dataset is said to “underlie” the block. A blockset 28 collects one or more blocks. A blockset derived table 30 is a table of data for a blockset, collecting, for the blocks in the blockset, the corresponding portions of the underlying datasets.
  • Dataset [0044]
  • Referring to FIG. 3A, [0045] dataset 24 is an information structure that encodes a series of observations on a given set of variables. The observations are stored in data sources 18 a (FIG. 1).
  • This description refers to elements of [0046] dataset 24 in the standard manner, with the data arranged in a table whose columns 24 a represent variables and whose rows 24 b represent observations. For a variable i that includes an M by 1 column vector Xi, where “M” is the number of rows, then an M by N dataset is a matrix of the form [X1 X2 . . . Xn]. This matrix contains blocking variables 24 c which, collectively, uniquely identify a given observation. Other variables in the dataset are non-blocking variables 24 d. For instance, a dataset on cars might uniquely identify each car by its make, model and year. Make, model and year would be encoded as blocking variables of the corresponding dataset. Non-blocking variables might include the car's size, price and gas mileage. As another example, a financial dataset might identify stocks by the ticker symbol and the day from which the closing price was taken. The non-blocking variables could include the stock price, volume traded, and any splits that might have occurred on that day. Each dataset contains at least one non-blocking variable.
    TABLE 1
    WorldInfo dataset definition and metadata:
    DsetID = 345;
    DsetBlockingVars = Year, Country;
    DsetNonblockingVars = Party, Tenure, ElexnMode;
    DsetTitle = “Political Indicators”;
    DsetDatabaseName = PolInd.mdb;
    DsetTableName = “Indicators Data”;
    DsetCodebookName = PolIndCodebook;
    DsetDescription = “Various political indicators from IFC”;
    DsetCategoryID = 3;
    DsetCitation = “IFC Annual Report, 1998”.
  • In the example of Table 1, the necessary metadata elements are the database name, the blocking variable list, and the non-blocking variable list. The other metadata are optional, and may be used to describe the data set in the user interface, described with regard to FIGS. 6 through 12F, below. Other corresponding tables are used in conjunction with the dataset metadata, such as a Variables table with variable descriptions, a Categories table with textual descriptions of the category codes, or Authors and Citations tables with more extensive bibliographic information. [0047]
  • Data Group [0048]
  • A [0049] data group 34 is a collection of datasets 24 having identical blocking variables. Typically, datasets within a data group are logically connected to each other by referencing the same basic type of data. For instance, in the “WorldInfo” example of Table 1, all datasets have blocking variables for Country and Year. Thus they form a data group describing countries' economic, political, demographic, social, and geographic conditions on a year-to-year basis. Note that a single data set could logically belong to more than one group; for example, a data set containing daily stock prices could also belong to a group of yearly corporate data, simply by averaging daily information over each one-year period. Furthermore, one practiced in the art would recognize that various data groups can be combined together, for instance via the use of composite primary keys and intermediary translation tables.
  • Database Example [0050]
  • Referring now to FIG. 1C, an [0051] example application database 20 encodes a single data group 34 conforming to the “WorldInfo” example. That is, the blocking variables for each dataset are Country and Year.
  • [0052] Application database 20 includes a dataset table 36, a variable table 38, an author table 40, an author-dataset table 40 p, a category table 42, a category-dataset table 42 f, a user table 44, a block table 46, and a blockset table 48.
  • Dataset table [0053] 36 stores information for a dataset 24. Each row of dataset table 36 includes a dataset key 36 a, which uniquely identifies rows within dataset table 36. Such a row also includes metadata fields for presentation to a user, such as a title 36 b, which can be used as a caption in a user interface; description 36 f, which can store phrases describing dataset 24 in detail; and codebook name 36 g. Each row also includes metadata for locating the dataset 24 in a data source 18 a, such as a table name 36 c and a database name 36 d. Fields for variable list 36 k, country list 36 m, and year list 36 n store comma-delimited lists of values from the corresponding data source 18 a. These lists cache data that specifies a block, reducing the need to connect directly to data source 18 a. In particular, country list 36 stores the distinct values in the underlying data for the blocking variable Country in the WorldInfo example. Similarly, year list 36 n stores distinct values for Year. Other bookkeeping data in dataset table 36 includes a first version 36 e and a creation date 36 p.
  • Variable table [0054] 38 stores information about non-blocking variables associated with a dataset 24. Each row of variable table 38 describes a variable. One row of dataset table 36 can be associated with many rows of variable table 38 via the foreign key field variable dataset ID 38 b. Each row includes a title 38 c for the variable, a units notation 38 f of the values the variable measures, a list of keywords 36 d associated with variable for searching, and a description 36 e, which can store phrases describing the variable in detail. Name 38 g indicates the corresponding column name for the variable in the underlying table. Rows also include variable key 38 a, which uniquely identifies the rows within variable table 38.
  • Author table [0055] 40 and category table 42 provide supplementary metadata for a dataset 24, allowing many-to-many relationships between dataset table 36 and author table 40, and between dataset table 36 and category table 42. In general, a given author can be associated with one or more datasets, and vice versa. Likewise, categories can be used to describe one or more datasets, and a given dataset can belong to one or more categories.
  • Each row of author table [0056] 40 represents an author that can be associated with data in a dataset 24. Each such row includes fields for a first name 40 b, a last name 40 c, an email address 40 d, a website address 40 e, and an affiliation 40 f. Rows also include an author key 38 a, which uniquely identifies the rows within author table 40.
  • Author-dataset table [0057] 40 p implements the many-to-many relationship between dataset table 36 and author table 40. Link key 40 q uniquely identifies the rows within author-dataset table 40 p. Author FK 40 r is a foreign key referencing author key 40 a. Dataset FK 40 s is a foreign key referencing dataset key 36 a.
  • Each row of category table [0058] 42 represents a category that can be associated with data in a dataset 24. Rows include a category key 42 a, which uniquely identifies the rows within author table 40. Each such row includes fields for a category name 42 b and a description 42 d, which can store phrases describing the category in detail. A field for parent category 42 c is a reflexive foreign key, i.e., one that references category key 42 a of its own table, allowing category table 42 to store nested hierarchies of categories.
  • Category-dataset table [0059] 42 f implements the many-to-many relationship between dataset table 36 and category table 42. Link key 42 g uniquely identifies rows within author-dataset table 42 f. Category FK 42 h is a foreign key referencing category key 42 a. Dataset FK 42 k is a foreign key referencing dataset key 36 a.
  • Each row of user table [0060] 44 stores persistent information about a system account for a human user 22, including a user name 44 b, a login name 44 c, a password 44 d, and admin level 44 m, which specifies a level of system privileges granted to the user. Each row also stores metadata about the user, such as title 44 e, an email address 44 f, a website 44 g, and a description 44 h. A field for affiliation 44 k indicates an organization or domain for the user. Latest dataset 44 n indicates the dataset 24 last used by the user, allowing the user to return to this dataset in a subsequent session.
  • Each row of block table [0061] 46 represents a block 26 (FIG. 2). One row of dataset table 36 can be associated with many rows of block table 46 via the foreign key field dataset FK 46 b. Conversely, one row of block table 46 references one row of dataset table 36. That is, each block 26 has one dataset 24, but one dataset 24 can have many blocks 26.
  • One row of blockset table [0062] 48 can be associated with many rows of block table 46 via the foreign key field blockset FK 46 c. Each block 26 has one blockset 28, but one blockset 28 can have many blocks 26.
  • Each row of block table [0063] 46 in the described example has a list 46 d of non-blocking variables associated with the block 26. Each row also has a set of values for the blocking variables Country and Year in the fields countries 46 e and years 46 f, respectively. Rows also include a block key 46 a, which uniquely identifies the rows within block table 46.
  • Each row of blockset table [0064] 48 represents a blockset 28. Each row has metadata describing the blockset 28 for presentation to a user, including fields for title 48 e, description 48 c, authors 48 m, citations 48 n, and categories 48 p. Each row also includes bookkeeping fields such as date 48 d, version 48 g, published 48 h, submitted 48 k, and deleted 48 q. A foreign key field user FK 48 b references a row in user table 44, indicating the user 22 who owns the blockset 28. A field for public 48 f indicates whether the blockset 28 should be shared with other users. Rows also include a blockset key 48 a, which uniquely identifies the rows within blockset table 48.
  • In the WorldInfo example, a country table, not shown, maps three-letter abbreviations to country names. For example, an entry in the table contains an abbreviation value of “ITA” and a country name value of “Italy”. [0065]
  • Blocks [0066]
  • In general, blocks refer to sets of data via metadata information. A block summarizes a set of data by including blocking-variable values that specify the set of data, together with metadata about additional, non-blocking-variable columns to associate with the block. [0067]
  • Referring to FIG. 3A, a [0068] block 26 describes a subset of data from one dataset 24. The subset is defined by specifying a set of non-blocking variables 24 d from the dataset and a set of values from each of the blocking variables 24 c. Thus a block B from a dataset having m blocking variables and n non-blocking variables is represented by a (m+1)-tuple, shown in FIG. 3B. The Vi are non-blocking variables and the X1 are legal subsets of values from the m blocking variables. For the example WorldInfo implementation, a corresponding block definition includes: [{Non-blocking vars}, Country set, Year set].
    TABLE 2
    WorldInfo block definition:
    VariableSet = {23, 45, 73, 215}.
    CountrySet = {GER, USA, SWE, AUS, CAN, JAP};
    YearSet = {1980, 1985, 1990, 1995, 2000};
  • Table 2 gives a sample block definition for the WorldInfo example. The specified values for the Country blocking variable are codes for Germany, USA, Sweden, Austria, Canada, and Japan. The specified values for the Year blocking variable are 1980, 1985, 1990, 1995, 2000. Non-blocking variable are referred to by their key values [0069] 38 a (FIG. 1C).
  • Referring again to FIG. 2, as an example of a process of constructing a block, block [0070] 26 can represent an entire dataset 24. In this case, dataset 24 contains blocking variables 24 c labeled BV1 and BV2 and non-blocking variables 24 d labeled NBV1 and NBV2. The block 26 will therefore include the values 26 a that BV1 and BV2 take on within the desired subset. For instance, a year set might consist of the set {1991, 1992, 1993, 1994, 1995, 1996}. The block 26 also includes the names or labels of the non-blocking variables NBV1 and NBV2, which are stored in a list in field 26 b. Notice that the names themselves or pointers to name identification fields are sufficient; the actual data within these fields need not be stored in the block itself.
  • Blocks require relatively little storage memory to reference large collections of data. For instance, a block containing i variables for j countries and k years can reference as many as i*j*k rows from the corresponding dataset, but the block can describe these rows using only i+j+k elements—a significant improvement for large values of i, j, or k. In general, the number of elements in the set of non-blocking variables and in the sets of values for the blocking columns provides an approximate upper bound on binary storage requirements of the block, subject to a linear scaling factor. As the sets grow arbitrarily large, the binary storage requirements of the block are on the order of the sum of the sizes of these sets. In other words, the binary storage requirements are linearly proportionate to the cardinality of these sets. [0071]
  • Blocksets [0072]
  • Referring to FIGS. 2 and 4, a [0073] blockset 28 is an unordered set of one or more blocks 26 from a single data group. For every block, there is a corresponding simple blockset with only one block. A user 22 interacts with system 10 in the context of one or more sessions 52 on client machine 14. Operations in one session 52 are independent of other sessions 52. A data cart 50 is a blockset 28 corresponding to the set of blocks currently selected in a session. Each session has either zero or one data cart 50. When initialized, a data cart is empty.
    TABLE 3
    WorldInfo blockset definition and metadata:
    MyBlocks = {43, 44, 45};
    CreatorId = “joseph42”;
    Title = “Political Economy Dataset”;
    Description = “War, GDP, and famine data for Africa”;
    Version = “1.3”;
    Submitted = “Aug. 4, 2001”;
    Authors = “Joseph Kalt”;
    Categories = “Politics, Economics, Development”;
    IsPublic = true.
  • Of the fields in the example of Table 3, only the first (MyBlocks, expressing the collection of blocks) is strictly necessary to define the blockset; all others are examples of blockset metadata that is useful in storage and user display functions. In particular, keeping track of the blockset creator in the creatorId field allows blocksets to be saved and shared with other users. [0074]
  • Since blocksets include collections of blocks, in general, they place minimal demands on memory. Furthermore, in [0075] system 10, a blockset 28 is encoded in software as a blockset object 54. Thus, blockset objects contain both properties 54 a and methods 54 b; that is, data and operations. Properties 54 a include the set 54 c of blocks associated with the blockset 28, such as the vector myBlocks in the example of Table 3. Blockset objects 54 can also possess methods 54 b which contain instructions on how to perform certain operations on blockset objects 54. For instance, a blockset object 54 can display summary information about itself by counting the number of variable entries in each of its constituent blocks. It can also associate a new block to itself by adding a reference to that block to its collection 54 c. A blockset object 54 can merge itself with another blockset object 54 via a set union operation.
  • Blockset Derived Tables [0076]
  • [0077] System 10 can download blockset data to a client machine via a number of standard output formats.
  • Referring to FIG. 4, a derived table [0078] 30 relates to a blockset 28 in the following way. Table 30 has blocking variable columns 30 a, one for each blocking variable in the data group. The other columns form a set 30 b, which is the union of the non-blocking variables from the constituent blocks 26 of blockset 28. When distinct blocks are from the same dataset, they may include overlapping variables. In this case, the derived table contains only one column for each overlapping variable.
  • Table [0079] 30 has rows 30 c for each of the blocking variable values specified in some block from the blockset. Specifically, if there are N blocking variables, the new table will have a row for every N-tuple 30 d of blocking variable values, [valBV1, valBV2, . . . , valN], where valX is one of the values specified for variable X in one of the blockset's blocks.
  • Table [0080] 30 has two types of columns: blocking variable columns 30 a, and data columns 30 b. Blocking variable columns have cells which contain the appropriate blocking variable values. That is, for a row associated to [val1, val2, . . . , valN], the value for blocking variable i will be val1. The data columns have cells that can be referenced by a row identifier, [val1, val2, . . . , valN], and a non-blocking variable from one of the blocks in the defining blockset, e.g. NBvar from block K.
  • Consider the cell of the new table characterized by [val[0081] 1, val2, . . . , valN] and NBvar. If there is a block such that:
  • 1. [val[0082] 1, val2, . . . , valN] is an allowed blocking variable combination, and
  • 2. NBvar is an included variable, [0083]
  • then this cell is filled in with the value from that block's dataset. Note that this value is the same if there are 2 such blocks in the blockset. If, for all such blocks, that cell does not exist, then this cell is empty in the new table. If {[val[0084] 1, val2, . . . , valN], Nbvar} is not an allowed combination for any block included in the dataset, then the cell is empty in the new table. Hence this algorithm produces an outer join of the multiple dataset tables referenced in the blockset's constituent blocks.
  • Derivation Process [0085]
  • Referring to FIGS. 5A and 5B, a [0086] derivation process 56 generates a blockset derived table 30 from a given blockset 28. This discussion assumes the dataset tables are stored in an SQL database, or some other container such that the data can be accessed via an SQL-like query language.
  • [0087] Derivation process 56 sequentially returns column values and table rows of a blockset derived table 30. Each returned row is represented as a list of values. The first row is header row, listing column names.
  • Initially, [0088] derivation process 56 defines an order for the blocking variable set 30 a (step 56 a), either arbitrarily or by user response. Derivation process 56 also defines an order for the blocks in the blockset.
  • [0089] Derivation process 56 generates a hashtable of query result sets (step 56 b). Most standard query engines allow a client to define a result set and retrieve the data row by row. This allows the client to make only one query per block, and then use the natural ordering of the returned data to progressively fill in the newly generated table. The result sets are indexed by a block id, where the id reflects the block ordering defined above. Derivation process 56 therefore generates a hashtable which maps integers to data result sets from which data can be incrementally retrieved.
  • [0090] Derivation process 56 then connects to each of the data sources 18 a (FIG. 1A) that provide the underlying datasets for the blocks in the blockset (step 56 c). Derivation process 56 returns the first row (step 56 d), generating the row incrementally, starting with the blocking variable names in their specified order. Derivation process 56 completes the row by going through the ordered blocks, appending each block's variable list.
  • [0091] Derivation process 56 traverses the result sets (loop bounded by 56 e and 56 n) in the same order as the rows that will be returned to the user. This way, no further requests need be made to the query engine. This ordering is created in the following way. For each dataset from which a block has been defined, generate a variable list V (FIG. 5B) which includes both blocking and non-blocking variables. Also generate the corresponding database table name D. Label the blocking variables BV1, BV2, . . . , and denote the union of values that BV1 takes on in all blocks with the set {A, B, C, . . . }, denote the union of values that BV2 takes on in all blocks with the set {X, Y, Z . . . .}, and so forth. Then for each block, derivation process 56 submits a query of the type shown in FIG. 5B, during connection to the underlying datasets (step 56 c).
  • Alternatively, if the number of allowed values for a blocking variable is so large that the resulting query is unmanageable, the above query can be run without imposing any restriction on the value of blocking variables. The resulting query would then return data for blocking variable values which [0092] derivation process 56 would simply ignore. One benefit of this strategy is that, at this point in the process, several queries have been issued to query engines, and several connections are open. These connections are subsequently left open simultaneously, and data is retrieved incrementally, as needed. This enables the query process to run quickly and efficiently, more so than, for instance, standard queries against data cubes.
  • [0093] Derivation process 56 returns data for the blockset derived table 30 one row at a time, creating one row for each distinct n-tuple (step 56 f). Derivation process 56 proceeds from one n-tuple to the next in the same order that the SQL engine uses when it returns data via “ORDER BY BV1, BV2, . . . .” In particular, derivation process 56 progresses through the blocking variable values alphabetically, as if the n-tuple was concatenated into one long string. For example, this can be done by first looping through the values of the nth-blocking variable, returning it to its starting point, incrementing the n-1st variable value, and so on. In this way, every row of the resulting new data table is returned.
  • For each combination of allowed blocking variable values, i.e., for each n-tuple, [0094] derivation process 56 goes through the blocks in the block set (loop bounded by 56 g and 56 k). For each block, derivation process 56 retrieves a row of data from the corresponding result set, comparing the blocking variable values of that row to the current n-tuple. If it is a match, derivation process 56 adds the retrieved non-blocking variable values to the data row being created for table 30 (step 56 h).
  • After all the blocks have been checked for a given n-tuple, [0095] derivation process 56 returns the data row (step 56 m). Often, some of the data values will be empty. These may be represented by a “.” in the returned list. The resulting dataset can be further modified via standard techniques to be read into various programs; e.g., XML output, spreadsheet programs such as Excel or Quattro Pro, database programs such as Oracle, Access, and SQL Server, statistical programs such as Stata, SPSS, and SAS, and so on.
  • User Interface [0096]
  • Referring to FIG. 6, a [0097] login web page 60 contains a login section 60 a, a set of links 60 b to selected datasets, a set of instructions 60 c on how to use the page, and a set of quick links 60 d to other sections of the site. In this embodiment, the system assigns a unique client identifier 44 a (FIG. 1C) to each user 22, which is then used to access preferences and previous saved data sets for that user. In general, the user need only be able to enter the system by supplying his or her login information, and be given access to the datasets and previously stored data carts.
  • Browsing Area [0098]
  • Referring to FIGS. [0099] 7A-7E, a main browsing section 62 contains a link bar 62 a, which may be identical to 60 d or contain different options. Section 62 also contains a frame for a list of all available datasets 62 b and a browsing area 62 c. The datasets listed in 62 b may be identical for all users, or they can vary depending on the user's identification and level of access privileges, stored for example in admin level 44 m (FIG. 1C). The datasets area 62 b also displays a Search option, which implements a search routine on all dataset titles, variable names, and descriptive metadata. If the system 10 hosts only one type of data, such as data grouped by country and year, then this browse page can be accessed directly.
  • If the [0100] system 10 has more than one type of data, then an intermediate selection stage allows the user to choose which type of data she wants to view, according to the data's blocking variables. These data groups could include, for instance, country-year, firm-year, firm-quarter, congressional district, stock-day, and so on. Once a type of data is selected, the system examines the metadata for the data sets currently available and selects those data sets of the currently selected data group. The browsing page for that data group is then dynamically generated and displayed to the user, and only those data sets in the system of the selected group are shown in area 62 b of the resulting page.
  • FIG. 7B shows the [0101] main browsing section 62 with the browsing area 62 c replaced by the data cart summary 62 d, the dataset summary 62 e, and the data selection area 62 f. This screen results when the user clicks on one of the datasets listed in 62 b. The data shown in the data cart summary 62 d summarizes the current blockset. Data cart summary 62 d includes an icon 63 that has an image of a physical shopping cart. The dataset summary 62 e is constructed from dataset metadata stored in application database 20. The cart summary contains descriptive information on the current data cart, described more fully below. The dataset summary contains summary information on a given dataset and a Quick Download option, in which the entire dataset is downloaded as a single blockset, using derivation process 56 (FIG. 5A). The Add Data to Cart option adds the blockset consisting of the entire dataset to the current cart.
  • The data selection area contains a number of tabs: one for each blocking variable and one for all other non-blocking variables. In the present example, the blocking variables are country and year, while the non-blocking variables are SYSTEM, YRSOFFICE, FINITTRM, etc. The system allows the user to choose one or more variables; in the example, this is done through checkboxes. It also allows for the immediate download of a subset via the Download Subset link, which downloads the currently selected blockset as defined by the choices made on the tabs. Various default options are available if the Download Subset link is pressed before all the tabs have been filled out; unused tabs may be assumed to be empty, for instance, or have all possible choices filled. Similarly, the Add Subset to Cart option adds the subset defined by the tabs to the current cart. Not all the variables in a dataset need be available to all users; dataset access, variable access, and data point access can all be limited via security clearance codes and data filtering. Each variable can also have one or more comments associated with it, stored in [0102] description 44 h (FIG. 1C). The user can access controls to create comments by clicking on the comment link next to any variable. These comments can be emailed to the system administrator and other users via standard mailing programs.
  • FIG. 7C shows [0103] main browsing section 62 with the years tab 62 g selected. In the illustration, the user can select all years, select individual years one by one, or select years in groups decade by decade. Individual controls exist for each value the variable can adopt. Similarly, FIG. 7D shows the same view, but with the countries tab 62 h selected. In the example, countries can be selected one at a time, or in groups by continent or international affiliations.
  • Data Cart [0104]
  • Referring to FIG. 8A, an example [0105] data cart screen 66 results from the user's clicking on the Add Data to Cart button or the Add Subset to Cart button, shown in FIG. 7A. The cart summary area 66 a displays summary information about the current cart, while the list of variables 66 b contains a list of the variables currently included in the data cart 50 (FIG. 2), and any relevant metadata information. In data cart screen 66, the dataset listing 62 b and data cart summary information 62 d remain, as in browsing area 62. Additionally, data cart screen 66 allows a user to view and generate the codebook 66 c associated with the data cart 50. Codebook 66 c is a custom-generated list of the metadata associated with the variables in the current blockset. Referring to FIG. 9, a sample codebook window 68 includes the metadata 68 a for a sample variable.
  • Thus the process of selecting data in [0106] system 10 allows users to place variables in their data cart 50, just as online shopping providers use a shopping cart for goods and services. The process of selecting data for data cart 50 is analogous: users place variables in their data cart. The cart analogy makes system navigation easy and intuitive; the users need only select the variables that they want and put them into a cart. They can combine variables from more than one dataset as long as all datasets belong to the same data group. When all desired variables have been added, the user can “check out” the cart by downloading the data, as described below. This construction also allows for asynchronous data selection; users can build their custom-made data sets a little at a time, as opposed to systems in which the variable selection must be made all at once. Data carts can be created and modified quickly due to their underlying blockset construction, whereby variables can be added and subtracted from blocksets via the manipulation of their metadata only, allowing the browsing process to occur without noticeable system delays.
  • As the user browses the website, she can use web forms to define data Blocks by choosing a dataset and specifying year and country subsets. The most recent block definition is stored with the user's session information. This block is used when new datasets are browsed; their most recent year and country selections are pre-filled into the forms associated to each dataset viewed. [0107]
  • If, while using a dataset Block definition form, a user chooses the “Add to Cart” feature, the currently defined Block is added to the list of blocks making up the current data cart blockset for that user. At any time, a user can view a web page which shows a list of the blocks contained in their current Datacart. This page is generated by iterating through the distinct Blocks in the Block list which defines the users datacart object. From this page, the user can choose to “Remove” individual Blocks from their cart. When the remove operation is requested, the system shortens the list of blocks in the user's datacart by one item, and the specified block is no longer referenced by the datacart object. [0108]
  • A user session can continue, with the user defining, adding, and removing Blocks from their dataset multiple times. If and when they decide they want to “Save” this cart, the system prompts for information about the cart. In particular, referring to FIG. 1C, the user provides a [0109] title 48, a description 48 c, and a public flag 48 f for the cart. The entered title and description are combined with the privacy flag, the user's ID 44 a, the current date, and a unique ID (blockset key 48 a). This information is saved into application database 20. A description of each Block in the datacart's list of Blocks is saved as a separate row in block table 46. Rows in this table contain a blockset FK 46 c column which references back to this cart's ID. In this way, blocks for a specified cart are uniquely identified.
  • Data Cart Information [0110]
  • Referring again to FIG. 8A, the user has the option of editing her data cart either by using the [0111] edit button 66 d or the Save Cart feature 66 e. Each of these leads to a data cart editing screen 70, shown in FIG. 10. The data cart information form 70 a can contain any of a number of fields relating to the summary cart information. The cart may also be saved as public or private. If the former, then any other registered user may view that data cart; if private, then only the user herself may view the cart in the Archive, described below. These choices need not be dichotomous; intermediate levels of access can be specified as well, depending on the particular organization's needs. For instance, only registered users with a certain clearance level or above may have access to certain data and saved carts.
  • After variables have been added to the data cart, the user can return to browsing the data sets, as illustrated in FIG. 7E. Any other dataset can be selected, and its variables can be added to the variables currently in the data cart. In the illustrated embodiment, the system displays the choices on the blocking variables made on the previous blockset added to the cart. In the illustration, the previous countries and years are both automatically selected and highlighted, the countries tab being shown as [0112] 62 h, with highlighted previous choices as 72 i.
  • Referring to FIG. 8B, if the user adds more variables to their data cart, then the list of variables in the [0113] cart display 66 f is expanded to include the variables from all data sets added to the cart. Items can be removed from the cart either one at a time via 66 g, or in their entirety via 66 h. Summary information from the cart is available via 66 i. As above, the entire blockset can be viewed with control 66 j or “checked out”—downloaded—with control 66 k.
  • Archive [0114]
  • Referring now to FIGS. 11A and 11B, the user accesses [0115] archive section 72 through the standard toolbar 60 d. One area of archive section 72 shows saved carts 72 a, as well as carts 72 b from other users who have placed public carts on the system. The user can delete her own carts. When another user's name is selected, then their publicly saved carts appear. Clicking on any saved cart brings that cart's information into the viewing area 72 c, as illustrated in FIG 11B. This shows the variables in the saved cart 72 d, which can then be manipulated as any other blockset.
  • The [0116] archive section 72 allows for collaboration by geographically dispersed users. Datasets can be saved, edited, and then saved again online, by manipulating the metadata of the saved blocksets. This makes dataset storage inexpensive from the viewpoint of system resources, and it makes the saving and retrieval of data carts quick and efficient. As with the data cart, the archive area also allows the user to select variables asynchronously. Not only can data be added to a cart a little at a time over a single login session, but saved carts add the possibility of stretching the dataset creation process across multiple sessions without having to rebuild the dataset from scratch every time.
  • Dataset Upload [0117]
  • Referring to FIG. 12A and FIG. 12H, an [0118] administrator screen 74 is accessible to users with administrative privileges. The left-hand pane 74 b in FIG. 12A lists administrative options, including List All Users, and Add datasets 74 n. When the user selects the control 74 n to add a dataset, the system initiates a wizard interface as illustrated in the right-hand pane 74 a. In the example, the databases to be loaded into the system are first located within a single directory on the server; hence only the database name need be entered. In alternative embodiments, a fully qualified directory path or URL could be entered, requiring only that the server have access privileges to the specified location. The database name is a value to be stored in database name field 36 d of dataset table 36 (FIG. 1C).
  • Upon entering this information, the user is taken to the [0119] screen 74 illustrated in FIG. 12B, where the user selects metadata codebook 74 c and dataset 74 d. Dataset 74 d is a control that specifies a data source table 84 (FIG. 12H) of data that will be the basis for a dataset 24. The table is identified within its database by a data source table name 84 d. Codebook 74 c is a control that allows the user to specify a codebook table 82, which system 10 uses as a source of metadata to describe the dataset 24 to upload. Codebook table 82 is identified within its database by codebook table name 82 f. If the dataset uploads successfully, system 10 stores this value in the field codebook name 36 g of the corresponding row in dataset table 36 (FIG. 12H).
  • In general, a codebook table [0120] 82 stores information used to identify a dataset 24, such as the dataset name, the set of blocking variables, and the names of the non-blocking variables. Other useful metadata can be added as well, such as author, variable descriptions, and coding rules.
  • Each row of the codebook table [0121] 82 represents information applicable either to an entire dataset or to a variable within a dataset. When a row includes the keyword “dataset” in code field 82, the row represents a dataset 24. Otherwise, the row represents a variable—in particular, code field 82 gives the name of a non-blocking variable field 84 b in data source table 84. Title 82 c stores a name for the corresponding variable. Description 82 d stores phrases describing dataset 24 in detail. Author FK 82 e is a foreign key referencing author table 40, which allows a dataset or variable to be associated with a particular author.
  • Referring to FIG. 12C and FIG. 1C, after entering the dataset and codebook locators, the user is taken to screen [0122] 74, which lists the non-blocking variable names 74 e as well as the actual values of the blocking variables 74 f. Upon user confirmation, the data can be added to the system. Each non-blocking variable has a corresponding row in codebook table 82, which serves as the basis for a new corresponding row in variable table 38. In particular, title 82 c maps to title 38 c, description 82 d to description 38 e, and code 82 b to name 38 g.
  • The row in codebook table [0123] 82 with “dataset” in code field 82 b corresponds to a new row in dataset table 36. In this case, description 82 d maps to description 36 f.
  • Other administrative features can be added to the system as well, including facilities for maintaining user accounts, assigning privileges, and editing metadata. An illustration of the latter is provided in FIG. 12D, where the administrator is presented with a [0124] form 74 g through which she can modify a dataset's author and category information. The author editing screen is illustrated in FIG. 12E, which displays the current author information 74 h, the list of potential authors currently in the system 74 i, and a form 74 j for adding a new author. Similarly, FIG. 12F offers the possibility of adding or amending category information. The current set of categories is provided in 74 k, possible additional categories already in the system are provided in 74 m, while new categories can be added via 74 p.
  • Referring to FIG. 12G, an upload process [0125] 80 guides a user through a process of adding a dataset 24 to the system, verifying that the necessary metadata is in place. Upload process 80 receives a data group 34, either explicitly as a passed value or reference, or implicitly by a default value (step 80 a). The data group 34 has a set of blocking variables.
  • Upload process [0126] 80 presents a user interface that prompts a user for a database name (step 80 b). Upload process 80 then receives user input specifying database name, which upload process 80 stores (step 80 c). Upload process 80 compares the database name to its current set of databases 18 (FIG. 1A) and determines whether the database name specifies a database 18 that system 10 can connect to, for example using an ODBC connection (step 80 d). If so (result 80 e), upload process 80 connects to the database and retrieves a list of tables in the database to present to the user (step 80 g). Otherwise, if the database is not available (result 80 f), the process prompts again for database name (step 80 b).
  • The user can choose a data source table [0127] 84 from the list of tables, as well as a codebook table 82 (FIG. 12H). Upon receiving input specifying the user's choice (80 h), upload process 80 prompts the user to confirm blocking and non-blocking variables in the data source table (step 80 i). Upload process 80 tests whether the user input confirms that the blocking and non-blocking variables are identified correctly (step 80 j). With confirmation (result 80 k), upload process 80 retrieves rows from codebook table 82 that describe the dataset and variables (step 80 p). If the user does not provide confirmation (result 80 m), upload process 80 returns failure (step 80 n) and terminates without adding a dataset 24 to the system.
  • Upload process [0128] 80 verifies that the rows in codebook table 82 correspond to the dataset and variables of data source table 84 (step 80 q). If any row is missing (result 80 r), upload process 80 returns failure (step 80 n) and terminates without adding a dataset 24 to the system. Otherwise (result 80 s), upload process 80 prompts the user and receives input to confirm the metadata for the prospective dataset 24 itself, as opposed to the metadata for the variables (step 80 t). Such metadata for the dataset includes its storage name, its name in presentation to users (i.e., a caption), a textual description, and its location.
  • If the dataset metadata is confirmed (result [0129] 80 u), upload process 80 loops to confirm each of the variables (loop bounded by 80 w and 80 ab). Otherwise, if the dataset metadata is not confirmed (result 80 v), the process returns failure (step 80 n) and terminates without adding a dataset 24 to the system.
  • Upload process [0130] 80 iterates over each variable to prompt the user with the associated metadata (step 80 x), for example the storage name, the title, a textual description, associated keywords, and units. Upload process 80 tests the user's response (step 80 y). If the user rejects any variable's metadata (result 80 aa), the process returns failure (step 80 n) and terminates without adding a dataset 24 to the system. Otherwise (result 80 z), upload process 80 commits the dataset 24 to application database 20 and adds a corresponding database 18 as the provider of the underlying data.
  • Codebook Download [0131]
  • When a user asks to download the codebook information about a specified dataset or saved datacart, the system iterates through each block of the datacart. For each block, it retrieves all codebook records which reference either the dataset containing that block, or a variable within that block's definition. This codebook information is stored in memory and formatted for display in a web page, or for printing. When a secondary table row is referenced, e.g. when an author is specified, the author table is queried, the information is retrieved and then formatted. [0132]
  • Referring to FIG. 13 and FIG. 1C, a codebook process [0133] 86 receives a request for a codebook for a table derived from a blockset (step 86 a). Codebook process 86 retrieves metadata for blockset from application database 20, for example title 48 e and description 48 c from blockset table 48 (step 86 b). Codebook process 86 outputs the metadata for blockset (step 86 c), then loops over the variables of the derived table (loop bounded by 86 d and 86 g), as encoded in the rows in variable table 38. For each such variable, codebook process 86 retrieves metadata from the application database (step 86 e). Such metadata includes a variable title 38 c, a description 38 e, and the title 36 b of the associated dataset in dataset table 36.
  • Alternative Embodiments [0134]
  • A number of embodiments of the invention have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. For example, [0135] databases 18 can be located on other machines than server 12, and can be connected via a database server. Thus the server 12 can include multiple physical machines distributed across a network.
  • In the described embodiment, each [0136] block 26 has one blockset 28. This provides an administrative advantage, in that edits to the block of a first blockset 28 cannot affect other blocksets, since blocks are not shared. In alternative embodiments, however, blocks could be shared by blocksets.
  • In the described embodiment, codebook table [0137] 82 includes basic text information like descriptions and comments. In alternative embodiments, codebook table 82 could also reference rows in other tables of application database 20, such as citation records or categories.
  • Accordingly, other embodiments are within the scope of the following claims.[0138]

Claims (25)

What is claimed is:
1. A computer-based method of representing a subset of a dataset table of rows and columns, the method comprising:
selecting a set of blocking variables corresponding to blocking columns of the dataset table such that, for each row in the dataset table, a tuple of values for the blocking columns uniquely identifies the row within the dataset table;
selecting a set of non-blocking variables corresponding to columns of the subset, the set of non-blocking variables disjoint from the set of blocking variables; and
creating a block information structure that includes the set of non-blocking variables and that includes, for each blocking variable in the set of blocking variables, a set of values.
2. The method of claim 1, wherein for each row in the subset having a tuple of values for the blocking columns, the values of the tuple are included in the corresponding sets of values for the blocking variables.
3. The method of claim 1, wherein the subset of the dataset table includes the entire dataset table.
4. The method of claim 1, wherein a sum of the number of elements in the set of non-blocking columns and in the sets of values for the blocking columns is linearly proportionate to an upper bound on binary storage requirements of the block information structure.
5. The method of claim 1, further comprising:
storing the block information structure on a machine-readable medium.
6. A computer-based method of accessing information in heterogeneous databases, the method comprising:
presenting a graphical user interface with controls representing a data cart and a plurality of datasets;
receiving user input that selects a dataset to add to the data cart from the plurality of datasets;
generating a block information structure specifying the dataset; and
adding the block information structure to the data cart.
7. The method of claim 6, wherein specifying the dataset includes incorporating into the block information structure a set of non-blocking variables, a set of blocking variables, and for each such blocking variable, a set of values.
8. The method of claim 7, wherein the dataset includes a plurality of rows, each identified by a corresponding tuple of values from the sets of values for the blocking variables.
9. The method of claim 7, further comprising:
for a blocking variable in the set of blocking variables, presenting enumeration controls in the graphical user interface, each corresponding to an existing value in the dataset for the blocking variable.
10. The method of claim 6, further comprising:
collecting user input specifying a subset of the dataset, and representing the subset in corresponding block information structure.
11. The method of claim 6, further comprising:
saving the data cart to a persistent storage medium.
12. The method of claim 6, further comprising:
adding a second block information structure to the data cart, in response to user input.
13. The method of claim 6, wherein the controls representing a data cart include a symbol of a shopping cart.
14. A computer-based method of retrieving information represented by a blockset, the method comprising:
for each of a plurality of blocks in the blockset in which each block specifies a subset of a dataset stored in a database, connecting to the corresponding database, wherein the blockset has a plurality of blocking variables, and the blocks each include a set of non-blocking variables, and specifying includes having a set of values for each blocking variable in the plurality of blocking variables;
adding to a derived table a blocking column for each blocking variable in the set of blocking variables;
adding to the derived table a non-blocking column for each element in a union of the non-blocking variables in the plurality of blocks;
adding a row to the derived table including a cell for each column in the derived table, the row uniquely identified by a tuple of values from the sets of values for the blocking variables; and
populating a cell of a non-blocking column in the row, using a value retrieved from the database corresponding to the block, wherein the block contains the non-blocking variable corresponding to the non-blocking column.
15. The method of claim 14, further comprising:
adding a row for each tuple of values from the sets of values for the blocking variables, wherein the tuple occurs in at least one dataset corresponding to a block in the plurality of blocks.
16. The method of claim 14, wherein connecting includes using each block as a basis for a database query that specifies the corresponding subset.
17. The method of claim 16, wherein the database query uses Structured Query Language.
18. A computer-based method of representing a table derived from a blockset, the method comprising:
outputting blockset metadata describing the blockset, including fields for a blockset title and a blockset description;
for a column in the table, outputting column metadata describing a variable associated with the column, wherein the variable is associated with an underlying dataset that provides data to the table in the blockset, and the column metadata includes fields for a title of the variable and for a title of the underlying dataset.
19. A computer-based method of collecting metadata for a dataset, the method comprising:
prompting a user to provide a database name;
confirming that the database name represents a database;
displaying a list of tables in the database to the user;
receiving user input specifying a table in the list of tables;
prompting the user to confirm that a list of blocking variables and a list of non-blocking variables are correct for the database;
prompting the user to confirm metadata for the dataset and for the list of non-blocking variables;
if the user confirms the list of blocking variables, the list of non-blocking variables, and the metadata, adding a dataset corresponding to the table to a collection of datasets.
20. The method of claim 19, wherein the metadata includes a title for the dataset.
21. The method of claim 19, wherein the metadata includes a description for the dataset.
22. The method of claim 19, wherein the metadata includes a title for a non-blocking variable in the list of non-blocking variables.
23. The method of claim 19, wherein the metadata includes a description for a non-blocking variable in the list of non-blocking variables.
24. An article comprising a machine-readable storage medium that stores executable instructions to represent a subset of a dataset table of rows and columns, the instructions causing a machine to:
select a set of blocking variables corresponding to blocking columns of the dataset table such that, for each row in the dataset table, a tuple of values for the blocking columns uniquely identifies the row within the dataset table;
select a set of non-blocking variables corresponding to columns of the subset, the set of non-blocking variables disjoint from the set of blocking variables; and
create a block information structure that includes the set of non-blocking variables and that includes, for each blocking variable in the set of blocking variables, a set of values.
25. An article comprising a machine-readable storage medium that stores executable instructions to represent a subset of a dataset table of rows and columns, the instructions causing a machine to:
present a graphical user interface with controls representing a data cart and a plurality of datasets;
receive user input that selects a dataset to add to the data cart from the plurality of datasets;
generate a block information structure specifying the dataset; and
add the block information structure to the data cart.
US10/217,492 2001-08-10 2002-08-12 Method and apparatus for access, integration, and analysis of heterogeneous data sources via the manipulation of metadata objects Abandoned US20030126144A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US10/217,492 US20030126144A1 (en) 2001-08-10 2002-08-12 Method and apparatus for access, integration, and analysis of heterogeneous data sources via the manipulation of metadata objects
US11/056,636 US20050154727A1 (en) 2001-08-10 2005-02-11 Method and apparatus for access, integration, and analysis of heterogeneous data sources via the manipulation of metadata objects
US12/169,477 US8171050B2 (en) 2001-08-10 2008-07-08 Method and apparatus for access, integration, and analysis of heterogeneous data sources via the manipulation of metadata objects

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US31149501P 2001-08-10 2001-08-10
US10/217,492 US20030126144A1 (en) 2001-08-10 2002-08-12 Method and apparatus for access, integration, and analysis of heterogeneous data sources via the manipulation of metadata objects

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US11/056,636 Division US20050154727A1 (en) 2001-08-10 2005-02-11 Method and apparatus for access, integration, and analysis of heterogeneous data sources via the manipulation of metadata objects

Publications (1)

Publication Number Publication Date
US20030126144A1 true US20030126144A1 (en) 2003-07-03

Family

ID=23207146

Family Applications (3)

Application Number Title Priority Date Filing Date
US10/217,492 Abandoned US20030126144A1 (en) 2001-08-10 2002-08-12 Method and apparatus for access, integration, and analysis of heterogeneous data sources via the manipulation of metadata objects
US11/056,636 Abandoned US20050154727A1 (en) 2001-08-10 2005-02-11 Method and apparatus for access, integration, and analysis of heterogeneous data sources via the manipulation of metadata objects
US12/169,477 Expired - Fee Related US8171050B2 (en) 2001-08-10 2008-07-08 Method and apparatus for access, integration, and analysis of heterogeneous data sources via the manipulation of metadata objects

Family Applications After (2)

Application Number Title Priority Date Filing Date
US11/056,636 Abandoned US20050154727A1 (en) 2001-08-10 2005-02-11 Method and apparatus for access, integration, and analysis of heterogeneous data sources via the manipulation of metadata objects
US12/169,477 Expired - Fee Related US8171050B2 (en) 2001-08-10 2008-07-08 Method and apparatus for access, integration, and analysis of heterogeneous data sources via the manipulation of metadata objects

Country Status (3)

Country Link
US (3) US20030126144A1 (en)
EP (1) EP1423804A4 (en)
WO (1) WO2003014977A1 (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030212667A1 (en) * 2002-05-10 2003-11-13 International Business Machines Corporation Systems, methods, and computer program products to browse database query information
US20040122646A1 (en) * 2002-12-18 2004-06-24 International Business Machines Corporation System and method for automatically building an OLAP model in a relational database
US20040139061A1 (en) * 2003-01-13 2004-07-15 International Business Machines Corporation Method, system, and program for specifying multidimensional calculations for a relational OLAP engine
US20050278290A1 (en) * 2004-06-14 2005-12-15 International Business Machines Corporation Systems, methods, and computer program products that automatically discover metadata objects and generate multidimensional models
US20050283488A1 (en) * 2004-06-22 2005-12-22 International Business Machines Corporation Model based optimization with focus regions
US20050283494A1 (en) * 2004-06-22 2005-12-22 International Business Machines Corporation Visualizing and manipulating multidimensional OLAP models graphically
US20060161545A1 (en) * 2005-01-18 2006-07-20 Agate Lane Services Inc. Method and apparatus for ordering items within datasets
US20080320058A1 (en) * 2007-06-19 2008-12-25 Linda Van Patten Benhase Apparatus, system, and method for maintaining dynamic persistent data
US20090113283A1 (en) * 2007-10-31 2009-04-30 Microsoft Corporation Method for capturing design-time and run-time formulas associated with a cell
US20090198693A1 (en) * 2005-01-18 2009-08-06 Richard Alexander Stephen Pura Method and apparatus for ordering items within datasets
US7895191B2 (en) 2003-04-09 2011-02-22 International Business Machines Corporation Improving performance of database queries
US8490058B2 (en) * 2011-12-13 2013-07-16 Microsoft Corporation Time-based navigation within resource utilization data
US20130268531A1 (en) * 2012-04-10 2013-10-10 Microsoft Corporation Finding Data in Connected Corpuses Using Examples
US20150278332A1 (en) * 2014-03-31 2015-10-01 International Business Machines Corporation Parallel bootstrap aggregating in a data warehouse appliance
US20160085807A1 (en) * 2014-09-24 2016-03-24 International Business Machines Corporation Deriving a Multi-Pass Matching Algorithm for Data De-Duplication
US11379453B2 (en) * 2017-06-02 2022-07-05 Palantir Technologies Inc. Systems and methods for retrieving and processing data
US11687514B2 (en) * 2020-07-15 2023-06-27 International Business Machines Corporation Multimodal table encoding for information retrieval systems

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE10329252A1 (en) * 2003-06-25 2005-01-13 Rolls-Royce Deutschland Ltd & Co Kg Control system for an aircraft engine
US7702622B2 (en) * 2007-06-29 2010-04-20 Microsoft Corporation Advanced techniques for SQL generation of performancepoint business rules
US8620845B2 (en) * 2008-09-24 2013-12-31 Timothy John Stoakes Identifying application metadata in a backup stream
US10127299B2 (en) * 2009-09-14 2018-11-13 International Business Machines Corporation Analytics information directories within a comprehensive framework for composing and executing analytics applications in business level languages
US10296596B2 (en) 2010-05-27 2019-05-21 Varonis Systems, Inc. Data tagging
US8538934B2 (en) * 2011-10-28 2013-09-17 Microsoft Corporation Contextual gravitation of datasets and data services
US11231840B1 (en) * 2014-10-05 2022-01-25 Splunk Inc. Statistics chart row mode drill down
US10303344B2 (en) 2014-10-05 2019-05-28 Splunk Inc. Field value search drill down

Citations (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5088052A (en) * 1988-07-15 1992-02-11 Digital Equipment Corporation System for graphically representing and manipulating data stored in databases
US5634053A (en) * 1995-08-29 1997-05-27 Hughes Aircraft Company Federated information management (FIM) system and method for providing data site filtering and translation for heterogeneous databases
US5745681A (en) * 1996-01-11 1998-04-28 Sun Microsystems, Inc. Stateless shopping cart for the web
US5897622A (en) * 1996-10-16 1999-04-27 Microsoft Corporation Electronic shopping and merchandising system
US5956709A (en) * 1997-07-28 1999-09-21 Xue; Yansheng Dynamic data assembling on internet client side
US5999993A (en) * 1996-02-28 1999-12-07 Nec Corporation Data transfer system having function of preventing error caused by suspension of data transfer to immediately service transfer interrupted request
US20020010654A1 (en) * 1996-09-30 2002-01-24 Kay Yonezawa Shopping basket presentation method for an online shopping system
US6349310B1 (en) * 1999-07-06 2002-02-19 Compaq Computer Corporation Database management system and method for accessing rows in a partitioned table
US6363387B1 (en) * 1998-10-20 2002-03-26 Sybase, Inc. Database system providing methodology for enhancing concurrency using row update bit and deferred locking
US20020069114A1 (en) * 2000-12-01 2002-06-06 Charette Phillip Carl Method and system for placing a purchase order over a communications network
US20020138370A1 (en) * 2001-03-22 2002-09-26 International Business Machines Corporation Method and apparatus for meta-shopping and dynamic setup of B2B eCommerce
US6466918B1 (en) * 1999-11-18 2002-10-15 Amazon. Com, Inc. System and method for exposing popular nodes within a browse tree
US20020156685A1 (en) * 2001-02-10 2002-10-24 International Business Machines Corporation System and method for automating electronic commerce transactions using a virtual shopping cart
US20020156688A1 (en) * 2001-02-21 2002-10-24 Michel Horn Global electronic commerce system
US20020174034A1 (en) * 2001-05-17 2002-11-21 International Business Machines Corporation Method and apparatus for a distributed web commerce system
US6523028B1 (en) * 1998-12-03 2003-02-18 Lockhead Martin Corporation Method and system for universal querying of distributed databases
US6625499B2 (en) * 2000-01-07 2003-09-23 Mitel Knowledge Corporation Tabular range editing mechanism
US6629079B1 (en) * 1998-06-25 2003-09-30 Amazon.Com, Inc. Method and system for electronic commerce using multiple roles
US20040039657A1 (en) * 2000-09-01 2004-02-26 Behrens Clifford A. Automatic recommendation of products using latent semantic indexing of content
US6704721B1 (en) * 1999-04-02 2004-03-09 International Business Machines Corporation Systems and methods for automated navigation between dynamic data with dissimilar structures
US6850917B1 (en) * 2000-10-02 2005-02-01 Oracle International Corporation Methods and systems for sharing an online shopping cart
US20060036510A1 (en) * 2003-06-02 2006-02-16 W.W. Grainger, Inc. System and method for directing a customer to additional purchasing opportunities
US20070100842A1 (en) * 2005-11-02 2007-05-03 Click Commerce, Inc. System and Method for Storing Item Attributes in an Electronic Catalog

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5359724A (en) * 1992-03-30 1994-10-25 Arbor Software Corporation Method and apparatus for storing and retrieving multi-dimensional data in computer memory
US5999933A (en) * 1995-12-14 1999-12-07 Compaq Computer Corporation Process and apparatus for collecting a data structure of a memory dump into a logical table
US6336117B1 (en) * 1999-04-30 2002-01-01 International Business Machines Corporation Content-indexing search system and method providing search results consistent with content filtering and blocking policies implemented in a blocking engine
WO2000075849A2 (en) * 1999-06-08 2000-12-14 Brio Technology, Inc. Method and apparatus for data access to heterogeneous data sources
US6750864B1 (en) * 1999-11-15 2004-06-15 Polyvista, Inc. Programs and methods for the display, analysis and manipulation of multi-dimensional data implemented on a computer
WO2001057711A1 (en) * 2000-02-02 2001-08-09 Searchlogic.Com Corporation Combinatorial query generating system and method
US20020019778A1 (en) 2000-04-04 2002-02-14 Larry Isaacson System and method for placing on-line orders
US20010054012A1 (en) * 2000-06-14 2001-12-20 Wildform, Inc. Client-based shopping cart
US6581064B1 (en) * 2000-06-21 2003-06-17 Databased Design & Decisions, Inc. Software tool for creating databases
US20030018536A1 (en) * 2001-07-20 2003-01-23 International Business Machines Corporation Reorder and default order mechanisms for a shopping cart of an e-commerce website

Patent Citations (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5088052A (en) * 1988-07-15 1992-02-11 Digital Equipment Corporation System for graphically representing and manipulating data stored in databases
US5634053A (en) * 1995-08-29 1997-05-27 Hughes Aircraft Company Federated information management (FIM) system and method for providing data site filtering and translation for heterogeneous databases
US5745681A (en) * 1996-01-11 1998-04-28 Sun Microsystems, Inc. Stateless shopping cart for the web
US5999993A (en) * 1996-02-28 1999-12-07 Nec Corporation Data transfer system having function of preventing error caused by suspension of data transfer to immediately service transfer interrupted request
US20020010654A1 (en) * 1996-09-30 2002-01-24 Kay Yonezawa Shopping basket presentation method for an online shopping system
US5897622A (en) * 1996-10-16 1999-04-27 Microsoft Corporation Electronic shopping and merchandising system
US5956709A (en) * 1997-07-28 1999-09-21 Xue; Yansheng Dynamic data assembling on internet client side
US6629079B1 (en) * 1998-06-25 2003-09-30 Amazon.Com, Inc. Method and system for electronic commerce using multiple roles
US6363387B1 (en) * 1998-10-20 2002-03-26 Sybase, Inc. Database system providing methodology for enhancing concurrency using row update bit and deferred locking
US6523028B1 (en) * 1998-12-03 2003-02-18 Lockhead Martin Corporation Method and system for universal querying of distributed databases
US6704721B1 (en) * 1999-04-02 2004-03-09 International Business Machines Corporation Systems and methods for automated navigation between dynamic data with dissimilar structures
US6349310B1 (en) * 1999-07-06 2002-02-19 Compaq Computer Corporation Database management system and method for accessing rows in a partitioned table
US6466918B1 (en) * 1999-11-18 2002-10-15 Amazon. Com, Inc. System and method for exposing popular nodes within a browse tree
US6625499B2 (en) * 2000-01-07 2003-09-23 Mitel Knowledge Corporation Tabular range editing mechanism
US20040039657A1 (en) * 2000-09-01 2004-02-26 Behrens Clifford A. Automatic recommendation of products using latent semantic indexing of content
US6850917B1 (en) * 2000-10-02 2005-02-01 Oracle International Corporation Methods and systems for sharing an online shopping cart
US20020069114A1 (en) * 2000-12-01 2002-06-06 Charette Phillip Carl Method and system for placing a purchase order over a communications network
US20020156685A1 (en) * 2001-02-10 2002-10-24 International Business Machines Corporation System and method for automating electronic commerce transactions using a virtual shopping cart
US20020156688A1 (en) * 2001-02-21 2002-10-24 Michel Horn Global electronic commerce system
US20020138370A1 (en) * 2001-03-22 2002-09-26 International Business Machines Corporation Method and apparatus for meta-shopping and dynamic setup of B2B eCommerce
US20020174034A1 (en) * 2001-05-17 2002-11-21 International Business Machines Corporation Method and apparatus for a distributed web commerce system
US20060036510A1 (en) * 2003-06-02 2006-02-16 W.W. Grainger, Inc. System and method for directing a customer to additional purchasing opportunities
US20070100842A1 (en) * 2005-11-02 2007-05-03 Click Commerce, Inc. System and Method for Storing Item Attributes in an Electronic Catalog

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7447687B2 (en) 2002-05-10 2008-11-04 International Business Machines Corporation Methods to browse database query information
US20030212667A1 (en) * 2002-05-10 2003-11-13 International Business Machines Corporation Systems, methods, and computer program products to browse database query information
US7873664B2 (en) 2002-05-10 2011-01-18 International Business Machines Corporation Systems and computer program products to browse database query information
US20080133582A1 (en) * 2002-05-10 2008-06-05 International Business Machines Corporation Systems and computer program products to browse database query information
US20040122646A1 (en) * 2002-12-18 2004-06-24 International Business Machines Corporation System and method for automatically building an OLAP model in a relational database
US7716167B2 (en) 2002-12-18 2010-05-11 International Business Machines Corporation System and method for automatically building an OLAP model in a relational database
US20040139061A1 (en) * 2003-01-13 2004-07-15 International Business Machines Corporation Method, system, and program for specifying multidimensional calculations for a relational OLAP engine
US7953694B2 (en) 2003-01-13 2011-05-31 International Business Machines Corporation Method, system, and program for specifying multidimensional calculations for a relational OLAP engine
US7895191B2 (en) 2003-04-09 2011-02-22 International Business Machines Corporation Improving performance of database queries
US7707143B2 (en) 2004-06-14 2010-04-27 International Business Machines Corporation Systems, methods, and computer program products that automatically discover metadata objects and generate multidimensional models
US20050278290A1 (en) * 2004-06-14 2005-12-15 International Business Machines Corporation Systems, methods, and computer program products that automatically discover metadata objects and generate multidimensional models
US20050283494A1 (en) * 2004-06-22 2005-12-22 International Business Machines Corporation Visualizing and manipulating multidimensional OLAP models graphically
US20050283488A1 (en) * 2004-06-22 2005-12-22 International Business Machines Corporation Model based optimization with focus regions
US20090198693A1 (en) * 2005-01-18 2009-08-06 Richard Alexander Stephen Pura Method and apparatus for ordering items within datasets
US20060161545A1 (en) * 2005-01-18 2006-07-20 Agate Lane Services Inc. Method and apparatus for ordering items within datasets
US20080320058A1 (en) * 2007-06-19 2008-12-25 Linda Van Patten Benhase Apparatus, system, and method for maintaining dynamic persistent data
US7814071B2 (en) * 2007-06-19 2010-10-12 International Business Machines Corporation Apparatus, system, and method for maintaining dynamic persistent data
US20090113283A1 (en) * 2007-10-31 2009-04-30 Microsoft Corporation Method for capturing design-time and run-time formulas associated with a cell
US8341512B2 (en) 2007-10-31 2012-12-25 Microsoft Corporation Method for capturing design-time and run-time formulas associated with a cell
US8490058B2 (en) * 2011-12-13 2013-07-16 Microsoft Corporation Time-based navigation within resource utilization data
US8983954B2 (en) * 2012-04-10 2015-03-17 Microsoft Technology Licensing, Llc Finding data in connected corpuses using examples
US20130268531A1 (en) * 2012-04-10 2013-10-10 Microsoft Corporation Finding Data in Connected Corpuses Using Examples
US20150193533A1 (en) * 2012-04-10 2015-07-09 Microsoft Technology Licensing, Llc Finding Data in Connected Corpuses Using Examples
US10140366B2 (en) * 2012-04-10 2018-11-27 Microsoft Technology Licensing, Llc Finding data in connected corpuses using examples
US20150278332A1 (en) * 2014-03-31 2015-10-01 International Business Machines Corporation Parallel bootstrap aggregating in a data warehouse appliance
US9613113B2 (en) * 2014-03-31 2017-04-04 International Business Machines Corporation Parallel bootstrap aggregating in a data warehouse appliance
US10248710B2 (en) 2014-03-31 2019-04-02 International Business Machines Corporation Parallel bootstrap aggregating in a data warehouse appliance
US10372729B2 (en) 2014-03-31 2019-08-06 International Business Machines Corporation Parallel bootstrap aggregating in a data warehouse appliance
US11120050B2 (en) 2014-03-31 2021-09-14 International Business Machines Corporation Parallel bootstrap aggregating in a data warehouse appliance
US20160085807A1 (en) * 2014-09-24 2016-03-24 International Business Machines Corporation Deriving a Multi-Pass Matching Algorithm for Data De-Duplication
US10169418B2 (en) * 2014-09-24 2019-01-01 International Business Machines Corporation Deriving a multi-pass matching algorithm for data de-duplication
US11379453B2 (en) * 2017-06-02 2022-07-05 Palantir Technologies Inc. Systems and methods for retrieving and processing data
US11687514B2 (en) * 2020-07-15 2023-06-27 International Business Machines Corporation Multimodal table encoding for information retrieval systems

Also Published As

Publication number Publication date
US8171050B2 (en) 2012-05-01
WO2003014977A1 (en) 2003-02-20
US20080270456A1 (en) 2008-10-30
EP1423804A1 (en) 2004-06-02
EP1423804A4 (en) 2006-11-08
US20050154727A1 (en) 2005-07-14

Similar Documents

Publication Publication Date Title
US8171050B2 (en) Method and apparatus for access, integration, and analysis of heterogeneous data sources via the manipulation of metadata objects
US8914414B2 (en) Integrated repository of structured and unstructured data
US7043472B2 (en) File system with access and retrieval of XML documents
US9009201B2 (en) Extended database search
CA2535093C (en) Generating end-user presentations from structured data
US8589777B1 (en) Method and system for calculating cost of a compilation of content
US7895243B1 (en) Method and system for moving content in a content object stored in a data repository
US7340481B1 (en) Method and system for adding user-provided content to a content object stored in a data repository
US10235429B2 (en) System and method for organizing data in a dynamic user-customizable interface for search and display
US8073840B2 (en) Querying joined data within a search engine index
US7089239B1 (en) Method and system for preventing mutually exclusive content entities stored in a data repository to be included in the same compilation of content
US6986102B1 (en) Method and configurable model for storing hierarchical data in a non-hierarchical data repository
CN102027471B (en) Improved search engine
US7401097B1 (en) System and method for creating compilations of content
US20060265377A1 (en) Personalizable information networks
US20060265396A1 (en) Personalizable information networks
US20020038308A1 (en) System and method for creating a virtual data warehouse
US20060265394A1 (en) Personalizable information networks
US20090216724A1 (en) Network resources structuring system
US20080195495A1 (en) Notebook system
US20060265395A1 (en) Personalizable information networks
Abramowicz et al. Filtering the Web to feed data warehouses
US7356766B1 (en) Method and system for adding content to a content object stored in a data repository
US8346739B1 (en) Segmenting documents among multiple data repositories
US20050125387A1 (en) Method of joining data and its metadata using dynamic metadata in relational database

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION