WO2001042986A1 - Method and system for indexing documents using connectivity and selective content analysis - Google Patents
Method and system for indexing documents using connectivity and selective content analysis Download PDFInfo
- Publication number
- WO2001042986A1 WO2001042986A1 PCT/US2000/033340 US0033340W WO0142986A1 WO 2001042986 A1 WO2001042986 A1 WO 2001042986A1 US 0033340 W US0033340 W US 0033340W WO 0142986 A1 WO0142986 A1 WO 0142986A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- references
- words
- capturing
- parsing
- pages
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Definitions
- This invention relates generally to computerized information
- Web Wide Web
- a Web page having specific content of interest to end-users locate Web pages having specific content of interest to end-users.
- Each Web site may contain several Web pages or
- a search engine indexes hundreds of millions of Web pages
- queries usually comprised of one or more key terms ("key words") and the
- search engine identifies pages that match the key words, e.g., pages that
- the "vicinity" of a Web page is defined by the hyperlinks that
- a Web page can point to other Web pages, and the retrieved Web
- This connectivity can be expressed as a graph where nodes
- the page receives a large y value and is designated as a good hub.
- the single site has undue
- Hypernews system which turns Usenet News articles into
- the present invention provides a method and system for
- references containing links to the document to be indexed is captured.
- captured references are further parsed to collect its content.
- a system wherein a processor is configured to
- the processor is further configured to parse
- Figure 1 is a block diagram illustrating the hyperlinked
- Figure 2 is a flow diagram illustrating a process for capturing
- Figure 3 is a flow diagram illustrating a process for selective
- Figure 4 is a diagram illustrating a tokenizing process of Figure 3
- Figure 5 is a diagram illustrating a system according to another
- Figure 1 shows a distributed network of computers 100 of the
- Client computers 110 and server computers 120 hosts are present invention.
- server computers 120 hosts
- a network 130 for example, the WorldNet
- the network 130 includes an application level interface called
- the Web allows the client computers 110 to access documents
- multi-media Web pages 121 maintained by the server 120.
- each page 121 The location or address of each page 121 is indicated by an associated
- hyperlinks 123 to other pages.
- the hyperlinks are also in the form of
- search engine 140 returns a result set 112 which satisfies the terms (key
- the order in which the result set 112 is presented to the client 110 is indicative of the
- the method 200 can be implemented by one of the
- the method of the present invention uses
- a user inputs a query at 202 and a full URI is sent to a web
- server 120 and the received data stream is transferred to a parser 201 for
- the parser 201 can be
- Parser 201 reads
- the parser 201 looks for the
- parser checks if the hyperlink is a full URI 212. If not, the full URI has to
- Parser 201 makes every relative URI become a full
- the protocol of the hyperlink is checked 216. If it is other than an HTTP
- the link is dumped 218. Also, the hyperlink is checked to see if it
- web browsers can parse HTML syntax errors to a certain extent, this makes
- any white spaces in the value are
- protocol name begins with a new line without leading white spaces
- parser 201 parser 201
- the present method calculates the
- the base URI is set by the BASE element, then the base URI is given
- the base URI is that of the current document.
- fragment identifiers have a similar function as a relative
- identifier is an anchor within a page and can be referred or linked in the
- An anchor is any object which is fixed so that its position
- the present method strips the anchor and gets the links 224 by
- characters, except space, may be encoded as a percent sign and two digital
- the method detects any networking and log problems at
- the present method parses this error by reading the HTML title to see
- the method first checks the trailing part of a
- dumped 236 If the resource is readable, it starts to obtain it.
- indexed page as described above are stored in a database 246.
- Hypersonic-SQL database which stores all the words.
- the in-memory
- Java containers such as, HashMap and
- HashMap can be used for maintaining the linking structure
- HashSet can be used for storing the hyperlinks. Every node in the linking
- Hypersonic-SQL is a small, high-performance database
- Hypersonic-SQL has JDBC
- Threshold value should be a function of the total number of words in the
- sets may involve conjunction of the sets or use frequency of appearance of a
- the processor system such as a
- CPU central processing unit
- I/O devices 308, 310 over a bus 316.
- the computer
- RAM random access memory
- ROM read only memory
- peripheral devices such as a floppy disk drive
- Memory 312 can be configured to store the
- parsed information as described above in Figures 1-4. It may also be
- processor 302 and memory 312 are desirable to integrate the processor 302 and memory 312 on a single
- the present invention provides a method and system for
- the method uses a structure of words used by the linked pages, whereas the prior art methods use only the content of the indexed page and/or the
Abstract
Description
Claims
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU29070/01A AU2907001A (en) | 1999-12-09 | 2000-12-08 | Method and system for indexing documents using connectivity and selective content analysis |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16988999P | 1999-12-09 | 1999-12-09 | |
US60/169,889 | 1999-12-09 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2001042986A1 true WO2001042986A1 (en) | 2001-06-14 |
WO2001042986A9 WO2001042986A9 (en) | 2002-07-25 |
Family
ID=22617633
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2000/033340 WO2001042986A1 (en) | 1999-12-09 | 2000-12-08 | Method and system for indexing documents using connectivity and selective content analysis |
Country Status (2)
Country | Link |
---|---|
AU (1) | AU2907001A (en) |
WO (1) | WO2001042986A1 (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5835905A (en) * | 1997-04-09 | 1998-11-10 | Xerox Corporation | System for predicting documents relevant to focus documents by spreading activation through network representations of a linked collection of documents |
US5838906A (en) * | 1994-10-17 | 1998-11-17 | The Regents Of The University Of California | Distributed hypermedia method for automatically invoking external application providing interaction and display of embedded objects within a hypermedia document |
US5940624A (en) * | 1991-02-01 | 1999-08-17 | Wang Laboratories, Inc. | Text management system |
-
2000
- 2000-12-08 WO PCT/US2000/033340 patent/WO2001042986A1/en active Application Filing
- 2000-12-08 AU AU29070/01A patent/AU2907001A/en not_active Abandoned
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5940624A (en) * | 1991-02-01 | 1999-08-17 | Wang Laboratories, Inc. | Text management system |
US5838906A (en) * | 1994-10-17 | 1998-11-17 | The Regents Of The University Of California | Distributed hypermedia method for automatically invoking external application providing interaction and display of embedded objects within a hypermedia document |
US5835905A (en) * | 1997-04-09 | 1998-11-10 | Xerox Corporation | System for predicting documents relevant to focus documents by spreading activation through network representations of a linked collection of documents |
Also Published As
Publication number | Publication date |
---|---|
WO2001042986A9 (en) | 2002-07-25 |
AU2907001A (en) | 2001-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP4857075B2 (en) | Method and computer program for efficiently retrieving dates in a collection of web documents | |
KR100953238B1 (en) | Content information analysis method, system and recording medium | |
US6321220B1 (en) | Method and apparatus for preventing topic drift in queries in hyperlinked environments | |
US6691105B1 (en) | System and method for geographically organizing and classifying businesses on the world-wide web | |
US8332422B2 (en) | Using text search engine for parametric search | |
US7065523B2 (en) | Scoping queries in a search engine | |
US7383299B1 (en) | System and method for providing service for searching web site addresses | |
US7860853B2 (en) | Document matching engine using asymmetric signature generation | |
US8266134B1 (en) | Distributed crawling of hyperlinked documents | |
US8949256B2 (en) | System and method for identifying an owner of a web page on the World-Wide Web | |
US7464076B2 (en) | System and method and computer program product for ranking logical directories | |
US9031942B2 (en) | Method and system for indexing information and providing results for a search including objects having predetermined attributes | |
US20030046276A1 (en) | System and method for modular data search with database text extenders | |
WO2001042986A1 (en) | Method and system for indexing documents using connectivity and selective content analysis | |
Svátek et al. | Rainbow-multiway semantic analysis of Web sites | |
Szymanski et al. | A method for indexing Web pages using Web bots | |
KR100503950B1 (en) | System of making customizing classification dictionary using internet search engine and method thereof | |
Chien | Keynote 2 NoC's at the center of chip architecture: Urgent needs (today) and what they must become (future) | |
Chang et al. | Semi-Structured Information Extraction Applying Automatic Pattern Discovery | |
Eskicioğlu | A Search Engine for Turkish with Stemming |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
AK | Designated states |
Kind code of ref document: C2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: C2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
COP | Corrected version of pamphlet |
Free format text: PAGES 1/5-5/5, DRAWINGS, REPLACED BY NEW PAGES 1/5-5/5; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase |
Ref country code: JP |