CN1536483A - Method for extracting and processing network information and its system - Google Patents

Method for extracting and processing network information and its system Download PDF

Info

Publication number
CN1536483A
CN1536483A CNA031093388A CN03109338A CN1536483A CN 1536483 A CN1536483 A CN 1536483A CN A031093388 A CNA031093388 A CN A031093388A CN 03109338 A CN03109338 A CN 03109338A CN 1536483 A CN1536483 A CN 1536483A
Authority
CN
China
Prior art keywords
classification
feature
speech
news
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA031093388A
Other languages
Chinese (zh)
Inventor
陈文中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CNA031093388A priority Critical patent/CN1536483A/en
Publication of CN1536483A publication Critical patent/CN1536483A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention relates to a network information extracting and processing method, adopting artificial intelligence and natural language processing technique, able to automatically download daily up-to-date news and information from named websites, making content extraction, classification, automatic abstracting and retrenching full text, then storing the full text, and then indexing the full text for making high-efficiency full text retrieval in future.

Description

The method and system of network information extraction and processing
Technical field
The present invention relates to a kind of data processing method and system, more particularly, relate to various the information particularly extraction of cyber journalism and the method and system of processing on a kind of computer network.
Background technology
Be the epoch of an information explosion now, along with the develop rapidly of internet, people more and more have many networks that passes through to obtain up-to-date consultation information.
Now, almost everyone has the custom of reading newspaper, more particularly to consultation information demand more urgent individual and enterprise, is the information that will oneself need from a lot of newspaper acquisitions more.We almost can see all news from network, and a lot of people obtain up-to-date news information by online.But, only be that online sees that news can not reduce our needed time, the news that we still need to read over a big piece of writing just can be known the content that this piece news is described, and perhaps will watch a lot of webpages and just can obtain our needed consultation information afterwards.And, the online news one i.e. mistake that dies, a lot of people need inquire about the news before many days, even need be to some months, and news is the year before inquired about.In this case, can not satisfy our requirement by network.
Traditional method based on the autoabstract of adding up, the general using mathematical statistic method is all given certain weight to each speech in the document, and the method for calculating weights generally is to calculate by calculating the frequency of occurrences of speech in article.The speech that the frequency of occurrences is high, the weights that had are just higher.Speech with high weight means that this speech is the center of article.
The sentence of article also is to give according to the weights of speech, and after we had composed weights to speech, we just can calculate the weights of each sentence, and the central idea of article can be represented more in the sentence that weights are high more.We can directly produce summary with the high sentence of weights.
It is very fast that this method generates the speed of summary, but because the high speech of the frequency of occurrences might not be exactly the central idea of article, and do not carry out grammatical analysis, and the readability of the summary that slaps together with the high sentence of weights also is poor.
But we can give the method that the method for weights and center sentence select by improvement and reach the effect that comparison can be accepted.
Chinese Automatic Word Segmentation is to set up the step that full-text index must pass through.So-called participle, exactly in short, speech in one piece of article divides out one by one.Chinese is unlike English, and Chinese does not have tangible cutting sign.The length of speech differs, and the definition of speech is also different, has polysemy, situations such as synonym.So Chinese Automatic Word Segmentation exists very big difficulty.
The method of now popular participle mainly contains following several:
The forward maximum matching method: be the segmenting method that proposes the earliest, the speech of each forward cutting with the longest (as being 6) and the speech of dictionary mate, if the match is successful, then continue down participle, otherwise deletion the last character continue coupling.
The high frequency precedence method: this method is based on the statistics of word frequency, the formation law of association between word and the word, and phenomenons such as ambiguity division put forward.This method has improved the efficient of participle, but powerless for ambiguity, and error rate does not lower.
Neural network is divided morphology: parallel according to the simulation human brain, and distribution process and set up numerical model work.It deposits the method for participle implicit expression that knowledge is disperseed in the neural network in, revises inner weights by self study and training, in the hope of reaching the word segmentation result of better effects.
Expert system is divided morphology: the method for this participle is independent from the inference machine of realizing the participle process the knowledge of participle (heuristic knowledge that comprises common-sense participle knowledge and disambiguation cutting is the ambiguity partition rule) from the angle of expert system.Thereby the realization that has realized MAINTENANCE OF KNOWLEDGE BASE and inference machine like this is separate.It also has finds common factor ambiguity field and the ability of ambiguity combination ambiguity field and certain self-learning capability.
Present full-text index generally adopts inverted file as index mechanism, preserves the tabulation of the document code of entry word correspondence in inverted file.
For text retrieval, the most effective index structure then is an inverted file: it is a list collection, and the identifier of all document d that comprise this entry word listed in the corresponding record of each entry word t in record.
Inverted file can be regarded as the transposition of document-entry word frequency matrix, from (d, t) be converted to (t, d), because the visit of row major is more more effective than being listed as preferential visit.
Index file comprises three parts: dictionary (invf.dict), inverted file (invf) and mapped file between the two (invf.idx).Indexed file structure as shown in Figure 2.
In dictionary (invf.dict):, preserve entry word character string t, the total number of documents f_t that comprises t, t total occurrence number F_t in the entire document set for each different entry word t.
In mapped file (invf.idx):, preserve the pointer that points to the corresponding start address of falling the permutation table for each different entry word t.
In inverted file (invf):, preserve identifier d (numerical value of order), the frequency of occurrences f of t in each document d of each document that comprises t for each different entry word t D, t, be stored as<d f D, tTabulation.
In addition with weight array W dTogether, just can satisfy the needs of boolean queries (Boolean Query) and classification inquiry (Ranked Query).
Summary of the invention
The method and system that the purpose of this invention is to provide a kind of network information extraction and processing, computer technology and natural language processing technique have been adopted, can download up-to-date news information every day from the website of each appointment automatically, and carry out content extraction, classification, autoabstract is simplified in full, and full text is stored in the native system, and this index of the style of writing of going forward side by side is so that carry out full-text search efficiently in the future.
In order to realize above-mentioned purpose, technical scheme of the present invention is as follows:
The method of a kind of network information extraction and processing comprises the steps:
One. news download step: comprise the steps
The url analytical procedure: system specifies certain url, program can analyze the final content url of news automatically from these url, and need not do a specific url module to each news website, employing gives url statistics and the method for url being carried out correlation analysis, at a webpage that contains final content news link address, carry out statistics and analysis, find useful final url address;
Automatically grasp the news web page step: all pages that meet the url form are downloaded with the link page in the destination address;
Rubbish filtering step: realize carrying out rubbish filtering, remove html label wherein and Chinese that some are useless, finally obtain Chinese vector information to grabbing the news content webpage that gets off;
The information extraction step: the above Chinese vector that obtains is carried out information extraction, realize early stage extracting title and content, the later stage realization is carried out the feature extraction correlation analysis to the web news content, document classification, and row heavily handles or the like;
Two. generate the summary step automatically: carry out participle, the analysis of feature speech, sentence important analysis, generate summary, and the output summary;
Three. generate the full-text index step: all news content files of having downloaded and having finished content extraction are carried out full-text index, comprise the steps:
Import step into, import next filename into;
The index determining step judges whether index mistake, is then to get back to import step into, otherwise enters next step;
Filtration step filters wherein all rubbish and insignificant speech;
Coupling participle step is carried out the dictionary matching participle;
Ngram participle step is carried out the ngram participle, in order to avoid the dictionary participle has the speech of failing to branch away fully;
Step of updating is all upgraded relevant index file to each speech, comprises key word and date, the classification index;
Four. level text classification step: be that a new document is included into classification step in the class in the given stratigraphic classification; Every part of document can only be included in the class, have on the big level of given term of weight in level in that each class in the stratigraphic classification and many vocabulary are relevant with term, and stopword is on another level.The feature speech of the document of being taken passages (news of finance) is taken as term in this system and glossary uses; Comprise level training step and document classification step;
The level training is the pre-service of document classification, before classification, earlier the level of classification is trained; The function of training is the stack features (feature speech) of self-training document of will collecting, and is each node (classification) assigned characteristics weight in level then, and in the document classification algorithm, feature weight is to be used for being the new document calculations classification grade of portion;
The document classification step is that present a file can be classified into a classification after by the training hierarchy, and file classifying method is from the root classification, and all subclass of root classification are assigned with grade, and it is calculated by equation:
R cd = Σ f N fd W fc
C is a classification, and d is a file, and f is a feature in D, and Rcd is the grade of c, and Nfd is that f appears at the number of times among the d, and Wfc is the weight of f in classification c;
If the grade of all subclass all be zero or negative, d is left on the root classification; If the classification of the grade of definite positive maximum is arranged in subclass, then this classification is selected; If this classification is a leaf classification, file d is assigned to this classification; If selecteed classification is not the leaf classification, then in such other subclass, proceed to calculate; Therefore, file d can assign to leaf classification or internal sort.
The system of a kind of network information extraction and processing comprises as lower device:
One. news download apparatus: comprise as lower device
The url analytical equipment: system specifies certain url, program can analyze the final content url of news automatically from these url, and need not do a specific url module to each news website, employing gives url statistics and the method for url being carried out correlation analysis, at a webpage that contains final content news link address, carry out statistics and analysis, find useful final url address;
Automatically grasp the news web page device: all pages that meet the url form are downloaded with the link page in the destination address;
Rubbish filtering device: realize carrying out rubbish filtering, remove html label wherein and Chinese that some are useless, finally obtain Chinese vector information to grabbing the news content webpage that gets off;
Information extracting device: the above Chinese vector that obtains is carried out information extraction, realize early stage extracting title and content, the later stage realization is carried out feature extraction to the web news content, correlation analysis, and document classification, row heavily handles or the like;
Two. generate summarization device automatically: carry out participle, the analysis of feature speech, sentence important analysis, generate summary, and the output summary;
Three. generate the full-text index device: all news content files of having downloaded and having finished content extraction are carried out full-text index, comprise as lower device:
Import device into, import next filename into;
The index judgment means judges whether index mistake, is then to get back to import device into, otherwise enters next step;
Filtration unit filters wherein all rubbish and insignificant speech;
Coupling participle device carries out the dictionary matching participle;
Ngram participle device carries out the ngram participle, in order to avoid the dictionary participle has the speech of failing to branch away fully;
Updating device all upgrades relevant index file to each speech, comprises key word and date, the classification index;
Four. level document sorting apparatus: be that a new document is included into sorter in the class in the given stratigraphic classification; Every part of document can only be included in the class, have on the big level of given term of weight in level in that each class in the stratigraphic classification and many vocabulary are relevant with term, and stopword is on another level.The feature speech of the document of being taken passages (news of finance) is taken as term in this system and glossary uses; Comprise level trainer and document classification device;
The level trainer is the pre-service to document classification, before classification, earlier the level of classification is trained; The function of training is the stack features (feature speech) of self-training document of will collecting, and is each node (classification) assigned characteristics weight in level then, and in the document classification algorithm, feature weight is to be used for being the new document calculations classification grade of portion;
Device for sorting document is that present a file can be classified into a classification after by the training hierarchy, and file classifying method is from the root classification, and all subclass of root classification are assigned with grade, and it is calculated by equation:
R cd = Σ f N fd W fc
C is a classification, and d is a file, and f is a feature in D, and Rcd is the grade of c, and Nfd is that f appears at the number of times among the d, and Wfc is the weight of f in classification c;
If the grade of all subclass all be zero or negative, d is left on the root classification; If the classification of the grade of definite positive maximum is arranged in subclass, then this classification is selected; If this classification is a leaf classification, file d is assigned to this classification; If selecteed classification is not the leaf classification, then in such other subclass, proceed to calculate; Therefore, file d can assign to leaf classification or internal sort.
Owing to adopt above-mentioned method and system, download up-to-date news web page source code from the indication version of the web website of appointment every day automatically; Can the html code that download be analyzed, obtain wherein valuable news content; The content of analyze is carried out autoabstract simplifies; Carry out participle and index to analyzing the content of coming out, for the usefulness of retrieval; Automatically classify to analyzing the content of coming out.
Description of drawings
Fig. 1 is the system construction drawing of the method and the program of existing automatic download network information;
Fig. 2 is the indexed file structure figure of existing web information processing method;
Fig. 3 is the process flow diagram of the news download step in network information extraction of the present invention and the disposal route;
Fig. 4 is the news list page figure of the press center of People's Net;
Fig. 5 is for analyzing the process flow diagram of the method that obtains token stream;
Fig. 6 is China.com channel for finance and economics page figure;
Fig. 7 is Http:// www.chinahd.com/news/stock/2002-3/161628.htmPage figure;
Fig. 8 is the source code figure of Fig. 7;
Fig. 9 is the news web page figure of certain piece of china.com channel for finance and economics;
Figure 10 is that the described news web page of Fig. 9 is through the available content information figure of content analysis;
Figure 11 is the process flow diagram that generates method of abstracting automatically;
Figure 12 is the analysis chart that generates method of abstracting automatically;
Figure 13 stores former texts and pictures for illustrational content;
Figure 14 is the summary figure that generates automatically according to the present invention;
Figure 15 is the process flow diagram of generation full-text search step of the present invention;
Figure 16 is the process flow diagram of news query steps of the present invention.
Embodiment
The present invention is further detailed explanation below in conjunction with drawings and embodiments:
We have only considered the process of automatic download and content analysis, not to the corresponding Matching Model of each website structure, we have realized a general algorithm to the website of this type of news website, and the frequency and the position that occur of frequency that occurs according to Chinese content and the intimate html tag of content determines that part is a news content exactly.To in the implementation method of back, specifically describe.
Because we need obtain the bigger content of accuracy, and it is carried out information extraction pass to the final user, so we do not need robot to carry out profound recurrence visit.The method that specific implementation is downloaded is automatically specifically introduced in the back
Since consider versatility, so we do not consider the web page characteristics of text, the autoabstract of the pure content that is based on the background information storehouse of consideration.
The method of a kind of network information extraction and processing comprises the steps:
One, news download step: as shown in Figure 3, the automatic download of news is divided into two parts, and url analyzes and source code grasps two parts.Because the advantage of the network programming that java has makes us to connect to online any resource, forms a stream, just can be as the resource on the operation local file operational network.
1, url analytical procedure:
System specifies certain url, and program can analyze the final content url of news automatically from these url.And need not do a specific url module to each news website.
Employing gives url statistics and url is carried out the method for correlation analysis, at a webpage that contains final content news link address, carries out statistics and analysis, finds our useful final url address.For example: program is specified the url that divided class of some.This url should be the listing file of news.Promptly can open the news content page in the link of this page click news.
With the People's Net is example: this page is exactly the news list page of the press center of People's Net, as shown in Figure 4.
By this page is analyzed, the url form that we can draw the final page is Http:// www.people.com.cn/GB/guoii/25/96/20020312/ * .htmlDeposit in the relevant final url formatted file.
Employing is to the token analytical approach of html:
Fully use the OO thought among the java, we regard each html source code file as an object, set up the class of a token by name simultaneously, token is used for describing significant character string among the html, and inherit out the urltoken class by token, urltoken is used for describing the token that feature meets the url form.
When carrying out the html source code analysis, we regard each file as an object like this, and simultaneously with regard to the character string between each html tag in this document and each the html tag, we regard them as a character string.
The attribute that each token had
String tokenstr=null; The string value of this token of // description
Int tokenloc=0; The position of // this token in original
Int gbnum=0; The Chinese character quantity that has among // this token
Boolean iskeentag=false; // whether be an intimate token of content fully
Float keenvalue=0; // with the intimate degree of content
The more special method that Token has:
public?boolean?ishref() {
String?flag1=″href=″;
int?flag2=-1;
If (tokenstr. index Of (flag1)==flag2)
return?false;
else
return?true;
}
This method is used for judging whether a url html tag
In fact, the thought of utilization oo is carried out the html source code analysis, utilizes the thought that flows among the java, and we have set up token stream, and the result proves that the effect of doing like this is good:
1. program structure is very clear, and oo thought has obtained very significantly embodying.
2. it is fine to analyze the effect that realizes, the accuracy rate height that reaches.
3. need not each website is defined special analysis stop sign etc.
4., can both carry out normal process as long as belong to the html code of standard.
The method that analysis obtains token stream as shown in Figure 5.
To any one news plate of each website, we define following characteristic item:
Classification under this plate, such as politics, industry, physical culture etc.These classifications are also defined by administration module;
Server address under this plate, such as: news.sina.com.cn;
Current directory under this plate (general regular website, the news of a plate all is below a catalogue);
The path attribute of this plate list page, i.e. absolute path or relative path.
Url is analyzed, mainly realize, mainly realized the analysis of token stream by urlanalyse.class and two classes of contentanalyse.class.
The main method of analyzing: urlanalyse.class has a method geturl (stringfilename) earlier source code to be changed into token stream to read in, the gbnum that then each is met the url token of form and this url back is not equal to 0 token and adds among the hashmap of buffer memory, generally speaking, to be not equal to 0 token all be the title of news to the gbnum of url back.
For example: the China.com channel for finance and economics page as shown in Figure 6.
After the url analysis, the hashmap that we can obtain being correlated with:
http://finance.china.com/zh_cn/news/financenews/10001254/20020506/10255883.html
Drop into 6,000 hundred million Chongqing in 10 years and will make international metropolis
http://finance.china.com/zh_cn/news/financenews/10001254/20020506/10255882.html
1 ton of oil is paid for four or five hundred yuan of tax control machines and is rejected cooking oil coupon receipts cash
http://finance.china.com/zh_cn/news/financenews/10001254/20020506/10255881.html
One thread spring breeze of Hong Kong tourist industry---economic recovery
After obtaining these, we just will grasp automatically, and all url webpage source codes of analyzing are out all grabbed.
2, grasp the news web page step automatically:
Each start-up routine, we all will be with the link page in the destination address all pages that meet the url form download.Do not carry out correlation analyses such as information extraction in the downloading process,, influence speed of download in order to avoid strengthen burden.The page of having downloaded is no longer downloaded.Download will be distinguished gb, the influence of coding such as big5 factor.
3, rubbish filtering module:
This step is to realize carrying out rubbish filtering to grabbing the news content webpage that gets off, and removes html label wherein and Chinese that some are useless, finally obtains Chinese vector information.Must when downloading, move in background thread.Later stage can be considered to add relevant informations such as weights at the Chinese vector that obtains.(position that weights occur according to literal, the html label of front and back etc. are determined, need the document of some to be familiar with, training).
4, information extraction modules:
The above Chinese vector that obtains is carried out information extraction, realize early stage extracting title and content.The later stage realization is carried out feature extraction to the web news content, correlation analysis, and document classification, row heavily handles or the like.Guarantee versatility.Guarantee higher accuracy rate.The function in early stage can realize (as a by simple method * * *In speech at content b * *c *d *In occurrence number) realize.Judge which piece is that content can be judged (label all has certain weights) by the html label of distance between the sentence and front and back.
As shown in Figure 7, source:
http://www.chinahd.com/news/stock/2002-3/161628.htm。Its source code is as figure
Shown in 8, as seen, the distance between the content is all very near, and middle html label generally all is<p〉, ﹠amp; Nbsp,<br〉(paragraph, space, line feed) and so on.We can judge the content place by the singularity of distance and label.
News content extracts
Be different from traditional content extraction method, we are not at model of each website structure, in program, mainly by realizations such as contentanalyse.class and token.class.
Concrete grammar is as follows:
1, the file conversion that will extract content earlier is concrete token stream;
2, token stream is calculated according to the content cohesion;
3, the most concentrated and cohesion is again simultaneously that the highest continuous token set is taken out with gb quantity;
If 4 gb quantity and cohesion can not meet above requirement simultaneously, then direct cancel.
For example: the news web page of certain piece of china.com channel for finance and economics as shown in Figure 9.
Through after the content analysis, because china.com is the webpage of comparison standard, we generally can reach very high accuracy, and concrete test data has detailed explanation in the back.
The available content information of content analysis as shown in figure 10.
When storing, we will be news source, category, and downloadtime, title, 5 parts such as content all store, and as key word index, the foundation of index such as date source also is simultaneously the source of summary.
5, management process: realize the news data of this machine storage is managed,, upgrade etc. as deletion.
Two, generate the summary step automatically: earlier original document is carried out pre-service, carry out participle, the analysis of feature speech, sentence important analysis then, generate summary, and the output summary;
The autoabstract step can be an independent step, need have only a get summary ion with the api interface of external interface.Its interface prototype is
Public String get summary ion (String FileName, boolean FileMode, intRatio)
The FileName parameter decides according to FileMode;
If FileMode=true, FileName then is a filename so;
Otherwise, be document to be extracted itself
The FileMode parameter is a mode parameter
Ratio is for taking out ratio, and only allowing integer between the 0-100 to generate the summary step automatically is an independent step, and independently daily record and transaction module are arranged, summary whether finish the carrying out that does not influence download and index.
The system flow of autoabstract system as shown in figure 11.
Participle adopts " no dictionary " segmenting method, adopts word frequency, and the new and old algorithm unity of thinking is only done some unnecessary improvement to accelerate participle speed.----the measurement of speech weight is that the unnecessary improvement of the possibility of speech is to accelerate participle speed.
P (w)=F (w)* L (w) cWhen (F (w)>minFreq, L (w)>minLen) otherwise P (w)=0minFreq are the appearance minimum frequencies of the speech preset; Usually 〉=2; Reduce is not that the string minLen of speech is that the shortest speech of the speech preset is long; Usually 〉=1; Guarantee that it is a normal value of presetting that low-frequency word is not separated c; Usually 〉=4; Guarantee that long word is not separated
Flow process: a character string be used as in whole literary composition, starts anew to ask substring, and all substrings are asked power, and the high person of weighting is as speech (too many useless scanning), and system value is got a string, adopts All Files as a setting, and the sweep time of having spent like this is many.
The extraction of feature speech, basic thought are based on the frequency of speech, and want to add up for the word frequency in background knowledge storehouse.
Algorithm: P ( w ) = F i ( w ) · ( numdoc advnumdoc ) · ( L ( w ) - D ) 2
The frequency that F (w) occurs for speech
L (w) is the length of speech
Numdoc is the occurrence number in this article of this speech
Advnumdoc average time occurs in all documents
The shortest speech that D presets is long
The reason of revising algorithm has 2 points:
1, former algorithm must use lot of background corpus (BWID); Therefore can be bigger time of system cost and space; New algorithm then is based on the number of times that occurs in the corpus itself and wants statistics.
2, new algorithm also has theoretical cogency.Because the background corpus is widely, therefore, some everyday words frequencies will be a lot, and numdoc/advnumdoc is equal substantially like this; And when a feature speech, occur more in this article usually, then not so much in BWID, on average get off just to make numdoc/advnumdoc big.Therefore the feature speech gets weight also just greatly.Specifically as shown in figure 12.
The relation of the importance of sentence and the generation of summary:
T ( s ) = ΣTi s 0 * s 1 * s 2 * m
Each sentence is calculated their weight by this formula.
Ti is the weight of the speech of sentence composition
S0 is total speech number of sentence
S1 is the words and expressions number of sentence
S2 is the number of number
M is that integer often is worth, and is generally 1.
Content stores original text as shown in figure 13.
Summary back article as shown in figure 14.
Three, generate the full-text index step:
This step need be carried out full-text index to all news content files of having downloaded and having finished content extraction, sets up the real-time work of setting up index on the backstage of process of index.Self also can be an independent step, the required interface parameters that provides be a filename.
The flow process of generation full-text search step comprises the steps: as shown in figure 15
Import step into, import next filename into;
The index determining step judges whether index mistake, is then to get back to import step into, otherwise enters next step;
Filtration step filters wherein all rubbish and insignificant speech;
Coupling participle step is carried out the dictionary matching participle;
Ngram participle step is carried out the ngram participle, in order to avoid the dictionary participle has the speech of failing to branch away fully;
Step of updating is all upgraded relevant index file to each speech, comprises key word and date, the classification index.
Four, level text classification step: be that a new document is included into classification step in the class in the given stratigraphic classification.Every part of document can only be included in the class.Each class in stratigraphic classification is relevant with many vocabulary and term, and sorting algorithm is originally adjusted in level repeatedly.Therefore, have on the big level of given term of weight in level, and stopword is on another level.The feature speech of the document of being taken passages (news of finance) is taken as term in this system and glossary uses.
Comprise two partly: level training step and document classification step, the level training is the pre-service of document classification.Before classification, earlier the level of classification is trained;
1. level training
The function of training is the stack features (feature speech) of self-training document of will collecting, and is each node (classification) assigned characteristics weight in level then.In the document classification algorithm, feature weight is to be used for being the new document calculations classification grade of portion.
Training comprises 4 steps:
1) collects from the feature speech of leaf class;
In the level, for the feature speech of the training document (news) of each leaf class, have only those to occur more than 2 times in single training document or the feature speech that occurs more than 10 times in the training document sets just is collected, these speech occur in summary at last.The feature vocabulary of these collections has shown the feature of leaf class.When a leaf class belongs to some training document sets, parent will comprise the feature of this leaf class.The feature of non-leaf class comprise it child nodes all features and in all child nodes the summation of feature occurrence frequency.
2) level optimization step
The optimization competition that solves between classification joint and its father and mother's classification.Because a file (news) can only be designated as a classification in the hierarchical organization of classification, when between classification competitive the time, algorithm should determine suitable classification for file.
Comprise the steps:
Acquisition step is captured in all features in the classification;
The feature determining step judges whether that the characteristic frequency ratio in father and mother is big in this classification, is then to arrive next step, otherwise not operation;
Look into the step that continues, look into successor's feature catalogue, find out the feature of successor's high-frequency and minimum frequency;
The ratio determining step judges whether in the difference of high frequency and minimum frequency greatlyyer than threshold value with the ratio of highest frequency, is then to arrive next step, otherwise deletes this feature from all successors.Have only father and mother to possess this feature;
The deletion step is unless delete the highest frequency that this feature successor has this feature from the successor.
Above-mentioned method rule can be found out common feature, and concerning parent class, its successor has the frequency of this feature and feature.But when the frequency of this feature is not delivered to the successor, this means highest frequency and low-limit frequency in the successor.Unless common feature comprises the common the highest frequency of feature from all successor's deletion successors.Therefore, the classification transmission outside the classification of above the leaf classification, digging up the roots of the feature of all leaf classifications and frequency, they will not participate in any file rating calculation in the root classification.
When subclass was possessed it, algorithm can not directly be deleted a feature from parent class.This is because we may need feature that file transfers is arrived parent class; It just cannot be delivered to subclass if it can not be delivered to parent class.Therefore, than the difference of lower level classification (subclass) by the level (parent class) above upwards being delivered to.
3) distribute the category feature weight step: be each feature specified weight of classification, have than higher weight feature to mean that it is prior to classification that features all in each classification are assigned with weight, are defined by following formula:
W fc=(λ+(1-λ)×N fc/M c)
Just in the feature of each existence, c is a classification to f, and Wfc is the weight that is designated as feature, and λ is one three number and is set at 0.4 now, N FcBe the number of times that f occurs in c, Mc is the frequency of feature maximum any in c.
When a feature only appears in the fraternal classification, but itself in c not, and it is designated as negative weight.There is the feature of negative weight to be added to the feature list of c.Negative weight is defined by following formula:
Just in the feature of each existence, c is a classification to f, and Wfc is the weight that is designated as feature, and λ is one three number and is set at 0.4 now, N FpBe the number of times that f occurs in the parent class of c, Mc is the frequency of feature maximum any in the parent class of c.
4) filter the feature list of each classification, the feature list of each classification will be filtered.No matter have only 200 negative features of 200 positive features in front and front to be carried in such other final feature list, be parent class or leaf classification.Other feature will be abandoned.The quantity of limited features is the computation complexity that is used for reducing a file of classification.
2. file classifying method: after by the training hierarchy, present a file can be classified into a classification, and file classifying method is from the root classification.All subclass of root classification are assigned with grade, and it is calculated by equation:
R cd = Σ f N fd W fc
C is a classification, and d is a file, and f is a feature in D, and Rcd is the grade of c, and Nfd is that f appears at the number of times among the d, and Wfc is the weight of f in classification c.
If the grade of all subclass all be zero or negative, d is left on the root classification.If the classification of the grade of definite positive maximum is arranged in subclass, then this classification is selected.If this classification is a leaf classification, file d is assigned to this classification.If selecteed classification is not the leaf classification, then in such other subclass, proceed to calculate.Therefore, file d can assign to leaf classification or internal sort.
Five, news query steps: as shown in figure 16, comprise the steps:
Submit step to, the submit queries condition;
Search step carries out search operation to index, obtains result set;
Return step, the result is returned to the user.
The front several steps has just been realized the automatic download on backstage, autoabstract, and the foundation of index, the function that news inquiry subsystem is realized is mutual with the user, the news inquiry that can allow the user be correlated with on the foreground comprises the news keyword query, the news category inquiry, the inquiry of news date, news sources inquiry etc.
Six, daily record and issued transaction step:
Owing to run into the termination of improper property through regular meeting under the situation of program run, such as unexpected deadlock, sudden power etc.
In this case, we must guarantee the integrality of back-end data, as guaranteeing that index must be complete, have stopped even carry out half program, next time, operation still can recover original indexed results, and began from newly carrying out indexing service from the position of failure.
Also have,,, so also must note down their work in order not cause the work of repetition and to save time for work such as download and summaries.
The Log file system function:
1, the url analysis module of download thread just reads in accounting file earlier, and is written into two up-to-date log files when analyzing url, downloads in order to judging whether.
2, whenever downloading a news content webpage, just the url that storage is relevant is to up-to-date log file.
3, in the process of index, must read in the positional information of index earlier, read in the log fileinfo of necessary index then.Then corresponding content file is carried out index, upgrade the index position information in the index log file simultaneously.
4, in the process of summary, must read in the positional information of summary earlier, read in the log fileinfo that must make a summary then.Then corresponding content file is made a summary, upgrade the summary positional information in the summary log file simultaneously.
5, whenever the source code of having downloaded a file, analyze content, finish summary, finish index and all will note down this work.In order to avoid accident takes place and can't handle, and can avoid repeating work.
6, download, summary, three threads of index never stop, even finished a certain work, finishing such as summary, and then the log file of load summary again begins summary.
Seven, management process:
Management process is mainly realized the data management to this machine, category management, news sources management, the index upgrade of data deletion, daily record renewal etc.

Claims (24)

1, the method for a kind of network information extraction and processing comprises the steps:
One. news download step: comprise the steps
The url analytical procedure: system specifies certain url, program can analyze the final content url of news automatically from these url, and need not do a specific url module to each news website, employing gives url statistics and the method for url being carried out correlation analysis, at a webpage that contains final content news link address, carry out statistics and analysis, find useful final url address;
Automatically grasp the news web page step: all pages that meet the url form are downloaded with the link page in the destination address;
Rubbish filtering step: realize carrying out rubbish filtering, remove html label wherein and Chinese that some are useless, finally obtain Chinese vector information to grabbing the news content webpage that gets off;
The information extraction step: the above Chinese vector that obtains is carried out information extraction, realize early stage extracting title and content, the later stage realization is carried out feature extraction to the web news content, correlation analysis, and document classification, row heavily handles or the like;
Two. generate the summary step automatically: carry out participle, the analysis of feature speech, sentence important analysis, generate summary, and the output summary;
Three. generate the full-text index step: all news content files of having downloaded and having finished content extraction are carried out full-text index, comprise the steps:
Import step into, import next filename into;
The index determining step judges whether index mistake, is then to get back to import step into, otherwise enters next step;
Filtration step filters wherein all rubbish and insignificant speech;
Coupling participle step is carried out the dictionary matching participle;
Ngram participle step is carried out the ngram participle, in order to avoid the dictionary participle has the speech of failing to branch away fully;
Step of updating is all upgraded relevant index file to each speech, comprises key word and date, the classification index;
Four. level text classification step: be that a new document is included into classification step in the class in the given stratigraphic classification; Every part of document can only be included in the class, have on the big level of given term of weight in level in that each class in the stratigraphic classification and many vocabulary are relevant with term, and stopword to be on another level. the feature speech of the document of being taken passages (news of finance) is taken as term in this system and glossary uses; Comprise level training step and document classification step;
The level training is the pre-service of document classification, before classification, earlier the level of classification is trained; The function of training is the stack features (feature speech) of self-training document of will collecting, and is each node (classification) assigned characteristics weight in level then, and in the document classification algorithm, feature weight is to be used for being the new document calculations classification grade of portion;
The document classification step is that present a file can be classified into a classification after by the training hierarchy, and file classifying method is from the root classification, and all subclass of root classification are assigned with grade, and it is calculated by equation:
R cd = Σ f N fd W fc
C is a classification, and d is a file, and f is a feature in D, and Rcd is the grade of c, and Nfd is that f appears at the number of times among the d, and Wfc is the weight of f in classification c;
If the grade of all subclass all be zero or negative, d is left on the root classification; If the classification of the grade of definite positive maximum is arranged in subclass, then this classification is selected; If this classification is a leaf classification, file d is assigned to this classification; If selecteed classification is not the leaf classification, then in such other subclass, proceed to calculate; Therefore, file d can assign to leaf classification or internal sort.
2, the method for network information extraction according to claim 1 and processing is characterized in that described news download step also comprises management process, realizes the news data of this machine storage is managed, and as deletion, upgrades etc.
3, the method for network information extraction according to claim 1 and processing is characterized in that described method also comprises the news query steps, comprises the steps:
Submit step to, the submit queries condition;
Search step carries out search operation to index, obtains result set;
Return step, the result is returned to the user.
4, network information extraction according to claim 1 and disposal route, it is characterized in that described method also comprises daily record and issued transaction step, even carrying out half has stopped, next time, operation still can recover original indexed results, and begin to note down for work such as download and summaries from newly carrying out indexing service from the position of failure; The url analysis module of download thread just reads in accounting file earlier, and is written into two up-to-date journal files when analyzing url, downloads in order to judging whether; Whenever downloading a news content webpage, just the url that storage is relevant is to up-to-date journal file; In the process of index, must read in the positional information of index earlier, read in the journal file information of necessary index then; Then corresponding content file is carried out index, upgrade the index position information in the index journal file simultaneously; In the process of summary, must read in the positional information of summary earlier, read in the journal file information that must make a summary then; Then corresponding content file is made a summary, upgrade the summary positional information in the summary journal file simultaneously; Source code whenever having downloaded a file analyzes content, finishes summary, finishes index and all will note down this work; Download, summary, three threads of index never stop, even finished a certain work, finishing such as summary, then download the journal file of summary again, begin summary.
5, the method for network information extraction according to claim 1 and processing is characterized in that described method also comprises management process, and management process is mainly realized the data management to this machine, category management, the news sources management, the index upgrade of data deletion, daily record renewal etc.
6, the method for network information extraction according to claim 1 and processing is characterized in that described autoabstract step can be an independent step, need have only a get summary ion with the api interface of external interface, and its interface prototype is
Public String get summary ion (String FileName, boolean FileMode, intRatio)
The FileName parameter decides according to FileMode; If FileMode=true, FileName then is a filename so; Otherwise, be document to be extracted itself; The FileMode parameter is a mode parameter; Ratio only allows the integer between the 0-100 for taking out ratio.
7, the method for network information extraction according to claim 1 and processing is characterized in that described generation full-text index step can be an independent step, and the required interface parameters that provides is a filename.
8, the method for network information extraction according to claim 1 and processing is characterized in that adopting in the described news download step token analytical approach to html; Fully use the OO thought among the java, regard each html source code file as an object, set up the class of a token by name simultaneously, token is used for describing significant character string among the html, and inherit out the urltoken class by token, urltoken is used for describing the token that feature meets the url form;
When carrying out the html source code analysis, regard each file as an object, simultaneously with regard to the character string between each html tag in this document and each the html tag, all it is regarded as a character string;
The attribute that each token had is
String tokenstr=null; The string value int tokenloc=0 of this token of // description; The position int gbnum=0 of // this token in original; The Chinese character quantity boolean iskeentag=false that has among // this token; // whether be the intimate token Float of a content keenvalue=0 fully; // more special the method that has with the intimate degree Token of content: public boolean ishref () { String flag 1=" href="; Int flag2=-1; If (tokenstr. index Of (flag1)==flag2) return false; Else return true;
This method is used for judging whether a url html tag;
Url is analyzed, mainly realize, mainly realized the analysis of token stream by urlanalyse.class and two classes of contentanalyse.class;
The main method of analyzing: urlanalyse.class has a method geturl (stringfilename) earlier source code to be changed into token stream to read in, the gbnum that then each is met the url token of form and this url back is not equal to 0 token and adds among the hashmap of buffer memory, generally speaking, to be not equal to 0 token all be the title of news to the gbnum of url back.
9, the method for network information extraction according to claim 1 and processing is characterized in that participle adopts " no dictionary " segmenting method in the described automatic generation summary step, adopts word frequency, and speech is heavy---measurement is the algorithmic formula of the possibility of speech:
P (w)=F (w) * L (w) cWhen (F (w)>minFreq, L (w)>minLen) otherwise P (w) minFreq are the appearance minimum frequencies of the speech preset; Usually 〉=2; Reduce is not that the string minLen of speech is that the shortest speech of the speech preset is long; Usually 〉=1; Guarantee that it is a normal value of presetting that low-frequency word is not separated c; Usually 〉=4; Guarantee that long word is not separated;
Flow process is as follows: a character string be used as in whole literary composition, starts anew to ask substring, and all substrings are asked power, and the high person of weighting is as speech (too many useless scanning), and system value is got a string, adopts All Files as a setting.
10, the method for network information extraction according to claim 1 and processing is characterized in that the extraction of feature speech in the described automatic generation summary step, based on the frequency of speech and want to add up for the word frequency in background knowledge storehouse,
P ( w ) = F i ( w ) · ( numdoc advnumdoc ) · ( L ( w ) - D ) c
The frequency that F (w) occurs for speech, L (w) is the length of speech, and numdoc is the occurrence number in this article of this speech, and advnumdoc average time occurs in all documents, and the shortest speech that D presets is long.
11, the method for network information extraction according to claim 1 and processing is characterized in that the relation of the generation of the importance of sentence in the described automatic generation summary step and summary:
T ( s ) · ΣTi s 0 * s 1 * s 2 * m
Each sentence is calculated their weight by this formula;
Ti is the weight of the speech of sentence composition, and S0 is total speech number of sentence, and S1 is the words and expressions number of sentence,
S2 is the number of number, and m is that integer often is worth, and is generally 1.
12, the method for network information extraction according to claim 1 and processing is characterized in that described level training step comprises 4 steps:
1) collects from the feature speech of leaf class: in the level, feature speech for the training document (news) of each leaf class, have only those in single training document, to occur just being collected more than 2 times or at the feature speech of training document sets appearance more than 10 times, these speech occur in summary at last, the feature vocabulary of these collections has shown the feature of leaf class, when a leaf class belongs to some training document sets, parent will comprise the feature of this leaf class, the feature of non-leaf class comprise it child nodes all features and in all child nodes the summation of feature occurrence frequency;
2) level optimization step: the optimization competition that solves between classification joint and its father and mother's classification, because a file (news) can only be designated as a classification in the hierarchical organization of classification, when between classification competitive the time, algorithm should determine suitable classification for file, comprises the steps:
Acquisition step is captured in all features in the classification;
The feature determining step judges whether that the characteristic frequency ratio in father and mother is big in this classification, is then to arrive next step, otherwise not operation;
Look into the step that continues, look into successor's feature catalogue, find out the feature of successor's high-frequency and minimum frequency;
The ratio determining step judges whether in the difference of high frequency and minimum frequency greatlyyer than threshold value with the ratio of highest frequency, is then to arrive next step, otherwise deletes this feature from all successors.Have only father and mother to possess this feature;
The deletion step is unless delete the highest frequency that this feature successor has this feature from the successor;
3) distribute the category feature weight step: be each feature specified weight of classification, have than higher weight feature to mean that it is prior to classification that features all in each classification are assigned with weight, are defined by following formula: W Fc=(λ+(1-λ) * N Fc/ M c) just in the feature of each existence, c is a classification to f, Wfc is the weight that is designated as feature, λ is one three number and is set at 0.4 now, N FcBe the number of times that f occurs in c, Mc is the frequency of feature maximum any in c;
When a feature only appears in the fraternal classification, but itself in c not, it is designated as negative weight, has the feature of negative weight to be added to the feature list of c, and negative weight is defined by following formula: W Fc=-(λ+(1-λ) * N Fp/ M p) just in the feature of each existence, c is a classification to f, Wfc is the weight that is designated as feature, λ is one three number and is set at 0.4 now, N FpBe the number of times that f occurs in the parent class of c, Mc is the frequency of feature maximum any in the parent class of c;
4) filter the feature list of each classification, the feature list of each classification will be filtered, no matter have only 200 negative features of 200 positive features in front and front to be carried in such other final feature list, be parent class or leaf classification, and other feature will be abandoned.The quantity of limited features is the computation complexity that is used for reducing a file of classification.
13, the system of a kind of network information extraction and processing is characterized in that: comprise as lower device:
One. news download apparatus: comprise as lower device
The url analytical equipment: system specifies certain url, program can analyze the final content url of news automatically from these url, and need not do a specific url module to each news website, employing gives url statistics and the method for url being carried out correlation analysis, at a webpage that contains final content news link address, carry out statistics and analysis, find useful final url address;
Automatically grasp the news web page device: all pages that meet the url form are downloaded with the link page in the destination address;
Rubbish filtering device: realize carrying out rubbish filtering, remove html label wherein and Chinese that some are useless, finally obtain Chinese vector information to grabbing the news content webpage that gets off;
Information extracting device: the above Chinese vector that obtains is carried out information extraction, realize early stage extracting title and content, the later stage realization is carried out feature extraction to the web news content, correlation analysis, and document classification, row heavily handles or the like;
Two. generate summarization device automatically: carry out participle, the analysis of feature speech, sentence important analysis, generate summary, and the output summary;
Three. generate the full-text index device: all news content files of having downloaded and having finished content extraction are carried out full-text index, comprise as lower device:
Import device into, import next filename into;
The index judgment means judges whether index mistake, is then to get back to import device into, otherwise enters next step;
Filtration unit filters wherein all rubbish and insignificant speech;
Coupling participle device carries out the dictionary matching participle;
Ngram participle device carries out the ngram participle, in order to avoid the dictionary participle has the speech of failing to branch away fully;
Updating device all upgrades relevant index file to each speech, comprises key word and date, the classification index;
Four. level document sorting apparatus: be that a new document is included into sorter in the class in the given stratigraphic classification; Every part of document can only be included in the class, have on the big level of given term of weight in level in that each class in the stratigraphic classification and many vocabulary are relevant with term, and stopword to be on another level. the feature speech of the document of being taken passages (news of finance) is taken as term in this system and glossary uses; Comprise level trainer and document classification device;
The level trainer is the pre-service to document classification, before classification, earlier the level of classification is trained; The function of training is the stack features (feature speech) of self-training document of will collecting, and is each node (classification) assigned characteristics weight in level then, and in the document classification algorithm, feature weight is to be used for being the new document calculations classification grade of portion;
Device for sorting document is that present a file can be classified into a classification after by the training hierarchy, and file classifying method is from the root classification, and all subclass of root classification are assigned with grade, and it is calculated by equation:
R cd = Σ f N fd W fc
C is a classification, and d is a file, and f is a feature in D, and Rcd is the grade of c, and Nfd is that f appears at the number of times among the d, and Wfc is the weight of f in classification c;
If the grade of all subclass all be zero or negative, d is left on the root classification; If the classification of the grade of definite positive maximum is arranged in subclass, then this classification is selected; If this classification is a leaf classification, file d is assigned to this classification; If selecteed classification is not the leaf classification, then in such other subclass, proceed to calculate; Therefore, file d can assign to leaf classification or internal sort.
14, the system of network information extraction according to claim 13 and processing is characterized in that described news download apparatus also comprises management devices, realizes the news data of this machine storage is managed, and as deletion, upgrades etc.
15, network information extraction according to claim 13 and disposal system is characterized in that described system also comprises the news inquiry unit, comprise as lower device:
Submit device to, the submit queries condition;
Searcher carries out search operation to index, obtains result set;
Return mechanism returns to the user with the result.
16, network information extraction according to claim 13 and disposal system, it is characterized in that described system also comprises daily record and transacter, even carrying out half has stopped, next time, operation still can recover original indexed results, and begin to note down for work such as download and summaries from newly carrying out indexing service from the position of failure; The url analysis module of download thread just reads in accounting file earlier, and is written into two up-to-date journal files when analyzing url, downloads in order to judging whether; Whenever downloading a news content webpage, just the url that storage is relevant is to up-to-date journal file; In the process of index, must read in the positional information of index earlier, read in the journal file information of necessary index then; Then corresponding content file is carried out index, upgrade the index position information in the index journal file simultaneously; In the process of summary, must read in the positional information of summary earlier, read in the journal file information that must make a summary then.Then corresponding content file is made a summary, upgrade the summary positional information in the summary journal file simultaneously; Source code whenever having downloaded a file analyzes content, finishes summary, finishes index and all will note down this work; Download, summary, three threads of index never stop, even finished a certain work, finishing such as summary, then download the journal file of summary again, begin summary.
17, the system of network information extraction according to claim 13 and processing is characterized in that described system also comprises management devices, and management devices is mainly realized the data management to this machine, category management, the news sources management, the index upgrade of data deletion, daily record renewal etc.
18, the system of network information extraction according to claim 13 and processing is characterized in that described autoabstract device can be an independent device, need have only a get summary ion with the api interface of external interface, and its interface prototype is
Public String get summary ion (String FileName, boolean FileMode, intRatio)
The FileName parameter decides according to FileMode; If FileMode=true, FileName then is a filename so; Otherwise, be document to be extracted itself; The FileMode parameter is a mode parameter; Ratio only allows the integer between the 0-100 for taking out ratio.
19, the system of network information extraction according to claim 13 and processing is characterized in that described generation full-text index device can be an independent device, and the required interface parameters that provides is a filename.
20, the system of network information extraction according to claim 13 and processing is characterized in that adopting in the described news download apparatus token analytical approach to html; Fully use the OO thought among the java, regard each html source code file as an object, set up the class of a token by name simultaneously, token is used for describing significant character string among the html, and inherit out the urltoken class by token, urltoken is used for describing the token that feature meets the url form;
When carrying out the html source code analysis, regard each file as an object, simultaneously with regard to the character string between each html tag in this document and each the html tag, all it is regarded as a character string;
The attribute that each token had is
String tokenstr=null; The string value int tokenloc=0 of this token of // description; The position int gbnum=0 of // this token in original; The Chinese character quantity boolean iskeentag=false that has among // this token; // whether be the intimate token Float of a content keenvalue=0 fully; // more special the method that has with the intimate degree Token of content: public boolean ishref () { String flag 1=" href="; Int flag2=-1; If (tokenstr. index Of (flag1)=flag2) return false; Else return true;
This method is used for judging whether a url html tag;
Url is analyzed, mainly realize, mainly realized the analysis of token stream by urlanalyse.class and two classes of contentanalyse.class;
The main method of analyzing: urlanalyse.class has a method geturl (stringfilename) earlier source code to be changed into token stream to read in, the gbnum that then each is met the url token of form and this url back is not equal to 0 token and adds among the hashmap of buffer memory, generally speaking, to be not equal to 0 token all be the title of news to the gbnum of url back.
21, the system of network information extraction according to claim 13 and processing is characterized in that participle adopts " no dictionary " segmenting method in the described automatic generation summarization device, adopts word frequency, as P (w)=F (w) * L (w) c(F (w)>minFreq, L (w)>minLen) otherwise P (w) minFreq are the appearance minimum frequencies of the speech preset; Usually 〉=2; Reduce is not that the string minLen of speech is that the shortest speech of the speech preset is long; Usually 〉=1; Guarantee that it is a normal value of presetting that low-frequency word is not separated c; Usually 〉=4; Guarantee that long word is not separated;
Flow process is as follows: a character string be used as in whole literary composition, starts anew to ask substring, and all substrings are asked power, and the high person of weighting is as speech (too many useless scanning), and system value is got a string, adopts All Files as a setting;
22, the system of network information extraction according to claim 13 and processing is characterized in that the extraction of feature speech in the described automatic generation summarization device, based on the frequency of speech and want to add up for the word frequency in background knowledge storehouse,
P ( w ) = F i ( w ) · ( numdoc advnumdoc ) · ( L ( w ) - D ) c
The frequency that F (w) occurs for speech, L (w) is the length of speech, and numdoc is the occurrence number in this article of this speech, and advnumdoc average time occurs in all documents, and the shortest speech that D presets is long.
23, the system of network information extraction according to claim 13 and processing is characterized in that the relation of the generation of the importance of sentence in the described automatic generation summarization device and summary:
T ( s ) · ΣTi s 0 * s 1 * s 2 * m
Each sentence is calculated their weight by this formula;
Ti is the weight of the speech of sentence composition, and S0 is total speech number of sentence, and S1 is the words and expressions number of sentence,
S2 is the number of number, and m is that integer often is worth, and is generally 1.
24, network information extraction according to claim 13 and disposal system is characterized in that described level trainer comprises 4 devices:
1) gathering-device: collect from the feature speech of leaf class; In the level, feature speech for the training document (news) of each leaf class, have only those in single training document, to occur just being collected more than 2 times or at the feature speech of training document sets appearance more than 10 times, these speech occur in summary at last, the feature vocabulary of these collections has shown the feature of leaf class, when a leaf class belongs to some training document sets, parent will comprise the feature of this leaf class, the feature of non-leaf class comprise it child nodes all features and in all child nodes the summation of feature occurrence frequency;
2) level optimization apparatus: the optimization competition that solves between classification joint and its father and mother's classification, because a file (news) can only be designated as a classification in the hierarchical organization of classification, when between classification competitive the time, algorithm should determine suitable classification for file, comprises as lower device:
Harvester is captured in all features in the classification;
The feature judgment means judges whether that the characteristic frequency ratio in father and mother is big in this classification, is then to arrive next device, otherwise not operation;
Look into the device that continues, look into successor's feature catalogue, find out the feature of successor's high-frequency and minimum frequency;
The ratio judgment means judges whether in the difference of high frequency and minimum frequency greatlyyer than threshold value with the ratio of highest frequency, is then to arrive next device, otherwise deletes this feature from all successors.Have only father and mother to possess this feature;
Delete device is unless delete the highest frequency that this feature successor has this feature from the successor;
3) distribute category feature weight device: be each feature specified weight of classification, have than higher weight feature to mean that it is prior to classification that features all in each classification are assigned with weight, are defined by following formula: W Fc=(λ+(1-λ) * N Fc/ M c) just in the feature of each existence, c is a classification to f, Wfc is the weight that is designated as feature, λ is one three number and is set at 0.4 now, N FcBe the number of times that f occurs in c, Mc is the frequency of feature maximum any in c;
When a feature only appears in the fraternal classification, but itself in c not, it is designated as negative weight, has the feature of negative weight to be added to the feature list of c, and negative weight is defined by following formula: W Fc=-(λ+(1-λ) * N Fp/ M p) just in the feature of each existence, c is a classification to f, Wfc is the weight that is designated as feature, λ is one three number and is set at 0.4 now, N FpBe the number of times that f occurs in the parent class of c, Mc is the frequency of feature maximum any in the parent class of c;
4) filtration unit: the feature list of filtering each classification, the feature list of each classification will be filtered, no matter have only 200 negative features of 200 positive features in front and front to be carried in such other final feature list, be parent class or leaf classification, and other feature will be abandoned.The quantity of limited features is the computation complexity that is used for reducing a file of classification.
CNA031093388A 2003-04-04 2003-04-04 Method for extracting and processing network information and its system Pending CN1536483A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA031093388A CN1536483A (en) 2003-04-04 2003-04-04 Method for extracting and processing network information and its system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA031093388A CN1536483A (en) 2003-04-04 2003-04-04 Method for extracting and processing network information and its system

Publications (1)

Publication Number Publication Date
CN1536483A true CN1536483A (en) 2004-10-13

Family

ID=34319301

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA031093388A Pending CN1536483A (en) 2003-04-04 2003-04-04 Method for extracting and processing network information and its system

Country Status (1)

Country Link
CN (1) CN1536483A (en)

Cited By (60)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100336056C (en) * 2005-01-07 2007-09-05 清华大学 Technological term extracting, law-analysing and reusing method based no ripe technogical file
CN100399330C (en) * 2005-03-23 2008-07-02 腾讯科技(深圳)有限公司 System for managing world wide web media in world wide web page and implementing method thereof
WO2008131597A1 (en) * 2007-04-29 2008-11-06 Haitao Lin Search engine and method for filtering agency information
CN100433018C (en) * 2007-03-13 2008-11-12 白云 Method for criminating electronci file and relative degree with certain field and application thereof
CN100444591C (en) * 2006-08-18 2008-12-17 北京金山软件有限公司 Method for acquiring front-page keyword and its application system
CN100462972C (en) * 2005-12-08 2009-02-18 国际商业机器公司 Document-based information and uniform resource locator (URL) management method and device
WO2009021429A1 (en) * 2007-08-13 2009-02-19 Tencent Technology (Shenzhen) Company Limited Method and device for dealing with the instant messaging information
CN101035128B (en) * 2007-04-18 2010-04-21 大连理工大学 Three-folded webpage text content recognition and filtering method based on the Chinese punctuation
CN101203847B (en) * 2005-03-11 2010-05-19 雅虎公司 System and method for managing listings
CN101231641B (en) * 2007-01-22 2010-05-19 北大方正集团有限公司 Method and system for automatic analysis of hotspot subject propagation process in the internet
CN1786965B (en) * 2005-12-21 2010-05-26 北大方正集团有限公司 Method for acquiring news web page text information
CN1858737B (en) * 2006-01-25 2010-06-02 华为技术有限公司 Method and system for data searching
CN101196935B (en) * 2008-01-03 2010-06-09 中兴通讯股份有限公司 System and method for creating index database
CN101192220B (en) * 2006-11-21 2010-09-15 财团法人资讯工业策进会 Label construction method and system adapting to resource searching
CN101140578B (en) * 2006-09-06 2010-12-08 鸿富锦精密工业(深圳)有限公司 Method and system for multithread analyzing web page data
CN101984435A (en) * 2010-11-17 2011-03-09 百度在线网络技术(北京)有限公司 Method and device for distributing texts
US7925621B2 (en) 2003-03-24 2011-04-12 Microsoft Corporation Installing a solution
CN101128819B (en) * 2004-12-30 2011-06-22 谷歌公司 Local item extraction
CN102117317A (en) * 2010-12-28 2011-07-06 北京航空航天大学 Blind person Internet system based on voice technology
CN102118400A (en) * 2009-12-31 2011-07-06 北京四维图新科技股份有限公司 Data acquisition method and system
US7979856B2 (en) 2000-06-21 2011-07-12 Microsoft Corporation Network-based software extensions
US7979803B2 (en) 2006-03-06 2011-07-12 Microsoft Corporation RSS hostable control
CN102236654A (en) * 2010-04-26 2011-11-09 广东开普互联信息科技有限公司 Web useless link filtering method based on content relevancy
CN101526938B (en) * 2008-03-06 2011-12-28 夏普株式会社 File processing device
CN102385570A (en) * 2010-08-31 2012-03-21 国际商业机器公司 Method and system for matching fonts
CN102446191A (en) * 2010-10-13 2012-05-09 北京创新方舟科技有限公司 Method for generating webpage content abstracts and equipment and system adopting same
CN101180624B (en) * 2004-10-28 2012-05-09 雅虎公司 Link-based spam detection
CN102446311A (en) * 2010-10-15 2012-05-09 商业对象软件有限公司 Business intelligence technology for process driving
CN102460437A (en) * 2009-06-26 2012-05-16 乐天株式会社 Information search device, information search method, information search program, and storage medium on which information search program has been stored
CN102521313A (en) * 2011-12-01 2012-06-27 北京大学 Static index pruning method based on web page quality
CN101055581B (en) * 2006-04-13 2012-07-04 Lg电子株式会社 Document management system and method
CN102592039A (en) * 2011-01-18 2012-07-18 四川火狐无线科技有限公司 Interaction method for processing cantering and entertainment service data and device and system for realizing same
CN101751438B (en) * 2008-12-17 2012-08-22 中国科学院自动化研究所 Theme webpage filter system for driving self-adaption semantics
US8280843B2 (en) 2006-03-03 2012-10-02 Microsoft Corporation RSS data-processing object
CN102812475A (en) * 2009-12-24 2012-12-05 梅塔瓦纳股份有限公司 System And Method For Determining Sentiment Expressed In Documents
CN102902757A (en) * 2012-09-25 2013-01-30 姚明东 Automatic generation method of e-commerce dictionary
CN102945246A (en) * 2012-09-28 2013-02-27 北界创想(北京)软件有限公司 Method and device for processing network information data
CN102955791A (en) * 2011-08-23 2013-03-06 句容今太科技园有限公司 Searching and classifying service system for network information
US8429522B2 (en) 2003-08-06 2013-04-23 Microsoft Corporation Correlation, association, or correspondence of electronic forms
CN103149840A (en) * 2013-02-01 2013-06-12 西北工业大学 Semanteme service combination method based on dynamic planning
CN103150632A (en) * 2013-03-13 2013-06-12 河海大学 Structuring method for flood control and drought control bulletin generation system based on water conservancy cloud platform
CN103488750A (en) * 2013-09-24 2014-01-01 长沙裕邦软件开发有限公司 Implementation method and system of network robot
US8661459B2 (en) 2005-06-21 2014-02-25 Microsoft Corporation Content syndication platform
US8751936B2 (en) 2005-06-21 2014-06-10 Microsoft Corporation Finding and consuming web subscriptions in a web browser
CN103853834A (en) * 2014-03-12 2014-06-11 华东师范大学 Text structure analysis-based Web document abstract generation method
CN104008126A (en) * 2014-03-31 2014-08-27 北京奇虎科技有限公司 Method and device for segmentation on basis of webpage content classification
US8892993B2 (en) 2003-08-01 2014-11-18 Microsoft Corporation Translation file
US8918729B2 (en) 2003-03-24 2014-12-23 Microsoft Corporation Designing electronic forms
CN104424308A (en) * 2013-09-04 2015-03-18 中兴通讯股份有限公司 Web page classification standard acquisition method and device and web page classification method and device
CN104657347A (en) * 2015-02-06 2015-05-27 北京中搜网络技术股份有限公司 News optimized reading mobile application-oriented automatic summarization method
CN105005563A (en) * 2014-04-15 2015-10-28 腾讯科技(深圳)有限公司 Abstract generation method and apparatus
US9210234B2 (en) 2005-12-05 2015-12-08 Microsoft Technology Licensing, Llc Enabling electronic documents for limited-capability computing devices
US9229917B2 (en) 2003-03-28 2016-01-05 Microsoft Technology Licensing, Llc Electronic form user interfaces
CN105760500A (en) * 2009-11-10 2016-07-13 启创互联公司 System, method and computer program for creating and manipulating data structures using an interactive graphical interface
CN106383887A (en) * 2016-09-22 2017-02-08 深圳市博安达信息技术股份有限公司 Environment-friendly news data acquisition and recommendation display method and system
US10146843B2 (en) 2009-11-10 2018-12-04 Primal Fusion Inc. System, method and computer program for creating and manipulating data structures using an interactive graphical interface
CN109086361A (en) * 2018-07-20 2018-12-25 北京开普云信息科技有限公司 A kind of automatic abstracting method of webpage article information and system based on mutual information between web page joint
CN112115259A (en) * 2020-06-17 2020-12-22 上海金融期货信息技术有限公司 Feature word driven text multi-label hierarchical classification method and system
CN113190644A (en) * 2021-05-24 2021-07-30 浪潮软件科技有限公司 Method and device for hot updating search engine word segmentation dictionary
CN113486279A (en) * 2021-06-29 2021-10-08 平安信托有限责任公司 Automatic news generation method, device, equipment and storage medium

Cited By (85)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7979856B2 (en) 2000-06-21 2011-07-12 Microsoft Corporation Network-based software extensions
US7925621B2 (en) 2003-03-24 2011-04-12 Microsoft Corporation Installing a solution
US8918729B2 (en) 2003-03-24 2014-12-23 Microsoft Corporation Designing electronic forms
US9229917B2 (en) 2003-03-28 2016-01-05 Microsoft Technology Licensing, Llc Electronic form user interfaces
US9239821B2 (en) 2003-08-01 2016-01-19 Microsoft Technology Licensing, Llc Translation file
US8892993B2 (en) 2003-08-01 2014-11-18 Microsoft Corporation Translation file
US9268760B2 (en) 2003-08-06 2016-02-23 Microsoft Technology Licensing, Llc Correlation, association, or correspondence of electronic forms
US8429522B2 (en) 2003-08-06 2013-04-23 Microsoft Corporation Correlation, association, or correspondence of electronic forms
CN101180624B (en) * 2004-10-28 2012-05-09 雅虎公司 Link-based spam detection
US8433704B2 (en) 2004-12-30 2013-04-30 Google Inc. Local item extraction
JP2011129154A (en) * 2004-12-30 2011-06-30 Google Inc Local item extraction
CN101128819B (en) * 2004-12-30 2011-06-22 谷歌公司 Local item extraction
CN100336056C (en) * 2005-01-07 2007-09-05 清华大学 Technological term extracting, law-analysing and reusing method based no ripe technogical file
CN101203847B (en) * 2005-03-11 2010-05-19 雅虎公司 System and method for managing listings
CN100399330C (en) * 2005-03-23 2008-07-02 腾讯科技(深圳)有限公司 System for managing world wide web media in world wide web page and implementing method thereof
US9104773B2 (en) 2005-06-21 2015-08-11 Microsoft Technology Licensing, Llc Finding and consuming web subscriptions in a web browser
US8661459B2 (en) 2005-06-21 2014-02-25 Microsoft Corporation Content syndication platform
US8751936B2 (en) 2005-06-21 2014-06-10 Microsoft Corporation Finding and consuming web subscriptions in a web browser
US8832571B2 (en) 2005-06-21 2014-09-09 Microsoft Corporation Finding and consuming web subscriptions in a web browser
US9894174B2 (en) 2005-06-21 2018-02-13 Microsoft Technology Licensing, Llc Finding and consuming web subscriptions in a web browser
US9762668B2 (en) 2005-06-21 2017-09-12 Microsoft Technology Licensing, Llc Content syndication platform
US9210234B2 (en) 2005-12-05 2015-12-08 Microsoft Technology Licensing, Llc Enabling electronic documents for limited-capability computing devices
CN100462972C (en) * 2005-12-08 2009-02-18 国际商业机器公司 Document-based information and uniform resource locator (URL) management method and device
CN1786965B (en) * 2005-12-21 2010-05-26 北大方正集团有限公司 Method for acquiring news web page text information
CN1858737B (en) * 2006-01-25 2010-06-02 华为技术有限公司 Method and system for data searching
US8280843B2 (en) 2006-03-03 2012-10-02 Microsoft Corporation RSS data-processing object
US8768881B2 (en) 2006-03-03 2014-07-01 Microsoft Corporation RSS data-processing object
US7979803B2 (en) 2006-03-06 2011-07-12 Microsoft Corporation RSS hostable control
CN101055581B (en) * 2006-04-13 2012-07-04 Lg电子株式会社 Document management system and method
CN100444591C (en) * 2006-08-18 2008-12-17 北京金山软件有限公司 Method for acquiring front-page keyword and its application system
CN101140578B (en) * 2006-09-06 2010-12-08 鸿富锦精密工业(深圳)有限公司 Method and system for multithread analyzing web page data
CN101192220B (en) * 2006-11-21 2010-09-15 财团法人资讯工业策进会 Label construction method and system adapting to resource searching
CN101231641B (en) * 2007-01-22 2010-05-19 北大方正集团有限公司 Method and system for automatic analysis of hotspot subject propagation process in the internet
CN100433018C (en) * 2007-03-13 2008-11-12 白云 Method for criminating electronci file and relative degree with certain field and application thereof
CN101035128B (en) * 2007-04-18 2010-04-21 大连理工大学 Three-folded webpage text content recognition and filtering method based on the Chinese punctuation
WO2008131597A1 (en) * 2007-04-29 2008-11-06 Haitao Lin Search engine and method for filtering agency information
US8204946B2 (en) 2007-08-13 2012-06-19 Tencent Technology (Shenzhen) Company Ltd. Method and apparatus for processing instant messaging information
WO2009021429A1 (en) * 2007-08-13 2009-02-19 Tencent Technology (Shenzhen) Company Limited Method and device for dealing with the instant messaging information
CN101196935B (en) * 2008-01-03 2010-06-09 中兴通讯股份有限公司 System and method for creating index database
CN101526938B (en) * 2008-03-06 2011-12-28 夏普株式会社 File processing device
CN101751438B (en) * 2008-12-17 2012-08-22 中国科学院自动化研究所 Theme webpage filter system for driving self-adaption semantics
CN102460437A (en) * 2009-06-26 2012-05-16 乐天株式会社 Information search device, information search method, information search program, and storage medium on which information search program has been stored
CN102460437B (en) * 2009-06-26 2014-10-15 乐天株式会社 Information search device, information search method, information search program, and storage medium on which information search program has been stored
CN105760500B (en) * 2009-11-10 2019-08-09 启创互联公司 System and method for being created using interactive graphics (IG) interface and manipulating data structure
CN105760500A (en) * 2009-11-10 2016-07-13 启创互联公司 System, method and computer program for creating and manipulating data structures using an interactive graphical interface
US10146843B2 (en) 2009-11-10 2018-12-04 Primal Fusion Inc. System, method and computer program for creating and manipulating data structures using an interactive graphical interface
CN102812475A (en) * 2009-12-24 2012-12-05 梅塔瓦纳股份有限公司 System And Method For Determining Sentiment Expressed In Documents
CN102118400B (en) * 2009-12-31 2013-07-17 北京四维图新科技股份有限公司 Data acquisition method and system
CN102118400A (en) * 2009-12-31 2011-07-06 北京四维图新科技股份有限公司 Data acquisition method and system
CN102236654A (en) * 2010-04-26 2011-11-09 广东开普互联信息科技有限公司 Web useless link filtering method based on content relevancy
CN102385570A (en) * 2010-08-31 2012-03-21 国际商业机器公司 Method and system for matching fonts
US9218325B2 (en) 2010-08-31 2015-12-22 International Business Machines Corporation Quick font match
US9002877B2 (en) 2010-08-31 2015-04-07 International Business Machines Corporation Quick font match
CN102446191A (en) * 2010-10-13 2012-05-09 北京创新方舟科技有限公司 Method for generating webpage content abstracts and equipment and system adopting same
CN102446311A (en) * 2010-10-15 2012-05-09 商业对象软件有限公司 Business intelligence technology for process driving
CN102446311B (en) * 2010-10-15 2016-12-21 商业对象软件有限公司 The business intelligence of proceduredriven
CN101984435B (en) * 2010-11-17 2012-10-10 百度在线网络技术(北京)有限公司 Method and device for distributing texts
CN101984435A (en) * 2010-11-17 2011-03-09 百度在线网络技术(北京)有限公司 Method and device for distributing texts
CN102117317B (en) * 2010-12-28 2012-08-22 北京航空航天大学 Blind person Internet system based on voice technology
CN102117317A (en) * 2010-12-28 2011-07-06 北京航空航天大学 Blind person Internet system based on voice technology
CN102592039A (en) * 2011-01-18 2012-07-18 四川火狐无线科技有限公司 Interaction method for processing cantering and entertainment service data and device and system for realizing same
CN102955791A (en) * 2011-08-23 2013-03-06 句容今太科技园有限公司 Searching and classifying service system for network information
CN102521313A (en) * 2011-12-01 2012-06-27 北京大学 Static index pruning method based on web page quality
CN102902757B (en) * 2012-09-25 2015-07-29 姚明东 A kind of Automatic generation method of e-commerce dictionary
CN102902757A (en) * 2012-09-25 2013-01-30 姚明东 Automatic generation method of e-commerce dictionary
CN102945246A (en) * 2012-09-28 2013-02-27 北界创想(北京)软件有限公司 Method and device for processing network information data
CN103149840B (en) * 2013-02-01 2015-03-04 西北工业大学 Semanteme service combination method based on dynamic planning
CN103149840A (en) * 2013-02-01 2013-06-12 西北工业大学 Semanteme service combination method based on dynamic planning
CN103150632A (en) * 2013-03-13 2013-06-12 河海大学 Structuring method for flood control and drought control bulletin generation system based on water conservancy cloud platform
CN103150632B (en) * 2013-03-13 2016-03-16 河海大学 Flood control based on water conservation cloud platform is taked precautions against drought the construction method of bulletin generation system
CN104424308A (en) * 2013-09-04 2015-03-18 中兴通讯股份有限公司 Web page classification standard acquisition method and device and web page classification method and device
CN103488750A (en) * 2013-09-24 2014-01-01 长沙裕邦软件开发有限公司 Implementation method and system of network robot
CN103853834A (en) * 2014-03-12 2014-06-11 华东师范大学 Text structure analysis-based Web document abstract generation method
CN103853834B (en) * 2014-03-12 2017-02-08 华东师范大学 Text structure analysis-based Web document abstract generation method
CN104008126A (en) * 2014-03-31 2014-08-27 北京奇虎科技有限公司 Method and device for segmentation on basis of webpage content classification
CN105005563A (en) * 2014-04-15 2015-10-28 腾讯科技(深圳)有限公司 Abstract generation method and apparatus
CN104657347A (en) * 2015-02-06 2015-05-27 北京中搜网络技术股份有限公司 News optimized reading mobile application-oriented automatic summarization method
CN106383887A (en) * 2016-09-22 2017-02-08 深圳市博安达信息技术股份有限公司 Environment-friendly news data acquisition and recommendation display method and system
CN106383887B (en) * 2016-09-22 2023-04-07 深圳博沃智慧科技有限公司 Method and system for collecting, recommending and displaying environment-friendly news data
CN109086361A (en) * 2018-07-20 2018-12-25 北京开普云信息科技有限公司 A kind of automatic abstracting method of webpage article information and system based on mutual information between web page joint
CN109086361B (en) * 2018-07-20 2019-06-21 北京开普云信息科技有限公司 A kind of automatic abstracting method of webpage article information and system based on mutual information between web page joint
CN112115259A (en) * 2020-06-17 2020-12-22 上海金融期货信息技术有限公司 Feature word driven text multi-label hierarchical classification method and system
CN113190644A (en) * 2021-05-24 2021-07-30 浪潮软件科技有限公司 Method and device for hot updating search engine word segmentation dictionary
CN113190644B (en) * 2021-05-24 2023-01-13 浪潮软件科技有限公司 Method and device for hot updating word segmentation dictionary of search engine
CN113486279A (en) * 2021-06-29 2021-10-08 平安信托有限责任公司 Automatic news generation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN1536483A (en) Method for extracting and processing network information and its system
CN1198225C (en) Keyword extracting system and text retneval system using the same
CN1096038C (en) Method and equipment for file retrieval based on Bayesian network
CN1669029A (en) System and method for automatically discovering a hierarchy of concepts from a corpus of documents
CN1904896A (en) Structured document processing apparatus, search apparatus, structured document system and method
CN1707476A (en) Auxiliary translation searching engine system and method thereof
CN1559044A (en) Content information analyzing method and apparatus
CN1942877A (en) Information extraction system
CN1535433A (en) Category based, extensible and interactive system for document retrieval
CN1667609A (en) Document information management system and document information management method
CN1728143A (en) Phrase-based generation of document description
CN100336056C (en) Technological term extracting, law-analysing and reusing method based no ripe technogical file
CN1691007A (en) Method, system or memory storing a computer program for document processing
CN1328668A (en) System and method for specifying www site
CN1728140A (en) Phrase-based indexing in an information retrieval system
CN1924858A (en) Method and device for fetching new words and input method system
CN1728141A (en) Phrase-based searching in an information retrieval system
CN1871603A (en) System and method for processing a query
CN1281191A (en) Information retrieval method and information retrieval device
CN1728142A (en) Phrase identification in an information retrieval system
CN1319836A (en) Method and device for converting expressing mode
CN1882943A (en) Systems and methods for search processing using superunits
CN1286776A (en) Document processor and recording medium
CN1625740A (en) Index structure of metadata, method for providing indices of metadata, and metadata searching method and apparatus using the indices of metadata
CN1269897A (en) Methods and/or system for selecting data sets

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20041013