CN101853300A - Method and system for identifying and evaluating video downloading service website - Google Patents

Method and system for identifying and evaluating video downloading service website Download PDF

Info

Publication number
CN101853300A
CN101853300A CN201010186795A CN201010186795A CN101853300A CN 101853300 A CN101853300 A CN 101853300A CN 201010186795 A CN201010186795 A CN 201010186795A CN 201010186795 A CN201010186795 A CN 201010186795A CN 101853300 A CN101853300 A CN 101853300A
Authority
CN
China
Prior art keywords
video
website
url
information
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201010186795A
Other languages
Chinese (zh)
Other versions
CN101853300B (en
Inventor
刘锐
朱明�
易荣峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN2010101867951A priority Critical patent/CN101853300B/en
Publication of CN101853300A publication Critical patent/CN101853300A/en
Application granted granted Critical
Publication of CN101853300B publication Critical patent/CN101853300B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The embodiment of the invention provides a method and a system for identifying and evaluating a video downloading service website. The method comprises the following steps of: acquiring a URL list of a homepage of a target website which needs to be processed and calling a webpage acquisition module to acquire a webpage of the target website according to the acquired URL list under the instruction of an acquisition rule made by a rule database; identifying whether the target website is the website which provides a video downloading service, and simultaneously updating the identified video information to a pre-established video information database; if the website is identified to be the video downloading service website and is visited for the first time, evaluating the website; and extracting related information of videos on the webpage of the target website, saving the related information to the video information database for feeding back and instructing website identification and evaluation, and simultaneously updating the rule database. Therefore, the method and the system for identifying and evaluating the video downloading service website can precisely identify the website which provides the video downloading service, track the latest updated video information of the website, and effectively evaluate the health and the legality of the website to construct a healthier and safer network system.

Description

A kind of identification of video downloading service website, appraisal procedure and system
Technical field
The present invention relates to network communication field, relate in particular to a kind of identification, appraisal procedure and system of video downloading service website.
Background technology
At present; along with Development of Multimedia Technology; increasing website begins to provide video download services; since multimedia have the magnanimity information of carrying, copyright feature obviously, health degree is to characteristics such as social influence are extensive; how to automatically identify the website that video download services is provided in the network, the content of following the tracks of this network upgrade; and the health degree of corresponding website assessed, for the protection multimedia copyright with make up healthy network etc. significance is all arranged.
In existing network information process field, existing comparatively ripe information extraction and content topic recognition technology, described information extraction technique mainly comprises: adaptive webpage metadata abstracting method, in conjunction with of the polymerization of weight tree coupling with decimation pattern; Template generates the data Automatic Extraction method of webpage, utilizes detected template extracted data from the example webpage automatically; And, be used to realize fast automatic extraction the Internet architecture data based on the Internet architecture data pick-up method of sample.
Described content topic recognition technology mainly is based on the method for statistics, based on the method for knowledge or the method for mixing in conjunction with both.Specifically, mainly be to utilize the co-occurrence information etc. of the frequency of characteristic item, position feature based on the method for statistics, wherein do not utilize extra knowledge base, for example machine readable dictionary etc.; Method based on knowledge mainly depends on sentence structure/semantic analyzer, and the knowledge base of employing comprises step machine readable dictionary etc., does not wherein utilize any corpus statistical technique; And the method for both combinations can consolidated statement reveals both advantages.
But in the prior art scheme, information extraction technique and content topic recognition technology are separate work, also lack an effectively special-purpose method that video downloading service website is discerned and assessed in the prior art.
Summary of the invention
The embodiment of the invention provides a kind of identification, appraisal procedure and system of video downloading service website, can accurate recognition go out to provide the website of video download services, follow the tracks of the video information of its recent renewal, and health, the legitimacy of this website effectively assessed, thereby can make up more healthy, the safe network system.
The embodiment of the invention provides a kind of identification, appraisal procedure of video downloading service website, and described method comprises:
Obtain the uniform resource position mark URL tabulation of the targeted website homepage that needs processing, and the invoking web page acquisition module is gathered the webpage of described targeted website according to the url list that is obtained under the guidance of the collection rule that rule database is formulated;
Survey by association analysis and degree of depth URL and to analyze the targeted website webpage that is collected, whether identification is updated to the video information that identifies in the video information data base of setting up in advance simultaneously for the website of video download services is provided;
If this website of identification is video downloading service website and is maiden visit, then utilize the video information in the described video information data base that degree of correlation analysis is carried out in described website, finish assessment to this website;
The information extraction rule that utilization pre-establishes extracts the relevant information of described targeted website video web page, and is stored in the described video information data base, upgrades described rule database simultaneously.
The present invention also provides a kind of identification, evaluating system of video downloading service website, and described system comprises:
The webpage acquisition module is used to obtain the uniform resource position mark URL tabulation of the targeted website homepage that needs handle, and gathers the webpage of described targeted website according to the url list that is obtained;
Identification module is used for surveying by association analysis and degree of depth URL and analyzes the targeted website webpage that is collected, and whether identification is updated to the video information that identifies in the video information data base of setting up in advance simultaneously for the website of video download services is provided;
Evaluation module, being used for discerning this website at described identification module is video downloading service website and for maiden visit, utilizes the video information in the described video information data base that the keyword coupling is carried out in described website, finishes the assessment to this website;
Information extraction module is used to utilize the information extraction rule that pre-establishes to extract the relevant information of described targeted website video web page, and is stored in the described video information data base.
By the above-mentioned technical scheme that provides as can be seen, at first obtain the uniform resource position mark URL tabulation of the targeted website homepage that needs processing, and the invoking web page acquisition module is gathered the webpage of described targeted website according to the url list that is obtained; Survey by association analysis and degree of depth URL and to analyze the targeted website webpage that is collected, whether identification is updated to the video information that identifies in the video information data base of setting up in advance simultaneously for the website of video download services is provided; If this website of identification is video downloading service website and is maiden visit, then utilize the video information in the described video information data base that degree of correlation analysis is carried out in described website, finish assessment to this website; The information extraction rule that utilization pre-establishes extracts the relevant information of described targeted website video web page, and is stored in the described video information data base.By above technical scheme, just can accurate recognition go out to provide the website of video download services, follow the tracks of the video information of its recent renewal, and health, the legitimacy of this website are effectively assessed, thereby can make up more healthy, the safe network system.
Description of drawings
Fig. 1 is the identification of video downloading service website that the embodiment of the invention provides, the schematic flow sheet of appraisal procedure;
Fig. 2 is the identification of video downloading service website that the embodiment of the invention provides, the structural representation of evaluating system.
Embodiment
The embodiment of the invention provides a kind of identification, appraisal procedure and system of video downloading service website, can accurate recognition go out to provide the website of video download services, follow the tracks of the video information of its recent renewal, and health, the legitimacy of this website effectively assessed, thereby can make up more healthy, the safe network system.
For better describing embodiment of the present invention, now in conjunction with the accompanying drawings specific embodiments of the invention are described, be illustrated in figure 1 as the identification of video downloading service website that the embodiment of the invention provides, the schematic flow sheet of appraisal procedure, comprise among the figure:
Step 11: obtain the url list of the targeted website homepage that needs processing, i.e. home address tabulation.
In this step, at first obtain the targeted website homepage url list file that needs processing.In the specific implementation process, the threshold value of all right further each module of initialization system and the video title in the video information data base, assessment keyword database etc.
Step 12: the invoking web page acquisition module, gather the webpage of described targeted website according to the collection rule in the rule database.
In this step, after getting access to the targeted website homepage url list file that needs to handle, just can the invoking web page acquisition module under the guidance of the collection rule that rule database is formulated, gather the webpage of described targeted website according to the url list that is obtained.
In the specific implementation process, the collection rule that described rule database is formulated is specially the URL feature of the Video service page, is used to assist described webpage acquisition module to safeguard corresponding url list.
The detailed process that described webpage acquisition module is gathered described targeted website webpage according to the url list that is obtained is as follows, certainly those skilled in the art can also propose other modifications or variation according to following proposal, and these modifications or variation all should be included in of the present invention comprising within the scope:
At first, the access destination website homepage, the depth value of setting current homepage is 0, itself is father URL node.
Then, obtain all web page addresses that point to (being same domain name) in the station in the described homepage, for convenience of description, remember that wherein arbitrary web page address is URL 1, subscript represents that depth value is 1, is labeled as the sub-URL node of described homepage, and deposits in the formation.
If the locating rule non-NULL of the Video service column that pre-establishes, then the URL set with the pairing Video service column of described locating rule joins rear of queue, and its depth value is set is 0, and itself is father URL node; Here, the locating rule of the described Video service column that pre-establishes comprises a series of URL set, is used for locating the Video service column of current site; Wherein, each column comprises a plurality of Video service subpage frames, is used to assist described webpage acquisition module to safeguard corresponding url list.
The degree of depth of note current accessed is i, and web page address is URL i,, (promptly be the web page address URL of i+1 by the degree of depth with all web page addresses that comprise described URL feature in the current accessed webpage if the webpage collection rule non-NULL that pre-establishes is then adjusted described url list according to described webpage collection rule I+1Form) preferentially join the formation afterbody, otherwise with all URL I+1Join the formation afterbody by obtaining order.
Taking out depth value from the formation stem successively again is the web page address URL of i i, download the corresponding page, obtain all web page address URL that point in this page in the station I+1(depth value is i+1) makes up the chained list<URL that is made up of the corresponding page and father node thereof and child node I-1, URL i, URL I+1.
If m is arranged, and (m<=i) is the Video service page and non-homepage child node, then mark father node URL I-1The video weight coefficient of (depth value is the web page address of i-1) is m, promptly<and URL I-1, m 〉, the expression web page address is URL I-1Webpage in comprise m Video service page URL.
Cyclic access is until arriving preassigned depth threshold, and store all video weight coefficient m greater than the url list of assign thresholds in advance in described Video service column locating rule.
Step 13: whether identification is for providing the website of video download services.
In this step, specifically can survey and analyze the targeted website webpage that is collected by association analysis and degree of depth URL, whether identification is updated to the video information that identifies in the video information data base of setting up in advance simultaneously for the website of video download services is provided.
For instance, whether identification is as follows for the detailed process of website that video download services is provided, certainly those skilled in the art can also propose other modifications or variation according to following proposal, and these modifications or variation all should be included in of the present invention comprising within the scope:
At first, utilize the association analysis device, import the healthy class keyword that in the assessment keyword database, sets, carry out the analyzing and processing of the described targeted website webpage and the video download services theme degree of association, if satisfy the threshold condition that pre-establishes, then proceed follow-up processing.
Then by calling degree of depth URL detector, the URL of identification video download address also carries out degree of depth URL to it and surveys, obtain the described targeted website webpage existence critical field relevant with the video download if survey, then the described targeted website of mark webpage is for providing the page of video download services.
The file in download name of again parsing being obtained (not containing extension name) is stored in the video information data base of setting up in advance as video title, and the time of upgrading described video recent findings.
In addition, in the process of utilizing the association analysis device, if the association analysis device is input as the healthy class keyword in the assessment keyword database of video information data base, then its function is the degree of association of contained information of statistical study target pages and video theme, identifies the page that Video service may be provided; If the bad class keyword in the assessment keyword database of enter video information database, then its function is the degree of association of contained information of statistical study target pages and flame, identifies the page that bad video download services may be provided, and comprises specifically:
1) calls the described assessment keyword database of input, according to keyword occurrence frequency F i, give keyword K iDifferent weights W i = F i / Σ 1 N F k ;
2) mate target pages respectively with keyword, if keyword K appears in target pages j, then write down W j
3) the statistics pairing weights of all keywords that comprise of the page and, i.e. ∑ W j, as if V in the assign thresholds scope Min<∑ W j<V Max, V wherein Min, V MaxBe respectively rule of thumb preassigned minimum maximum constant threshold value, then judge this page analysis success, otherwise finish.
In addition, degree of depth URL detector is used for URL is carried out depth finding, screens out true download address, surveys video related information such as video title, video format etc., specifically can comprise following steps:
A) obtain analytic target, comprise all URL in the page, all URL in the XML document element path xpath of video summary information in the corresponding subpage frame of URL;
B) if exist URL to comprise " thunder: // ", " flashget: // ", " ed2k: // ", " bc: // " or the like similar critical field, judge that then URL is a class download address, calling the respective downloaded instrument resolves this URL (resolving after perhaps utilizing the base64 encoding and decoding to convert thereof into other class download address) and obtains information such as video title, if survey successfully, then finish; Otherwise carry out c;
C) if exist URL to comprise " down ", " tid ", " aid ", " attachment ", " .torrent " or the like similar critical field, judge that then URL is two class download address, it is stored in the formation, take out and initiate connection request in turn, resolve the header field of http response message, obtain the filename key assignments among the Content-Disposition,, then carry out d if find that key assignments comprises " .torrent " and then reads the corresponding torrent seed file of URL; Otherwise attempt all members of formation,, then carry out e if do not find yet;
D) resolve the seed file content, utilize the common extension name of video as " .rmvb ", " .avi ", " .mkv ", " .wmv " or the like location file in download name, intercept part between English colon ": " and the extension name, obtain information such as video title, survey successfully, then finish;
E) if exist URL to comprise " hash ", and domain name is pointed to outside the station, judge that then URL is three class download address, calling COM (communication object model) interface of IE browser opens the website and locatees the submission form that seed is downloaded, location submit button and simulation are clicked, read the torrent seed file, return previous step;
F) if exist URL to comprise " .avi ", " .mkv ", " .rmvb ", " ftp: // " or the like similar critical field, judge that then URL is four class download address, intercept last position separator "/" part (not containing separator and extension name) afterwards, obtain information such as video title, survey successfully, then finish;
G) if do not obtain video title information yet, then obtain page title TITLE,, then remove the site name part if comprise the targeted sites title; If comprise space character, then TITLE is cut into multistage by space character, merge from left to right segmentation until the long TITLE of surpassing of character string long half, with this assembling section as video title.
By above-mentioned process, just can screen out true download address, survey video related information such as video informations such as video title, video format.
In addition, in embodiments of the present invention, a class download address is the download address of related P2P downloaded software such as a sudden peal of thunder, express etc., can resolve and finish downloading task by P2P software; Two class download address are the seed download address, and the address corresponds to the seed file that is positioned on the background server of targeted website; Three class download address are the seed download address also, and what distinguish described two class download address is that the address corresponds to the seed file on third party's Website server; Four class download address are the video file download address, and the address corresponds to the video file that is positioned on targeted website or the third party website background server.
Step 14: be video downloading service website and be maiden visit as if this website of identification, then the targeted website is assessed.
In this step, can judge earlier that specifically whether the assessment trigger triggers, and thinks that then this website is a video downloading service website if trigger, and then enters subsequent treatment; Otherwise do not trigger yet when designated depth or URL count as if arriving, think that then this website does not provide video download services, then finishes the subsequent processes to this website at the webpage acquisition module.
If think that this website is a video downloading service website, judge further again that then whether the targeted website is maiden visit, if maiden visit then enters subsequent treatment, assesses the targeted website.
In the specific implementation process, can utilize the video information in the described video information data base that degree of correlation analysis is carried out in described website, finish assessment to this website, detailed process comprises:
Utilize the association analysis device, the bad class keyword that input sets in the assessment keyword database identifies the page that bad video download services is provided; Utilize relevant matches module at random, detect the degree of correlation of historical website video in the video of download that described targeted website provides and the described video information data base, and return the video title number that is complementary; Utilize the synthetic determination module, the character that the result who is returned according to described association analysis device and the described module of relevant matches at random comes the described targeted website of synthetic determination is finished the assessment to this website.
For instance, the module of relevant matches at random in the embodiment of the invention, be used for adding up video and the health in the data with existing storehouse/bad class video number of matches that the targeted website provides download, in order to improve matching speed and efficient, only consider during coupling described video recent findings time phase difference with described new site video title be no more than time threshold T (as a week or one month) as described in historical website video title, specifically comprise following execution in step:
A) the healthy video title coupling of initialization number AM=0, bad video title coupling number BM=0;
B) randomly draw new site video title as described in the N bar (as 10 to 100), call the character string degree of correlation analytical algorithm of increasing income,
C) successively itself and described historical website video title (healthy class) are carried out degree of correlation judgement respectively, if the match is successful for the two relevant then this bar, AM++;
D) successively itself and described historical website video title (bad class) are carried out degree of correlation judgement respectively, if the match is successful for the two relevant then this bar, BM++;
E) return the AM value, the BM value.
Synthetic determination module in the embodiment of the invention, according to the association analysis device and at random the result that returns of relevant matches module come the website is assessed, specifically comprise following steps:
A) at first if trigger triggers, then judge this website for video downloading service website is provided, then defining grade is 0;
B) grade is 0 o'clock, for the association analysis device that is input as bad class assessment keyword database, get the pairing weights of all keywords of each page and maximal value, i.e. W=Max{ ∑ W j, if W greater than assign thresholds, judges that then this website for bad video downloading service website may be provided, then define grade and is upgraded to 1;
C) grade is more than or equal to 0 o'clock, if the AM value returned of relevant matches module judges then that greater than assign thresholds this website for bad video downloading service website is provided, then defines grade and is upgraded to 2 at random;
D) grade is more than or equal to 0 o'clock, if the BM value returned of relevant matches module judges then that greater than assign thresholds this website for bad video downloading service website is provided, then defines grade and is upgraded to 3 at random;
E) this website of mark is historical website again, and the new site video information data base is described historical website video information data base with corresponding heavy label, if grade is 0, then is labeled as healthy class, if grade is 2 or 3, then is labeled as bad class.
Step 15: extract the relevant information of video in the described targeted website, upgrade video information data base and rule database.
In this step, the described information extraction rule that pre-establishes is the XML document element path xpath at video information place, is used in reference to the guide extraction of information frequently.
In the specific implementation process, the information extraction module that is used for information extraction can comprise xpath maker, withdrawal device and checker, specifically can comprise following processing procedure:
At first, if the information extraction rule of described targeted website correspondence is empty, then call XML document elements path xpath maker generates video summary information place xpath, add this xpath simultaneously to the information extraction rule that pre-establishes, upgrade described rule database;
Call withdrawal device then, by the described information extraction rule that pre-establishes the page of described targeted website is carried out video summary information and extract.
Call checker again, the information that the described withdrawal device of verification is extracted, if verification is passed through, then the video summary information of preservation extraction and download address are to described video information data base; Otherwise, proceed video summary information and extract.Checker in the embodiment of the invention can be used for the video summary information that the described withdrawal device of verification extracts, if extraction information too short (less than 50 bytes) or with the degree of association of video theme not in the assign thresholds scope, then verification is not passed through, otherwise is called by verification.
In addition, if all information extraction that pre-establishes rule visits still do not have information by verification when finishing, then call described xpath maker and obtain video summary information place xpath, and extract corresponding video summary information, utilize described checker to carry out verification again, if verification is not by then abandoning extraction process and finishing; Otherwise add this xpath to the described information extraction rule that pre-establishes, upgrade described rule database.
Describing the xpath maker in detail with a concrete example below is how to generate video summary information place xpath, it will be understood by those skilled in the art that the following stated only is schematic example, does not limit the scope of the invention:
The utilization program DOM4J that increases income becomes DOM with the page source file conversion, cleaning the page removes as font etc. and only is used to the no articulation point that shows, call healthy class assessment keyword database, establishing database for sake of convenience is four speech " translated name, title, age, directors "; Obtain the xpath as " translated name, title, age, director " keyword correspondence then respectively, all xpath of statistical study get the father node of the maximum public part in path, and are as follows:
/HTML[1]/BODY[1]/DIV[3]/DIV[3]/FORM[1]/DIV[1]/TABLE[1]/TR[1]/TD[2]/DIV[3]/DIV[3]/FONT[1]/text()[26]
/HTML[1]/BODY[1]/DIV[3]/DIV[3]/FORM[1]/DIV[1]/TABLE[1]/TR[1]/TD[2]/DIV[3]/DIV[3]/FONT[1]/text()[30]
/HTML[1]/BODY[1]/DIV[3]/DIV[3]/FORM[1]/DIV[1]/TABLE[1]/TR[1]/TD[2]/DIV[3]/DIV[3]/FONT[1]/text()[32]
/HTML[1]/BODY[1]/DIV[3]/DIV[3]/FORM[1]/DIV[1]/TABLE[1]/TR[1]/TD[2]/DIV[3]/DIV[3]/FONT[3]/text()[30]
Wherein maximum public part acquisition methods for will/... / in be considered as a node, add up each path same position node frequency of occurrences, if the node frequency of occurrences is over half greater than keyword quantity, then take out, do not satisfy condition and stop otherwise there being node at this place.So take out public part up to FONT[1], get father node DIV[3 again], then obtain the path:
/HTML[1]/BODY[1]/DIV[3]/DIV[3]/FORM[1]/DIV[1]/TABLE[1]/TR[1]/TD[2]/DIV[3]/DIV[3]
All text node contents are information needed under this path, promptly successfully generate video summary information place xpath.
Like this, by the enforcement of above technical scheme, just can accurate recognition go out to provide the website of video download services, the video information of following the tracks of its recent renewal, and health, the legitimacy of this website effectively assessed, thereby can make up more healthy, the safe network system.
The embodiment of the invention also provides a kind of identification, evaluating system of video downloading service website, is illustrated in figure 2 as the structural representation of system that the embodiment of the invention provides, and described system comprises:
Webpage acquisition module 201 is used to obtain the url list of the targeted website homepage that needs handle, and gathers the webpage of described targeted website according to the url list that is obtained;
Identification module 202 is used for surveying by association analysis and degree of depth URL and analyzes the targeted website webpage that is collected, and whether identification is updated to the video information that identifies in the video information data base of setting up in advance simultaneously for the website of video download services is provided;
Evaluation module 203 is used for being video downloading service website and being maiden visit in described identification module 202 these websites of identification, utilizes the video information in the described video information data base that degree of correlation analysis is carried out in described website, finishes the assessment to this website;
Information extraction module 204 is used to utilize the information extraction rule that pre-establishes to extract the relevant information of described targeted website video web page, and is stored in the described video information data base.
Described system also can comprise:
System scheduling module 205 is used for the operation of dispatching system, coordinates the operation of each module of described system;
Rule database 206 is used to instruct the running of described webpage acquisition module 201, described information extraction module 204 and described identification module 202;
Video information data base 207 is used for the accessed video related information of the described information extraction module of memory by using 204, realizes content tracing, and guides described evaluation module 203 to finish assessment to the website.
In addition, also can comprise in the described video information data base 207:
New site video information data base 2071 is used for describing the new site of maiden visit, specifically comprises video summary information, video title, video recent findings time and Video service page URL;
Historical website video information data base 2072 is divided into healthy and bad two types, is used for describing historical access site, specifically comprises video summary information, video title, video recent findings time, Video service page URL;
Assessment keyword database 2073 is divided into healthy and bad two types, is used for the aid identification video downloading service website and the health degree of assessment objective website; Wherein, the initialization of described assessment keyword database can be adopted manual generation, or is generated and upgrade by described historical website video information data base.
It should be noted that among the said system embodiment that each included module is just divided according to function logic, but is not limited to above-mentioned division, as long as can realize function corresponding; In addition, the concrete title of each functional module also just for the ease of mutual differentiation, is not limited to protection scope of the present invention.
In sum, the specific embodiment of the invention can accurate recognition goes out to provide the website of video download services, follow the tracks of the video information of its recent renewal, and health, the legitimacy of this website are effectively assessed, thereby can make up more healthy, the safe network system.
The above; only for the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, and anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claims.

Claims (10)

1. the identification of a video downloading service website, appraisal procedure is characterized in that, described method comprises:
Obtain the uniform resource position mark URL tabulation of the targeted website homepage that needs processing, and the invoking web page acquisition module is gathered the webpage of described targeted website according to the url list that is obtained under the guidance of the collection rule that rule database is formulated;
Survey by association analysis and degree of depth URL and to analyze the targeted website webpage that is collected, whether identification is updated to the video information that identifies in the video information data base of setting up in advance simultaneously for the website of video download services is provided;
If this website of identification is video downloading service website and is maiden visit, then utilize the video information in the described video information data base that degree of correlation analysis is carried out in described website, finish assessment to this website;
The information extraction rule that utilization pre-establishes extracts the relevant information of described targeted website video web page, and is stored in the described video information data base, upgrades described rule database simultaneously.
2. the method for claim 1 is characterized in that, described invoking web page acquisition module is gathered the webpage of described targeted website according to the url list that is obtained under the guidance of the collection rule that rule database is formulated, and detailed process is:
The access destination website homepage, the depth value of setting current homepage is 0, his father URL node is itself;
Obtain all URL that point in the described homepage in the station 1, being labeled as the sub-URL node of described homepage, depth value is 1, and deposits in the formation;
If the locating rule non-NULL of the Video service column that pre-establishes, then the URL set with the pairing Video service column of described locating rule joins rear of queue, and its depth value is set is 0, and itself is father URL node;
If the webpage collection rule non-NULL that pre-establishes is then adjusted described url list according to described webpage collection rule, will comprise the URL of described URL feature I+1Preferentially join the formation afterbody, otherwise with all URL I+1Join the formation afterbody by obtaining order, wherein, described URL is characterized as in the URL character string remainder of removing behind numeral and the Hash HASH sign indicating number;
Take out URL from the formation stem successively i, setting its depth value is i, and downloads the corresponding page, obtains all URL that point in this page in the station I+1, the mark depths value is i+1, makes up the chained list<URL that is made up of the corresponding page and father node thereof and child node I-1, URL i, URL I+1;
If m is arranged, and (m<=i) is the Video service page, and is non-homepage child node, then mark father node URL I-1The video weight coefficient be m, promptly<URL I-1, m 〉;
Cyclic access is until arriving preassigned depth threshold, and store all video weight coefficient m greater than the url list of assign thresholds in advance in described Video service column locating rule.
3. the method for claim 1 is characterized in that, the described detection by association analysis and degree of depth URL analyzed the targeted website webpage that is collected, and whether identification specifically comprises for the website of video download services is provided:
Utilize the association analysis device, the healthy class keyword that input sets in the assessment keyword database, carry out the analyzing and processing of the described targeted website webpage and the video download services theme degree of association,, then proceed follow-up processing if satisfy the threshold condition that pre-establishes;
By calling degree of depth URL detector, the URL of identification video download address also carries out degree of depth URL to it and surveys, obtain the described targeted website webpage existence critical field relevant with the video download if survey, then the described targeted website of mark webpage is for providing the page of video download services.
4. the method for claim 1 is characterized in that, describedly utilizes the video information in the described video information data base that degree of correlation analysis is carried out in described website, finishes the assessment to this website, specifically comprises:
Utilize the association analysis device, the bad class keyword that input sets in the assessment keyword database identifies the page that bad video download services is provided;
Utilize relevant matches module at random, detect the degree of correlation of historical website video in the video of download that described targeted website provides and the described video information data base, and return the video title number that is complementary;
Utilize the synthetic determination module, the character that the result who is returned according to described association analysis device and the described module of relevant matches at random comes the described targeted website of synthetic determination is finished the assessment to this website.
5. method as claimed in claim 4 is characterized in that, the character that the described result who is returned according to described association analysis device and the described module of relevant matches at random comes the described targeted website of synthetic determination is finished the assessment to this website, specifically comprises:
If judge this website for the website of video download services is provided, then defining grade is 0;
When grade was 0, the bad class keyword that input sets in the assessment keyword database if identify this website for the website of bad video download services is provided, then defined grade and is upgraded to 1;
When grade more than or equal to 0 the time, if the healthy video title number that the described module of relevant matches is at random returned, is then further judged this website greater than preassigned threshold value for the website of bad video download services is provided, the definition grade is upgraded to 2; And the bad video title number that returns when the described module of relevant matches is at random then further judged this website for bad video downloading service website is provided greater than preassigned another threshold value, and the definition grade is upgraded to 3.
6. the method for claim 1, it is characterized in that the information extraction rule that described utilization pre-establishes extracts the relevant information of described targeted website video web page, and be stored in the described video information data base, upgrade described rule database simultaneously, specifically comprise:
If the information extraction rule of described targeted website correspondence is empty, then call XML document elements path xpath maker generates video summary information place xpath, adds this xpath simultaneously to the information extraction rule that pre-establishes, and upgrades described rule database;
Call withdrawal device, by the described information extraction rule that pre-establishes the page of described targeted website is carried out video summary information and extract;
Call checker, the information that the described withdrawal device of verification is extracted, if verification is passed through, then the video summary information of preservation extraction and download address are to described video information data base; Otherwise, proceed video summary information and extract;
If all information extraction that pre-establishes rule visits still do not have information by verification when finishing, then call described xpath maker and obtain video summary information place xpath, and extract corresponding video summary information, utilize described checker to carry out verification again, if verification is not by then abandoning extraction process and finishing; Otherwise add this xpath to the described information extraction rule that pre-establishes, upgrade described rule database.
7. method as claimed in claim 2 is characterized in that,
The webpage collection rule that comprises in the described webpage acquisition module is specially the URL feature of the Video service page, is used to assist described webpage acquisition module to safeguard corresponding url list;
The described information extraction rule that pre-establishes is the XML document element path xpath at video information place, is used in reference to the guide extraction of information frequently;
The locating rule of the described Video service column that pre-establishes comprises a series of URL set, is used for locating the Video service column of current site; Wherein, each column comprises a plurality of Video service subpage frames, is used to assist described webpage acquisition module to safeguard corresponding url list.
8. the identification of a video downloading service website, evaluating system is characterized in that, described system comprises:
The webpage acquisition module is used to obtain the uniform resource position mark URL tabulation of the targeted website homepage that needs handle, and gathers the webpage of described targeted website according to the url list that is obtained;
Identification module is used for surveying by association analysis and degree of depth URL and analyzes the targeted website webpage that is collected, and whether identification is updated to the video information that identifies in the video information data base of setting up in advance simultaneously for the website of video download services is provided;
Evaluation module, being used for discerning this website at described identification module is video downloading service website and for maiden visit, utilizes the video information in the described video information data base that degree of correlation analysis is carried out in described website, finishes the assessment to this website;
Information extraction module is used to utilize the information extraction rule that pre-establishes to extract the relevant information of described targeted website video web page, and is stored in the described video information data base.
9. system as claimed in claim 8 is characterized in that, described system also comprises:
System scheduling module is used for the operation of dispatching system, coordinates the operation of each module of described system;
Rule database comprises the webpage collection rule, and information extraction rule and Video service plate locating rule are used to instruct the running of described webpage acquisition module, described information extraction module and described identification module;
Video information data base is used for the accessed video related information of the described information extraction module of memory by using, realizes content tracing, and guides described evaluation module to finish assessment to the website.
10. system as claimed in claim 9 is characterized in that, comprises in the described video information data base:
The new site video information data base is used for describing the new site of maiden visit, specifically comprises video summary information, video title, video recent findings time and Video service page URL;
Historical website video information data base is divided into healthy and bad two types, is used for describing historical access site, specifically comprises video summary information, video title, video recent findings time, Video service page URL;
The assessment keyword database is divided into healthy and bad two types, is used for the aid identification video downloading service website and the health degree of assessment objective website; Wherein, the initialization of described assessment keyword database can be adopted manual generation, or is generated and upgrade by described historical website video information data base.
CN2010101867951A 2010-05-26 2010-05-26 Method and system for identifying and evaluating video downloading service website Expired - Fee Related CN101853300B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010101867951A CN101853300B (en) 2010-05-26 2010-05-26 Method and system for identifying and evaluating video downloading service website

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010101867951A CN101853300B (en) 2010-05-26 2010-05-26 Method and system for identifying and evaluating video downloading service website

Publications (2)

Publication Number Publication Date
CN101853300A true CN101853300A (en) 2010-10-06
CN101853300B CN101853300B (en) 2013-01-30

Family

ID=42804792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010101867951A Expired - Fee Related CN101853300B (en) 2010-05-26 2010-05-26 Method and system for identifying and evaluating video downloading service website

Country Status (1)

Country Link
CN (1) CN101853300B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102083100A (en) * 2010-12-31 2011-06-01 百度在线网络技术(北京)有限公司 Method and device for detecting states of multiple resource links based on sites
CN103473299A (en) * 2013-09-06 2013-12-25 北京锐安科技有限公司 Website bad likelihood obtaining method and device
CN104811750A (en) * 2014-01-23 2015-07-29 北京风行在线技术有限公司 Method and device used for playing video in P2P peers and system
CN104834639A (en) * 2014-02-10 2015-08-12 腾讯科技(深圳)有限公司 Data interaction method, terminal, server and data interaction system
CN104866517A (en) * 2014-12-30 2015-08-26 智慧城市信息技术有限公司 Method and device for capturing webpage content
CN105589945A (en) * 2015-12-17 2016-05-18 华为技术有限公司 Knowledge base construction method and controller
CN105635038A (en) * 2014-10-27 2016-06-01 任子行网络技术股份有限公司 Method and system for discriminating audio and video websites
CN105630942A (en) * 2015-12-23 2016-06-01 北京奇虎科技有限公司 Method and device for scheduling update sections of electronic book
WO2016095628A1 (en) * 2014-12-18 2016-06-23 网宿科技股份有限公司 Video terminal and video play restricting method and system thereof
CN105828189A (en) * 2015-01-05 2016-08-03 任子行网络技术股份有限公司 Method of detecting illegal audio and video programs from multiple dimensions
CN105955980A (en) * 2013-05-31 2016-09-21 北京奇虎科技有限公司 File download device and method
CN107766481A (en) * 2017-10-13 2018-03-06 国家计算机网络与信息安全管理中心 A kind of method and system for finding internet financial platform
CN108183831A (en) * 2016-12-08 2018-06-19 中国移动通信有限公司研究院 Information processing method and device in a kind of P2P transmission
CN108664646A (en) * 2018-05-16 2018-10-16 电子科技大学 A kind of automatic download system of audio and video based on keyword
CN109474847A (en) * 2018-10-30 2019-03-15 百度在线网络技术(北京)有限公司 Searching method, device, equipment and storage medium based on video barrage content
CN110020332A (en) * 2017-07-25 2019-07-16 北京国双科技有限公司 A kind of event generation method and device for selecting element based on circle

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020042923A1 (en) * 1992-12-09 2002-04-11 Asmussen Michael L. Video and digital multimedia aggregator content suggestion engine
CN101025737A (en) * 2006-02-22 2007-08-29 王东 Attention degree based same source information search engine aggregation display method and its related system
CN101599089A (en) * 2009-07-17 2009-12-09 中国科学技术大学 The automatic search of update information on content of video service website and extraction system and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020042923A1 (en) * 1992-12-09 2002-04-11 Asmussen Michael L. Video and digital multimedia aggregator content suggestion engine
CN101025737A (en) * 2006-02-22 2007-08-29 王东 Attention degree based same source information search engine aggregation display method and its related system
CN101599089A (en) * 2009-07-17 2009-12-09 中国科学技术大学 The automatic search of update information on content of video service website and extraction system and method

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102083100B (en) * 2010-12-31 2014-11-26 百度在线网络技术(北京)有限公司 Method and device for detecting states of multiple resource links based on sites
CN102083100A (en) * 2010-12-31 2011-06-01 百度在线网络技术(北京)有限公司 Method and device for detecting states of multiple resource links based on sites
CN105955980A (en) * 2013-05-31 2016-09-21 北京奇虎科技有限公司 File download device and method
CN103473299B (en) * 2013-09-06 2017-02-08 北京锐安科技有限公司 Website bad likelihood obtaining method and device
CN103473299A (en) * 2013-09-06 2013-12-25 北京锐安科技有限公司 Website bad likelihood obtaining method and device
CN104811750A (en) * 2014-01-23 2015-07-29 北京风行在线技术有限公司 Method and device used for playing video in P2P peers and system
CN104834639A (en) * 2014-02-10 2015-08-12 腾讯科技(深圳)有限公司 Data interaction method, terminal, server and data interaction system
CN104834639B (en) * 2014-02-10 2019-08-30 腾讯科技(深圳)有限公司 A kind of data interactive method, terminal, server and data interaction system
CN105635038A (en) * 2014-10-27 2016-06-01 任子行网络技术股份有限公司 Method and system for discriminating audio and video websites
CN105635038B (en) * 2014-10-27 2018-08-21 任子行网络技术股份有限公司 A kind of method and system for screening audio and video website
WO2016095628A1 (en) * 2014-12-18 2016-06-23 网宿科技股份有限公司 Video terminal and video play restricting method and system thereof
CN104866517A (en) * 2014-12-30 2015-08-26 智慧城市信息技术有限公司 Method and device for capturing webpage content
CN105828189A (en) * 2015-01-05 2016-08-03 任子行网络技术股份有限公司 Method of detecting illegal audio and video programs from multiple dimensions
CN105828189B (en) * 2015-01-05 2018-10-23 任子行网络技术股份有限公司 A kind of method of various dimensions detection violation audio/video program
WO2017101591A1 (en) * 2015-12-17 2017-06-22 华为技术有限公司 Method for constructing knowledge base, and controller
CN105589945A (en) * 2015-12-17 2016-05-18 华为技术有限公司 Knowledge base construction method and controller
CN105630942A (en) * 2015-12-23 2016-06-01 北京奇虎科技有限公司 Method and device for scheduling update sections of electronic book
CN105630942B (en) * 2015-12-23 2019-05-21 北京奇虎科技有限公司 The dispatching method and device of e-book update chapters and sections
CN108183831A (en) * 2016-12-08 2018-06-19 中国移动通信有限公司研究院 Information processing method and device in a kind of P2P transmission
CN110020332A (en) * 2017-07-25 2019-07-16 北京国双科技有限公司 A kind of event generation method and device for selecting element based on circle
CN107766481A (en) * 2017-10-13 2018-03-06 国家计算机网络与信息安全管理中心 A kind of method and system for finding internet financial platform
CN108664646A (en) * 2018-05-16 2018-10-16 电子科技大学 A kind of automatic download system of audio and video based on keyword
CN109474847A (en) * 2018-10-30 2019-03-15 百度在线网络技术(北京)有限公司 Searching method, device, equipment and storage medium based on video barrage content

Also Published As

Publication number Publication date
CN101853300B (en) 2013-01-30

Similar Documents

Publication Publication Date Title
CN101853300B (en) Method and system for identifying and evaluating video downloading service website
CN102073726B (en) Structured data import method and device for search engine system
CN106095979B (en) URL merging processing method and device
CN102073725B (en) Method for searching structured data and search engine system for implementing same
CN100514323C (en) System and method for automatically extracting by-line information
CN112749284B (en) Knowledge graph construction method, device, equipment and storage medium
CN107590169B (en) Operator gateway data preprocessing method and system
CN104766014A (en) Method and system used for detecting malicious website
CN103544176A (en) Method and device for generating page structure template corresponding to multiple pages
CN102253937A (en) Method and related device for acquiring information of interest in webpages
CN103218431A (en) System and method for identifying and automatically acquiring webpage information
CN105577528B (en) A kind of wechat public platform collecting method and device based on virtual machine
CN102402566A (en) Web user behavior analysis method based on Chinese webpage automatic classification technology
US11263062B2 (en) API mashup exploration and recommendation
CN104182412A (en) Webpage crawling method and webpage crawling system
CN103534696A (en) Exploiting query click logs for domain detection in spoken language understanding
CN108416034B (en) Information acquisition system based on financial heterogeneous big data and control method thereof
CN103294732A (en) Web page crawling method and spider
CN103838862B (en) Video searching method, device and terminal
CN107862039A (en) Web data acquisition methods, system and Data Matching method for pushing
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN112328936A (en) Website identification method, device and equipment and computer readable storage medium
CN112000929A (en) Cross-platform data analysis method, system, equipment and readable storage medium
CN113918794B (en) Enterprise network public opinion benefit analysis method, system, electronic equipment and storage medium
WO2017000659A1 (en) Enriched uniform resource locator (url) identification method and apparatus

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130130

Termination date: 20150526

EXPY Termination of patent right or utility model