WO2012075884A1 - 书签智能分类的方法和服务器 - Google Patents

书签智能分类的方法和服务器 Download PDF

Info

Publication number
WO2012075884A1
WO2012075884A1 PCT/CN2011/082620 CN2011082620W WO2012075884A1 WO 2012075884 A1 WO2012075884 A1 WO 2012075884A1 CN 2011082620 W CN2011082620 W CN 2011082620W WO 2012075884 A1 WO2012075884 A1 WO 2012075884A1
Authority
WO
WIPO (PCT)
Prior art keywords
classification
link address
bookmark
category
preset
Prior art date
Application number
PCT/CN2011/082620
Other languages
English (en)
French (fr)
Inventor
关磊
莫沙
颜伽艺
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2012075884A1 publication Critical patent/WO2012075884A1/zh
Priority to US13/910,478 priority Critical patent/US9106698B2/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9562Bookmark management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/30Managing network names, e.g. use of aliases or nicknames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/30Managing network names, e.g. use of aliases or nicknames
    • H04L61/3015Name registration, generation or assignment

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a method and server for intelligent classification of bookmarks.
  • the management of bookmarks for bookmarks in the prior art is mostly based on manual management by the user. For example, the user creates a category for the bookmark by himself, and the user judges the category of each bookmark stored by himself, and manually sorts each bookmark.
  • Embodiments of the present invention provide a method and server for intelligent classification of bookmarks.
  • the technical solution is as follows:
  • the embodiment of the present invention provides a method for intelligent classification of bookmarks, including:
  • the classification result is returned to the client as a bookmark category.
  • the classifying the bookmark link address of the request classification includes: matching the bookmark link address of the request to the link address in the link library, where the link library is preset by the link An address composition, where the preset link address is a corresponding link address set for a preset URL category; If there is a link address in the link library that matches the bookmark link address of the request classification, the URL category corresponding to the matched link address is used as a classification result.
  • the URL category further corresponds to a preset keyword
  • the bookmark link address of the request classification is matched in the link library, and then includes:
  • the key information is captured from the webpage corresponding to the bookmark link address of the requested category, and the key information is generated by the word segmentation keyword.
  • the generated keyword is compared with the preset keyword corresponding to each URL category, and the URL category corresponding to the preset keyword with the highest similarity is used as the classification result.
  • the classifying the bookmark link address that is requested to be classified includes: presetting a web address category, and setting a keyword corresponding to the web address category;
  • the classifying the bookmark link address of the requested classification specifically includes:
  • the generated keyword is compared with the preset keyword corresponding to each URL category, and the URL category corresponding to the preset keyword with the highest similarity is used as the classification result.
  • the classifying the bookmark link address of the requested category includes: presetting a web address category;
  • the classifying the bookmark link address of the requested classification specifically includes:
  • the default category is used as the classification result.
  • the embodiment of the invention further provides a server for intelligent classification of bookmarks, comprising:
  • An obtaining module configured to obtain a bookmark link address of the client request classification
  • a classification module configured to classify a bookmark link address of the requested classification
  • the return module is configured to return the classification result to the client as a bookmark category.
  • the server further includes:
  • the classification module includes:
  • a first classification unit configured to perform a link address matching in the link library by using a bookmark link address for classifying the request
  • a first matching unit configured to: if a link address matching the bookmark link address of the request classification exists in the link library, use a URL category corresponding to the matched link address as a classification result.
  • the pre-setting module is further configured to set a keyword corresponding to the URL category, and the classification module further includes:
  • a first generating unit configured to: after the first classification unit matches the bookmark link address of the request in the link library, if the bookmark link address of the request classification does not exist in the link library Matching the link address, extracting key information from the webpage corresponding to the bookmark link address of the requested category, and performing word segmentation generating keywords on the key information;
  • the second matching unit is configured to compare the generated keywords with the preset keywords corresponding to each URL category, and compare the URL categories corresponding to the preset keywords with the highest similarity as the classification result.
  • the server further includes:
  • the classification module includes:
  • a second generating unit configured to: capture key information from a webpage corresponding to the bookmark link address of the requested category, and perform word segmentation generating keywords on the key information;
  • the third matching unit is configured to compare the generated keywords with the preset keywords corresponding to each URL category, and compare the URL categories corresponding to the preset keywords with the highest similarity as the classification result.
  • the server further includes:
  • Pre-setting module for pre-setting the URL category
  • the classification module comprises:
  • the default processing unit is configured to use the default classification as the classification result if there is no URL category matching the bookmark link address of the requested category in the preset URL category.
  • the technical solution provided by the embodiment of the present invention has the beneficial effects of: classifying the bookmark link address of the requested classification by acquiring the bookmark link address of the request classification, and using the classification result as a bookmark category, returning to the client, and achieving intelligent classification of the bookmark
  • the purpose is to get rid of the time spent by users manually sorting bookmarks, and at the same time bring a better browser experience for users.
  • 1 is a schematic flowchart of a method for intelligent classification of bookmarks according to Embodiment 1 of the present invention
  • 2 is a schematic flowchart of a method for intelligently classifying bookmarks according to Embodiment 2 of the present invention
  • FIG. 3 is a schematic structural diagram of a server for intelligent classification of bookmarks according to Embodiment 3 of the present invention.
  • FIG. 4 is a schematic structural diagram of a server for intelligent classification of bookmarks according to Embodiment 4 of the present invention. detailed description
  • a first embodiment of the present invention provides a method for intelligent classification of bookmarks.
  • the flow of the method is as shown in FIG. 1 , and the method includes the following steps: Step 101: Acquire a bookmark link address of a request classification;
  • Step 102 classify a bookmark link address that is requested to be classified
  • Step 103 Return the classification result to the client as a bookmark category.
  • the bookmark link address of the request classification is obtained by classifying the bookmark link address of the request classification, and the classification result is used as the bookmark category, and is returned to the client, thereby achieving the purpose of intelligently classifying the bookmark, thereby getting rid of the user manual.
  • the second embodiment of the present invention is improved on the basis of the first embodiment, and the flow thereof is as shown in FIG. 2, including: Step 201: The classification server acquires a bookmark link address of the requested classification.
  • the browser sends the bookmark link address of the webpage to the server, and uses the powerful computing power of the server to automatically classify the bookmark.
  • the classification server is a server for intelligent classification of bookmarks.
  • Step 202 The classification server presets a webpage category, and sets a keyword or/and a link address corresponding to the webpage category, and the preset link address constitutes a link library.
  • a number of URL categories are preset, such as: technology, education, entertainment, blogs, and the like. And setting a keyword or/and a link address corresponding to the URL category, and the preset link address constitutes a link library.
  • the representation of text mainly adopts the vector space model, because the original form of natural language is not suitable for direct processing by mathematical methods, and thus it is difficult to realize automatic processing of natural language.
  • the idea of a vector space model is to describe the text in the form of a vector, (WW 2 , W 3 ... I where is the weight of the mth feature item, and the feature item can select a word or a phrase. In general, the word is selected as the feature item. Better than the phrase. Therefore, The selected feature item is to be used as each vector in the vector space.
  • These feature items are used as the dimension of the vector to represent the text, and the word frequency is used to represent the vector component corresponding to the feature item.
  • the word frequency calculation method mainly uses the TF-IDF formula: (H t/(t,d)x log(JV/n I + 0.01)
  • TF-IDF is a commonly used weighting technique for information search and information mining. Widely used in search, literature classification, and other related fields to assess the importance of a word for a document set or one of a corpus.
  • each of the keywords corresponding to each of the preset URL categories is calculated by the TF-IDF formula.
  • the weights of all words in the same lexicon are combined to form the N-dimensional vector of the lexicon, and N represents the number of keywords in the lexicon. It can be expressed as: (word 1 weight, word 2 weight, word 3 weight, word 4 weight, . . . , word N weight)
  • the weight of each keyword constitutes the preset URL category vector, and
  • the bookmarks of the user's favorite web pages are used to calculate the similarity.
  • Step 203 The classification server classifies the bookmark link address of the requested classification.
  • the server side distributes the bookmark link address of the request classification to a different classification server for classification processing.
  • the embodiment of the present invention classifies bookmarks in three ways.
  • the first way is to first match through the link library, and the second way is to pass the vector space model and the preset category keywords if the match is not obtained.
  • the similarity comparison is performed to obtain the classification result, and the third method does not perform the matching of the link, and the classification result is directly obtained by comparing the similarity between the vector space model and the keywords of the preset category. Therefore, step 203 can be specifically:
  • Step 2031 Set up a balanced load server to reduce the classification server pressure.
  • the server side sets up a balanced load server before the classification server cluster to balance the pressure of the classification server. Specifically, it is used to receive a bookmark link address of the request classification sent by the client, and then distribute the link address in the classification server cluster according to the configured equalization policy, and maintain the server availability.
  • Step 2032 The classification server matches the bookmark link address of the request classification in the link library for link address matching. Specifically, after receiving the bookmark link address from the request classification allocated by the load balancing server, the classification server first performs domain name matching in the link library.
  • a domain name is the name of a computer or group of computers on a network consisting of a series of dot-separated names.
  • Step 2033 If there is a link address in the link library that matches the bookmark link address of the request classification, the classification server uses the URL category corresponding to the matched link address as the classification result.
  • the matching of the domain name can be used to find the most known domain name, and the network address category corresponding to the matched link address is used as the classification result, and step 204 can be performed to return the classification result to the client.
  • the domain name of the bookmark link address saved by the user is cnbeta.com
  • cnbeta.com is recorded in the link library in advance and the domain name is corresponding to the technology news category.
  • the server After receiving the request from the client, the server matches the domain name in the link library, and finds the corresponding web address category of the domain name as tech news, and returns the category to the client.
  • Step 2034 If there is no link address in the link library that matches the bookmark link address of the request classification, the classification server captures key information from the webpage corresponding to the bookmark link address of the requested category, and the key information is Perform word segmentation generation keywords;
  • step 2034 can be specifically:
  • Step 20341 Grab the key information of the webpage corresponding to the bookmark link address.
  • the classification server does not include the link address, and the known information cannot be used to determine the type of the website. Then the classification server will analyze the page corresponding to the link address and judge the type of the website by itself.
  • the classification server accesses the link address, crawls key information of the webpage, finds key information such as a title, a keyword, and a specific webpage content in the webpage, and returns to the classification server for analysis.
  • the method for capturing webpage information is the prior art, which is not limited in the embodiment of the present invention.
  • Step 20342 Perform word segmentation on the key information to generate a keyword.
  • Chinese word segmentation is performed on the key information captured, and keywords are generated.
  • Word segmentation is the process of recombining consecutive word sequences into word sequences according to certain norms.
  • the classification server analyzes the words after the word segmentation into the categories to which they belong.
  • the method for the Chinese word segmentation is the prior art, which is not limited in the embodiment of the present invention.
  • Step 20343 Calculate a vector of the generated keyword.
  • the weight of each word in the keyword is calculated according to the TF-IDF formula described in step 202, to obtain a vector of the generated keyword (word 1 weight, word 2 weight, word 3 weight, word 4 weight, ... , the word N weight).
  • Step 2035 The classification server compares the generated keywords with the preset keywords corresponding to each URL category, and compares the URL categories corresponding to the preset keywords with the highest similarity as the classification result.
  • a computer is used instead of manually classifying objects such as documents or documents, and generally includes automatic clustering and automatic classification.
  • the main difference between automatic clustering and automatic classification is that automatic clustering does not need to define the classification system in advance, and the similarity is calculated.
  • the automatic clustering method does not require the server to pre-collect the classified feature lexicon and link address, but the effect will be better than automatic.
  • the class results are poor.
  • Automatic classification needs to determine the category system, and provide a batch of pre-divided objects for each category as the training corpus. In the actual classification, according to the learned classification knowledge, one or more categories are determined for the documents to be classified. .
  • the automatic classification method is used to calculate the key information of the webpage after the crawling and segmentation by the text vector space model, and determine the type of the webpage.
  • the classification system has been pre-determined before the classification calculation, and the corresponding thesaurus is provided as a training corpus for each category.
  • the classification of text is based on the content of the text to automatically determine the category of the text association under the given classification system. From a mathematical point of view, text categorization is a mapping process that maps unspecified text to existing categories.
  • the vectors of the generated keywords are respectively compared with the vectors of all the URL categories, and all the similarity values obtained are arranged according to the size, and the webpage corresponding to the bookmark link address may be determined to be the value with the highest similarity.
  • the corresponding URL category is
  • the calculating the similarity of the two contrast texts is represented by the cosine of the angles of their corresponding vectors, and the calculation formula
  • W ik , Wj k represent the weights of the Kth feature items of texts 4 and 4, respectively, and Sim (dgedj) is the similarity of the two texts 4 and d.
  • the comparison of the similarity is only one way of determining the category, and the category can also be determined by other means, which is not limited in the embodiment of the present invention.
  • Step 2036 The classification server captures key information from a webpage corresponding to the bookmark link address of the requested category, and performs word segmentation to generate a keyword for the key information.
  • step 2036 can be specifically:
  • Step 20361 Grab the key information of the webpage corresponding to the bookmark link address.
  • Step 20362 Perform word segmentation on the key information to generate a keyword.
  • Step 20363 Calculate a vector of the generated keyword.
  • Step 2037 The classification server compares the generated keywords with the preset keywords corresponding to each URL category, and compares the URL categories corresponding to the preset keywords with the highest similarity as the classification result.
  • step 2037 and step 2035 are the same, and therefore are not described herein again.
  • Step 2038 If there is no URL category matching the bookmark link address of the request classification in the preset URL category, the classification server uses the default classification as the classification result.
  • the bookmark link address fails to obtain a result after being matched and calculated by the classification server, the category of the bookmark is attributed to the default category and returned to the client.
  • Step 204 The classification server returns the classification result as a bookmark category to the client.
  • the link library address or the text similarity calculation is performed on the bookmark link address, and the classification result is returned to the client.
  • the category of the bookmark link address is obtained, the category is returned to the equalization load server, and then returned to the client browser through the balance load server.
  • the browser After receiving the category of the bookmark link address, the browser automatically organizes the bookmark into the category folder according to the received category, thereby completing intelligent sorting and classification of the user bookmark.
  • the bookmark link address fails to obtain a result after matching and calculating by the classification server, the category of the bookmark is attributed to the default category and returned to the client, and the client places the bookmark of the unfinished category. In the default category directory.
  • the bookmark link address of the requested classification is obtained by classifying the bookmark link address of the request classification, and the classification result is used as the bookmark category, and is returned to the client to achieve the purpose of intelligently classifying the bookmark.
  • the key information is captured by the web page corresponding to the bookmark link address, and the keyword is generated by the word segmentation, and the keyword is preset.
  • the keyword corresponding to the URL category is subjected to similarity calculation, thereby obtaining the category with the largest similarity, as the category of the bookmark link address, thereby getting rid of the time cost of manually sorting the bookmark by the user, and at the same time bringing better to the user. Browser experience.
  • Example 3 Example 3
  • a third embodiment of the present invention provides a server for intelligent classification of bookmarks, which has a structure as shown in FIG. 3, and includes: an obtaining module 1, configured to acquire a bookmark link address of a request classification;
  • a classification module 2 configured to classify the bookmark link address of the requested classification; Return to module 3, which is used to return the classification result to the client as a bookmark category.
  • the bookmark link address of the request classification is obtained by classifying the bookmark link address of the request classification, and the classification result is used as the bookmark category, and is returned to the client, thereby achieving the purpose of intelligently classifying the bookmark, thereby getting rid of the user manual.
  • the fourth embodiment of the present invention is improved on the basis of the third embodiment, and its structure is as shown in FIG. 4.
  • the server for intelligent classification of bookmarks includes: an acquisition module 1, a classification module 2, and a return module 3, and may further include Set module 4.
  • Get module 1 used to get the bookmark link address of the request classification.
  • the browser sends the bookmark link address of the webpage to the server, and uses the powerful computing power of the server to automatically classify the link.
  • the preset module 4 is configured to pre-set a web address category and set a keyword or a link address corresponding to the web address category before the sorting module 2 is executed, and the preset link address constitutes a link library.
  • a number of URL categories are preset, such as: technology, education, entertainment, blogs, and the like. And setting a keyword or/and a link address corresponding to the URL category, and the preset link address constitutes a link library.
  • each of the keywords corresponding to each of the preset URL categories is calculated by the TF-IDF formula.
  • the weights of all the words in the same thesaurus are combined to form the N-dimensional vector of the thesaurus, and N represents the number of key words in the thesaurus. It can be expressed as: (word 1 weight, word 2 weight, word 3 weight, word 4 weight, . . . , word N weight)
  • the weight of each keyword constitutes the preset URL category vector, and
  • the bookmarks of the user's favorite web pages are used to calculate the similarity.
  • the classification module 2 is configured to classify the bookmark link address of the requested classification.
  • the classification module 2 includes a first classification unit 22 and a first matching unit 23; or, the classification module 2 includes a first generation unit 24 and a second matching unit 25; or, the classification module 2 includes a second generation unit 26 and a Three matching units 27. Further, the classification module 2 may further include a decompression unit 21.
  • the decompression unit 21 is used to set up a balanced load server to reduce the pressure on the cloud server.
  • the server side sets up a balanced load server before the classification server cluster to balance the pressure of the classification server. Specifically, it is used to receive a bookmark link address of the request classification sent by the client, and then distribute the link address in the classification server cluster according to the configured equalization policy, and maintain the server availability.
  • the first classifying unit 22 the bookmark link address for classifying the request, performs link address matching in the link library.
  • a domain name is the name of a computer or group of computers on a network consisting of a series of dot-separated names.
  • the first matching unit 23 is configured to: if there is a link address in the link library that matches the bookmark link address of the request classification, use the URL category corresponding to the matched link address as a classification result.
  • the matching of the domain name can find most known domain names, and the network address category corresponding to the matched link address is used as the classification result.
  • a first generating unit 24 configured to: after executing the first classifying unit, if there is no link address in the link library that matches the bookmark link address of the request classification, corresponding to the bookmark link address of the request classification
  • the webpage captures key information, and the key information is segmented to generate keywords.
  • the first generating unit 24 may be specifically:
  • the first crawling subunit is configured to capture key information of a webpage corresponding to the bookmark link address.
  • the classification server does not include the link address, and the known information cannot be used to determine the type of the website. Then the classification server will analyze the page corresponding to the link address and judge the type of the website by itself.
  • the classification server accesses the link address, crawls key information of the webpage, finds key information such as a title, a keyword, and a specific webpage content in the webpage, and returns to the classification server for analysis.
  • the method for capturing webpage information is the prior art, which is not limited in the embodiment of the present invention.
  • a first generating subunit configured to perform word segmentation on the key information to generate a keyword.
  • Chinese word segmentation is performed on the key information captured, and keywords are generated.
  • Word segmentation is the process of recombining consecutive word sequences into word sequences according to certain norms.
  • the classification server analyzes the words after the word segmentation into the categories to which they belong.
  • the method for the Chinese word segmentation is the prior art, which is not limited in the embodiment of the present invention.
  • a first calculation subunit configured to calculate a vector of the generated keyword.
  • the weight of each word in the keyword is calculated by the TF-IDF formula, and a vector of the generated keyword is obtained (word 1 weight, word 2 weight, word 3 weight, word 4 weight, . . . , Word N weight).
  • the second matching unit 25 is configured to perform similarity comparison between the generated keyword and the preset keyword corresponding to each URL category, and use the URL category corresponding to the preset keyword with the highest similarity as the classification result.
  • the vectors of the generated keywords are respectively compared with the vectors of all the URL categories, and all the similarity values obtained are arranged according to the size, and the webpage corresponding to the bookmark link address may be determined to be the value with the highest similarity.
  • the corresponding URL category is
  • the second generating unit 26 is configured to capture key information from a webpage corresponding to the bookmark link address of the requested category, and perform word segmentation generating keywords on the key information. Specifically, the process of classifying the bookmark link address of the request classification sent by the client may not perform matching of the link library.
  • the second generating unit 26 may be specifically:
  • a second crawling subunit configured to capture key information of a webpage corresponding to the bookmark link address
  • a second generating subunit configured to perform word segmentation on the key information to generate a keyword
  • a second calculation subunit configured to calculate a vector of the generated keywords.
  • the second generation unit 26 has the same method concept and principle as the first generation unit 24, and therefore will not be described again here.
  • the third matching unit 27 is configured to perform similarity comparison between the generated keyword and the preset keyword corresponding to each URL category, and use the URL category corresponding to the preset keyword with the highest similarity as the classification result.
  • the third matching unit 27 and the second matching unit 25 have the same concept and principle, and therefore are not described herein.
  • the default processing unit 28 is configured to use the default classification as the classification result if there is no URL category in the preset URL category that matches the bookmark link address of the requested category.
  • module 3 which is used to return the classification result to the client as a bookmark category.
  • the category of the bookmark link address is obtained by matching or calculating by the classification server, and the category is returned to the balanced load server, and then returned to the browser of the client through the balanced load server.
  • the browser After receiving the category of the bookmark link address, the browser automatically organizes the bookmark into the category folder according to the received category, thereby completing intelligent sorting and classification of the user bookmark.
  • the bookmark link address fails to obtain a result after matching and calculating by the classification server, the category of the bookmark is attributed to the default category and returned to the client, and the client places the bookmark of the unfinished category. In the default category directory.
  • the bookmark link address of the request classification is obtained by classifying the bookmark link address of the request classification, and the classification result is used as the bookmark category, and is returned to the client to achieve the purpose of intelligently classifying the bookmark.
  • the key information is captured by the web page corresponding to the bookmark link address, and the keyword is generated by the word segmentation, and the keyword is preset.
  • the keyword corresponding to the URL category is subjected to similarity calculation, thereby obtaining the category with the largest similarity, as the category of the bookmark link address, thereby getting rid of the time cost of manually sorting the bookmark by the user, and at the same time bringing better to the user. Browser experience.
  • the server of the third and fourth embodiments of the embodiments of the present invention, and the method concepts of the foregoing first and second embodiments The principle is the same, and therefore the same portions as those in the first and second embodiments will not be described again in the third and fourth embodiments.
  • the integrated unit according to the embodiment of the present invention may also be stored in a computer readable storage medium if it is implemented in the form of a software functional unit and sold or used as a stand-alone product.
  • the technical solution of the present invention which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium, including a plurality of instructions for making a A computer device (which may be a personal computer, website, or network device, etc.) performs all or part of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, and the like, which can store program codes. .

Description

书签智能分类的方法和服务器 本申请要求于 2010年 12月 6日提交中国专利局、 申请号为 201010580033. X、 发明名 称为 "书签智能分类的方法和服务器" 的中国专利申请的优先权, 其全部内容通过引用结 合在本申请中。
技术领域
本发明涉及计算机技术领域, 特别涉及一种书签智能分类的方法和服务器。 背景技术 书
随着互联网技术的发展, 通过浏览器访问网站是人们获取资讯查找资料的最主要方式。 在使用浏览器的过程中, 人们通常会将经常访问或需特别关注的网站和网站中的网页进行 收藏保存, 以便在今后的使用过程中, 快捷方便的找到所需内容。
现有技术中浏览器对于收藏的书签的管理大部分基于用户手动管理。 例如用户自行对 书签进行类别的创建, 用户自行判断存储的每一条书签的类别, 并将每一条书签手动进行 归类处理。
在对现有技术进行分析后, 发明人发现现有技术至少具有如下缺点:
现有技术中浏览器对书签不能进行自动的归类处理, 需要用户自行对类别判断以及对 判断后的归类处理。 对于用户来说, 既耗费时间又降低了对浏览器使用的体验。 发明内容
本发明实施例提供了一种书签智能分类的方法和服务器。 所述技术方案如下: 本发明实施例提出了一种书签智能分类的方法, 包括:
获取客户端请求分类的书签链接地址;
对请求分类的书签链接地址进行分类处理;
将分类结果作为书签类别, 返回至所述客户端。
作为上述技术方案的优选, 所述对请求分类的书签链接地址进行分类处理, 具体包括: 将所述请求分类的书签链接地址在链接库中进行链接地址匹配, 所述链接库由预设的 链接地址组成, 所述预设的链接地址是为预设的网址类别设置的对应的链接地址; 若所述链接库中存在与所述请求分类的书签链接地址匹配的链接地址, 将所述匹配的 链接地址对应的网址类别作为分类结果。
作为上述技术方案的优选, 所述网址类别还对应预设关键词, 所述将所述请求分类的 书签链接地址在链接库中进行链接地址匹配, 之后包括:
若所述链接库中不存在与所述请求分类的书签链接地址匹配的链接地址, 从所述请求 分类的书签链接地址对应的网页抓取关键信息, 并对所述关键信息进行分词生成关键词; 将生成的关键词与每一网址类别对应的预设关键词进行相似度比较, 并将相似度最大 的预设关键词对应的网址类别作为分类结果。
作为上述技术方案的优选, 所述对请求分类的书签链接地址进行分类处理, 之前包括: 预先设置网址类别, 并设置与所述网址类别对应的关键词;
相应地, 所述对请求分类的书签链接地址进行分类处理, 具体包括:
从所述请求分类的书签链接地址对应的网页抓取关键信息, 并对所述关键信息进行分 词生成关键词;
将生成的关键词与每一网址类别对应的预设关键词进行相似度比较, 并将相似度最大 的预设关键词对应的网址类别作为分类结果。
作为上述技术方案的优选, 所述对请求分类的书签链接地址进行分类处理, 之前包括: 预先设置网址类别;
所述对请求分类的书签链接地址进行分类处理, 具体包括:
如果预设的网址类别中没有与所述请求分类的书签链接地址匹配的网址类别, 将默认 分类作为分类结果。
本发明实施例还提出了一种书签智能分类的服务器, 包括:
获取模块, 用于获取客户端请求分类的书签链接地址;
分类模块, 用于对请求分类的书签链接地址进行分类处理;
返回模块, 用于将分类结果作为书签类别, 返回至所述客户端。
作为上述技术方案的优选, 所述服务器还包括:
预设置模块, 用于预先设置网址类别, 并设置与所述网址类别对应的链接地址, 所述 预设的链接地址组成链接库; 相应地, 所述分类模块包括:
第一分类单元, 用于将所述请求分类的书签链接地址在所述链接库中进行链接地址匹 配;
第一匹配单元, 用于若所述链接库中存在与所述请求分类的书签链接地址匹配的链接 地址, 将所述匹配的链接地址对应的网址类别作为分类结果。 作为上述技术方案的优选, 所述预设置模块还用于设置与所述网址类别对应的关键词, 相应地, 所述分类模块还包括:
第一生成单元, 用于在所述第一分类单元将所述请求分类的书签链接地址在链接库中 进行链接地址匹配之后, 若所述链接库中不存在与所述请求分类的书签链接地址匹配的链 接地址, 从所述请求分类的书签链接地址对应的网页抓取关键信息, 并对所述关键信息进 行分词生成关键词;
第二匹配单元, 用于将生成的关键词与每一网址类别对应的预设关键词进行相似度比 较, 并将相似度最大的预设关键词对应的网址类别作为分类结果。
作为上述技术方案的优选, 所述服务器还包括:
预设置模块, 用于预先设置网址类别, 并设置与所述网址类别对应的关键词; 相应地, 所述分类模块包括:
第二生成单元, 用于从所述请求分类的书签链接地址对应的网页抓取关键信息, 并对 所述关键信息进行分词生成关键词;
第三匹配单元, 用于将生成的关键词与每一网址类别对应的预设关键词进行相似度比 较, 并将相似度最大的预设关键词对应的网址类别作为分类结果。
作为上述技术方案的优选, 所述服务器还包括:
预设置模块, 用于预先设置网址类别;
相应地, 所述分类模块包括:
默认处理单元, 用于如果预设的网址类别中没有与所述请求分类的书签链接地址匹配 的网址类别, 将默认分类作为分类结果。
本发明实施例提供的技术方案的有益效果是: 通过获取请求分类的书签链接地址, 对 请求分类的书签链接地址进行分类处理, 将分类结果作为书签类别, 返回至客户端, 达到 对书签智能分类的目的, 从而摆脱用户手动整理书签带来的时间耗费, 同时为用户带来更 好的浏览器使用体验。 附图说明
为了更清楚地说明本发明实施例中的技术方案, 下面将对实施例描述中所需要使用的 附图作简单地介绍, 显而易见地, 下面描述中的附图仅仅是本发明的一些实施例, 对于本 领域普通技术人员来讲, 在不付出创造性劳动的前提下, 还可以根据这些附图获得其他的 附图。
图 1是为本发明实施例 1提供的书签智能分类的方法流程示意图; 图 2是为本发明实施例 2提供的书签智能分类的方法流程示意图;
图 3是为本发明实施例 3提供的书签智能分类的服务器结构示意图;
图 4是为本发明实施例 4提供的书签智能分类的服务器结构示意图。 具体实施方式
为使本发明的目的、 技术方案和优点更加清楚, 下面将结合附图对本发明实施方式作 进一步地详细描述。 实施例 1
本发明第一实施例提出了一种书签智能分类的方法, 其流程如图 1所示, 包括: 步骤 101、 获取请求分类的书签链接地址;
步骤 102、 对请求分类的书签链接地址进行分类处理;
步骤 103、 将分类结果作为书签类别, 返回至客户端。
本发明实施例中, 通过获取请求分类的书签链接地址, 对请求分类的书签链接地址进 行分类处理, 将分类结果作为书签类别, 返回至客户端, 达到对书签智能分类的目的, 从 而摆脱用户手动整理书签带来的时间耗费, 同时为用户带来更好的浏览器使用体验。 实施例 2
本发明第二实施例是在第一实施例的基础上改进而来, 其流程如图 2所示, 包括: 步骤 201、 分类服务器获取请求分类的书签链接地址。
具体的, 当用户在使用浏览器进行收藏网页时, 浏览器会将所述网页的书签链接地址 发送至服务器端, 利用服务器端强大的计算能力, 对书签进行自动分类。
其中, 分类服务器即书签智能分类的服务器。
步骤 202、 分类服务器预先设置网址类别, 并设置与所述网址类别对应的关键词或 /和 链接地址, 所述预设的链接地址组成链接库。
具体的, 预先设置若干网址类别, 例如: 科技、 教育、 娱乐、 博客等。 并设置与所述 网址类别对应的关键词或 /和链接地址, 所述预设的链接地址组成链接库。
进一步的, 目前在自然语言处理方向上, 文本的表示主要采用向量空间模型, 因为自 然语言的原始形式不适合直接使用数学方法处理, 也因此难以实现自然语言的自动处理。 向量空间模型的思想就是用向量的形式来描述文本, (W W2, W3…… I 其中 为第 m个特 征项的权重, 特征项可以选择词或词组, 一般情况下选取词作为特征项要优于词组。 因此, 要将所选取的特征项以作为向量空间中的每一个向量。 由这些特征项作为向量的维数来表 示文本, 用词频来表示特征项对应的向量分量, 词频计算方法主要运用 TF-IDF公式: (H t/(t,d)x log(JV/nI + 0.01)
^ [tf(t, d)x {og(N/nl + 0.0 \)] 其中, W (t , d)为词 t在文本 d中的权重, 而 tf (t , d)为词 t在文本 d中的词频, N为训练文本的总数, r 为训练文本集中出现 t的文本数, i=l,2,…,!!!(m为词的个数), 分 母为归一化因子。 TF-IDF是一种用于信息搜索和信息挖掘的常用加权技术。 在搜索、 文献 分类和其他相关领域有广泛的应用, 用以评估一个字词对于一个文件集或一个语料库中的 其中一份文件的重要程度。 字词的重要性随着它在文件中出现的次数成正比增加, 但同时 会随着它在语料库中出现的频率成反比下降。 某一文档中某一词条出现的频率越高, 说明 它区分文档内容属性的能力越强, 其权值越大。
进一步的, 对预设的每一个网址类别中所对应的每一个关键词通过所述 TF-IDF公式计 算它的权重。将同一词库中所有词的权重组合起来构成所述词库的 N维向量, N代表词库中 关键词的个数。 可以表示为: (词 1权重, 词 2权重, 词 3权重, 词 4权重,。。。。, 词 N权 重) 由每一个关键词的权重构成所述预置的网址类别向量, 用于和用户收藏的网页书签进 行相似度的计算。
步骤 203、 分类服务器对请求分类的书签链接地址进行分类处理。
具体的, 服务器端收到客户端发送的请求分类的书签链接地址后, 由均衡负载服务器 将所述请求分类的书签链接地址分配至不同的分类服务器进行分类处理。
进一步的, 本发明实施例通过三种方式对书签进行分类, 第一种方式是先对通过链接 库进行匹配, 第二种方式是若匹配不到则通过向量空间模型与预设类别的关键词进行相似 度比较得到分类结果, 第三种方式不进行链接的匹配, 直接通过向量空间模型与预设类别 的关键词进行相似度比较得到分类结果。 因此, 步骤 203可以具体为:
步骤 2031、 架设均衡负载服务器减轻分类服务器压力。
具体的, 服务器端在分类服务器集群之前架设了一个均衡负载服务器来平衡分类服务 器的压力。 具体用于接收由客户端发送来的请求分类的书签链接地址, 然后将所述链接地 址根据已配置的均衡策略在分类服务器集群中分发, 并对服务器可用性进行维护。
步骤 2032、 分类服务器将所述请求分类的书签链接地址在所述链接库中进行链接地址 匹配。 具体的, 分类服务器接收到来自负载均衡服务器分配的请求分类的书签链接地址后, 首先将所述链接地址在所述链接库中进行域名匹配。 域名是由一串用点分隔的名字组成的 网络上某一台计算机或计算机组的名称。
步骤 2033、 若所述链接库中存在与所述请求分类的书签链接地址匹配的链接地址, 分 类服务器将所述匹配的链接地址对应的网址类别作为分类结果。
具体的, 进行域名的匹配可找到大部分已知的域名, 将所述匹配的链接地址对应的网 址类别作为分类结果, 可以执行步骤 204将分类结果返回客户端。
例如:用户保存的书签链接地址的域名是 cnbeta. com,预先在链接库中记录 cnbeta. com 并将此域名对应到科技新闻类别。 服务器端接收到来自客户端的请求后在链接库中匹配到 此域名, 并找到此域名相对应的网址类别为科技新闻, 将此类别返回给客户端。
进一步的, 若用户保存的书签链接地址在所述链接库中没有匹配到, 则执行步骤 2034。 步骤 2034、 若所述链接库中不存在与所述请求分类的书签链接地址匹配的链接地址, 分类服务器从所述请求分类的书签链接地址对应的网页抓取关键信息, 并对所述关键信息 进行分词生成关键词;
因此, 步骤 2034可以具体为:
步骤 20341、 抓取书签链接地址对应的网页的关键信息。
具体的, 若用户保存的书签链接地址在所述链接库中没有匹配到, 说明分类服务器对 所述链接地址没有进行收录, 无法通过已知的信息判断这个网站是何种类型。 那么此时分 类服务器会对这个链接地址对应的页面进行分析, 自行判断其网站的类型。
进一步的, 分类服务器会对所述链接地址进行访问, 抓取网页的关键信息, 找到网页 中的标题, 关键词和具体的网页内容等关键信息, 并返回到分类服务器进行分析。 其中抓 取网页信息的方法为现有技术, 本发明实施例中并不对此做出限定。
步骤 20342、 对所述关键信息进行分词生成关键词。
具体的, 对抓取来所述关键信息进行中文分词, 生成关键词。 分词就是将连续的字序 列按照一定的规范重新组合成词序列的过程。 分类服务器会对分词后的词语进行分析其所 属的类别。 其中中文分词的方法为现有技术, 本发明实施例中并不对此做出限定。
步骤 20343、 计算所述生成的关键词的向量。
具体的, 按照步骤 202中所述 TF-IDF公式计算关键词中每一个词的权重, 得到一个所 述生成的关键词的向量 (词 1权重, 词 2权重, 词 3权重, 词 4权重, … , 词 N权重)。
步骤 2035、 分类服务器将生成的关键词与每一网址类别对应的预设关键词进行相似度 比较, 并将相似度最大的预设关键词对应的网址类别作为分类结果。 具体的, 使用计算机代替人工对文档或文献等对象进行分类, 一般包括自动聚类和自 动分类。 自动聚类和自动分类的主要区别就是自动聚类不需要事先定义好分类体系, 计算 相似度, 自动聚类方法不需要服务器端预先采集分类的特征词库以及链接地址, 但是效果 会比自动归类结果差。 自动分类则需要确定好类别体系, 并且要为每个类别提供一批预先 分好的对象作为训练文集, 在实际分类时, 再根据学习到的分类知识为需要分类的文献确 定一个或者多个类别。
在本发明实施例中采用的是自动分类方法通过文本向量空间模型对抓取并分词后网页 的关键信息进行计算, 确定其所属类型的方法。 在进行分类计算之前已经预先确定好了类 别体系, 并且为每个类别提供了所对应的词库作为训练文集。 对文本的分类是在给定的分 类体系下, 根据文本的内容自动确定文本关联的类别。 从数学角度来看, 文本分类是一个 映射的过程, 它将未标明类别的文本映射到已有的类别中。
进一步的, 将所述生成的关键词的向量分别与所有网址类别的向量计算相似度, 得到 的所有相似度值按大小排列, 可以判断所述书签链接地址所对应的网页属于相似度最大的 值所对应的网址类别。
所述计算两个对比文本的相似度是用它们对应向量的夹角的余弦值来表示, 计算公式
Figure imgf000009_0001
其中, Wik、 Wjk分别表示文本 4和4第 K个特征项的权值, Sim (d„dj)为两个文本 4和 d的相似度。
当然, 通过相似度的比较只是确定类别的一种方式, 还可以通过其他方式来确定类别, 本发明实施例中并不对此做出限定。
步骤 2036、 分类服务器从所述请求分类的书签链接地址对应的网页抓取关键信息, 并 对所述关键信息进行分词生成关键词;
具体的, 对客户端发送来的请求分类的书签链接地址进行分类处理的过程, 也可以不 进行链接库的匹配, 直接进行相似度计算的过程。 因此, 步骤 2036可以具体为:
步骤 20361、 抓取书签链接地址对应的网页的关键信息。
步骤 20362、 对所述关键信息进行分词生成关键词。 步骤 20363、 计算所述生成的关键词的向量。
进一步的, 步骤 2036与步骤 2034方法构思和原理相同, 故在这里不再赘述。
步骤 2037、 分类服务器将生成的关键词与每一网址类别对应的预设关键词进行相似度 比较, 并将相似度最大的预设关键词对应的网址类别作为分类结果。
具体的, 步骤 2037与步骤 2035方法构思和原理相同, 故在这里不再赘述。
步骤 2038、 如果预设的网址类别中没有与所述请求分类的书签链接地址匹配的网址类 别, 分类服务器将默认分类作为分类结果。
具体的, 若所述书签链接地址通过分类服务器的匹配与计算后, 都没有得到结果, 那 么将此书签的类别归属到默认分类返回给客户端。
步骤 204、 分类服务器将分类结果作为书签类别, 返回至客户端。
具体的, 在上述过程中对书签链接地址进行了链接库匹配或 /和文本相似度计算, 将分 类结果返回至客户端。
通过分类服务器的匹配或计算, 得到所述书签链接地址的类别, 将所述类别返回到均 衡负载服务器, 再通过均衡负载服务器返回到客户端的浏览器。 浏览器在收到所述书签链 接地址的类别后, 将按照接收到的类别将该书签自动的整理到所述类别文件夹下, 从而完 成对用户书签的智能整理与分类。
进一步的, 若所述书签链接地址通过分类服务器的匹配与计算后, 都没有得到结果, 那么将此书签的类别归属到默认分类并返回给客户端, 客户端将此类未完成分类的书签放 在默认分类目录下。
本发明实施例中, 通过获取请求分类的书签链接地址, 对请求分类的书签链接地址进 行分类处理, 将分类结果作为书签类别, 返回至客户端, 达到对书签智能分类的目的。 另 外对于与预先设置的链接库未匹配到的书签链接地址, 通过对书签链接地址对应的网页抓 取关键信息, 并对所述关键信息进行分词生成关键词, 将所述关键词与预置的网址类别所 对应的关键词进行相似度计算, 从而得到对应的相似度最大的类别, 作为所述书签链接地 址的类别, 从而摆脱用户手动整理书签带来的时间耗费, 同时为用户带来更好的浏览器使 用体验。 实施例 3
本发明第三实施例提出了一种书签智能分类的服务器, 其结构如图 3所示, 包括: 获取模块 1, 用于获取请求分类的书签链接地址;
分类模块 2, 用于对请求分类的书签链接地址进行分类处理; 返回模块 3, 用于将分类结果作为书签类别, 返回至客户端。
本发明实施例中, 通过获取请求分类的书签链接地址, 对请求分类的书签链接地址进 行分类处理, 将分类结果作为书签类别, 返回至客户端, 达到对书签智能分类的目的, 从 而摆脱用户手动整理书签带来的时间耗费, 同时为用户带来更好的浏览器使用体验。 实施例 4
本发明第四实施例是在第三实施例的基础上改进而来, 其结构如图 4所示, 书签智能 分类的服务器包括: 获取模块 1、 分类模块 2和返回模块 3, 还可以包括预设置模块 4。
获取模块 1, 用于获取请求分类的书签链接地址。
具体的, 当用户在使用浏览器进行收藏网页时, 浏览器会将所述网页的书签链接地址 发送至服务器端, 利用服务器端强大的计算能力, 对链接进行自动分类。
预设置模块 4, 用于执行所述分类模块 2之前, 预先设置网址类别, 并设置与所述网址 类别对应的关键词或 /和链接地址, 所述预设的链接地址组成链接库。
具体的, 预先设置若干网址类别, 例如: 科技、 教育、 娱乐、 博客等。 并设置与所述 网址类别对应的关键词或 /和链接地址, 所述预设的链接地址组成链接库。
进一步的, 对预设的每一个网址类别中所对应的每一个关键词通过 TF-IDF公式计算它 的权重。将同一词库中所有词的权重组合起来构成所述词库的 N维向量, N代表词库中关键 词的个数。 可以表示为: (词 1权重, 词 2权重, 词 3权重, 词 4权重,。。。。, 词 N权重) 由每一个关键词的权重构成所述预置的网址类别向量, 用于和用户收藏的网页书签进行相 似度的计算。
分类模块 2, 用于对请求分类的书签链接地址进行分类处理。
具体的, 分类模块 2包括第一分类单元 22和第一匹配单元 23; 或者, 分类模块 2包括 第一生成单元 24和第二匹配单元 25; 或者, 分类模块 2包括第二生成单元 26和第三匹配 单元 27。 进一步的, 分类模块 2还可以包括减压单元 21。
减压单元 21, 用于架设均衡负载服务器减轻云端服务器压力。
具体的, 服务器端在分类服务器集群之前架设了一个均衡负载服务器来平衡分类服务 器的压力。 具体用于接收由客户端发送来的请求分类的书签链接地址, 然后将所述链接地 址根据已配置的均衡策略在分类服务器集群中分发, 并对服务器可用性进行维护。
第一分类单元 22, 用于将所述请求分类的书签链接地址在所述链接库中进行链接地址 匹配。
具体的, 分类服务器接收到来自负载均衡服务器分配的请求分类的书签链接地址后, 首先将所述链接地址在所述链接库中进行域名匹配。 域名是由一串用点分隔的名字组成的 网络上某一台计算机或计算机组的名称。
第一匹配单元 23, 用于若所述链接库中存在与所述请求分类的书签链接地址匹配的链 接地址, 将所述匹配的链接地址对应的网址类别作为分类结果。
具体的, 进行域名的匹配可找到大部分已知的域名, 将所述匹配的链接地址对应的网 址类别作为分类结果。
第一生成单元 24, 用于执行所述第一分类单元之后, 若所述链接库中不存在与所述请 求分类的书签链接地址匹配的链接地址, 从所述请求分类的书签链接地址对应的网页抓取 关键信息, 并对所述关键信息进行分词生成关键词。
具体的, 第一生成单元 24可以具体为:
第一抓取子单元, 用于抓取书签链接地址对应的网页的关键信息。
具体的, 若用户保存的书签链接地址在所述链接库中没有匹配到, 说明分类服务器对 所述链接地址没有进行收录, 无法通过已知的信息判断这个网站是何种类型。 那么此时分 类服务器会对这个链接地址对应的页面进行分析, 自行判断其网站的类型。
进一步的, 分类服务器会对所述链接地址进行访问, 抓取网页的关键信息, 找到网页 中的标题, 关键词和具体的网页内容等关键信息, 并返回到分类服务器进行分析。 其中抓 取网页信息的方法为现有技术, 本发明实施例中并不对此做出限定。
第一生成子单元, 用于对所述关键信息进行分词生成关键词。
具体的, 对抓取来所述关键信息进行中文分词, 生成关键词。 分词就是将连续的字序 列按照一定的规范重新组合成词序列的过程。 分类服务器会对分词后的词语进行分析其所 属的类别。 其中中文分词的方法为现有技术, 本发明实施例中并不对此做出限定。
第一计算子单元, 用于计算所述生成的关键词的向量。
具体的, 通过 TF-IDF公式计算关键词中每一个词的权重, 得到一个所述生成的关键词 的向量 (词 1权重, 词 2权重, 词 3权重, 词 4权重,。。。。, 词 N权重)。
第二匹配单元 25, 用于将生成的关键词与每一网址类别对应的预设关键词进行相似度 比较, 并将相似度最大的预设关键词对应的网址类别作为分类结果。
进一步的, 将所述生成的关键词的向量分别与所有网址类别的向量计算相似度, 得到 的所有相似度值按大小排列, 可以判断所述书签链接地址所对应的网页属于相似度最大的 值所对应的网址类别。
第二生成单元 26, 用于从所述请求分类的书签链接地址对应的网页抓取关键信息, 并 对所述关键信息进行分词生成关键词。 具体的, 对客户端发送来的请求分类的书签链接地址进行分类处理的过程, 也可以不 进行链接库的匹配。
进一步的, 第二生成单元 26可以具体为:
第二抓取子单元, 用于抓取书签链接地址对应的网页的关键信息;
第二生成子单元, 用于对所述关键信息进行分词生成关键词;
第二计算子单元, 用于计算所述生成的关键词的向量。
进一步的, 第二生成单元 26与第一生成单元 24地方法构思和原理相同, 故在这里不 再赘述。
第三匹配单元 27, 用于将生成的关键词与每一网址类别对应的预设关键词进行相似度 比较, 并将相似度最大的预设关键词对应的网址类别作为分类结果。
具体的, 第三匹配单元 27与第二匹配单元 25方法构思和原理相同, 故在这里不再赘 述。
默认处理单元 28, 用于如果预设的网址类别中没有与所述请求分类的书签链接地址匹 配的网址类别, 将默认分类作为分类结果。
返回模块 3, 用于将分类结果作为书签类别, 返回至客户端。
具体的, 通过分类服务器的匹配或计算, 得到所述书签链接地址的类别, 将所述类别 返回到均衡负载服务器, 再通过均衡负载服务器返回到客户端的浏览器。 浏览器在收到所 述书签链接地址的类别后, 将按照接收到的类别将该书签自动的整理到所述类别文件夹下, 从而完成对用户书签的智能整理与分类。
进一步的, 若所述书签链接地址通过分类服务器的匹配与计算后, 都没有得到结果, 那么将此书签的类别归属到默认分类并返回给客户端, 客户端将此类未完成分类的书签放 在默认分类目录下。
本发明实施例中, 通过获取请求分类的书签链接地址, 对请求分类的书签链接地址进 行分类处理, 将分类结果作为书签类别, 返回至客户端, 达到对书签智能分类的目的。 另 外对于与预先设置的链接库未匹配到的书签链接地址, 通过对书签链接地址对应的网页抓 取关键信息, 并对所述关键信息进行分词生成关键词, 将所述关键词与预置的网址类别所 对应的关键词进行相似度计算, 从而得到对应的相似度最大的类别, 作为所述书签链接地 址的类别, 从而摆脱用户手动整理书签带来的时间耗费, 同时为用户带来更好的浏览器使 用体验。 本发明实施例第三和第四实施例的服务器, 与前述的第一和第二实施例的方法构思和 原理相同, 因此在第三和第四实施例中对与第一和第二实施例中相同的部分不再赘述。 本发明实施例所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售 或使用时, 也可以存储在一个计算机可读取存储介质中。 基于这样的理解, 本发明的技术 方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来, 该计算机 软件产品存储在一个存储介质中, 包括若干指令用以使得一台计算机设备 (可以是个人计 算机, 网站, 或者网络设备等) 执行本发明各个实施例所述方法的全部或部分。 而前述的 存储介质包括: U盘、 移动硬盘、 只读存储器 (ROM, Read-Only Memory), 随机存取存储器 ( RAM, Random Access Memory), 磁碟或者光盘等各种可以存储程序代码的介质。
以上仅为本发明的较佳实施例, 并不用以限制本发明, 凡在本发明的精神和原则之内, 所作的任何修改、 等同替换、 改进等, 均应包含在本发明的保护范围之内。

Claims

权 利 要 求 书
1、 一种书签智能分类的方法, 其特征在于, 所述方法包括:
获取客户端请求分类的书签链接地址;
对请求分类的书签链接地址进行分类处理;
将分类结果作为书签类别, 返回至所述客户端。
2、 根据权利要求 1所述的一种书签智能分类的方法, 其特征在于, 所述对请求分类的书 签链接地址进行分类处理, 具体包括:
将所述请求分类的书签链接地址在链接库中进行链接地址匹配, 所述链接库由预设的链 接地址组成, 所述预设的链接地址是为预设的网址类别设置的对应的链接地址;
若所述链接库中存在与所述请求分类的书签链接地址匹配的链接地址, 将所述匹配的链 接地址对应的网址类别作为分类结果。
3、 根据权利要求 2所述的一种书签智能分类的方法, 其特征在于, 所述网址类别还对应 预设关键词, 所述将所述请求分类的书签链接地址在链接库中进行链接地址匹配, 之后包括: 若所述链接库中不存在与所述请求分类的书签链接地址匹配的链接地址, 从所述请求分 类的书签链接地址对应的网页抓取关键信息, 并对所述关键信息进行分词生成关键词;
将生成的关键词与每一网址类别对应的预设关键词进行相似度比较, 并将相似度最大的 预设关键词对应的网址类别作为分类结果。
4、 根据权利要求 1所述的一种书签智能分类的方法, 其特征在于, 所述对请求分类的书 签链接地址进行分类处理, 之前包括:
预先设置网址类别, 并设置与所述网址类别对应的关键词;
相应地, 所述对请求分类的书签链接地址进行分类处理, 具体包括:
从所述请求分类的书签链接地址对应的网页抓取关键信息, 并对所述关键信息进行分词 生成关键词;
将生成的关键词与每一网址类别对应的预设关键词进行相似度比较, 并将相似度最大的 预设关键词对应的网址类别作为分类结果。
5、 根据权利要求 1所述的一种书签智能分类的方法, 其特征在于, 所述对请求分类的书 签链接地址进行分类处理, 之前包括:
预先设置网址类别;
所述对请求分类的书签链接地址进行分类处理, 具体包括:
如果预设的网址类别中没有与所述请求分类的书签链接地址匹配的网址类别, 将默认分 类作为分类结果。
6、 一种书签智能分类的服务器, 其特征在于, 所述服务器包括:
获取模块, 用于获取客户端请求分类的书签链接地址;
分类模块, 用于对请求分类的书签链接地址进行分类处理;
返回模块, 用于将分类结果作为书签类别, 返回至所述客户端。
7、根据权利要求 6所述的一种书签智能分类的服务器,其特征在于,所述服务器还包括: 预设置模块, 用于预先设置网址类别, 并设置与所述网址类别对应的链接地址, 所述预 设的链接地址组成链接库;
相应地, 所述分类模块包括:
第一分类单元,用于将所述请求分类的书签链接地址在所述链接库中进行链接地址匹配; 第一匹配单元, 用于若所述链接库中存在与所述请求分类的书签链接地址匹配的链接地 址, 将所述匹配的链接地址对应的网址类别作为分类结果。
8、 根据权利要求 7所述的一种书签智能分类的服务器, 其特征在于, 所述预设置模块还 用于设置与所述网址类别对应的关键词, 相应地, 所述分类模块还包括:
第一生成单元, 用于在所述第一分类单元将所述请求分类的书签链接地址在链接库中进 行链接地址匹配之后, 若所述链接库中不存在与所述请求分类的书签链接地址匹配的链接地 址, 从所述请求分类的书签链接地址对应的网页抓取关键信息, 并对所述关键信息进行分词 生成关键词;
第二匹配单元,用于将生成的关键词与每一网址类别对应的预设关键词进行相似度比较, 并将相似度最大的预设关键词对应的网址类别作为分类结果。
9、根据权利要求 7所述的一种书签智能分类的服务器,其特征在于,所述服务器还包括: 预设置模块, 用于预先设置网址类别, 并设置与所述网址类别对应的关键词; 相应地, 所述分类模块包括:
第二生成单元, 用于从所述请求分类的书签链接地址对应的网页抓取关键信息, 并对所 述关键信息进行分词生成关键词;
第三匹配单元,用于将生成的关键词与每一网址类别对应的预设关键词进行相似度比较, 并将相似度最大的预设关键词对应的网址类别作为分类结果。
10、 根据权利要求 7所述的一种书签智能分类的服务器, 其特征在于, 所述服务器还包 括- 预设置模块, 用于预先设置网址类别;
相应地, 所述分类模块包括:
默认处理单元, 用于如果预设的网址类别中没有与所述请求分类的书签链接地址匹配的 网址类别, 将默认分类作为分类结果。
PCT/CN2011/082620 2010-12-06 2011-11-22 书签智能分类的方法和服务器 WO2012075884A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/910,478 US9106698B2 (en) 2010-12-06 2013-06-05 Method and server for intelligent categorization of bookmarks

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201010580033XA CN102486791A (zh) 2010-12-06 2010-12-06 书签智能分类的方法和服务器
CN201010580033.X 2010-12-06

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/910,478 Continuation US9106698B2 (en) 2010-12-06 2013-06-05 Method and server for intelligent categorization of bookmarks

Publications (1)

Publication Number Publication Date
WO2012075884A1 true WO2012075884A1 (zh) 2012-06-14

Family

ID=46152284

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/082620 WO2012075884A1 (zh) 2010-12-06 2011-11-22 书签智能分类的方法和服务器

Country Status (3)

Country Link
US (1) US9106698B2 (zh)
CN (1) CN102486791A (zh)
WO (1) WO2012075884A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120131163A1 (en) * 2010-11-24 2012-05-24 International Business Machines Corporation Balancing the loads of servers in a server farm based on an angle between two vectors

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298614B (zh) * 2011-07-29 2015-04-22 百度在线网络技术(北京)有限公司 一种确定网页收藏信息的收藏分类的方法、装置和设备
CN103577492B (zh) * 2012-08-09 2018-07-06 腾讯科技(深圳)有限公司 网页主页生成方法及装置
CN102902796B (zh) * 2012-09-29 2016-07-06 北京奇虎科技有限公司 浏览器网页标签自动分组系统及方法
CN102902788B (zh) * 2012-09-29 2016-07-06 北京奇虎科技有限公司 浏览器网页标签自动分组系统及方法
CN102929963B (zh) * 2012-10-11 2019-03-29 北京百度网讯科技有限公司 一种网址类型的设置方法及系统
CN103853730B (zh) * 2012-11-29 2018-09-21 腾讯科技(深圳)有限公司 控制网络链接快捷方式分类的方法和系统
CN103324669B (zh) * 2013-05-20 2016-12-28 北京奇虎科技有限公司 一种对网页书签进行处理的方法和客户端
US9060203B2 (en) * 2013-10-16 2015-06-16 International Business Machines Corporation Personalized categorization of television programming
CN105224533B (zh) * 2014-05-28 2019-09-03 北京搜狗科技发展有限公司 浏览器收藏夹整理方法和装置
CN105989109B (zh) * 2015-02-12 2019-10-25 Oppo广东移动通信有限公司 一种显示应用详情的方法及装置
CN104809234B (zh) * 2015-05-11 2018-02-23 中国联合网络通信集团有限公司 浏览器书签的处理方法及终端
US10157235B2 (en) 2015-06-30 2018-12-18 Microsoft Technology Licensing, Llc Automatic grouping of browser bookmarks
CN105653571A (zh) * 2015-07-31 2016-06-08 广州市动景计算机科技有限公司 书签存储及书签操作指令的响应方法、浏览器
CN105677815B (zh) * 2015-12-30 2019-07-16 Oppo广东移动通信有限公司 一种网页书签添加方法及终端
CN107193814B (zh) * 2016-03-14 2020-07-31 北京京东尚科信息技术有限公司 数字阅读中实现书籍自动分类整理的方法和装置
CN107436907A (zh) * 2016-05-27 2017-12-05 中国联合网络通信集团有限公司 网络文本分类整合方法及装置
CN106202312B (zh) * 2016-07-01 2019-10-18 天翼智慧家庭科技有限公司 一种用于移动互联网的兴趣点搜索方法和系统
CN106528838A (zh) * 2016-11-23 2017-03-22 北京小米移动软件有限公司 书签保存方法和装置
CN108287848B (zh) * 2017-01-10 2020-09-04 中国移动通信集团贵州有限公司 用于语义解析的方法和系统
CN108959316B (zh) * 2017-05-24 2021-08-20 北京搜狗科技发展有限公司 一种将网页添加至收藏夹的方法和装置
US11210357B2 (en) 2018-09-17 2021-12-28 International Business Machines Corporation Automatically categorizing bookmarks from customized folders and implementation based on web browsing activity
CN109918587A (zh) * 2019-01-23 2019-06-21 平安科技(深圳)有限公司 网页书签管理方法、装置、电子设备及存储介质
CN110021439B (zh) * 2019-03-07 2023-01-24 平安科技(深圳)有限公司 基于机器学习的医疗数据分类方法、装置和计算机设备

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6275862B1 (en) * 1999-01-06 2001-08-14 International Business Machines Corporation Automatic categorization of bookmarks in a web browser
US20040205499A1 (en) * 2001-11-29 2004-10-14 International Business Machines Corporation Apparatus and method of organizing bookmarked web pages into categories

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5801702A (en) * 1995-03-09 1998-09-01 Terrabyte Technology System and method for adding network links in a displayed hierarchy
US6100890A (en) * 1997-11-25 2000-08-08 International Business Machines Corporation Automatic bookmarks
US6832350B1 (en) * 1998-09-30 2004-12-14 International Business Machines Corporation Organizing and categorizing hypertext document bookmarks by mutual affinity based on predetermined affinity criteria
US6574625B1 (en) * 2000-09-12 2003-06-03 International Business Machines Corporation Real-time bookmarks
US20030101216A1 (en) * 2001-11-29 2003-05-29 International Business Machines Corporation Apparatus and method of linking sub-folders in a bookmark folder
WO2005008527A1 (ja) * 2003-07-16 2005-01-27 Fujitsu Limited 動的にカテゴライズされるブックマーク管理装置
US7747937B2 (en) * 2005-08-16 2010-06-29 Rojer Alan S Web bookmark manager
CN101593200B (zh) * 2009-06-19 2012-10-03 淮海工学院 基于关键词频度分析的中文网页分类方法

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6275862B1 (en) * 1999-01-06 2001-08-14 International Business Machines Corporation Automatic categorization of bookmarks in a web browser
US20040205499A1 (en) * 2001-11-29 2004-10-14 International Business Machines Corporation Apparatus and method of organizing bookmarked web pages into categories

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120131163A1 (en) * 2010-11-24 2012-05-24 International Business Machines Corporation Balancing the loads of servers in a server farm based on an angle between two vectors
US20120158941A1 (en) * 2010-11-24 2012-06-21 International Business Machines Corporation Balancing the loads of servers in a server farm based on an angle between two vectors
US8645545B2 (en) 2010-11-24 2014-02-04 International Business Machines Corporation Balancing the loads of servers in a server farm based on an angle between two vectors
US8676983B2 (en) 2010-11-24 2014-03-18 International Business Machines Corporation Balancing the loads of servers in a server farm based on an angle between two vectors

Also Published As

Publication number Publication date
US9106698B2 (en) 2015-08-11
US20130297827A1 (en) 2013-11-07
CN102486791A (zh) 2012-06-06

Similar Documents

Publication Publication Date Title
WO2012075884A1 (zh) 书签智能分类的方法和服务器
KR101721338B1 (ko) 검색 엔진 및 그의 구현 방법
US8341147B2 (en) Blending mobile search results
US9489401B1 (en) Methods and systems for object recognition
El-Beltagy et al. KP-Miner: A keyphrase extraction system for English and Arabic documents
US7962487B2 (en) Ranking oriented query clustering and applications
CN106202124B (zh) 网页分类方法及装置
US10210179B2 (en) Dynamic feature weighting
US8498455B2 (en) Scalable face image retrieval
WO2016058267A1 (zh) 一种基于网站主页特征分析的中文网站分类方法和系统
US9830379B2 (en) Name disambiguation using context terms
US20150161129A1 (en) Image result provisioning based on document classification
JP2010539589A (ja) 電子的情報源からの特定のエンティティに関連する情報の特定
CN112256861B (zh) 一种基于搜索引擎返回结果的谣言检测方法及电子装置
WO2017113592A1 (zh) 模型生成方法、词语赋权方法、装置、设备及计算机存储介质
CN103020208B (zh) 一种与移动终端相适应的搜索方法及装置
CN104615723B (zh) 查询词权重值的确定方法和装置
CN109948154A (zh) 一种基于邮箱名的人物获取及关系推荐系统和方法
CN106886577B (zh) 一种多维度网页浏览行为评估方法
CN113761125A (zh) 动态摘要确定方法和装置、计算设备以及计算机存储介质
Divya et al. Onto-search: An ontology based personalized mobile search engine
JP2010282403A (ja) 文書検索方法
AU2021100441A4 (en) A method of text mining in ranking of web pages using machine learning
Garg et al. On-Device Document Classification using multimodal features
Chan et al. Enhancing classification effectiveness of Chinese news based on term frequency

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11847449

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 161013)

122 Ep: pct application non-entry in european phase

Ref document number: 11847449

Country of ref document: EP

Kind code of ref document: A1