CN100472484C

CN100472484C - Feedback loop for spam prevention

Info

Publication number: CN100472484C
Application number: CNB2004800037693A
Authority: CN
Inventors: R·L·朗特瓦特; D·E·黑克尔曼; J·D·梅尔; N·D·豪威尔; M·C·鲁珀斯伯格; D·A·斯劳森; J·T·古德曼
Original assignee: Microsoft Corp
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2003-03-03
Filing date: 2004-02-25
Publication date: 2009-03-25
Anticipated expiration: 2024-02-25
Also published as: AU2004216772A1; CA2513967A1; US7219148B2; EP1599781A2; NO20053733L; US20070208856A1; CO6141494A2; EG23988A; MXPA05008303A; WO2004079514A2; TWI331869B; ZA200506085B; WO2004079514A3; US7558832B2; AU2004216772A2; NZ541628A; JP4828411B2; JP2006521635A; BRPI0407045A; AU2004216772B2

Abstract

The subject invention provides for a feedback loop system and method that facilitate classifying items in connection with spam prevention in server and/or client-based architectures. The invention makes uses of a machine-learning approach as applied to spam filters, and in particular, randomly samples incoming email messages so that examples of both legitimate and junk/spam mail are obtained to generate sets of training data. Users which are identified as spam-fighters are asked to vote on whether a selection of their incoming email messages is individually either legitimate mail or junk mail. A database stores the properties for each mail and voting transaction such as user information, message properties and content summary, and polling results for each message to generate training data for machine learning systems. The machine learning systems facilitate creating improved spam filter(s) that are trained to recognize both legitimate mail and spam mail and to distinguish between them.

Description

Be used to prevent the feedback cycle of spam

Technical field

The present invention relates to be used to discern the System and method for of legal (for example good mail) and unwelcome information (for example spam), relate in particular to preventing that spam from classifying to E-mail communication.

Background of invention

Appearance such as global communications network such as the Internets provides the commercial opportunity of getting in touch with a large amount of potential customers' foundation.Electronic information communication, especially Email (" e-mail "), as scatter unwanted advertisement and the sales promotion means of (also claiming " spam ") to the network user, it is more and more general just to become.

Radicati Group Co.,Ltd---a family consulting and market research agency is estimated to as in August, 2002, has every day 2,000,000,000 spam e-mail messages to be sent out---the triplication every two years of this number expected.Individual and the inconvenience day by day of enterprise (for example company, government organs) sensation, and it is tired to be unequal to spam often.Nowadays or be about to become a kind of main threat for Trusted Computing similarly, SPAM.

A kind of gordian technique that is used to hinder SPAM is to use filtering system and/or method.A kind of verified filtering technique is based on machine learning method---and the machine learning filtrator is the probability of spam to importing this message of distribution of messages into.In the method, from two class example messages (for example rubbish and non-rubbish message), extract feature usually, and the Applied Learning filtrator carries out the probability differentiation between two classes.Because many characteristic informations relate to content (for example word and expression in message subject and/or text), this type of filtrator is commonly referred to as " content-based filtrator ".

Some junk/spam filters is adaptive, and this is important, because the user of multi-language user and use rare foreign languages language needs to be adaptive to the filtrator of its specific demand.In addition, not all user can both be and be not to agree on the junk/spam at what.Therefore, by using (for example, via the observing user behavior) filtrator that can implicitly train, each filtrator of dynamically customizing is to satisfy user's particular message identification demand.

Filtering adaptive a kind of method is that the request user is spam and non-spam with message marking.Unfortunately, because the complicacy that is associated with this type of training, this type of manual intensive training technique is unwelcome to many users, let alone correctly realizes the time quantum that this type of training is required.In addition, the personal user usually makes this type of manual training technique defectiveness that becomes.For example, free mail sends the subscription of tabulation often to be forgotten by the user, therefore is designated as spam by mistake.As a result, legitimate mail is blocked indefinitely and is entered user's mailbox.Another kind of adaptive filtering device training method is to use implicit expression training prompting.For example, if the user replys or transmits message, this method supposes that this message is non-spam.Yet, only use this type of message notifying that statistic bias is incorporated in the training process, cause the filtrator of low respective accuracy.

Another method is that the Email with all users is used for training, wherein initial labels is distributed by existing filtrator, and the user uses explicit prompting (for example " user's correction " method) sometimes---for example, select such as " as the spam deletion " and options such as " non-spams "---and/or the implicit expression prompting covers those distribution.Although these class methods are better than previously discussed technology, to compare with claimed the present invention with following description, it is still incomplete.

Summary of the invention

For the basic comprehension to some aspect of the present invention is provided, below provide simplification general introduction of the present invention.This general introduction is not an exhaustive overview of the present invention.It is not attempted to identify key of the present invention and decisive element or describes category of the present invention.Its unique purpose is to propose notions more of the present invention in simplified form, as the preface in greater detail to providing after a while.

The invention provides a kind of feedback loop system and method for with regard to preventing spam, project being classified be convenient to.The present invention has utilized the machine learning method that is applied to twit filter, especially randomly the email message that imports into is sampled, thereby obtains legal and rubbish/spam mail generates training dataset.Previously selected individual takes on spam soldier (fighter), and participates in each copy (optionally making an amendment slightly) of sample is sorted out.

Generally speaking, make amendment in all fields, make it show as polling message choosing the message of using for poll.The aspect of a uniqueness of the present invention is, the message of importing into of choosing the confession poll to use is duplicated, thereby certain user (for example, spam soldier) can (for example receive same message twice, aspect message content): once being the form of polling message, is its primitive form for the second time.Another unique aspect of the present invention is that all message all are considered for poll---comprise that those have been labeled as the message of spam by existing filtrator.The message that is marked as spam is considered for poll, and if selected, do not treat according to the standard of existing filtrator and (for example, move on to Junk E-Mail folder, deletion as spam ...).

Different with conventional twit filter, can train twit filter by feedback technique according to the present invention, make its association distinguish mail and spam, created twit filter more accurately, thereby reduce devious and inaccurate filtration.Feedback is reached its feedback of importing Email into to obtain by the user of any suitable quantity of poll at least in part.It is the task that legitimate mail or spam are put to the vote that the user who is identified as the spam soldier is endowed a selection of importing message into.The good mail (for example, non-spam) that the affirmation and negation classification of importing Email into all is supposed to alleviate using for the user falls as Spam filtering mistakenly.Respective classified and any other are moved in the database with information that each mail transaction is associated, so that the training twit filter.Database and associated component can compile and store the attribute of selected message (or mail transaction of choosing), comprise user property, user decide by vote information and history, such as the message attributes such as unique identifying number of distributing to each selected message, message classification and message content summary or relate to above any statistics, coming is that machine learning system generates training dataset.Machine learning system (for example, neural network, support vector machine (SVM), Bayes's trust network) be convenient to create by training with identification legitimate mail and spam, and can distinguish the two improved twit filter.In case trained new twit filter according to the present invention, then it can be distributed to mail server and client email software programs.In addition, can train new twit filter, to improve the performance of personalized filter with respect to the specific user.When having made up new training dataset, twit filter can stand further training via machine learning, optimizes its performance and accuracy.The user feedback of message classification mode be can also utilize, the tabulation of the twit filter and head of a family control, the performance and/or the identification spam place that rises of test twit filter generated.

Another aspect of the present invention provides a kind of method that detects insincere user by cross-validation technique and/or known results test post.Cross validation relates to trains the poll result's who has got rid of the certain user filtrator.That is, use poll result to train filtrator from user's subclass.On average, even some mistakes are arranged, this usefulness subclass family is still worked finely, is enough to detect those usually and their inconsistent user.To compare from the user's who is excluded the poll result and the result of housebroken filtrator.This has determined to come the user of self-training subclass how to decide by vote belonging to the message that is excluded the user more in fact.If the user's who is excluded voting and the consistance between filtrator are very low, the poll result from this user can be rejected or be labeled for manual examination (check) so.This technology can repeat as required, each data of getting rid of from different user.

Such as filtrator and user voting extremely the mistake of individual message such as inconsistent message also can be detected.These message can be labeled for removing automatically and/or manual examination (check).As the replacement of cross validation, can on complete all or substantially all of user, train filtrator.Can be rejected with inconsistent user's voting of filtrator and/or message.Another replacement of cross validation relates to the known results test post that wherein requires the user that the known message of result is put to the vote.The user verifies this user's credibility to the accurate classification (for example, user's voting and matches filter action) of message, and determines whether to remove from training this user's classification, and whether will remove this user from poll in the future.

Another aspect of the present invention provides establishment known spam target (for example, honey jar (honeypot)) will import mail into and is designated spam, and/or follows the tracks of the processing of particular business e-mail address.Known spam target, or claim honey jar, be the e-mail address that can determine the legitimate mail group and all other mails are considered as spam.For example, can on certain website, reveal e-mail address with the limited form that unlikely be it is found that.Therefore, any Email that sends to this address can be regarded as spam.Perhaps, can only this e-mail address be revealed to expectation and receive the businessman of legitimate email from it.Therefore, the mail of receiving from this businessman is legal, but all other mails of receiving can be considered as spam safely.Can will (for example come from honey jar and/or other source, the user) spam data integration is in feedback loop system, but because use the dramatic growth of the spam classification of honey jar, the weight that should reduce these type of data obtains poll result devious to reduce, and this will describe hereinafter in more detail.

Another aspect of the present invention provides thinks the isolation of uncertain message to feedback loop system or filtrator.This type of message is retained any reasonable time section, rather than is rejected or classifies.Can preestablish this time period, perhaps can keep this message until the poll result who receives the predetermined quantity that is similar to this message (for example, from same IP address or have similar content).

In order to reach aforementioned and relevant purpose, together with the following description and drawings some illustrative aspect of the present invention has been described herein.But these aspects only illustrate the certain methods in the whole bag of tricks that can use principle of the present invention, and the present invention is intended to comprise all these type of aspects and equivalent aspect thereof.When considered in conjunction with the accompanying drawings, other advantage of the present invention and novel feature will be from following to becoming apparent the specific descriptions of the present invention.

The accompanying drawing summary

Figure 1A is the block diagram of feedback cycle training system according to an aspect of the present invention.

Figure 1B is the process flow diagram of exemplary feedback cycle training process according to an aspect of the present invention.

Fig. 2 is according to an aspect of the present invention, is convenient to the process flow diagram of user's mail classifying with the illustrative methods of establishment twit filter.

Fig. 3 is according to an aspect of the present invention, is convenient to the user of the method that participates in Fig. 2 is carried out the process flow diagram of the illustrative methods of cross validation.

Fig. 4 is according to an aspect of the present invention, is convenient to judge the whether process flow diagram of incredible illustrative methods of user.

Fig. 5 is according to an aspect of the present invention, the process flow diagram of being convenient to catch spam and determining the illustrative methods of spam originators.

Fig. 6 is the block diagram based on the feedback loop architecture of client computer according to an aspect of the present invention.

Fig. 7 is according to an aspect of the present invention, has one or more users' of generating training data the block diagram based on the feedback loop system of server.

Fig. 8 is according to an aspect of the present invention, the block diagram of inter-organization feedback loop system based on server, and wherein this system comprises the internal server that carries database, pulls out the training data that is stored on the external user database.

Fig. 9 shows the exemplary environments that is used to realize various aspects of the present invention.

Figure 10 is the schematic block diagram according to exemplary communications environment of the present invention.

The detailed description of invention

With reference now to accompanying drawing, describe the present invention, reference number identical among the figure is used in reference to identical element all the time.In the following description, for illustrative purposes, a large amount of details have been set forth, so that thorough understanding of the present invention to be provided.Yet, obviously can implement the present invention without these details.In other example, the present invention for convenience of description illustrates known structure and equipment with the block diagram form.

As used in the present invention, term " assembly " refers to the relevant entity of computing machine with " system ", combination, software or the executory software of they or hardware, hardware and software.For example, assembly can be, but is not limited to, and runs on process, processor, object, executable code, execution thread, program and/or computing machine on the processor.As an illustration, the application program of moving on server and this server can be assemblies.One or more assemblies can reside in process and/or the execution thread, and assembly can be on the computing machine and/or be distributed between two or many computing machines.

The present invention can be in conjunction with generating relevant various inference schemes and/or the technology of training data with the Spam filtering that is machine learning.As used in this article, term " inference " refers generally to from one group of process via reasoning or inference system, environment and/or state of user the observation of incident and/or data capture.For example, inference can be used for discerning concrete context or action, maybe can generate the probability distribution of state.Inference can be probabilistic---promptly, and based on the probability distribution of the consideration of data and incident being calculated interested state.Inference also can refer to be used for form from one group of incident and/or data the technology of advanced event.This type of inference causes structure new events or action from the one group of observed incident and/or the event data of being stored, and no matter whether each incident is closely related in time, and also no matter each incident and data are from one or several incidents and data source.

Although should be appreciated that running through this instructions uses term message in a large number, this type of term is not limited to Email itself, but can be applicable to rightly and comprise any type of electronic information communication that can be distributed on any suitable communication architecture.For example, the conference applications program of being convenient to the meeting between two people or the many people (for example, interactive chat programs, and instant messaging program) also can utilize the benefit of filtration disclosed herein, because disagreeable text can be spread in the normal chat messages when the user exchanges messages electronically, and/or message, end or above all message are inserted into to start with.In this application-specific, be labeled as spam for the content (for example commercial advertisement, sales promotion or advertisement) of catching non-expectation and with it, filter training can be become automatic fitration particular message content (text and image).

In the present invention, term " recipient " addressee that refers to import into message or project.Term " user " refers to passively or selects to participate in the recipient of feedback loop system and process as described in this article on one's own initiative.

Refer now to Fig. 1, the general block diagram of feedback training system 10 according to an aspect of the present invention is shown.Message sink assembly 12 receives message of importing into (being designated as IM) and the recipient 14 who sends it to expection.() convention for example, twit filter, this message sink assembly can comprise at least one filtrator 16 as many message sink assemblies.Message sink assembly 12 combined filtering devices 16 come processing messages (IM) and to subset of messages after filtration is provided of the recipient 14 of expection (IM ').

As the part of feedback of the present invention aspect, poll assembly 18 receives all message imported into (IM) and identification expection recipient 14 separately.For example, this poll assembly is selected expection recipient's 14 a subclass (being called as spam soldier 20) will import a subclass of message (be designated as IM ") into and is categorized as spam or non-spam.The information (being designated as voting information) that classification is relevant is submitted to message stores/voting storage 22, wherein decides by vote information and each IM " copy be stored for equaling to use after a while such as feedback component 24.Particularly, feedback component 24 has used machine learning techniques (for example, neural network, SVM, Bayesian network or anyly be applicable to machine learning system of the present invention), this machine learning techniques utilization voting information is to come (and/or making up new filtrator) trained and/or improved to filtrator 16 with respect to for example discerning spam.When having handled the new message flow that imports into by the filtrator 16 of new training, spam still less and more legitimate messages (being designated as IM ') are sent to expection recipient 14.Therefore, the feedback of system 10 by utilizing spam soldier 20 to generate promoted the identification of spam and the training of improved twit filter.This type of feedback aspect of the present invention provides and has been used to improve the abundant of spam detection systems and highly dynamic scheme.Various details about more detailed aspect of the present invention below are discussed.

Refer now to Figure 1B, show according to the present invention about resisting feedback cycle training flow process Figure 100 that spam and spam prevent.Before the preparatory stage and/or training process of training process, (for example select the user as the spam soldier, concentrate from the master who comprises all Email Users)---according to the present invention, selection can be based on stochastic sampling or level of trust or any suitable selection scheme/standard.For example, user's subclass of choosing can comprise all users, one group of user who selects at random, decide to do spam soldier's user or the user and/or its combination in any that withdraw from of decision and/or be based in part on its population position and relevant information.

Perhaps, selected Email User master collection can be limited to the paying customer, this can allow spammer need pay higher cost could destroy the present invention.Therefore, user's subclass of selected participation antagonism spam can only comprise the paying customer.With the tabulation that can create the name that comprises the user (for example, spam soldier) who chooses and attribute or client's table.

When having received the message flow 102 that imports into, check the recipient of each message in 104 all spam soldiers' of contrast tabulation.If the recipient is in this tabulation, this message is considered for poll so.Next, determine whether to select message to be used for poll.Different with conventional twit filter, the present invention is imported mail at all at least and is not deleted any message (for example, spam) before being considered for poll.That is, stand this mail of any mark (for example, spam, non-spam) elder generation's classification before at mail---do like this and be convenient to obtain the agonic intelligence sample that can be used for user's poll.

Can use the assembly (not shown) that is used for the message selection to select message, to reduce data deviation by a certain random chance.Another kind method relates to population in use information and other user/recipient's attribute and character.Therefore, can select message based on user/recipient at least in part.There is other replacement algorithm that is used to select message.But, the message number of selecting each time period for each user or each user, or select the probability of message to have restriction from any given user.If there is not this type of restriction, spammer can be created number of the account, send millions of spam messages and all these type of message classifications are good message to it: this will allow spammer to use mistakenly the message of mark to damage tranining database.

The Spam filtering of famous some form that is called as black hole lists may not be skipped.Black hole lists prevents that server from receiving any mail from Internet protocol (IP) address list.Therefore, the selection of message can be from being not to select the mails from black hole lists.

A unique aspect of the present invention is that the selected message that is used for poll that is labeled as spam by current filtrator in place is deleted or move on to Junk E-Mail folder.On the contrary, they are placed on and receive in the common inbox or mailbox of all other message for the poll consideration.But, if message has two copies, and filtrator thinks that this message is spam, so a copy is sent to Junk E-Mail folder, or handle (for example, deletion, signalment or move on to Junk E-Mail folder) according to the parameter of setting.

When a message selected, it be forwarded to the user and with certain particular form mark to indicate that it is polling message.Particularly, the message of choosing can be revised by message modification component 106.The example of message modification includes, but not limited to polling message be navigated to independent file, change " from (from) " address or subject line and/or use to the user special icon or the special color of this message identifier as polling message.Also the message of choosing can be encapsulated in another message, how this another message decides by vote and/or the instruction of the packed message of classifying if providing to the user.For example, these instructions can comprise at least two buttons or link: one is spam with the message voting, and another is non-spam with the message voting.

Before sending the copy of polling message, can realize deciding by vote button by the content of revising message to the user.When using for client email software (relative) when of the present invention, can revise user interface to comprise the voting button with e-mail server.

In addition, polling message can comprise instruction and voting button, and appended selected message.Polling message also can comprise such as subject line, from the address, the summary of the selected message such as first few lines at least of date of shipping and/or date received and text or text.Another kind method relates to deciding by vote instruction and the voting button that it is considered in advance being sent message.In force, when the user opens and/or download the copy of polling message, include but not limited to that the button (or link) of " spam " and " non-spam " button can eject on user interface, perhaps can be incorporated in the polling message.Therefore, each polling message all comprise one group the instruction and suitable voting button be possible.Other modification can be essential, may comprise removing HTML background instructions (they will make the text of instruction or button be difficult to see).

The type that depends on expectation information also can provide such as another buttons such as " commercial E-mail of asking for " buttons.Message also can comprise the button/link that withdraws from poll in the future.Instruction is turned to the language of user preference by this locality, and can be embedded in the polling message.

In addition, choose be used for poll message can by message modification component 106 or by some other suitable virus scan assembly (not shown) Scan for Viruses.If find virus, can peel off this virus or give up this message.Should be appreciated that virus peels off any point of the system that can occur in 100, comprise when message is selected and before the proper user's download message.

Revise after the message, message delivery component 108 transmits polling message for voting to the user.Distribute unique identifier (ID) 110 (for example, metadata) to user feedback (for example, polling message, user's voting and any user property that is associated with it).ID 110 and/or the information corresponding with it are submitted to message stores/voting storage 112 (for example, the central databases) of compiling and storage user classification/voting.

At database level, can preserve can be used for poll the message of choosing for poll or use after a while.In addition, database can be carried out frequency analysis on the timing basis, determine not to specific user's over-sampling, and the data of in the restriction specified as the user, having collected some from this user.Particularly, the percentage limit and the sampling period of feedback system 100 monitoring user mails are to alleviate the deviation of sampling and data.When selecting the user from all available subscribers that comprise low utilization rate and high utilization rate user, this shows important especially.For example, compare the mail of common reception of low utilization rate user and quantity forwarded much less with high utilization rate user.Therefore, system 100 monitors the message selection courses, approximately is in every T the message receiving of user 1 with the message of guaranteeing to choose, and receives 1 message in the every Z of no more than user hour.Therefore, for example, this system can carry out poll (for example, considering to be used for poll) to per 10 in the message 1 of importing into that will be sampled, but no more than per 2 hours 1.This frequency (or number percent) restriction alleviated with high utilization rate user compare, to the low utilization rate user message of disproportionate quantity of sampling, but also alleviated certain user of excessive harassing and wrecking.

Central database 112 scans those message of having been sampled and being used for poll but also not being classified by system 100 often.Database is pulled out these message, and with respect to the ascribed characteristics of population of relative users with they localizations, and create polling message and ask the user to decide by vote and classify these message.But twit filter can just not be modified immediately or train after receiving each new incoming classification.On the contrary, off-line training allows the trainer constantly to check the data that receive in the database 112 on the basis of that be scheduled, ongoing or every day.That is, the trainer is from predetermined starting point or the time quantum set in the past, and checks from these data of lighting forward and train filtrator.For example, the preset time section can be from the midnight to the 6:00 AM.

Can train new twit filter on the afoot basis by with the message classification of safeguarding in machine learning techniques 114 (for example, neural network, support vector machine (the SVM)) analytical database 112.The example that machine learning techniques needs good mail and spam is therefrom learning, thereby they can learn to distinguish the two.Even also can benefit from the example with good mail based on the technology of the known examples of spam of coupling, thereby they can determine the mail that they are not unexpectedly caught.

Therefore, have affirmation and negation examples of spam both, but not only have complaint, be very important.The territory that exists some to send out a large amount of spams simultaneously and send legitimate mail such as tabulation such as free mail.If only based on complaining constructing system, then all mails from these territories can be filtered, and cause a large amount of mistakes.Therefore, know that it is important that this territory also sends out a large amount of good mails.In addition, the user usually makes such as forget mistakes such as they contract in certain free mail transmission tabulation.For example, send out legitimate mail regularly such as large-scale legal suppliers such as New York Times.Some users forget that they are once signatory and complain, thereby are spam with these message classifications.If there are not most of users to recognize that this mail is legal data, coming since then, the mail of website will be blocked.

New filtrator 116 can be by the central database that is distributed to Email or message server, indivedual E-mail client, update service device and/or indivedual companies on the distributed components 118 afoot bases by the Internet service provider (ISP) that participates in.In addition, feedback system is moved on the 100 afoot bases, thereby is considered and the intelligence sample that is used for poll can be followed the actual distribution of the Email that system 100 receives.As a result, be used to train the training dataset of new twit filter for adaptive spammer, to keep up-to-date.When having made up new filtrator, obtain polling data before can based on how long and it is given up or reduce weight (for example, carrying out discount).

When receiving mail, can realize system 100 such as server places such as gateway server, e-mail server and/or message servers.For example, when mail entered e-mail server, this whois lookup expection recipient's attribute determined whether the recipient determines adding system 100.If its attribute shows that so, then these recipients' mail can be used for poll potentially.Also there is the architecture that client computer is only arranged.For example, client email software can be made the poll decision-making for unique user, and central database is arrived in E-mail conveyance, or uses this polling message to improve the performance of personalized filter.Except architecture described herein, other that has this system 100 replaced architecture, and conceives this type of architecture and all fall within the category of the present invention.

Refer now to Fig. 2, according to an aspect of the present invention, show the process flow diagram of basic feedback cycle process 200.Although for explaining simple purpose, with this method representation be described as a series of actions, yet be to be understood that, the present invention is not subjected to the restriction of the order of each action, because according to the present invention, some action can take place by different order, and/or other action that illustrates and describe with this paper takes place simultaneously.For example, it will be understood by those skilled in the art that method can be expressed as such as a series of state or the incidents of being mutually related in the constitutional diagram with being replaced.And, be not the behavior shown in all be to realize that the method according to this invention is necessary.

Process 200 enters such as assembly such as server and by it at 202 place's mails and receives beginning.When the mail arrives server, this server identification expection recipient's attribute is to determine whether determined participation to be used for poll (204) as the spam soldier before the expection recipient.Therefore, process 200 utilizations can indicate the recipient whether to determine to participate in the user property field of this feedback system, perhaps consult the user's of decision participation tabulation.If determine that 206 this user is the participant and the selected poll that is used for of feedback system, this feedback system is by determining that choosing which message to be used for poll (208) takes action.Otherwise process 200 turns back to 202, expects that until at least one that determined to import message into recipient is user (for example, spam soldier).

In force, all message are considered for poll, comprise that those filtrators by current use (for example, personalized filter, Brightmail filtrator) are appointed as the message of (maybe will be) spam.Therefore, before message is considered for poll, do not have message deleted, give up or send to Junk E-Mail folder.

Each message that server is received or message item have one group of attribute corresponding to mail transaction.These attributes of server compiles also send to central database with it with polling message.The example of attribute comprises that the recipient (for example tabulates, as listed in " To:(to) ", " cc:(makes a copy for) ", " bcc:(secretly send) " field), the final conclusion of the filtrator of current use (for example, whether filtrator is spam with message identifier), other (for example can choose twit filter wantonly, the Brightmail filtrator) final conclusion and user profile (for example, the frequency of user name, password, true name, polling message, utilization rate data ...).Polling message and/or its content, and each of corresponding user/recipient all is assigned with a unique identifier.This identifier also can be sent to database, and upgrades as required subsequently.

214, revise selected message (for example, the origination message that is used for poll _1-M, wherein M is the integer more than or equal to 1), with to user's Indication message _1-MIt is polling message _P1-PM, and with soon it sends the user to for poll (216).For example, polling message can comprise the attachment typed origination message that will decide by vote, and one group of instruction of message being put to the vote about how.For example, this group instruction comprises such as two buttons such as " good mail " button and " spam " buttons at least.When one of user's button click (218) when having classified the message as mail or spam, the user is directed to the uniform resource locator (URL) of the unique identifier of the classification of submitting to corresponding to the user.This information is recorded, and the record that is associated of this origination message 1-M is updated in the central database.

216 or process 200 during any other reasonable time, origination message is by can randomly being sent to the user.Therefore, the user receives this message twice---once being its primitive form, is amended poll form for the second time.

In certain time after a while, create and trained new twit filter based on user feedback 220 to small part.In case create and trained new twit filter, can on e-mail server, use this filtrator immediately, and/or it can be distributed to client-server, client email software or the like (222).Training and distribute new or upgrade after twit filter be ongoing activity.Therefore, new when importing message flow into when receiving, process 200 continues 204.When having made up new filtrator, obtain than legacy data before based on how long, it is given up or reduce weight.

Feedback system 100 and process 200 depend on the feedback of its participating user.Unfortunately, the certain user is trustless, and is perhaps tangible in lazy the accurate classification of making peace can't be provided.Central database 112 (Figure 1A) the maintenance customer history of classifying.Therefore, feedback system 100 quantity, the user that can the follow the tracks of contradiction number of times, user that change his/her idea is to the response of known good mail and known spam and number of times or the frequency that the user replys polling message.

In these quantity any one surpasses predetermined threshold, or only for each user of system, and feedback system 100 can be called the credibility that or several affirmation technology visit certain or some specific users.According to another aspect of the present invention, a kind of method is a cross validation method 300 as shown in Figure 3.

Cross-validation technique is in 302 beginnings, and central database receives and imports data into such as poll result and respective user information etc.Next, 304, must determine whether to expect that cross validation tests the user of right quantity.If expectation so,, use certain part of importing data into to train new twit filter so 306.That is, the user's data from positive tested person is excluded outside training.For example, use to be subjected to about 90% of polled user data to train filtrator (being designated as 90% filtrator), thereby got rid of 10% (being designated as 10% tested person user) corresponding to the data of the data of submitting to by the tested person user.

308, contrast all the other 10% tested person user data and move 90% filtrator, to determine 90% user how tested person user's message is decided by vote.If the inconsistent amount between 90% filtrator and the 10% tested person user data surpasses predetermined threshold value (310), so 312 can this user of manual examination (check) classification.As an alternative or in addition, can send test post, and/or these specific users are got rid of from poll in the future, and/or give up their data in the past to suspicious or insincere user.But if do not surpass threshold value, this process turns back to 306 so.In force, cross-validation technique 300 can be used for any suitable test subscriber group, gets rid of different users where necessary, judges and safeguard the credibility of voting/grouped data.

The second method of calling party loyalty and reliability is included on all data of collecting in the given period and trains filtrator, tests on training data then, and uses this filtrator.This technology is called as the test (test-on-training) while training.If comprised certain message in the training, this filtrator should be known its classification, for example, the filtrator of acquiring through training should with the same method of user this message of classifying.But filtrator possibility Dauerverbrechen is with its mistake that is labeled as spam when the user is labeled as non-spam with it, and vice versa.For making filtrator and its training data inconsistent, this message must be very inconsistent with other message.Otherwise the filtrator of training is sure finding someway with its correct classification almost.Therefore, this message can be used as and have unreliable label and give up.This technology or cross validation all can use: cross validation can more unreliable real estate class estranged in more mistake; On the contrary, less mistake is found in test more reliably while training.

Test and cross-validation technique 300 all can be applicable to individual message while training, and wherein the individual user is got rid of (for example, following most of people's classification) to the classification or the classification of message by General Agreement.Alternatively, two kinds of technology all can be used for discerning potential unreliable user.

Except cross validation and/or while training the measuring technology, or substitute, can use " known results " technology to verify user's credibility (and then 314 to Fig. 4) as it.The technology of Fig. 3 and 4 although demonstrated individually should be appreciated that and can utilize this two kinds of technology simultaneously.That is, from be known as good news and be known as spam messages information can with the result combinations of cross validation or test while train, to give up which user with definite.

Refer now to Fig. 4, show the process flow diagram of the process 400 of the loyalty of confirming user's voting according to an aspect of the present invention.Process 400 is drawn 314 shown in Fig. 3 freely.402, send the known results test post to suspicious user (or all users).For example, test post can be injected and import mail into, and immediately manually with its classification, thereby database receives " known " result.Otherwise process 400 can be waited for, sends known results message until the third party.The user is allowed to same test post is put to the vote.404, the result who decides by vote is compared with known results.406, if user's voting is inconsistent, so can be in an appropriate time section manual examination (check) they are current and/or in the future and/or classification (408) in the past, express consistance and reliability until it.Perhaps, can discount or remove their current or following or classification in the past.At last, these users can be removed from poll in the future.But,, these users can be considered as believable so 410 if their voting result is consistent with test message results really.Process turns back to Fig. 3 412, to determine that expectation is to the affirmation technology of next group suspicious user with which kind of type.

The 4th kind of method (not shown) of assess user reliability is initiatively study.In the active learning art, be not picked at random message.On the contrary, feedback system can be estimated the useful degree of message to system.For example, if filtrator returns the probability of spam, then can preferentially select when front filter classify least definitely message be used for poll, that is, its spam probability is near those message of 50%.The another kind of method of selecting message is to determine the O-level of message.Message is common more, and it is just useful more for poll so.Unique useful degree of message is lower, because their O-level are lower.Can use initiatively study by the confidence level of utilizing existing filtrator, the O-level of utilizing the characteristic information and the setting of the existing filtrator of utilization or the confidence level (for example, first degree of confidence) of content.Also have such as the known trustee's inquiry of the technician in machine learning field (query-by-committee) and wait other initiatively learning art, can use in these technology any.

Refer now to Fig. 5, show the process flow diagram of the process 500 in the twit filter training that the feedback of the honey jar except that user feedback is attached to according to an aspect of the present invention.Honey jar be known who should be to the e-mail address of its send Email.For example, the e-mail address of certain new establishment can be maintained secrecy, and only reveal to selected several body (502).Also can be publicly but with the invisible restrictive one of people with its open (for example, it is linked as mail, is placed on the white background) with white font.Honey jar is particularly useful in the dictionary attack of spammer.In dictionary attack, spammer is attempted to very a large amount of addresses email, may be all addresses in the dictionary, the perhaps address of the word from the address to producing, or be used to find the similar techniques of effective address.Send to any Email (504) of honey jar or be not that any Email (506) from some selected individualities is regarded as spam (508).Also can e-mail address and suspect merchant is signatory.Therefore, any Email of receiving from this businessman has been regarded as mail (510), but all other mails are regarded as spam.Can train twit filter (512) in view of the above.In addition, judge that this suspect merchant sells or revealed user's information (for example, e-mail address) at least to the third party.Can repeat this process to other suspicious businessman, and the generation tabulation warns the user that their information may be distributed to spammer.These only are to obtain several in the technology that sends to Email honey jar, that can be considered as spam safely.In force, also have other replacement method to obtain and send to Email honey jar, that can be considered as spam safely.

Because honey jar is the fine source of spam, but the no good source of legitimate mail, so can will train new twit filter from the data of honey jar with from the data of feedback loop system (Fig. 1) are combined.Can carry out different weightings to mail from separate sources or different classification.For example, if having 10 honey jars and 10 that 10% the polled user of mail is arranged, then can expect from 10 times of the spams of honey jar to spam from poll.Therefore, in order to remedy this difference, can be to carrying out 10 times or 11 times from the legitimate mail of poll to the weighting of spam.Perhaps, optionally reduce the weight of honey jar data.For example, about 50% user mail has been a mail, and about 50% is spam.The spam of equivalent is transferred in the honey jar.Therefore, seeming honey jar has 100% spam, and is sampled all, but not only 10%.For in the system of combination with the become reconciled training recently of mail of correct spam, the honey jar data are reduced by 95% weight, and with the weight of user's spam reduction by 50%, to cause the overall ratio of 1:1.

Other sources of spam report comprise that seat participant not is included in the user in the feedback loop system.For example,, " report spam " button that can use all users can be arranged, report the spam that has passed through filtrator for all mails.These data can combine with the data from feedback loop system.Once more, can reduce the weight in this spam source, or carry out different weightings, because it may be that deviation or incredible is arranged in all fields.Also should carry out weighting again, the fact that can report by " being reported as spam " button with the mail that only is not filtered of reflection.

Except twit filter, guard filter can be created and use to feedback loop system.Guard filter utilized the affirmation and negation mail features both.For example, the mail from welcome online merchants is almost always good.Aspect certain of the mail of spammer by the businessman that imitation is good in its spam, come this system of malicious exploitation.Another example is that the spammer passes through to send a small amount of good mail via the IP address, comes the willful deceit feedback system.Feedback cycle association has been a mail with this classification of mail, and this moment, spammer begins from same IP address transmission spam.

Therefore, guard filter is noticed on the basis of historical record data, compares with this system's custom, receives certain specific positive feature of dramatic growth.Therefore this message is suspected by this system that makes, and it is isolated before maybe this mail being labeled as spam selecting to send to this mail, until having obtained enough poll results.Guard filter also can used when new IP address gets the mail, and to this new IP address, unknown or uncertain this mail is that spam also is non-spam, and still is unknown in a bit of time.Can carry out with some kinds of methods and isolate, comprise temporarily mail being labeled as spam and it is moved on to Junk E-Mail folder or it not being sent to the user or be stored to somewhere not descried.Can be to isolating: can suppose to help to make correct decision-making from the additional information of poll near the message of twit filter threshold value.Also can when receiving a lot of similar message, isolate: can send some message and carry out poll, and the filtrator of training again can be used for classifying messages correctly for feedback cycle.

Except making up filtrator, can also utilize as described in this article feedback loop system that it is assessed.That is, can adjust the parameter of twit filter as required.For example, train up certain filtrator midnight yestereve.After midnight, get and enter data of database and determine the error rate of twit filter and user classification under comparing.In addition, feedback cycle can be used for judging the sure and capture rate of vacation of twit filter.For example, desirable user's voting, and make mail pass a potential filtrator, to determine certainly false and capture rate.This information can be used for adjusting and optimizing this filtrator immediately.By make up several each all use different the setting or the filtrator of algorithm, can be manually or from different parameter setting of dynamic test or different algorithms, thus obtain the sure and capture rate of minimum vacation.Therefore, can compare each result to select preferably or optimum filter parameter.

Can utilize feedback cycle make up and fill those always by voting for spam or always be the IP address of good mail etc. or the tabulation of territory or URL for good mail or at least 90% by voting by voting.These tabulations can be used for carrying out Spam filtering with other method.For example, at least 90% quilt voting can be used for making up the address black hole lists of not accepting from the Email of its transmission for the IP address list of spam.Feedback cycle also can be used for stopping the number of the account of spammer.For example, send spam if the specific user of ISP seems, then this ISP can be notified automatically.Similarly, will be responsible for for a large amount of spams if special domain seems, then the email provider in this territory can be notified automatically.

The some kinds of architectures that can be used for realizing this feedback loop system are arranged.Described in Fig. 7, a kind of example architecture is based on server, and selection course takes place when the mail arrives e-mail server.Described in Fig. 6, a kind of architecture of replacement is based on client computer.In feedback cycle based on client computer, polling message can be used for improving the performance of personalized filter, perhaps, and herein in the exemplary realization shown in, this information can be sent to shared thesaurus as the training data that is used for sharing filtrator (for example, the company's scope or the whole world).The architecture that should be appreciated that the following stated is exemplary, and can comprise add-on assemble and the feature of not describing herein.

Get back to Fig. 6 now, show exemplary universal block diagram based on the feedback cycle technology in the architecture of client computer.Network 600 is provided (also has been designated as client computer so that Email is to and from one or more client computer 602,604 and 606 ₁, client computer ₂... client computer _N, wherein N is the integer more than or equal to 1) between communication.This network can be such as global communications network such as the Internet (GCN), or WAN (wide area network), LAN (LAN (Local Area Network)) or any other network configuration.In this specific implementation, SMTP (Simple Mail Transfer protocol) gateway server 608 and network 600 interfaces are to provide SMTP service to LAN610.Be placed in e-mail server 612 and gateway 608 interfaces on the LAN610 in the operation, with the Email that imports into and spread out of of control and processing client 602,604 and 606.This type of client computer 602,604 and 606 also is placed in LAN

On 610, so that the mail service that provides on it to be provided at least.

Client computer ₁602 comprise the CPU (central processing unit) (CPU) 614 of controlling client process.CPU 614 can be made up of a plurality of processors.CPU 614 carries out about being provided at any instruction of above-described one or more data aggregation/feedback functions.These instructions comprise, but be not limited to, coded order, they carry out above-mentioned basic feedback cycle method at least, at least any or all method that can be used in combination with it, these methods are used to solve the selection of client computer and message, polling message is revised, data keep, client computer reliability and class validation, again weighting is from the data that comprise a plurality of sources such as feedback cycle, twit filter is optimized and is adjusted, guard filter, the establishment of spam tabulation, and to its automatic notice of the spammer of ISP and email provider separately.Provide user interface 616 so that communicate by letter with Client OS, thereby client computer 1 can be put to the vote with the visit Email with to polling message alternately with CPU 614.

Can select to be used for poll by message selector switch 620 from the sampling of the client communication of server 612 retrievals.If expection recipient (client computer) has agreed to participate in before, then choose and revise message to be used for poll.Message modification device 622 is a polling message with message modification.For example, according to the description of message modification provided above, message modification can be become comprise voting instruction and voting button and/or link.Voting button and/or link are to realize by the user interface 616 of revising client email software.In addition, open or download message for before client computer 602 checks, any virus in the message modification device 622 removable message (poll and non-polling message).

In one implementation, the user of the client computer 602 of antagonism spam only sees each message once, and wherein some message is labeled as polling message especially, and comprises voting button etc.In this realization, the user of spam opposing client computer 602 can see some message twice, wherein once is normal messages, and another time is polling message.These available some kinds of methods realize.For example, can return polling message and it is stored in the polling message storage to server 612.Perhaps, client computer 602 can be stored additional message in e-mail server 612.Perhaps, client computer 602 can show each message twice to the user, once is normal messages, once is amended form.

Poll result 626 can be sent to CPU 614, is sent to database 630 subsequently, depends on the ad hoc arrangement of client computer feedback system structure, database 630 can be configured to store the data from a client computer or an above client computer.The information of central database 630 storage polling message, poll result and each client user.Can use associated component to analyze this type of information, such as credibility (for example, the user confirms 632) and other client computer statistic of determining poll frequency, client user.Especially when the reliability of client computer voting is suspected, can use the affirmation technology.Suspection may come to inconsistent number of times, the number of times of changing mind and to the analysis of the number of times of certain or some specific user's polling messages; Perhaps, can use the affirmation technology to each user.The data that are stored in any right quantity in the central database can be used for machine learning techniques 634, so that train new and/or improved twit filter.

Client computer 604 and 606 comprises and similar assembly mentioned above, to obtain and to train filtrator to the specific client personalization.Except described, polling message rinser (scrubber) 628 can be between CPU 614 and central database 630 interface, thereby can be because of some aspect that removes polling message such as a variety of causes such as data gathering, data compressions.Polling message rinser 628 can wash out the irrelevant part of polling message and the user profile of any non-expectation of being associated with it.

Refer now to Fig. 7,, show the exemplary feedback loop system 700 of being convenient to multi-user's login and obtaining polling data based on server according to feedback cycle technology of the present invention.Network 702 is provided (also has been designated as the user so that Email is to and from one or more users 704 ₁704 ₁, the user ₂704 ₂..., and user _N704 _N, wherein N is the integer more than or equal to 1) between communication.Network 702 can be such as global communications network such as the Internet (GCN), or WAN (wide area network), LAN (LAN (Local Area Network)) or any other network configuration.In this specific implementation, SMTP (Simple Mail Transfer protocol) gateway server 710 and network 702 interfaces are to provide SMTP service to LAN712.Be placed in e-mail server 714 and gateway 710 interfaces on the LAN712 in the operation, with the Email that imports into and spread out of of control and process user 704.

System 700 provides the ability of multiple login, thereby for signing in to each different user generation user and message selection 716, message modification 718 and the message poll (720,722,724) of system 700.Therefore, provide user interface 726, it has presented the part of logon screen as the computer operating system bootup process, or according to desired user 704 can visit his or her import message into before a predetermined user's overview that is associated.Therefore, as first user, 7041 (users ₁) when selecting access message, this first user 704 ₁Be generally the visit information of username and password form by input, sign in in the system via logon screen 728.CPU 730 handles this visit information and only visits first user's inbox position 732 to allow this user via messaging application (for example, Mail Clients).

When receiving on message server 714 when importing mail into, their are selected to be used for poll randomly, this means that in these message at least one is labeled to be used for poll.Whether check the expection recipient who is labeled message, also be the user of the antagonism spam of appointment to determine among these recipients any one.Can on any other suitable assembly of message server 714 or system 700, safeguard recipient's attribute of indication this type of information.In case determined which also is the spam soldier among the expection recipient, can with they separately mail copy and send to central database 734 about any out of Memory of this mail transaction and be used for storage.Message modification device 718 is revised with the method for above-described any amount and is marked as the message that is used for poll.The selected message that is used for poll can be special-purpose to user 704 also.For example, user 704 can indicate and have only the message of some type to can be used for poll.Because this can cause data sampling devious, so these class data can come weighting again with respect to other client data, to alleviate the out-of-proportion training dataset of structure.

Also can be at this moment or any virus scan of carrying out polling message At All Other Times before user 704 downloads and/or opens polling message.In case revised message by rights, soon it is sent to and is designated as inbox ₁732, inbox ₂736 and inbox _NEach user of 738 inbox, can open it there is used for poll.For ease of polling procedure, each polling message comprises two or more voting buttons or link, when the user chooses these buttons or link, promptly generates the information that relates to this polling message and poll result.Can revise the text of each polling message, button will be decided by vote or link is attached to wherein.

The message poll result who comprises any information (for example, polling message or the ID that is associated with it, user property) that is produced by classification (is designated as the message poll ₁720, message poll ₂722 and the message poll _N724) send to central database 734 via the network interface on the LAN 712 740.Central database 734 can be stored from each user's poll and user profile (720,722,724), makes up or optimizes new and/or improved twit filter 742 to be applied to machine learning techniques.But the reason for privacy and/or safety before information is sent to central database 714, can remove from this information or the stripping machine confidential information.Also the information aggregation that user 704 can be generated via poll is in statistics.Thereby, used less bandwidth to send this information.

But then can be such as on ongoing basis such as new filtrator time spent, by specific request or the twit filter 742 that automatically will newly train be distributed to other server (not shown) and with the client email software (not shown) of LAN 712 interfaces.For example, can it release up-to-date twit filter, and/or make it can be used for downloading via the website from trend.When having generated new training dataset and made up the twit filter of renewal, can be according to the data set (for example, obtaining and/or be used to train the information of filtrator) that give up the service time of data or discount is older before.

Consider now the situation of replacing, the tissue of wherein being devoted to resist spam can be used the filtrator of being shared by the tissue of many different use filtrators.In one aspect of the invention, filter provider also is very large-scale E-mail service (for example, paying and/or free email accounts) supplier.This filter provider is selected also to use some data of using the tissue of filtrator from some, rather than depends on the Email from its oneself tissue exclusively, to have caught the scope of mail and spam better.Feedback loop system as mentioned also can be used for so inter-organization situation with the form of server or client architecture.Be called " inside " tissue with assembling with filter provider, and the assembly of one of tissue that will reside in the use filtrator of participation is called " outside " from the data of the tissue of different use filtrators from its oneself user.Generally speaking, striding organization system comprises such as, but not limited to Hotmail etc. and is positioned at the mail database server (inside) of filter provider and can resides in one or more message servers (outside) such as message server of one or more independent companies such as those.In this case, inner mail database server is also stored the email feedback from its oneself client's essence.According to this aspect of the invention, can be (for example based on the information that is stored in internal database, free email/message communicating on Hotmail or the MSN server) and be stored in information in the one or more external data bases that are associated with each external server, generate training dataset.For example, can be used for machine learning techniques via the information of safeguarding on the external data base being sent to internal server such as networks such as the Internets.Finally, can be used for the conventional garbage mail filter of training new twit filter and/or improvement to be positioned at outside (for example, in each company) or to be associated from the data of external data base with internal mail server.

Data from one or more external data bases should comprise polling message, poll result (classification), user information/properties, and each user, every group of user or average at least one in the voting statistics of each company.The voting statistics is convenient to determine the reliability of the information that generated by each company, and the deviation that alleviates external data.Therefore, can perhaps be different from the weighting of one or more other external data bases to data weighting again from one or more external data bases (company).In addition, can use the reliability and the credibility of coming the test for external entity with described similar affirmation technology above.

For example, for the purpose of corporate security, privacy and secret, can will wash, abridge and/or concentrate from its primitive form by information or data that the Internet sends to e-mail server from each company.Primitive form can be safeguarded on each external data base, and/or handle according to the preference of each company.Therefore, e-mail server or any other internal mail server only receive that to generate training data for information about necessary, such as spam classification, sender territory, sender's title, be categorized as the content etc. of the message of spam.

Refer now to Fig. 8, show exemplary striding and organize feedback system 800, wherein internal database server and Outside Mail server can be via network service and swap data library informations, so that be created on the training dataset that is used for making up improved twit filter in the machine learning techniques.System 800 comprises at least one external message server 802 (for example, being associated with at least one company) and internal database server 804.Owing to stride the characteristic of organization system, external server 802 and internal e-mail server 804 are safeguarded its oneself database respectively.That is, e-mail server 804 with also can be used for training the internal database 806 of new spam filters 808 to be associated.Similarly, external server 802 with can be used for training at least one new twit filter 812 and be associated with respect to the external data base 810 that e-mail server 804 is positioned at inner twit filter 808.Therefore, the information that is stored on the external data base 810 can be used for training the twit filter 808 that is positioned on the e-mail server.

Provide GCN 814 so that information is to and from communicating by letter of internal e-mail server 804 and one or more external message server 802.The external server assembly of striding organization system with move based on the similar mode of the feedback loop system (for example, above-mentioned Fig. 7) of server.For example, message server 802, external data base 810 and filtrator 812 can be positioned on the LAN 815.In addition, provide user interface 816, it presents the part of logon screen 818 as the bootup process of computer operating system, or as requested, user 704 addressable his or her import message into before the predetermined user's overview that is associated.

In this system based on server, in order to utilize available mail service, one or more users (are designated as the user ₁820, user ₂822, user _N824) can sign in in the system simultaneously.In force, as first user, 820 (users ₁) when selecting access message, this first user 820 is generally the visit information of username and password form by input, signs in in the system via logon screen 818.CPU 826 handles this visit information and only visits first user's inbox position 828 to allow this user via messaging application (for example, Mail Clients).

When receiving that on message server 802 when importing mail into, message is decided to be the target of poll randomly or especially.Whether this type of expection recipient and spam soldier user list that is decided to be the message of target is compared before message is used for poll can selecting, also be the user of the antagonism spam of appointment to determine among these recipients any one.Can on any other appropriate assembly of message server 802, database 810 or system 800, safeguard recipient's attribute of indication this type of information.In case determining among the expection recipient which also is the spam soldier, promptly selects message to be used for poll, and sends the copy of polling messages and any out of Memory that relates to mail transaction to database 810.

Message modification device 830 can above-described any amount method revise the selected message that is used for poll.In force, can distribute unique identifier (ID) and it is stored in the database 810 to each polling message, each spam soldier and/or each poll result.As previously mentioned, the selected message that is used for poll can be selected at random, can be special-purpose for relative users (820,822 and 824) perhaps.For example, user ₁820 only can indicate, and the message of some type can be used for poll (for example, from the outside message that sends of company).To the data that generate from this type of particular message weighting and/or discount again, obtain data sampling devious to alleviate.

Also can be at this moment or any virus scan of carrying out polling message At All Other Times before user 704 downloads and/or opens polling message.In case revised message by rights, soon it is sent to and is designated as inbox ₁828, inbox ₂832 and inbox _NEach user of 834 inbox, can open it there is used for poll.For ease of polling procedure, each polling message comprises two or more voting buttons or link, when the user chooses these buttons or link, promptly generates the information that relates to this polling message and poll result.Can revise the text of each polling message, button will be decided by vote or link is attached to wherein.

The message poll result who comprises any information (for example, polling message or the ID that is associated with it, user property) that is produced by classification (is designated as the message poll ₁720, message poll ₂722 and the message poll _N724) send to central database 810 via the network interface on the LAN815 842.Central database 810 can be stored from each user's poll and user profile, for making up or optimize machine learning techniques new and/or improved twit filter 812,808 and use being used for after a while.

For example, for the reason of privacy, each company may want peeling off key message by GCN 814 before its oneself database 810 and/or email database 806 send polling messages and/or user profile.A kind of method is only to provide the feedback to spam messages to database (806 and/or 810), thereby has got rid of the feedback to legitimate mail.Another kind method is the part subclass that only provides such as the information on the legitimate mail such as sender and sender IP address.Another kind method is, for the message of choosing, is labeled as by the user such as those and is labeled as badly by filtrator, or opposite message before sending it to filtrator, requires the user to permit clearly.Any or its combination in these methods is convenient to safeguard the privacy of confidential information of the client computer of participation, constantly provides data to train twit filter (808 and/or 812) simultaneously.

Confirm that such as user such as described scheme above scheme also can be applicable to each company and in-company each user.For example, the user can individually stand cross-validation technique, and wherein the classification of suspicious user is got rid of from filter training.Filtrator uses and trains from all the other user's data.Housebroken filtrator travels through the message from the user who is excluded subsequently, to determine its these message of how classifying.If inconsistent amount surpasses a threshold value, this suspicious user is regarded as insincere so.Following message classification from insincere user can carry out manual examination (check) before by database and/or filtrator acceptance.Otherwise, can from poll in the future, remove these users.

Refer now to Fig. 9, be used to realize that the exemplary environments 910 of various aspects of the present invention comprises computing machine 912.Computing machine 912 comprises processing unit 914, system storage 916 and system bus 918.System bus 918 will include but not limited to that the system component of system storage 916 is coupled to processing unit 914.Processing unit 914 can be any in the various available processors.Dual micro processor and other multiprocessor architecture also can be used as processing unit 914.

System bus 918 can be any in the some kinds of bus structure, comprise memory bus or Memory Controller, peripheral bus or external bus, and/or use any local bus in the various available bus architectures, these architectures comprise, but be not limited to 11 buses, ISA(Industry Standard Architecture), MCA (MCA), the ISA (EISA) of expansion, intelligent driver electronic equipment (IDE), VESA local bus (VLB), peripheral component interconnect (PCI), USB (universal serial bus) (USB), advanced graphics port (AGP), PCMCIA (personal computer memory card international association) bus (PCMCIA), and small computer system interface (SCSI).

System storage 916 comprises volatile memory 920 and nonvolatile memory 922.Comprise such as when starting, helping basic input/output (BIOS) to be stored in the nonvolatile memory 922 in that computing machine 912 inner each interelement transmit the basic routine of information.And unrestricted, nonvolatile memory 922 can comprise ROM (read-only memory) (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically-erasable ROM (EEROM) (EEPROM) or flash memory as example.Volatile memory 920 comprises the random-access memory (ram) of taking on the External Cache storer.As example and unrestricted, RAM can have various ways, such as synchronous random access memory (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDR SDRAM), enhancement mode SDRAM (ESDRAM), synchronization link DRAM (SLDRAM), and direct memory bus RAM (DRRAM).

Removable/not removable, volatile/nonvolatile computer storage media that computing machine 912 also comprises.For example, Fig. 9 illustrates disk storage 924.Disk storage 924 includes but not limited to, as the equipment of disc driver, floppy disk, tape drive, Jaz driver, Zip drive, LS-100 driver, flash card or memory stick and so on.In addition, disk storage 924 can comprise storage medium separately or make up with other storage mediums, other storage mediums comprise, but be not limited to, can write down driver (CD-R driver), CD recordable drive (CD-RW driver) or digital versatile disc ROM driver CD drive such as (DVD-ROM) such as CD ROM equipment (CD-ROM), CD.Be connected to system bus 918 for ease of disk storage device 924, use such as removable or not removable interfaces such as interfaces 926 usually.

Should be appreciated that Fig. 9 has described the software of taking on intermediary between the basic computer resources of user and description in suitable operating environment 910.This type of software comprises operating system 928.Can be stored in that operating system 928 in the disk storage 924 is used to control and the resource of Distribution Calculation machine system 912.System application 930 has utilized operating system 928 by being stored in the program module 932 in the system storage 916 or in the disk storage 924 and the management of 934 pairs of resources of routine data.Should be appreciated that the present invention can make up with various operating systems or its realizes.

The user arrives in the computing machine 912 by input equipment 936 input commands or information.Input equipment 936 comprises, but be not limited to, such as positioning equipment, keyboard, microphone, operating rod, game mat, satellite dish, scanner, TV tuner card, digital camera, Digital Video, IP Camera or the like such as mouse, tracking ball, contact pilotage, touch pads.These and other equipment is linked processing unit 914 via interface port 938 by system bus 918.Interface port 938 comprises for example serial port, parallel port, game port and USB (universal serial bus) (USB).Output device 940 uses the port of some and input equipment 936 same types.Thereby for example USB port can be used for being provided to the input of computing machine 912, reaches from computing machine 912 output informations to output device 940.O adapter 942 is provided, has in other output devices 940 to be shown in that some need the output device 940 of private adapter such as monitor, loudspeaker and printer etc.As example and unrestricted, o adapter 942 comprises, a kind of video card and sound card that is connected means between output device 940 and the system bus 918 is provided.Should be noted that such as remote computer 944 waits other equipment and/or device systems that input and fan-out capability are provided simultaneously.

Computing machine 912 can use logic such as one or more remote computers such as remote computers 944 and be connected in the networked environment and operate.Remote computer 944 can be personal computer, server, router, network PC, workstation, the electrical equipment based on microprocessor, peer device or other common network node or the like, and generally includes with respect to computing machine 912 described many or whole elements.For the sake of simplicity, only memory storage devices 946 is shown with remote computer 944.Remote computer 944 logically is connected to computing machine 912 by network interface 948, then via communicating to connect 950 physical connections.Network interface 948 comprises such as Local Area Network and wide area network communication networks such as (WAN).Lan technology comprises Fiber Distributed Data Interface (FDDI), Copper Distributed Data Interface (CDDI), Ethernet/IEEE1102.3, token ring/IEEE1102.5 or the like.The WAN technology includes, but not limited to point-to-point link, the circuit-switched network as ISDN (Integrated Service Digital Network) and variant thereof, packet switching network and Digital Subscriber Line (DSL).

Communicate to connect the hardware/software that 950 fingers are used for network interface 948 is linked bus 918.Although clear for explanation, will communicate to connect 950 and be shown in computing machine 912 inside, it also can be positioned at computing machine 912 outsides.It only is illustrative purpose, be connected to network interface 948 necessary hardware/softwares and comprise inside and outside technology, such as the modulator-demodular unit that comprises routine call level modulator-demodular unit, cable modem and DSL modulator-demodular unit, ISDN adapter and Ethernet card.

Figure 10 be can with the schematic block diagram of the mutual example calculation environment 1000 of the present invention.System 1000 comprises one or more client computer 1010.Client computer 1010 can be hardware and/or software (for example, thread, process, computing equipment).System 1000 also comprises one or more servers 1030.Server 1030 also can be hardware and/or software (for example, thread, process, computing equipment).For example, server 1030 can hold thread, carries out conversion with the application of the invention.1030 a kind of possible communicating by letter of client computer 1010 and server can be the forms that is suitable for the packet that sends between two or more computer processes.System 1000 comprises the communications framework of communicating by letter 1050 that can be used to be convenient to

client

1010 and 1030 of servers.Be connected to the one or more client data storages 1060 that can be used for storing to the information of client computer 1010 this locality in client computer 1010 operations.Similarly, link the one or more server data stores 1040 that can be used for storing to the information of server 1030 this locality in server 1030 operations.

Top description comprises example of the present invention.Certainly, can not describe each combination that can expect of each assembly or method, but those of ordinary skill in the art can be appreciated that many other combinations of the present invention are possible with conversion for description the present invention.Therefore, the present invention is intended to comprise spirit and interior all these type of changes, modification and the variant of category that falls into appended claims.In addition, use in embodiment or claims on the meaning that term " comprises ", this type of term intention has pardon as term " comprises ", is explained during as " comprising " transition speech in being used as claims.

Claims

1. be convenient to regard to preventing spam, to come system that project is classified for one kind, it is characterized in that, comprising:

Be used to receive the device of one group of described project;

Be used to discern the expection recipient of described project and a subclass of described project is labeled as the device of poll project, described poll project is corresponding to recipient's subclass of the user who is known as the antagonism spam; And

Utilize the feedback assembly of machine learning techniques, be used to receive the user that relates to described antagonism spam and be used to train twit filter and fill the spam tabulation to the information of the classification of described poll project and based on the information of user's input of described antagonism spam and machine learning techniques information with described classification;

Wherein, described project comprises in Email and the message at least a.

2. the system as claimed in claim 1 is characterized in that, the described device that is used for receiving one group of described project is any one of e-mail server, message server and E-mail client.

3. the system as claimed in claim 1 is characterized in that, described poll project comprises all items of being received.

4. the system as claimed in claim 1 is characterized in that, described recipient's subclass comprises all recipients.

5. the system as claimed in claim 1 is characterized in that, described recipient's subclass is selected at random.

6. the system as claimed in claim 1 is characterized in that, described poll project is subject at least one in the following terms and conditions:

The quantity of each user's selected item;

Each user is in the quantity of the selected item of each time period; And

To carry out the probability of mark corresponding to the project of known users.

7. the system as claimed in claim 1 is characterized in that, each of described poll project all is assigned with a unique ID, and described unique ID is corresponding in the content of described poll project and described poll project any one.

8. the system as claimed in claim 1 is characterized in that, also comprises the device of revising described poll project.

9. system as claimed in claim 8 is characterized in that described poll project comprises the summary of poll project, and described summary comprises in the first few lines of theme, date, Message-text and described Message-text at least one.

10 systems as claimed in claim 9, it is characterized in that, described poll project comprises voting instruction any one voting button and link with at least two voting buttons and in linking, described at least two voting buttons and link are corresponding at least two corresponding classification of poll project, so that by the user poll project is classified.

11. the system as claimed in claim 1, it is characterized in that, the central database that also comprises canned data and data, described information and data relate to user property, the contents of a project that are associated with the poll project and attribute, user's classification and voting statistics, each user's wheel and inquire about the frequency analysis data of the poll of each each time period of user, spam tabulation, legitimate mail tabulation and black hole lists.

12. the system as claimed in claim 1, it is characterized in that, described system distributes in the company of antagonism spam more than, thereby the information from the described classification of each company is sent to the central database that links to each other with each company, wherein, the confidential information in the information of described classification is removable.

13. the system as claimed in claim 1 is characterized in that, also comprises being used for test subscriber's reliability and credible user's class validation device.

14. system as claimed in claim 13 is characterized in that, described user's class validation device can be applied to one or more users that suspected.

15. the system as claimed in claim 1 is characterized in that, described feedback assembly receives and relates to user feedback, honey jar feedback and the optional user recipient feedack of receiving project.

16. be convenient to regard to preventing spam, to come method that message is classified for one kind, it is characterized in that, comprising:

Receive one group of described message;

Discern the expection recipient of described message;

A subclass of described message is labeled as polling message, and described polling message is corresponding to recipient's subclass of the user who is known as the antagonism spam;

Reception relates to the information of the user of described antagonism spam to the classification of polling message; And

Be used to train twit filter and fill the spam tabulation based on the information of described classification and machine learning techniques information described classification.

17. method as claimed in claim 16 is characterized in that, the described recipient's subclass that is known as the user of antagonism spam is carried out following at least one by each recipient and is determined:

Decision participates in providing feedback to message so that train new twit filter;

Determine passively that by not determining to withdraw from participation provides the feedback to message;

Email and messenger service that the message server that participates in of serving as reasons provides are paid; And

Offer email accounts to the message server that participates in.

18. method as claimed in claim 16 is characterized in that, described polling message is limit by one or more poll restrictions.

19. method as claimed in claim 16 is characterized in that, also comprises revising described polling message.

20. method as claimed in claim 19 is characterized in that, revises polling message and comprises following at least one of execution:

Described polling message is moved on to the independent file that is used for polling message;

Revise " certainly " address of described polling message;

Revise the subject line of described polling message;

On described polling message, use the poll icon to identify; And

Use unique color to identify described polling message.

21. method as claimed in claim 16 is characterized in that, also is included in described polling message and is downloaded and is used for before the poll its Scan for Viruses.

22. method as claimed in claim 16, it is characterized in that, also be included as each described polling message and make a copy the same during with original receiving, thereby make each user among the user who resists spam can receive the triplicate of the amended poll form of the first authentic copy of primitive form of described message and described message.

23. method as claimed in claim 16, it is characterized in that, also comprise described housebroken twit filter is distributed to one or more servers, described distribution is automatically to take place and/or take place by request, and described request is from least one of the announcement that is used for downloading on email message and the website.

24. method as claimed in claim 16, it is characterized in that, training described twit filter and filling described spam tabulation is to be used based on classify feedback and optional data that generated by one or more additive sources of user by machine learning techniques to carry out, and described one or more additive sources comprise honey jar, the non-user of recipient classify feedback and active learning art.

25. method as claimed in claim 24, it is characterized in that, come pro rata weighting again by the data that described one or more additive sources generate with respect to the type of the data that generate by this source and with respect to the information of described classification, so that obtain the bias free sampling of data.

26. method as claimed in claim 16 is characterized in that, also comprises:

Message one or more positive feature are separately imported in supervision into;

Determine the frequency of the positive feature received;

Judge based on historical data whether the frequency of the positive feature of receiving surpasses a threshold frequency at least in part; And

Isolation is corresponding to the suspect message of the one or more positive feature that surpass described threshold frequency, until there being other grouped data to can be used for judging whether suspect message is spam.

27. method as claimed in claim 26 is characterized in that, the positive feature of being received is the information about the sender, comprises in sender IP address and the territory at least one.

28. method as claimed in claim 26 is characterized in that, isolates suspect message and is by in the following action at least one and carry out:

Described suspect message is labeled as spam and it is moved on to Junk E-Mail folder temporarily;

Delay sends to described suspect message to the user, can use until other grouped data; And

Described suspect message is stored in the sightless file of user.

29. method as claimed in claim 16 is characterized in that, comprises that also the vacation of definite described twit filter is affirmed and capture rate, so that optimize described twit filter, wherein, determines that vacation is affirmed and capture rate comprises:

Use training dataset to train described twit filter, described training dataset comprises the first group polling result;

Utilize the user feedback second group polling message of classifying, to produce the second group polling result;

Make the described second group polling message pass described housebroken twit filter;

Described second group polling result and described housebroken twit filter result are compared, affirm and capture rate with the vacation of determining described filtrator, thereby assess and adjust filter parameter according to optimum filter capability.

30. method as claimed in claim 29, it is characterized in that, make up an above twit filter, its each all have different parameters, and each all uses same training dataset to train, thereby with the vacation of each twit filter certainly and the vacation of capture rate and at least one other twit filter is sure and capture rate compares, think that Spam filtering determines optimized parameter.

31. method as claimed in claim 16, it is characterized in that, also comprise and use the extra message groups of importing into to make up improved twit filter, the subclass of described additional message group will be subjected to poll to produce the information of the new classification relevant with the described improved twit filter of training, wherein, the information of the previous classification that obtains obtains it before at least in part based on how long and comes its weighting again.

32. method as claimed in claim 16 is characterized in that, comprises that also the information of using described classification makes up the legitimate sender tabulation.

33. method as claimed in claim 16 is characterized in that, comprises that also the information of using described classification is so that termination spammer's number of the account.

34. method as claimed in claim 33 is characterized in that, comprises that also identification using the spammer of ISP and send from this ISP of trend notice spam.

35. method as claimed in claim 33 is characterized in that, comprises that also identification will be send the territory that spam is responsible for, and automatically at least one the notice spam in the ISP in the email provider in this territory and this territory sends.

36. method as claimed in claim 16, it is characterized in that, comprise that also with at least one is distributed in mail server and the client email software any one in the tabulation of described twit filter and described spam, wherein, described distribution comprises following at least one:

On the website, put up notice, can be used for downloading to notify the tabulation of described twit filter and spam;

Release described twit filter and the tabulation of described spam from trend mail server and client email software; And

Manually release described twit filter and the tabulation of described spam to mail server and client email software.

37. method as claimed in claim 16 is characterized in that, described method also comprises reliability and the credible cross validation step of being convenient to verify that the user classifies, and described step comprises:

Get rid of one or more users' of being suspected classification from the data that are used for training twit filter;

Use all other available users to classify and train described twit filter; And

Make the described user's of being suspected polling message pass described housebroken twit filter, compare with the described user's of being suspected classification determining, its described message of can how classifying.

38. method as claimed in claim 37 is characterized in that, also comprises carrying out following at least one:

To carrying out discount, be confirmed as credible until this user by being confirmed as the existing and classification in the future that incredible user provides;

Give up by being confirmed as the existing classification that incredible user provides; And

Described insincere user is removed from poll in the future.

39. method as claimed in claim 16 is characterized in that, described method also comprises is convenient to verify reliability and the credibility of user in classifying, and to train the step of twit filter via feedback loop system, described step comprises:

With a sub-set identifier that resists the user of spam is suspicious user;

Provide one or more test posts to be used for poll to described suspicious user with known results; And

Determine whether described suspicious user is complementary with described known classification to the classification of described one or more test posts, to determine the reliability of described user's classification.

40. method as claimed in claim 39 is characterized in that, user's the subclass that is identified as the described antagonism spam of suspicious user comprises all users.

41. method as claimed in claim 39 is characterized in that, described test post is to be known as spam to become reconciled at least aly in the mail, and is injected into by described feedback loop system and imports in the mail flows and be sent to described suspicious user.

42. method as claimed in claim 39 is characterized in that, the message that is used for poll that described suspicious user is received to train described spam sorter with correct classification, is discerned insincere user by system manager's manual classification.

43. method as claimed in claim 39 is characterized in that, also comprises at least one in the following action:

Described insincere user is removed from poll in the future.