WO2014036788A1 - A method for collecting and classification email - Google Patents

A method for collecting and classification email Download PDF

Info

Publication number
WO2014036788A1
WO2014036788A1 PCT/CN2012/085097 CN2012085097W WO2014036788A1 WO 2014036788 A1 WO2014036788 A1 WO 2014036788A1 CN 2012085097 W CN2012085097 W CN 2012085097W WO 2014036788 A1 WO2014036788 A1 WO 2014036788A1
Authority
WO
WIPO (PCT)
Prior art keywords
confidence
email
mail
spam
reported
Prior art date
Application number
PCT/CN2012/085097
Other languages
French (fr)
Chinese (zh)
Inventor
林延中
潘庆峰
Original Assignee
盈世信息科技(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 盈世信息科技(北京)有限公司 filed Critical 盈世信息科技(北京)有限公司
Publication of WO2014036788A1 publication Critical patent/WO2014036788A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/107Computer-aided management of electronic mailing [e-mailing]

Definitions

  • the present invention relates to the field of communications technologies, and in particular, to an email collection and classification method.
  • text classification is performed using artificial intelligence classification algorithms. These algorithms need to learn the learning samples first, and then construct the corresponding discriminant models before text classification; therefore, the learning samples need to be acquired first.
  • the method of learning samples is to manually mark a batch of samples and mark the mail as spam or non-spam.
  • the technical problem to be solved by the embodiments of the present invention is to provide an email collection and classification method, which does not need to arrange a special person to classify a large number of emails, but directly collects feedback information of the user by using a computer, thereby reducing manual
  • the workload ensures the accuracy of the classification, and it does not require manual reading of the mail, which protects the privacy of the user.
  • an embodiment of the present invention provides an email collection and classification method, including: scanning all reported emails in a server, and extracting target emails whose number of reported times is greater than or equal to n, where n is a default value.
  • the reported mail includes a mail that is reported as a normal mail and is reported as spam; calculating a confidence level of the target mail, and obtaining a calculation result; determining, according to the calculation result, that the target mail is spam or Normal mail, and stored in the database.
  • the step of calculating the confidence of the target mail comprises: adding the confidence levels of all the reporters who report the target mail as normal mail, and obtaining the total normal mail confidence X; Adding the confidence of all the reporters who report the target email to spam, and obtaining the total spam confidence Y; calculating the absolute value IX-YI of the difference between the total normal email confidence X and the total spam confidence Y, The calculation results are obtained.
  • the determining, according to the calculation result, the step that the target mail is a spam or a normal mail comprises: the difference between the total normal mail confidence X and the total spam confidence Y
  • the absolute value IX-YI is compared with the threshold value T to determine whether IX-YI is less than ⁇ .
  • the mail is not judged for the time being, and when it is judged as no, the size of X and ⁇ is compared.
  • X is greater than ⁇ , it is determined.
  • the mail is a normal mail.
  • X is less than ⁇ , the mail is determined.
  • the piece is spam.
  • the method before the step of calculating the confidence of the target mail, the method further includes: presetting the initial confidence of the reporter of the initial report mail to be 1.
  • the email collection and classification method further includes: updating a confession of the whistleblower, increasing the confidence of the whistleblower that is consistent with the final determination result, and reducing the confidence of the whistleblower who is inconsistent with the final determination result. degree.
  • the increase rate of the confidence is slower than the decrease speed.
  • the confidence level is provided with a maximum value and a minimum value, and the confidence level does not increase after rising to a maximum value, and does not decrease after dropping to a minimum value.
  • the utility model has the beneficial effects of: scanning all the reported mails in the server by the computer, extracting the target mails whose reported times are greater than or equal to the system default value, performing confidence calculation on the target mail based on the confidence, and then calculating according to the confidence As a result, it is determined that the reported mail is spam or normal mail, and is collected into the corresponding database; the process is to directly process the user feedback information based on the confidence level by the computer, thereby reducing the manual work intensity and workload, and ensuring the classification.
  • the accuracy rate without the need to manually read the mail, protects the privacy of the user.
  • FIG. 1 is a schematic structural diagram of a first embodiment of an email collection and classification method according to the present invention
  • FIG. 2 is a schematic structural diagram of a second embodiment of an email collection and classification method according to the present invention.
  • FIG. 3 is a schematic structural diagram of a third embodiment of an email collection and classification method according to the present invention.
  • FIG. 4 is a schematic structural diagram of a fourth embodiment of an email collection and classification method according to the present invention.
  • FIG. 1 is a schematic structural diagram of a first embodiment of an email collection and classification method according to the present invention.
  • the method includes the following steps: S100: Scan all the reported emails in the server, and extract the target emails whose reported times are greater than or equal to n.
  • n is a default value
  • the reported mail includes an email that is reported as a normal email and is reported as spam.
  • n can be set according to specific conditions, preferably, the default value is n. Is 3.
  • S102 Determine, according to the calculation result, that the target mail is a spam or a normal mail, and store the information in a database.
  • the result of the determination is that the spam is stored in the spam database, and the determination result is normal.
  • the mail is stored in the normal mail database.
  • FIG. 2 is a schematic structural diagram of a second embodiment of an email collection and classification method according to the present invention.
  • the method includes the following steps: S200: Scan all the reported emails in the server, and extract the target emails whose reported times are greater than or equal to n.
  • n is a default value
  • the reported mail includes an email that is reported as a normal email and is reported as spam.
  • n can be set according to specific conditions, preferably, the default value is n. Is 3.
  • S201 Add the confidence levels of all the reporters who report the target mail to the normal mail, and obtain the total normal mail confidence X.
  • S202 Add the confidence levels of all the reporters who report the target mail as spam, and obtain the total spam confidence Y.
  • steps S201 and S202 have no sequence and can be performed simultaneously.
  • S204 Determine, according to the calculation result, that the target mail is a spam or a normal mail, and store the data in a database.
  • the result of the determination is that the spam is stored in the spam database
  • the result of the determination is that the normal mail is stored in the normal mail database.
  • the reporter and the ⁇ report will report the mail as a normal mail
  • the reporter C and D reports M email as spam
  • rapporter A's confidence is 5
  • whistleblower B's confidence is 10
  • whistleblower C's confidence is 3
  • whistleblower D's confidence is 8
  • FIG. 3 is a schematic structural diagram of a third embodiment of an email collection and classification method according to the present invention.
  • the method includes: S300: Scan all the reported emails in the server, and extract the target emails whose reported times are greater than or equal to n.
  • n is a default value
  • the reported mail includes an email that is reported as a normal email and is reported as spam.
  • n can be set according to specific conditions, preferably, the default value is n. Is 3.
  • S301 Add the confidence levels of all the reporters who report the target mail to the normal mail, and obtain the total normal mail confidence X.
  • S302 Adding the confidence levels of all the reporters who report the target mail as spam, and obtaining the total spam confidence Y.
  • steps S301 and S302 have no sequence and can be performed simultaneously.
  • the threshold ⁇ may be preset according to a specific situation, and generally the threshold is higher than the initial confidence, and preferably the threshold ⁇ is 3.
  • the target mail that is not temporarily determined is stored in the temporary storage server and left for subsequent scanning determination.
  • the size of X and ⁇ is compared.
  • X is greater than ⁇ , it is determined that the mail is a normal mail, and when X is smaller than ⁇ , the mail is determined to be spam.
  • the result of the determination is that the spam is stored in the spam database
  • the result of the determination is that the normal mail is stored in the normal mail database.
  • the threshold T is preset to 3, then IX-YKT, so the m mail is not determined for the time being, and the m mail is continued in the temporary storage server, and is left to the subsequent scan determination.
  • the M mail is found to be reported 4 times, which is greater than the default value of 3, and is therefore extracted as the target mail, wherein the reporters A and B report the M mail as a normal mail, and the informants C and D will M The email is reported as spam.
  • the whistleblower B's confidence is 10
  • the whistleblower C's confidence is 3
  • the whistleblower's confidence is 8
  • the total normal email confidence X 5+10 15
  • the confidence level of the reporter is 3, the confidence of the reporter B is 8, and the confidence of the reporter C is 5, the informant
  • S400 Scan all the reported emails in the server, and extract the target emails whose reported times are greater than or equal to ⁇ .
  • is a default value, and the reported mail includes an email that is reported as a normal email and is reported as spam.
  • the default value ⁇ can be set according to the specific situation, preferably, the default value ⁇ Is 3.
  • S402 Add the confidence levels of all the reporters who report the target mail to the normal mail, and obtain the total normal mail confidence X.
  • steps S401 and S402 have no sequence and can be performed simultaneously.
  • S403 Adding the confidence levels of all the reporters who report the target mail as spam, and obtaining the total spam confidence level.
  • the threshold T may be preset according to a specific situation, and generally the threshold T is higher than the initial confidence, and preferably the threshold T is 3.
  • the target mail that is not temporarily determined is stored in the temporary storage server and left for subsequent scanning determination.
  • the size of X and Y is compared.
  • X is greater than Y, it is determined that the mail is a normal mail, and when X is smaller than Y, the mail is determined to be spam.
  • the result of the determination is that the spam is stored in the spam database
  • the result of the determination is that the normal mail is stored in the normal mail database.
  • the increase and decrease of the confidence level may be preset as needed.
  • the increase degree of the confidence is +1; the decrease of the confidence is decreased by 10% or -1, which is The larger one.
  • the increase in the confidence is slower than the decrease.
  • the confidence level is provided with a maximum value and a minimum value, and the confidence level does not increase after rising to a maximum value, and does not decrease after dropping to a minimum value.
  • the maximum value or the minimum value may be preset as needed, and preferably, the maximum value is 50 and the minimum value is 0.
  • the threshold T is preset to 3, then IX-YKT, so the m mail is not determined for the time being, and the m mail is continued in the temporary storage server, and is left to the subsequent scan determination.
  • the M message is reported to be reported 4 times, which is greater than the default value of 3, and thus is extracted as the target mail, wherein the reporter A and B report the M mail as a normal mail, and the whistlemen C and D will M
  • the email is reported as spam. If the reporter A is the first report, the initial confidence level of the reporter A is 1, the confidence of the reporter B is 14, the confidence of the reporter C is 3, and the confidence of the reporter D is 8.
  • the confidence level of the whistleblower A and B +1 the confidence level of the whistleblower A becomes 2
  • the confidence level of whistleblower B becomes 15
  • whistleblower C and D judgment If the results are inconsistent, the confidence level of the reporters C and D is reduced by 10% or -1, the original confidence of the reporter C is 3, and the decrease of 10% is less than -1, and the confidence of the reporter C is 2,
  • the original confidence of the whistleblower D is 8, and the decrease of 10% is less than -1, and the confidence level of the reporter D is decreased to 7.
  • the confidence of the informant is updated, the whistlemen C and D are consistent with the judgment result, so the confidence level of the reporter C and D is +1, the confidence level of the whistleman C becomes 6, and the confidence of the sufflator D becomes 21;
  • the reporter's ⁇ and ⁇ are inconsistent with the judgment result, so the confidence level of the whistleblower A and B is reduced by 10% or -1, the original confidence of the whistleblower A is 3, and the decrease of 10% is less than -1, then the whistleblower
  • the confidence of A decreases to 2
  • the original confidence of whistleblower B is 15, and the decrease of 10% is greater than -1, then the confidence level of reporter B decreases by 1.5 to 13.5.
  • the computer scans all the reported mails in the server, extracts the target mail whose reported number of times is greater than or equal to the system default value, performs confidence calculation on the target mail based on the confidence level, and then determines the reported result according to the calculation result.
  • the mail is spam or normal mail, and is collected into the corresponding database; the process is to directly process the user feedback information based on the confidence level, which reduces the manual work intensity and workload, and ensures the accuracy of the classification. It does not require manual reading of the mail, which protects the privacy of the user.

Abstract

A method for collecting and classification email is disclosed. The method includes: scanning all reported email in the server; extracting target email which is reported the number of times greater than or equal to n that n is the default value; the reported email including the reported normal mail and the reported spam email; calculating confidence of the target mail; obtaining the results; determining whether the target mail is a spam email or a normal email according to the calculation result; storing the result in the database. The invention directly uses computer to collect user feedback without arrange special person for classification and annotation of mass email. The invention reduces manual workload, ensures the accuracy of the classification and protects the user's privacy without manual reading the email.

Description

一种电子邮件收集分类方法 技术领域  An electronic mail collection and classification method
[0001] 本发明涉及通信技术领域, 尤其涉及一种电子邮件收集分类方法。  [0001] The present invention relates to the field of communications technologies, and in particular, to an email collection and classification method.
背景技术 Background technique
[0002] 目前, 进行文本分类使用的是人工智能分类算法, 这些算法需先对学习样本进行学 习, 构造出对应的判别模型后, 才可进行文本分类; 因此, 需先获取学习样本, 目前获取学 习样本的方法是人工直接对一批抽样进行标注, 标注邮件为垃圾邮件或非垃圾邮件。  [0002] At present, text classification is performed using artificial intelligence classification algorithms. These algorithms need to learn the learning samples first, and then construct the corresponding discriminant models before text classification; therefore, the learning samples need to be acquired first. The method of learning samples is to manually mark a batch of samples and mark the mail as spam or non-spam.
[0003] 由于分类算法需要有足够的学习信息量, 至少需要对几万封学习样本进行学习才能 构造出一个可靠的模型, 因此, 需要安排专人对几万封邮件进行分类标注, 其工作量巨大, 而且人工长期进行这类重复工作, 容易出现失误, 导致样本错误率增高, 影响分类算法最终 的学习效果; 另外, 在对邮件进行分类标注时, 需人工阅读用户邮件, 侵犯了用户的隐私。 发明内容 [0003] Since the classification algorithm needs to have enough learning information, at least tens of thousands of learning samples need to be learned to construct a reliable model. Therefore, it is necessary to arrange a special person to classify tens of thousands of mails, which has a huge workload. Moreover, artificially performing such repeated work for a long period of time is prone to errors, resulting in an increase in the sample error rate and affecting the final learning effect of the classification algorithm. In addition, when the mail is classified, the user's mail is manually read, which infringes the user's privacy. Summary of the invention
[0004] 本发明实施例所要解决的技术问题在于, 提供一种电子邮件收集分类方法, 该方法 无需安排专人对大量邮件进行分类标注, 而是直接利用计算机收集用户的反馈信息, 减轻了 人工的工作量, 确保了分类的准确率, 同时也无需人工对邮件进行阅读, 保护了用户的隐私。  [0004] The technical problem to be solved by the embodiments of the present invention is to provide an email collection and classification method, which does not need to arrange a special person to classify a large number of emails, but directly collects feedback information of the user by using a computer, thereby reducing manual The workload ensures the accuracy of the classification, and it does not require manual reading of the mail, which protects the privacy of the user.
[0005] 为了解决上述技术问题, 本发明实施例提供了一种电子邮件收集分类方法, 包括: 扫描服务器中所有被举报的邮件, 提取被举报次数大于或等于 n的目标邮件, n为默认值, 所述被举报的邮件包括被举报为正常邮件及被举报为垃圾邮件的邮件; 计算所述目标邮件的 置信度, 得出计算结果; 根据所述计算结果判定所述目标邮件为垃圾邮件或正常邮件, 并存 储到数据库中。 [0005] In order to solve the above technical problem, an embodiment of the present invention provides an email collection and classification method, including: scanning all reported emails in a server, and extracting target emails whose number of reported times is greater than or equal to n, where n is a default value. The reported mail includes a mail that is reported as a normal mail and is reported as spam; calculating a confidence level of the target mail, and obtaining a calculation result; determining, according to the calculation result, that the target mail is spam or Normal mail, and stored in the database.
[0006] 作为上述方案的改进, 所述计算所述目标邮件的置信度的步骤包括: 将所有把目标 邮件举报为正常邮件的举报人的置信度相加, 得出总正常邮件置信度 X; 将所有把目标邮件 举报为垃圾邮件的举报人的置信度相加, 得出总垃圾邮件置信度 Y; 计算总正常邮件置信度 X与总垃圾邮件置信度 Y的差的绝对值 IX-YI , 得出计算结果。  [0006] As an improvement of the foregoing solution, the step of calculating the confidence of the target mail comprises: adding the confidence levels of all the reporters who report the target mail as normal mail, and obtaining the total normal mail confidence X; Adding the confidence of all the reporters who report the target email to spam, and obtaining the total spam confidence Y; calculating the absolute value IX-YI of the difference between the total normal email confidence X and the total spam confidence Y, The calculation results are obtained.
[0007] 作为上述方案的改进, 所述根据所述计算结果判定所述目标邮件为垃圾邮件或正常 邮件的步骤包括: 将所述总正常邮件置信度 X与总垃圾邮件置信度 Y的差的绝对值 IX- YI与 阈值 T进行比较, 判断 IX-YI是否小于 τ, 判断为是时, 暂时不对该邮件进行判定, 判断为否 时, 比较 X与 Υ的大小, 当 X大于 Υ时, 判定邮件为正常邮件, 当 X小于 Υ时, 判定邮 件为垃圾邮件。 [0007] As an improvement of the foregoing solution, the determining, according to the calculation result, the step that the target mail is a spam or a normal mail comprises: the difference between the total normal mail confidence X and the total spam confidence Y The absolute value IX-YI is compared with the threshold value T to determine whether IX-YI is less than τ. When it is judged as YES, the mail is not judged for the time being, and when it is judged as no, the size of X and Υ is compared. When X is greater than Υ, it is determined. The mail is a normal mail. When X is less than Υ, the mail is determined. The piece is spam.
[0008] 作为上述方案的改进, 在所述计算所述目标邮件的置信度的步骤之前还包括: 将初 次举报邮件的举报人的初始置信度预设为 1。  [0008] As an improvement of the foregoing solution, before the step of calculating the confidence of the target mail, the method further includes: presetting the initial confidence of the reporter of the initial report mail to be 1.
[0009] 作为上述方案的改进, 所述电子邮件收集分类方法还包括: 更新举报人的置信度, 增加与最终判定结果一致的举报人的置信度, 降低与最终判定结果不一致的举报人的置信度。  [0009] As an improvement of the foregoing solution, the email collection and classification method further includes: updating a confession of the whistleblower, increasing the confidence of the whistleblower that is consistent with the final determination result, and reducing the confidence of the whistleblower who is inconsistent with the final determination result. degree.
[0010] 作为上述方案的改进, 所述置信度的增加速度比降低速度慢。 [0010] As an improvement of the above scheme, the increase rate of the confidence is slower than the decrease speed.
[0011] 作为上述方案的改进, 所述置信度设有最大值及最小值, 所述置信度上升到最大值后 就不再增加, 下降到最小值后就不再降低。  [0011] As a modification of the above solution, the confidence level is provided with a maximum value and a minimum value, and the confidence level does not increase after rising to a maximum value, and does not decrease after dropping to a minimum value.
[0012] 实施本发明的有益效果在于: 通过计算机扫描服务器中所有被举报的邮件, 提取被 举报次数大于或等于系统默认值的目标邮件, 基于置信度对目标邮件进行置信度计算, 然后 根据计算结果判定被举报的邮件为垃圾邮件或正常邮件, 并收集到对应的数据库中; 该过程 是通过计算机基于置信度对用户反馈信息进行直接处理, 减轻了人工的工作强度及工作量, 确保了分类的准确率, 且无需人工对邮件进行阅读, 保护了用户的隐私。  [0012] The utility model has the beneficial effects of: scanning all the reported mails in the server by the computer, extracting the target mails whose reported times are greater than or equal to the system default value, performing confidence calculation on the target mail based on the confidence, and then calculating according to the confidence As a result, it is determined that the reported mail is spam or normal mail, and is collected into the corresponding database; the process is to directly process the user feedback information based on the confidence level by the computer, thereby reducing the manual work intensity and workload, and ensuring the classification. The accuracy rate, without the need to manually read the mail, protects the privacy of the user.
附图说明 DRAWINGS
[0013] 图 1是本发明一种电子邮件收集分类方法的第一实施例流程结构示意图;  1 is a schematic structural diagram of a first embodiment of an email collection and classification method according to the present invention;
图 2是本发明一种电子邮件收集分类方法的第二实施例流程结构示意图; 2 is a schematic structural diagram of a second embodiment of an email collection and classification method according to the present invention;
图 3是本发明一种电子邮件收集分类方法的第三实施例流程结构示意图; 3 is a schematic structural diagram of a third embodiment of an email collection and classification method according to the present invention;
图 4是本发明一种电子邮件收集分类方法的第四实施例流程结构示意图。 FIG. 4 is a schematic structural diagram of a fourth embodiment of an email collection and classification method according to the present invention.
具体实施方式 detailed description
[0014] 为使本发明的目的、 技术方案和优点更加清楚, 下面将结合附图对本发明作进一步 地详细描述。  The present invention will be further described in detail below with reference to the accompanying drawings.
[0015] 图 1是本发明一种电子邮件收集分类方法的第一实施例流程结构示意图, 包括: S100, 扫描服务器中所有被举报的邮件, 提取被举报次数大于或等于 n的目标邮件。  1 is a schematic structural diagram of a first embodiment of an email collection and classification method according to the present invention. The method includes the following steps: S100: Scan all the reported emails in the server, and extract the target emails whose reported times are greater than or equal to n.
[0016] n为默认值, 所述被举报的邮件包括被举报为正常邮件及被举报为垃圾邮件的邮件。 [0016] n is a default value, and the reported mail includes an email that is reported as a normal email and is reported as spam.
[0017] 需要说明的是, 是通过计算机自动对服务器中所有被举报的邮件进行扫描, 计算机 每隔一定时间就会对服务器扫描一次; 默认值 n可根据具体情况设置, 优选地, 默认值 n为 3。 [0017] It should be noted that all the reported mails in the server are automatically scanned by the computer, and the computer scans the server once every certain time; the default value n can be set according to specific conditions, preferably, the default value is n. Is 3.
[0018] S101 , 计算所述目标邮件的置信度, 得出计算结果。  [0018] S101. Calculate a confidence level of the target mail, and obtain a calculation result.
[0019] S102, 根据所述计算结果判定所述目标邮件为垃圾邮件或正常邮件, 并存储到数据库 中。  [0019] S102. Determine, according to the calculation result, that the target mail is a spam or a normal mail, and store the information in a database.
[0020] 需要说明的是, 判定结果为垃圾邮件的存储到垃圾邮件数据库中, 判定结果为正常 邮件的存储到正常邮件数据库中。 [0020] It should be noted that the result of the determination is that the spam is stored in the spam database, and the determination result is normal. The mail is stored in the normal mail database.
[0021] 图 2是本发明一种电子邮件收集分类方法的第二实施例流程结构示意图, 包括: S200, 扫描服务器中所有被举报的邮件, 提取被举报次数大于或等于 n的目标邮件。  2 is a schematic structural diagram of a second embodiment of an email collection and classification method according to the present invention. The method includes the following steps: S200: Scan all the reported emails in the server, and extract the target emails whose reported times are greater than or equal to n.
[0022] n为默认值, 所述被举报的邮件包括被举报为正常邮件及被举报为垃圾邮件的邮件。 [0022] n is a default value, and the reported mail includes an email that is reported as a normal email and is reported as spam.
[0023] 需要说明的是, 是通过计算机自动对服务器中所有被举报的邮件进行扫描, 计算机 每隔一定时间就会对服务器扫描一次; 默认值 n可根据具体情况设置, 优选地, 默认值 n为 3。 [0023] It should be noted that all the reported mails in the server are automatically scanned by the computer, and the computer scans the server once every certain time; the default value n can be set according to specific conditions, preferably, the default value is n. Is 3.
[0024] S201 , 将所有把目标邮件举报为正常邮件的举报人的置信度相加, 得出总正常邮件置 信度 X。  [0024] S201: Add the confidence levels of all the reporters who report the target mail to the normal mail, and obtain the total normal mail confidence X.
[0025] S202, 将所有把目标邮件举报为垃圾邮件的举报人的置信度相加, 得出总垃圾邮件置 信度 Y。  [0025] S202: Add the confidence levels of all the reporters who report the target mail as spam, and obtain the total spam confidence Y.
[0026] 需要说明的是, 步骤 S201与 S202没有先后顺序, 可同时进行。  It should be noted that the steps S201 and S202 have no sequence and can be performed simultaneously.
[0027] S203 , 计算总正常邮件置信度 X与总垃圾邮件置信度 Υ的差的绝对值 IX- Yl, 得出计 算结果。  [0027] S203. Calculate an absolute value IX-Yl of the difference between the total normal mail confidence X and the total spam confidence Υ, and obtain a calculation result.
[0028] S204, 根据所述计算结果判定所述目标邮件为垃圾邮件或正常邮件, 并存储到数据库 中。  [0028] S204. Determine, according to the calculation result, that the target mail is a spam or a normal mail, and store the data in a database.
[0029] 需要说明的是, 判定结果为垃圾邮件的存储到垃圾邮件数据库中, 判定结果为正常 邮件的存储到正常邮件数据库中。  [0029] It should be noted that the result of the determination is that the spam is stored in the spam database, and the result of the determination is that the normal mail is stored in the normal mail database.
[0030] 例如, Μ邮件经扫描发现被举报了 4次, 大于默认值 3 (预设), 因此被提取为目标 邮件, 其中举报人 Α和 Β将 Μ邮件举报为正常邮件, 举报人 C和 D将 M邮件举报为垃圾 邮件, 举报人 A的置信度为 5, 举报人 B的置信度为 10, 举报人 C的置信度为 3, 举报人 D的置信度为 8 ; 则总正常邮件置信度 X为 5+10=15, 总垃圾邮件置信度 Y为 3+8=11, 总 正常邮件置信度 X与总垃圾邮件置信度 Y的差的绝对值 IX-YI为 115-111=4。  [0030] For example, if the mail is scanned and found to be reported 4 times, which is greater than the default value of 3 (preset), it is extracted as the target mail, and the reporter and the 举 report will report the mail as a normal mail, the reporter C and D reports M email as spam, rapporter A's confidence is 5, whistleblower B's confidence is 10, whistleblower C's confidence is 3, whistleblower D's confidence is 8; then total normal mail confidence The degree X is 5+10=15, the total spam confidence Y is 3+8=11, and the absolute value IX-YI of the difference between the total normal mail confidence X and the total spam confidence Y is 115-111=4.
[0031] 图 3是本发明一种电子邮件收集分类方法的第三实施例流程结构示意图, 包括: S300, 扫描服务器中所有被举报的邮件, 提取被举报次数大于或等于 n的目标邮件。 3 is a schematic structural diagram of a third embodiment of an email collection and classification method according to the present invention. The method includes: S300: Scan all the reported emails in the server, and extract the target emails whose reported times are greater than or equal to n.
[0032] n为默认值, 所述被举报的邮件包括被举报为正常邮件及被举报为垃圾邮件的邮件。 [0032] n is a default value, and the reported mail includes an email that is reported as a normal email and is reported as spam.
[0033] 需要说明的是, 是通过计算机自动对服务器中所有被举报的邮件进行扫描, 计算机 每隔一定时间就会对服务器扫描一次; 默认值 n可根据具体情况设置, 优选地, 默认值 n为 3。 [0033] It should be noted that all the reported mails in the server are automatically scanned by the computer, and the computer scans the server once every certain time; the default value n can be set according to specific conditions, preferably, the default value is n. Is 3.
[0034] S301 , 将所有把目标邮件举报为正常邮件的举报人的置信度相加, 得出总正常邮件置 信度 X。 [0035] S302, 将所有把目标邮件举报为垃圾邮件的举报人的置信度相加, 得出总垃圾邮件置 信度 Y。 [0034] S301. Add the confidence levels of all the reporters who report the target mail to the normal mail, and obtain the total normal mail confidence X. [0035] S302: Adding the confidence levels of all the reporters who report the target mail as spam, and obtaining the total spam confidence Y.
[0036] 需要说明的是, 步骤 S301与 S302没有先后顺序, 可同时进行。  [0036] It should be noted that the steps S301 and S302 have no sequence and can be performed simultaneously.
[0037] S303 , 计算总正常邮件置信度 X与总垃圾邮件置信度 Υ的差的绝对值 IX- Yl, 得出计 算结果。  [0037] S303. Calculate the absolute value IX-Yl of the difference between the total normal mail confidence X and the total spam confidence Υ, and obtain a calculation result.
[0038] S304, 将所述总正常邮件置信度 X与总垃圾邮件置信度 Υ的差的绝对值 IX- ΥΙ与阈值 Τ进行比较, 判断 ΙΧ-ΥΙ是否小于1\  [0038] S304, comparing the absolute value IX- 差 of the difference between the total normal mail confidence X and the total spam confidence ΥΙ with a threshold ,, determining whether the ΙΧ-ΥΙ is less than 1\
[0039] 需要说明的是, 阈值 Τ可根据具体情况进行预设, 通常阈值 Τ要高于初始置信度, 优选地阈值 Τ为 3。  [0039] It should be noted that the threshold Τ may be preset according to a specific situation, and generally the threshold is higher than the initial confidence, and preferably the threshold Τ is 3.
[0040] 判断为是时, 暂时不对该邮件进行判定。  [0040] When the determination is YES, the mail is not determined for the time being.
[0041] 需要说明的是, 对暂时不进行判定的目标邮件, 将其继续暂存服务器中, 留予后续 扫描判定。  [0041] It should be noted that the target mail that is not temporarily determined is stored in the temporary storage server and left for subsequent scanning determination.
[0042] 判断为否时, 比较 X与 Υ的大小, 当 X大于 Υ时, 判定邮件为正常邮件, 当 X小 于 Υ时, 判定邮件为垃圾邮件。  [0042] When the judgment is no, the size of X and Υ is compared. When X is greater than Υ, it is determined that the mail is a normal mail, and when X is smaller than Υ, the mail is determined to be spam.
[0043] 需要说明的是, 判定结果为垃圾邮件的存储到垃圾邮件数据库中, 判定结果为正常 邮件的存储到正常邮件数据库中。  [0043] It should be noted that the result of the determination is that the spam is stored in the spam database, and the result of the determination is that the normal mail is stored in the normal mail database.
[0044] 例如, m邮件经扫描发现被举报了 4次, 大于默认值 3 (预设), 因此被提取为目标 邮件, 其中举报人 a和 b将 m邮件举报为正常邮件, 举报人 c和 d将 m邮件举报为垃圾邮 件, 举报人 a的置信度为 5, 举报人 b的置信度为 10, 举报人 c的置信度为 5, 举报人 d的 置信度为 8 ; 则总正常邮件置信度 X为 5+10=15, 总垃圾邮件置信度 Y为 3+8=13, 总正常 邮件置信度 X与总垃圾邮件置信度 Y的差的绝对值 IX-YI为 115-131=2, 而阈值 T预设为 3,则 IX-YKT, 因此暂时不对该 m邮件进行判定, 将该 m邮件继续暂存服务器中, 留予后续扫描 判定。  [0044] For example, the m mail is found to be reported 4 times, which is greater than the default value of 3 (preset), and thus is extracted as the target mail, wherein the reporters a and b report the m mail as a normal mail, the reporter c and d Report m mail as spam, the confidence level of the whistleblower a is 5, the confidence level of the whistleblower b is 10, the confidence level of the whistleblower c is 5, and the confidence level of the whistleblower d is 8; Degree X is 5+10=15, total spam confidence Y is 3+8=13, and the absolute value of the difference between the total normal mail confidence X and the total spam confidence Y is 1-1-YI is 115-131=2. The threshold T is preset to 3, then IX-YKT, so the m mail is not determined for the time being, and the m mail is continued in the temporary storage server, and is left to the subsequent scan determination.
[0045] 又如, M邮件经扫描发现被举报了 4次, 大于默认值 3, 因此被提取为目标邮件, 其 中举报人 A和 B将 M邮件举报为正常邮件, 举报人 C和 D将 M邮件举报为垃圾邮件, 若 举报人 A的置信度为 5, 举报人 B的置信度为 10, 举报人 C的置信度为 3, 举报人 D的置 信度为 8 ; 则总正常邮件置信度 X为 5+10=15, 总垃圾邮件置信度 Y为 3+8=11, 总正常邮 件置信度 X与总垃圾邮件置信度 Y的差的绝对值 IX- YI为 115-111=4, 而阈值 T预设为 3,则 IX- YI>T, 因此需比较 X与 Υ的大小, 又 Χ=15, Y=ll, Χ>Υ, 则判定 Μ邮件为正常邮件, 并 将 Μ邮件收集到正常邮件数据库中。  [0045] For another example, the M mail is found to be reported 4 times, which is greater than the default value of 3, and is therefore extracted as the target mail, wherein the reporters A and B report the M mail as a normal mail, and the informants C and D will M The email is reported as spam. If the confidant A's confidence is 5, the whistleblower B's confidence is 10, the whistleblower C's confidence is 3, and the whistleblower's confidence is 8; then the total normal email confidence X 5+10=15, the total spam confidence Y is 3+8=11, and the absolute value of the difference between the total normal mail confidence X and the total spam confidence Y is 1-1-YI is 115-111=4, and the threshold is T is preset to 3, then IX-YI>T, so you need to compare the size of X and Υ, and Χ=15, Y=ll, Χ>Υ, then determine that the mail is a normal mail, and collect the mail to normal. In the mail database.
[0046] 若举报人 Α的置信度为 3, 举报人 B的置信度为 8, 举报人 C的置信度为 5, 举报人 D的置信度为 10; 则总正常邮件置信度 X为 3+8=11, 总垃圾邮件置信度 Y为 5+10=15, 总 正常邮件置信度 X与总垃圾邮件置信度 Y的差的绝对值 IX-YI为 111-151=4, 而阈值 T预设为 3,则 IX-YI>T, 因此需比较 X与 Υ的大小, 又 X=l l, Υ=15, Χ<Υ, 则判定 Μ邮件为垃圾邮 件, 并将 Μ邮件收集到垃圾邮件数据库中。 [0046] If the confidence level of the reporter is 3, the confidence of the reporter B is 8, and the confidence of the reporter C is 5, the informant The confidence of D is 10; then the total normal email confidence X is 3+8=11, the total spam confidence Y is 5+10=15, and the total normal email confidence X is the difference from the total spam confidence Y. The absolute value IX-YI is 111-151=4, and the threshold T is preset to 3, then IX-YI>T, so it is necessary to compare the size of X and Υ, and X=ll, Υ=15, Χ<Υ, then The e-mail is determined to be spam, and the e-mail is collected into the spam database.
[0047] 图 4是本发明一种电子邮件收集分类方法的第四实施例流程结构示意图, 包括: S400, 扫描服务器中所有被举报的邮件, 提取被举报次数大于或等于 η的目标邮件。  4 is a schematic structural diagram of a fourth embodiment of an email collection and classification method according to the present invention. The method includes the following steps: S400: Scan all the reported emails in the server, and extract the target emails whose reported times are greater than or equal to η.
[0048] η为默认值, 所述被举报的邮件包括被举报为正常邮件及被举报为垃圾邮件的邮件。 [0048] η is a default value, and the reported mail includes an email that is reported as a normal email and is reported as spam.
[0049] 需要说明的是, 是通过计算机自动对服务器中所有被举报的邮件进行扫描, 计算机 每隔一定时间就会对服务器扫描一次; 默认值 η可根据具体情况设置, 优选地, 默认值 η为 3。 [0049] It should be noted that all the reported mails in the server are automatically scanned by the computer, and the computer scans the server once every certain time; the default value η can be set according to the specific situation, preferably, the default value η Is 3.
[0050] S401 , 将初次举报邮件的举报人的初始置信度预设为 1。  [0050] S401. The initial confidence of the reporter of the initial report email is preset to 1.
[0051] S402, 将所有把目标邮件举报为正常邮件的举报人的置信度相加, 得出总正常邮件置 信度 X。  [0051] S402: Add the confidence levels of all the reporters who report the target mail to the normal mail, and obtain the total normal mail confidence X.
[0052] 需要说明的是, 步骤 S401与 S402没有先后顺序, 可同时进行。  [0052] It should be noted that steps S401 and S402 have no sequence and can be performed simultaneously.
[0053] S403 , 将所有把目标邮件举报为垃圾邮件的举报人的置信度相加, 得出总垃圾邮件置 信度 Υ。  [0053] S403: Adding the confidence levels of all the reporters who report the target mail as spam, and obtaining the total spam confidence level.
[0054] S404, 计算总正常邮件置信度 X与总垃圾邮件置信度 Υ的差的绝对值 IX-YI , 得出计 算结果。  [0054] S404. Calculate the absolute value IX-YI of the difference between the total normal mail confidence X and the total spam confidence Υ, and obtain a calculation result.
[0055] S405 , 将所述总正常邮件置信度 X与总垃圾邮件置信度 Y的差的绝对值 IX- YI与阈值 T进行比较, 判断 IX-YI是否小于1\  [0055] S405. Compare the absolute value IX-YI of the difference between the total normal mail confidence X and the total spam confidence Y with a threshold T to determine whether IX-YI is less than 1\
[0056] 需要说明的是, 阈值 T可根据具体情况进行预设, 通常阈值 T要高于初始置信度, 优选地阈值 T为 3。  [0056] It should be noted that the threshold T may be preset according to a specific situation, and generally the threshold T is higher than the initial confidence, and preferably the threshold T is 3.
[0057] 判断为是时, 暂时不对该邮件进行判定。  [0057] When the determination is YES, the mail is not determined for the time being.
[0058] 需要说明的是, 对暂时不进行判定的目标邮件, 将其继续暂存服务器中, 留予后续 扫描判定。  [0058] It should be noted that the target mail that is not temporarily determined is stored in the temporary storage server and left for subsequent scanning determination.
[0059] 判断为否时, 比较 X与 Y的大小, 当 X大于 Y时, 判定邮件为正常邮件, 当 X小 于 Y时, 判定邮件为垃圾邮件。  [0059] When the determination is no, the size of X and Y is compared. When X is greater than Y, it is determined that the mail is a normal mail, and when X is smaller than Y, the mail is determined to be spam.
[0060] 需要说明的是, 判定结果为垃圾邮件的存储到垃圾邮件数据库中, 判定结果为正常 邮件的存储到正常邮件数据库中。  [0060] It should be noted that the result of the determination is that the spam is stored in the spam database, and the result of the determination is that the normal mail is stored in the normal mail database.
[0061] S406, 更新举报人的置信度, 增加与最终判定结果一致的举报人的置信度, 降低与最 终判定结果不一致的举报人的置信度。 [0062] 需要说明的是, 置信度的增加和降低幅度可根据需要进行预设, 优选地, 置信度的 增加幅度为 +1 ; 置信度的降低幅度为下降 10%或 -1, 取两者中的幅度较大者。 [0061] S406, updating the confidence level of the reporter, increasing the confidence of the reporter who is consistent with the final determination result, and reducing the confidence of the reporter who does not match the final determination result. [0062] It should be noted that the increase and decrease of the confidence level may be preset as needed. Preferably, the increase degree of the confidence is +1; the decrease of the confidence is decreased by 10% or -1, which is The larger one.
[0063] 更佳地, 所述置信度的增加速度比降低速度慢。 More preferably, the increase in the confidence is slower than the decrease.
[0064] 需要说明的是, 置信度的增加速度比降低速度慢, 可确保拥有高置信度的举报人更 具可信度, 其专业性更强, 从而保证最终判定更准确。  [0064] It should be noted that the increase rate of confidence is slower than the rate of decrease, which ensures that the whistleblower with high confidence is more credible and more professional, thus ensuring a more accurate final decision.
[0065] 更佳地, 所述置信度设有最大值及最小值, 所述置信度上升到最大值后就不再增加, 下降到最小值后就不再降低。  [0065] More preferably, the confidence level is provided with a maximum value and a minimum value, and the confidence level does not increase after rising to a maximum value, and does not decrease after dropping to a minimum value.
[0066] 需要说明的是, 最大值或最小值可根据需要进行预设, 优选地, 最大值为 50, 最小 值为 0。  [0066] It should be noted that the maximum value or the minimum value may be preset as needed, and preferably, the maximum value is 50 and the minimum value is 0.
[0067] 例如, m邮件经扫描发现被举报了 4次, 大于默认值 3 (预设), 因此被提取为目标 邮件, 其中举报人 a和 b将 m邮件举报为正常邮件, 举报人 c和 d将 m邮件举报为垃圾邮 件, 举报人 a的置信度为 5, 举报人 b的置信度为 10, 举报人 c的置信度为 5, 举报人 d的 置信度为 8 ; 则总正常邮件置信度 X为 5+10=15, 总垃圾邮件置信度 Y为 3+8=13, 总正常 邮件置信度 X与总垃圾邮件置信度 Y的差的绝对值 IX-YI为 115-131=2, 而阈值 T预设为 3,则 IX-YKT, 因此暂时不对该 m邮件进行判定, 将该 m邮件继续暂存服务器中, 留予后续扫描 判定。  [0067] For example, the m mail is found to be reported 4 times, which is greater than the default value of 3 (preset), and thus is extracted as the target mail, wherein the reporters a and b report the m mail as a normal mail, the reporter c and d Report m mail as spam, the confidence level of the whistleblower a is 5, the confidence level of the whistleblower b is 10, the confidence level of the whistleblower c is 5, and the confidence level of the whistleblower d is 8; Degree X is 5+10=15, total spam confidence Y is 3+8=13, and the absolute value of the difference between the total normal mail confidence X and the total spam confidence Y is 1-1-YI is 115-131=2. The threshold T is preset to 3, then IX-YKT, so the m mail is not determined for the time being, and the m mail is continued in the temporary storage server, and is left to the subsequent scan determination.
[0068] 又如, M邮件经扫描发现被举报了 4次, 大于默认值 3, 因此被提取为目标邮件, 其 中举报人 A和 B将 M邮件举报为正常邮件, 举报人 C和 D将 M邮件举报为垃圾邮件, 若 举报人 A为初次举报, 赋予举报人 A初始置信度为 1, 举报人 B的置信度为 14, 举报人 C 的置信度为 3, 举报人 D的置信度为 8 ; 则总正常邮件置信度 X为 1+14=15, 总垃圾邮件置 信度 Y为 3+8=11, 总正常邮件置信度 X与总垃圾邮件置信度 Y 的差的绝对值 IX- YI为 115- 111=4, 而阈值 T预设为 3,则 IX-YI>T, 因此需比较 X与 Υ的大小, 又 Χ=15, Y=l l, Χ>Υ, 则判定 Μ邮件为正常邮件, 并将 Μ邮件收集到正常邮件数据库中, 同时, 更新举报人的置 信度, 举报人 Α和 Β与判定结果一致, 因此举报人 A和 B的置信度 +1, 举报人 A的置信度 变为 2, 举报人 B的置信度变为 15 ; 举报人 C和 D与判定结果不一致, 因此举报人 C和 D 的置信度下降 10%或 -1, 举报人 C的原始置信度为 3, 下降 10%小于 -1的幅度, 则举报人 C 的置信度下降后为 2, 举报人 D的原始置信度为 8, 下降 10%小于 -1的幅度, 则举报人 D的 置信度下降后为 7。  [0068] For another example, the M message is reported to be reported 4 times, which is greater than the default value of 3, and thus is extracted as the target mail, wherein the reporter A and B report the M mail as a normal mail, and the whistlemen C and D will M The email is reported as spam. If the reporter A is the first report, the initial confidence level of the reporter A is 1, the confidence of the reporter B is 14, the confidence of the reporter C is 3, and the confidence of the reporter D is 8. The total normal email confidence X is 1+14=15, the total spam confidence Y is 3+8=11, and the absolute value of the difference between the total normal email confidence X and the total spam confidence Y is IX-YI. 115- 111=4, and the threshold T is preset to 3, then IX-YI>T, so it is necessary to compare the size of X and Υ, and Χ=15, Y=ll, Χ>Υ, then it is determined that the mail is normal mail. , and collect the Μ mail into the normal mail database, at the same time, update the confession of the whistleblower, the whistleblower and Β are consistent with the judgment result, so the confidence level of the whistleblower A and B +1, the confidence level of the whistleblower A becomes 2, the confidence level of whistleblower B becomes 15; whistleblower C and D and judgment If the results are inconsistent, the confidence level of the reporters C and D is reduced by 10% or -1, the original confidence of the reporter C is 3, and the decrease of 10% is less than -1, and the confidence of the reporter C is 2, The original confidence of the whistleblower D is 8, and the decrease of 10% is less than -1, and the confidence level of the reporter D is decreased to 7.
[0069] 若举报人 A的置信度为 3, 举报人 B的置信度为 15, 举报人 C的置信度为 5, 举报 人 D的置信度为 20; 则总正常邮件置信度 X为 3+15=18, 总垃圾邮件置信度 Y为 5+20=25, 总正常邮件置信度 X与总垃圾邮件置信度 Y的差的绝对值 IX-YI为 118-251=7, 而阈值 T预设 为 3,则 IX-YI>T, 因此需比较 X与 Υ的大小, 又 Χ=18, Υ=25, Χ<Υ, 则判定 Μ邮件为垃圾 邮件, 并将 Μ邮件收集到垃圾邮件数据库中, 同时, 更新举报人的置信度, 举报人 C和 D 和判定结果一致, 因此举报人 C和 D的置信度 +1, 举报人 C的置信度变为 6, 举报人 D的 置信度变为 21 ; 举报人 Α和 Β和判定结果不一致, 因此举报人 A和 B的置信度下降 10%或 -1, 举报人 A的原始置信度为 3, 下降 10%小于 -1的幅度, 则举报人 A的置信度下降后为 2, 举报人 B的原始置信度为 15, 下降 10%大于 -1 的幅度, 则举报人 B的置信度下降 1.5, 变 为 13.5。 [0069] If the confidence level of the reporter A is 3, the confidence level of the reporter B is 15, the confidence of the reporter C is 5, and the confidence of the reporter D is 20; then the total normal mail confidence X is 3+ 15=18, the total spam confidence Y is 5+20=25, the absolute value of the difference between the total normal mail confidence X and the total spam confidence Y is 119-251=7, and the threshold T is preset. If it is 3, then IX-YI>T, so you need to compare the size of X and Υ, and Χ=18, Υ=25, Χ<Υ, then judge the Μ mail as spam, and collect the Μ mail into the spam database. At the same time, the confidence of the informant is updated, the whistlemen C and D are consistent with the judgment result, so the confidence level of the reporter C and D is +1, the confidence level of the whistleman C becomes 6, and the confidence of the sufflator D becomes 21; The reporter's Α and Β are inconsistent with the judgment result, so the confidence level of the whistleblower A and B is reduced by 10% or -1, the original confidence of the whistleblower A is 3, and the decrease of 10% is less than -1, then the whistleblower The confidence of A decreases to 2, the original confidence of whistleblower B is 15, and the decrease of 10% is greater than -1, then the confidence level of reporter B decreases by 1.5 to 13.5.
[0070] 由上可知, 通过计算机扫描服务器中所有被举报的邮件, 提取被举报次数大于或等 于系统默认值的目标邮件, 基于置信度对目标邮件进行置信度计算, 然后根据计算结果判定 被举报的邮件为垃圾邮件或正常邮件, 并收集到对应的数据库中; 该过程是通过计算机基于 置信度对用户反馈信息进行直接处理, 减轻了人工的工作强度及工作量, 确保了分类的准确 率, 且无需人工对邮件进行阅读, 保护了用户的隐私。  [0070] As can be seen from the above, the computer scans all the reported mails in the server, extracts the target mail whose reported number of times is greater than or equal to the system default value, performs confidence calculation on the target mail based on the confidence level, and then determines the reported result according to the calculation result. The mail is spam or normal mail, and is collected into the corresponding database; the process is to directly process the user feedback information based on the confidence level, which reduces the manual work intensity and workload, and ensures the accuracy of the classification. It does not require manual reading of the mail, which protects the privacy of the user.
[0071] 以上所述是本发明的优选实施方式, 应当指出, 对于本技术领域的普通技术人员来 说, 在不脱离本发明原理的前提下, 还可以做出若干改进和润饰, 这些改进和润饰也视为本 发明的保护范围。  The above is a preferred embodiment of the present invention, and it should be noted that those skilled in the art can make several improvements and refinements without departing from the principles of the present invention. Retouching is also considered to be the scope of protection of the present invention.

Claims

权 利 要 求 Rights request
1. 一种电子邮件收集分类方法, 其特征在于, 包括:  A method for classifying email collection, characterized in that it comprises:
扫描服务器中所有被举报的邮件, 提取被举报次数大于或等于 n的目标邮件, n为默认值, 所述被举报的邮件包括被举报为正常邮件及被举报为垃圾邮件的邮件; Scan all the reported emails in the server, extract the target emails whose number of reported times is greater than or equal to n, n is the default value, and the reported emails include emails that are reported as normal emails and reported as spam emails;
计算所述目标邮件的置信度, 得出计算结果; Calculating the confidence of the target mail, and obtaining a calculation result;
根据所述计算结果判定所述目标邮件为垃圾邮件或正常邮件, 并存储到数据库中。 The target mail is determined to be spam or normal mail according to the calculation result, and is stored in a database.
2. 如权利要求 1 所述的电子邮件收集分类方法, 其特征在于, 所述计算所述目标邮件的置 信度的步骤包括:  2. The method for collecting emails according to claim 1, wherein the calculating the confidence of the target email comprises:
将所有把目标邮件举报为正常邮件的举报人的置信度相加, 得出总正常邮件置信度 X; 将所有把目标邮件举报为垃圾邮件的举报人的置信度相加, 得出总垃圾邮件置信度 Y; Add the confidence of all the reporters who report the target email to normal email, and get the total normal email confidence X; add the confidence of all the reporters who report the targeted email to spam, and get the total spam. Confidence Y;
计算总正常邮件置信度 X与总垃圾邮件置信度 Y的差的绝对值 IX-YI , 得出计算结果。 Calculate the absolute value of the difference between the total normal mail confidence X and the total spam confidence Y, IX-YI, to obtain the calculation result.
3. 如权利要求 2 所述的电子邮件收集分类方法, 其特征在于, 所述根据所述计算结果判定 所述目标邮件为垃圾邮件或正常邮件的步骤包括:  3. The method for collecting and collecting emails according to claim 2, wherein the step of determining, according to the calculation result, that the target email is a spam or a normal email comprises:
将所述总正常邮件置信度 X与总垃圾邮件置信度 Y的差的绝对值 IX-YI与阈值 τ进行比较, 判断 IX-YI是否小于 τ, Comparing the absolute value IX-YI of the difference between the total normal mail confidence X and the total spam confidence Y with the threshold τ to determine whether IX-YI is less than τ,
判断为是时, 暂时不对该邮件进行判定, If the judgment is yes, the email will not be judged for the time being.
判断为否时, 比较 X与 Υ的大小, 当 X大于 Υ时, 判定邮件为正常邮件, 当 X小于 Υ时, 判定邮件为垃圾邮件。 When the judgment is no, the size of X and Υ is compared. When X is greater than Υ, the message is judged to be a normal mail, and when X is less than Υ, the mail is determined to be spam.
4. 如权利要求 2 所述的电子邮件收集分类方法, 其特征在于, 在所述计算所述目标邮件的 置信度的步骤之前还包括:  4. The email collection and classification method according to claim 2, further comprising: before the step of calculating the confidence of the target email:
将初次举报邮件的举报人的初始置信度预设为 1。 The initial confidence level of the reporter who reported the initial report is preset to 1.
5. 如权利要求 1所述的电子邮件收集分类方法, 其特征在于, 还包括:  5. The email collection and classification method according to claim 1, further comprising:
更新举报人的置信度, 增加与最终判定结果一致的举报人的置信度, 降低与最终判定结果不 一致的举报人的置信度。 The confession of the whistleblower is updated, the confidence of the informant consistent with the final judgment result is increased, and the confidence of the whistleblower who is inconsistent with the final judgment result is lowered.
6. 如权利要求 5 所述的电子邮件收集分类方法, 其特征在于, 所述置信度的增加速度比降 低速度慢。  6. The email collection and classification method according to claim 5, wherein the increase in the confidence is slower than the decrease speed.
7. 如权利要求 5 所述的电子邮件收集分类方法, 其特征在于, 所述置信度设有最大值及最 小值, 所述置信度上升到最大值后就不再增加, 下降到最小值后就不再降低。  The method for collecting and collecting emails according to claim 5, wherein the confidence level is set to a maximum value and a minimum value, and the confidence level does not increase after rising to a maximum value, and is decreased to a minimum value. It will not be lowered anymore.
PCT/CN2012/085097 2012-09-07 2012-11-23 A method for collecting and classification email WO2014036788A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2012103276245A CN102880952A (en) 2012-09-07 2012-09-07 Method for collecting and classifying E-mails
CN201210327624.5 2012-09-07

Publications (1)

Publication Number Publication Date
WO2014036788A1 true WO2014036788A1 (en) 2014-03-13

Family

ID=47482268

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/085097 WO2014036788A1 (en) 2012-09-07 2012-11-23 A method for collecting and classification email

Country Status (2)

Country Link
CN (1) CN102880952A (en)
WO (1) WO2014036788A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984703A (en) * 2014-04-22 2014-08-13 新浪网技术(中国)有限公司 Mail classification method and device

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104424280B (en) * 2013-08-30 2018-10-23 格博信息技术(苏州)有限公司 Push follow-up method and its system
CN103970832A (en) * 2014-04-01 2014-08-06 百度在线网络技术(北京)有限公司 Method and device for recognizing spam

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999067731A1 (en) * 1998-06-23 1999-12-29 Microsoft Corporation A technique which utilizes a probabilistic classifier to detect 'junk' e-mail
WO2005001733A1 (en) * 2003-06-30 2005-01-06 Dong-June Seen E-mail managing system and method thereof
CN1719812A (en) * 2005-08-08 2006-01-11 北京中星微电子有限公司 Method and system for filtering refuse E-mail

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7171450B2 (en) * 2003-01-09 2007-01-30 Microsoft Corporation Framework to enable integration of anti-spam technologies
CN101674264B (en) * 2009-10-20 2011-09-14 哈尔滨工程大学 Spam detection device and method based on user relationship mining and credit evaluation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999067731A1 (en) * 1998-06-23 1999-12-29 Microsoft Corporation A technique which utilizes a probabilistic classifier to detect 'junk' e-mail
WO2005001733A1 (en) * 2003-06-30 2005-01-06 Dong-June Seen E-mail managing system and method thereof
CN1719812A (en) * 2005-08-08 2006-01-11 北京中星微电子有限公司 Method and system for filtering refuse E-mail

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984703A (en) * 2014-04-22 2014-08-13 新浪网技术(中国)有限公司 Mail classification method and device
CN103984703B (en) * 2014-04-22 2017-04-12 新浪网技术(中国)有限公司 Mail classification method and device

Also Published As

Publication number Publication date
CN102880952A (en) 2013-01-16

Similar Documents

Publication Publication Date Title
US20120136812A1 (en) Method and system for machine-learning based optimization and customization of document similarities calculation
CN109218223B (en) Robust network traffic classification method and system based on active learning
CN109981625B (en) Log template extraction method based on online hierarchical clustering
US8352409B1 (en) Systems and methods for improving the effectiveness of decision trees
CN112686775A (en) Power network attack detection method and system based on isolated forest algorithm
JP2018010451A (en) Data classification device, data classification method, and program
CN109359137B (en) User growth portrait construction method based on feature screening and semi-supervised learning
WO2014036788A1 (en) A method for collecting and classification email
CN105843889A (en) Credibility based big data and general data oriented data collection method and system
CN108683658B (en) Industrial control network flow abnormity identification method based on multi-RBM network construction reference model
TW201810093A (en) User background information collection method and device
CN116226103A (en) Method for detecting government data quality based on FPGrow algorithm
CN104021180A (en) Combined software defect report classification method
WO2023241385A1 (en) Model transferring method and apparatus, and electronic device
Zeng et al. PyroHMMvar: a sensitive and accurate method to call short indels and SNPs for Ion Torrent and 454 data
CN110807546A (en) Community grid population change early warning method and system
US20210117858A1 (en) Information processing device, information processing method, and storage medium
CN115099875A (en) Data classification method based on decision tree model and related equipment
CN112000955B (en) Method for determining log characteristic sequence, vulnerability analysis method, system and equipment
US11481369B2 (en) System and method for fingerprinting-based conversation threading
Liu et al. Towards misdirected email detection for preventing information leakage
CN103336865A (en) Dynamic communication network construction method and device
JP5008096B2 (en) Automatic document classification method and automatic document classification system
CN113792114A (en) Credible evaluation method and system for urban field knowledge graph
CN110070464B (en) Energy consumption reminding method, user equipment, storage medium and device for grain processing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12884103

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12884103

Country of ref document: EP

Kind code of ref document: A1