CN104715248A - Method for recognizing mail advertisement picture - Google Patents

Method for recognizing mail advertisement picture Download PDF

Info

Publication number
CN104715248A
CN104715248A CN201510121822.XA CN201510121822A CN104715248A CN 104715248 A CN104715248 A CN 104715248A CN 201510121822 A CN201510121822 A CN 201510121822A CN 104715248 A CN104715248 A CN 104715248A
Authority
CN
China
Prior art keywords
picture
text block
virtual coordinate
coordinate system
recognition methods
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510121822.XA
Other languages
Chinese (zh)
Other versions
CN104715248B (en
Inventor
许广彬
徐慧灵
纪春来
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huayun Industrial Internet Co ltd
Original Assignee
Wuxi Huayun Data Technology Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Huayun Data Technology Service Co Ltd filed Critical Wuxi Huayun Data Technology Service Co Ltd
Priority to CN201510121822.XA priority Critical patent/CN104715248B/en
Publication of CN104715248A publication Critical patent/CN104715248A/en
Application granted granted Critical
Publication of CN104715248B publication Critical patent/CN104715248B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a method for recognizing a mail advertisement picture. The method comprises the steps that S1, after a picture in a mail is extracted to be preprocessed, the arrangement direction of text blocks is determined; S2, a virtual coordinate system is set up according to the arrangement direction of the text blocks; S3, binaryzation data of all the text blocks in the picture in the coordinate system are calculated; S4, statistics is carried out on the size and number of the test blocks in the picture; S5, whether the picture is the advertisement picture or not is judged according to a set threshold value. The projection of the text blocks in the picture in the virtual coordinate system is obtained, the binaryzation data are calculated, therefore, whether the picture is the advertisement picture or not can be effectively judged according to the set threshold value by carrying out statistics on the size and number of the text blocks in the picture, the effect of extracting characters in the advertisement picture in the garbage mail is obviously improved, the capacity of resisting disturbance is high, and the load of a server is reduced.

Description

A kind of recognition methods to email advertisement picture
Technical field
The present invention relates to spam treatment technology and technical field of network security, particularly relate to a kind of recognition methods to email advertisement picture.
Background technology
In the spam in the annual whole world, picture category spam quantity occupies more than 50% of spam total amount.So the technology for picture spam mails identification is needed upgrading badly and is upgraded, and identifies picture category spam so that more effective, improves spam filtering rate.
In the prior art, usually using optical character identification (OCR) to extract with the content of text realized comprising advertising pictures, judging whether ad content by content, thus realizing the identification of spam.So-called optical character identification usually use generally by the computer software being referred to as OCR engine to originally paper, microfilm or other printing on media, typewriting, digital picture that is hand-written or other text writing processes, and produces machine identifiable design and editable text from described image.The image of multipage writing material can be comprised by the digital picture of the document of OCR engine process.The image that will carry out by OCR engine the text processed obtains by various formation method, comprises and uses image reading apparatus to catch the digital picture of text.But this technical scheme exists that calculated amount is large, Word Input effect in advertising pictures is undesirable, False Rate is higher, and to adding technological deficiencies such as disturbing the recognition effect of the spam after spam putting person process such as character or vertical setting of types displaying contents not good.
In view of this, be necessary to be improved the recognition methods to email advertisement picture of the prior art, to solve above-mentioned technology flaw.
Summary of the invention
The object of the invention is to openly a kind of recognition methods to email advertisement picture, improve the effect of the picture comprising word being carried out to Word Input, thus realize effectively identifying the spam comprising advertising pictures, reduce the load of server simultaneously, improve the antijamming capability of server when filtering spam mail simultaneously.
For achieving the above object, the invention provides a kind of recognition methods to email advertisement picture, comprise the following steps:
Text block orientation is determined after carrying out pre-service after picture in S1, extraction mail;
S2, set up virtual coordinate system according to text block orientation;
The binaryzation data of each text block in virtual coordinate system in S3, respectively calculating picture;
The size of the text block in S4, statistics picture and quantity;
S5, whether be advertising pictures according to setting threshold decision picture.
As a further improvement on the present invention, the pre-service in step S1 comprises frame process, inverse process, removes background process, binary conversion treatment, noise reduction process.
As a further improvement on the present invention, step S2 is specially: the continuity of the projection result fastened at virtual coordinates according to image content, is the virtual coordinate system that coupling set up by picture.
As a further improvement on the present invention, step S3 is specially: text block each in picture projected relative to the pole axis of virtual coordinates axle, if coordinate points has foreground pixel, be labeled as black, otherwise be labeled as white.
As a further improvement on the present invention, step S4 is specially: carry out independent projection process to the binaryzation data in picture relative to the pole axis of virtual coordinate system, the wide high level fastened along virtual coordinates of shorthand text block and non-legible text block, and be saved to server database after adding up respective numbers.
As a further improvement on the present invention, server database comprises MySQL database, oracle database.
As a further improvement on the present invention, virtual coordinate system comprises an axle virtual coordinate system, two axle virtual coordinate systems.
As a further improvement on the present invention, two axle virtual coordinate systems comprise two axle orthographic virtual coordinate systems, the nonopiate virtual coordinate system of two axles.
As a further improvement on the present invention, the setting threshold value in step S5 is specially: the scope of writing text number of blocks T is 50 to 300, and it is 50 to 100 that writing text block area summation accounts for picture area percentage scope, and non-legible text block quantitative range is 0 to 2T.
Compared with prior art, the invention has the beneficial effects as follows: by obtaining the projection in virtual coordinate system of text in picture block and calculating binaryzation data, can whether be advertising pictures by the size of text block in statistics picture and quantity and according to setting threshold decision picture effectively, considerably improve the extraction effect to the word in the advertising pictures in spam, antijamming capability is strong, and reduces the load of server.
Accompanying drawing explanation
Fig. 1 is the schematic diagram of a kind of recognition methods to email advertisement picture of the present invention;
Fig. 2 is a kind of type picture extracted from mail;
The picture of Fig. 3 for generating after Fig. 2 is carried out the pre-service of step S2;
Fig. 4 is the another kind of type picture extracted from mail;
The picture of Fig. 5 for generating after Fig. 4 is carried out the pre-service of step S2;
Fig. 6 is that Fig. 3 is carried out Continuity Analysis by projection result foreground pixel being labeled as to black thus determined the schematic diagram in ranks direction;
Fig. 7 is that Fig. 5 is carried out Continuity Analysis by projection result foreground pixel being labeled as to black thus determined the schematic diagram in ranks direction;
Fig. 8 is for carrying out the schematic diagram of independent projection process to the first row text block in the picture shown in Fig. 7;
Fig. 9 is the schematic diagram according to the wide high level of projection result schematic diagram recording text block shown in Fig. 8 and text block quantity.
Embodiment
Below in conjunction with each embodiment shown in the drawings, the present invention is described in detail; but should be noted that; these embodiments are not limitation of the present invention; those of ordinary skill in the art are according to these embodiment institute work energy, method or structural equivalent transformations or substitute, and all belong within protection scope of the present invention.
In the present embodiment, a kind of recognition methods to email advertisement picture, described recognition methods comprises the following steps:
Text block orientation is determined after carrying out pre-service after picture in step S1, extraction mail.This pre-service comprises frame process, inverse process, removes background process, binary conversion treatment, noise reduction process.
Frame process is to judge whether picture has frame, if there is frame, removes the outside and/or inner frame of picture by cutting.Inverse process is to calculate foreground in picture and/or background colour.Removing background process is by calculating the background colour obtaining picture, and is removed; Exchanging of foreground and background colour is carried out to the picture of inverse process simultaneously.If comprise the background interference such as landscape or personage factor in picture, then according to whole style or the pixel color Distribution value situation of the picture extracted from mail in step 1, remove the disturbing factor such as personage's background or background scenery.Binary conversion treatment is the configuration operation according to computing machine, adopts Error Compensation Algorithm, carries out overall binary conversion treatment to according to the picture extracted from mail in step 1.File through the picture of binary conversion treatment is very little, is convenient to whether the computing machine later stage is that advertising pictures judges to it.Noise reduction process carries out noise reduction process specifically by two background filter method to the picture that computing machine extracts, thus the noise reduced in picture calculates the harmful effect caused to the identification of later stage advertising pictures.
Shown in ginseng Fig. 2 and Fig. 3, Fig. 4 and Fig. 5, Fig. 2 is the pre-processed results generated after the pre-service of inverse process as shown in Figure 3.Fig. 4 is the pre-processed results generated after the pre-service of frame process as shown in Figure 5.
Step S2, set up virtual coordinate system according to text block orientation.
In order to determine size and the quantity of text in picture block, need the orientation of the text block first determining to comprise in image content.Such as, text block in Fig. 2 and Fig. 4 is horizontal cross arrangement and vertical longitudinal arrangement respectively.
Shown in ginseng Fig. 6, step S2 is specially: the continuity of the projection result fastened at virtual coordinates according to image content, is the virtual coordinate system that coupling set up by picture.This virtual coordinate system comprises an axle virtual coordinate system, two axle virtual coordinate systems, and two axle virtual coordinate systems comprise two axle orthographic virtual coordinate systems, the nonopiate virtual coordinate system of two axles.
Specifically, if when the word in advertising pictures is rendered as a horizontal arrangement or longitudinally arranges, then only set up an axle virtual coordinate system (transverse direction) or an axle virtual coordinate system (vertically) according to the text block orientation in picture.
If when the word in advertising pictures is rendered as that many laterally arrangement or many vertically arrange, then sets up two axle orthographic virtual coordinate systems, and the pole axis of horizontal direction be defined as X-axis, the pole axis of vertical direction is defined as Y-axis.
If during the oblique arrangement of the word imaging in picture, then need by being with the rotation process of textual image to set up virtual coordinate system.Realize especially by following technical scheme.
Step S11: picture is set up coordinate axis according to the wide high natural direction of picture, and mark vertical direction is X-axis, and horizontal direction is Y-axis.Calculate high point, the pole low spot of picture in X-axis, the pole far point in Y-axis, pole near point; Wherein,
High point is the point that in X-direction, numerical value is maximum;
Pole low spot is the point that in X-direction, numerical value is minimum;
Pole far point is the point that in Y direction, numerical value is maximum;
Pole near point is the point that in Y direction, numerical value is minimum.
Step 12: setting extreme value deviate tdev=20px, calculates the set of high point, low spot set, far point set, near point set.Account form is as follows:
In picture, X-direction is less than or equal to the point of tdev apart from high point, is recorded as high some set h;
In picture, X-direction distance pole low spot is more than or equal to the point of tdev, is recorded as low spot set l;
In picture, Y direction distance pole far point is less than or equal to the point of tdev, is recorded as far point set f;
In picture, Y direction distance pole near point is more than or equal to the point of tdev, is recorded as near point set n.
Step 13: the width calculating the set of high point, low spot set, is recorded as hw, lw respectively.Calculate far point set, near point roll-in altitude, be recorded as fh, nh respectively.
Step 14: judge whether picture content of text is an axle orthogonal graph: set an axle orthogonal decision threshold v11=20, v12=80, decision method is as follows:
If hw, lw meet be less than or equal to v11, and fh or nh is more than or equal to v12, then judge that picture is orthogonal as an axle;
If fh, nh meet be less than or equal to v11, and hw or lw is more than or equal to v12, then judge that picture is orthogonal as an axle.
If to be an axle orthogonal can directly use for picture, do not need to continue process, otherwise, enter next step.
Step 15: judge whether picture content of text is two axle orthogonal graphs: set the orthogonal decision threshold v2=80 of two axles, decision method is as follows:
If hw or lw meets be more than or equal to v2, then judge that picture is orthogonal as two axles;
If fh or nh meets be more than or equal to v2, then judge that picture is orthogonal as two axles.
If picture is two axle orthogonal graphs, do not need to continue process, otherwise redirect performs next step.
Step 16: the angle of inclination calculating two axles nonopiate picture content of text: get high point, pole far point, calculates the angle of inclination of picture content of text.
Step 17: according to angle of inclination, carries out rotation process to picture, becomes two axle orthogonal graphs.
The binaryzation data of each text block in virtual coordinate system in step S3, respectively calculating picture, and be specially: text block each in picture is projected relative to the pole axis of virtual coordinates axle, if coordinate points has foreground pixel, be labeled as black, otherwise be labeled as white.
Shown in ginseng Fig. 6 and Fig. 7, after the picture obtained after pre-service projects in two axle orthographic virtual coordinate systems, if there is writing text block in image content, then there will be black region perpendicular on projecting direction, if when there is null, space, English, numeral (i.e. " non-legible text block ") in image content, then there will be white portion perpendicular on projecting direction.
Then size and the quantity of the text block in step S4, statistics picture is performed, and be specially: relative to the pole axis of virtual coordinate system, independent projection process is carried out to the binaryzation data in picture, the wide high level fastened along virtual coordinates of shorthand text block and non-legible text block, and be saved to server database after adding up respective numbers.Concrete, this server database comprises MySQL according to storehouse, oracle database, and is more preferably MySQL database.Shown in ginseng Fig. 8, if a certain text block is middle word, then usually be rendered as the projection width being greater than English or numeral in the width projection of X-direction, and the standoff height of English or numeral is greater than in the standoff height of Y direction, thus realize judging efficiently and screening to the type of the text block in picture, and progressively or column by column independent projection process is carried out to the text block in picture.
In the present embodiment, be labeled as the region (namely text block is Chinese) that wider black region is writing text block, narrower black region is the region (namely text block is English or numeral) of non-legible text block, and other white portions are the region (namely without the region of any middle word) of non-legible text block.
Shown in reference Fig. 9, it should be noted that, first the present invention both can project line by line from top to bottom along X-axis or project line by line from bottom to up along X-axis; The present invention simultaneously also can project by column from top to bottom along Y-axis or project by column from bottom to up along Y-axis, thus the size realized the text block in picture statistics picture and quantity.
Shown in ginseng Fig. 9, in execution step S5, whether can be advertising pictures according to setting threshold decision picture.Setting threshold value is in step s 5 specially: the scope of writing text number of blocks T is 50 to 300, and it is 50 to 100 that writing text block area summation accounts for picture area percentage scope, and non-legible text block quantitative range is 0 to 2T.
After statistics is completed to all text block (comprising writing text block and non-legible text block) in virtual coordinate system, just can judge whether this picture extracted from mail is advertising pictures according to statistics.Concrete, in the present embodiment, the width range of text block is 20px-40px, and text block altitude range is 35px-60px.
By the present invention, the advertising pictures that can realize comprising in mail accurately identifies, discrimination reaches 99.99%, thus is spam by the mail recognition comprising this advertising pictures.This recognition methods can be applicable to send out in spam engine, to improve identification to spam, filtration, intercepting efficiency.
A series of detailed description listed is above only illustrating for feasibility embodiment of the present invention; they are also not used to limit the scope of the invention, all do not depart from the skill of the present invention equivalent implementations done of spirit or change all should be included within protection scope of the present invention.
To those skilled in the art, obviously the invention is not restricted to the details of above-mentioned one exemplary embodiment, and when not deviating from spirit of the present invention or essential characteristic, the present invention can be realized in other specific forms.Therefore, no matter from which point, all should embodiment be regarded as exemplary, and be nonrestrictive, scope of the present invention is limited by claims instead of above-mentioned explanation, and all changes be therefore intended in the implication of the equivalency by dropping on claim and scope are included in the present invention.Any Reference numeral in claim should be considered as the claim involved by limiting.
In addition, be to be understood that, although this instructions is described according to embodiment, but not each embodiment only comprises an independently technical scheme, this narrating mode of instructions is only for clarity sake, those skilled in the art should by instructions integrally, and the technical scheme in each embodiment also through appropriately combined, can form other embodiments that it will be appreciated by those skilled in the art that.

Claims (9)

1. to a recognition methods for email advertisement picture, it is characterized in that, described recognition methods comprises the following steps:
Text block orientation is determined after carrying out pre-service after picture in S1, extraction mail;
S2, set up virtual coordinate system according to text block orientation;
The binaryzation data of each text block in virtual coordinate system in S3, respectively calculating picture;
The size of the text block in S4, statistics picture and quantity;
S5, whether be advertising pictures according to setting threshold decision picture.
2. recognition methods according to claim 1, is characterized in that, the pre-service in described step S1 comprises frame process, inverse process, removes background process, binary conversion treatment, noise reduction process.
3. recognition methods according to claim 1, is characterized in that, described step S2 is specially: the continuity of the projection result fastened at virtual coordinates according to image content, is the virtual coordinate system that coupling set up by picture.
4. recognition methods according to claim 1, is characterized in that, described step S3 is specially: text block each in picture projected relative to the pole axis of virtual coordinates axle, if coordinate points has foreground pixel, be labeled as black, otherwise be labeled as white.
5. recognition methods according to claim 1, it is characterized in that, described step S4 is specially: carry out independent projection process to the binaryzation data in picture relative to the pole axis of virtual coordinate system, the wide high level fastened along virtual coordinates of shorthand text block and non-legible text block, and be saved to server database after adding up respective numbers.
6. recognition methods according to claim 5, is characterized in that, described server database comprises MySQL database, oracle database.
7. the recognition methods according to any one of claim 2 to 6, is characterized in that, described virtual coordinate system comprises an axle virtual coordinate system, two axle virtual coordinate systems.
8. recognition methods according to claim 7, is characterized in that, described two axle virtual coordinate systems comprise two axle orthographic virtual coordinate systems, the nonopiate virtual coordinate system of two axles.
9. recognition methods according to claim 1, it is characterized in that, setting threshold value in described step S5 is specially: the scope of writing text number of blocks T is 50 to 300, and it is 50 to 100 that writing text block area summation accounts for picture area percentage scope, and non-legible text block quantitative range is 0 to 2T.
CN201510121822.XA 2015-03-19 2015-03-19 A kind of recognition methods to email advertisement picture Active CN104715248B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510121822.XA CN104715248B (en) 2015-03-19 2015-03-19 A kind of recognition methods to email advertisement picture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510121822.XA CN104715248B (en) 2015-03-19 2015-03-19 A kind of recognition methods to email advertisement picture

Publications (2)

Publication Number Publication Date
CN104715248A true CN104715248A (en) 2015-06-17
CN104715248B CN104715248B (en) 2018-10-23

Family

ID=53414559

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510121822.XA Active CN104715248B (en) 2015-03-19 2015-03-19 A kind of recognition methods to email advertisement picture

Country Status (1)

Country Link
CN (1) CN104715248B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399161A (en) * 2018-03-06 2018-08-14 平安科技(深圳)有限公司 Advertising pictures identification method, electronic device and readable storage medium storing program for executing
CN111753675A (en) * 2020-06-08 2020-10-09 北京天空卫士网络安全技术有限公司 Picture type junk mail identification method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040015554A1 (en) * 2002-07-16 2004-01-22 Brian Wilson Active e-mail filter with challenge-response
CN102542290A (en) * 2011-12-22 2012-07-04 国家计算机网络与信息安全管理中心 Junk mail image recognition method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040015554A1 (en) * 2002-07-16 2004-01-22 Brian Wilson Active e-mail filter with challenge-response
CN102542290A (en) * 2011-12-22 2012-07-04 国家计算机网络与信息安全管理中心 Junk mail image recognition method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
程红蓉等: ""图像垃圾邮件中文本区域的自动提取方法"", 《解放军理工大学学报(自然科学版)》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399161A (en) * 2018-03-06 2018-08-14 平安科技(深圳)有限公司 Advertising pictures identification method, electronic device and readable storage medium storing program for executing
WO2019169769A1 (en) * 2018-03-06 2019-09-12 平安科技(深圳)有限公司 Advertisement picture identification method, electronic device, and readable storage medium
CN111753675A (en) * 2020-06-08 2020-10-09 北京天空卫士网络安全技术有限公司 Picture type junk mail identification method and device
CN111753675B (en) * 2020-06-08 2024-03-26 北京天空卫士网络安全技术有限公司 Picture type junk mail identification method and device

Also Published As

Publication number Publication date
CN104715248B (en) 2018-10-23

Similar Documents

Publication Publication Date Title
CN111814722B (en) Method and device for identifying table in image, electronic equipment and storage medium
US20210256253A1 (en) Method and apparatus of image-to-document conversion based on ocr, device, and readable storage medium
US10896349B2 (en) Text detection method and apparatus, and storage medium
CN110008809B (en) Method and device for acquiring form data and server
US9235756B2 (en) Table grid detection and separation
JP4626886B2 (en) Method and apparatus for locating and extracting captions in digital images
US8693790B2 (en) Form template definition method and form template definition apparatus
CN103824373B (en) A kind of bill images amount of money sorting technique and system
Roy et al. Wavelet-gradient-fusion for video text binarization
CN110598566A (en) Image processing method, device, terminal and computer readable storage medium
CN112906695B (en) Form recognition method adapting to multi-class OCR recognition interface and related equipment
CN102956029A (en) Image processing apparatus, image processing method
CN113688688A (en) Completion method of table lines in picture and identification method of table in picture
CN112601068A (en) Video data augmentation method, device and computer readable medium
US10963725B2 (en) Systems and methods for digitized document image data spillage recovery
US9355311B2 (en) Removal of graphics from document images using heuristic text analysis and text recovery
US20120269438A1 (en) Image processing apparatus
CN104715248A (en) Method for recognizing mail advertisement picture
CN102682457A (en) Rearrangement method for performing adaptive screen reading on print media image
JP5171421B2 (en) Image processing apparatus, image processing method, and computer program
Lu et al. A shadow removal method for tesseract text recognition
KR101524074B1 (en) Method for Image Processing
CN114399670A (en) Control method for extracting characters in pictures in 5G messages in real time
Wang et al. Text string extraction from scene image based on edge feature and morphology
CN113591746A (en) Document table structure detection method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 214000, science and software park, Binhu District, Jiangsu, Wuxi 6

Patentee after: Huayun data holding group Co.,Ltd.

Address before: 214000, science and software park, Binhu District, Jiangsu, Wuxi 6

Patentee before: WUXI CHINAC DATA TECHNICAL SERVICE Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20221109

Address after: Room 316, Government Affairs Service Center, No. 1, Renmin Road, Pingshang Town, Lingang Economic Development Zone, Linyi City, Shandong Province, 276000

Patentee after: Huayun Industrial Internet Co.,Ltd.

Address before: No. 6 Science and Education Software Park, Binhu District, Wuxi City, Jiangsu Province

Patentee before: Huayun data holding group Co.,Ltd.