CN104715248A

CN104715248A - Method for recognizing mail advertisement picture

Info

Publication number: CN104715248A
Application number: CN201510121822.XA
Authority: CN
Inventors: 许广彬; 徐慧灵; 纪春来
Original assignee: Wuxi Huayun Data Technology Service Co Ltd
Current assignee: Huayun Industrial Internet Co ltd
Priority date: 2015-03-19
Filing date: 2015-03-19
Publication date: 2015-06-17
Anticipated expiration: 2035-03-19
Also published as: CN104715248B

Abstract

The invention provides a method for recognizing a mail advertisement picture. The method comprises the steps that S1, after a picture in a mail is extracted to be preprocessed, the arrangement direction of text blocks is determined; S2, a virtual coordinate system is set up according to the arrangement direction of the text blocks; S3, binaryzation data of all the text blocks in the picture in the coordinate system are calculated; S4, statistics is carried out on the size and number of the test blocks in the picture; S5, whether the picture is the advertisement picture or not is judged according to a set threshold value. The projection of the text blocks in the picture in the virtual coordinate system is obtained, the binaryzation data are calculated, therefore, whether the picture is the advertisement picture or not can be effectively judged according to the set threshold value by carrying out statistics on the size and number of the text blocks in the picture, the effect of extracting characters in the advertisement picture in the garbage mail is obviously improved, the capacity of resisting disturbance is high, and the load of a server is reduced.

Description

A kind of recognition methods to email advertisement picture

Technical field

The present invention relates to spam treatment technology and technical field of network security, particularly relate to a kind of recognition methods to email advertisement picture.

Background technology

In the spam in the annual whole world, picture category spam quantity occupies more than 50% of spam total amount.So the technology for picture spam mails identification is needed upgrading badly and is upgraded, and identifies picture category spam so that more effective, improves spam filtering rate.

In the prior art, usually using optical character identification (OCR) to extract with the content of text realized comprising advertising pictures, judging whether ad content by content, thus realizing the identification of spam.So-called optical character identification usually use generally by the computer software being referred to as OCR engine to originally paper, microfilm or other printing on media, typewriting, digital picture that is hand-written or other text writing processes, and produces machine identifiable design and editable text from described image.The image of multipage writing material can be comprised by the digital picture of the document of OCR engine process.The image that will carry out by OCR engine the text processed obtains by various formation method, comprises and uses image reading apparatus to catch the digital picture of text.But this technical scheme exists that calculated amount is large, Word Input effect in advertising pictures is undesirable, False Rate is higher, and to adding technological deficiencies such as disturbing the recognition effect of the spam after spam putting person process such as character or vertical setting of types displaying contents not good.

In view of this, be necessary to be improved the recognition methods to email advertisement picture of the prior art, to solve above-mentioned technology flaw.

Summary of the invention

The object of the invention is to openly a kind of recognition methods to email advertisement picture, improve the effect of the picture comprising word being carried out to Word Input, thus realize effectively identifying the spam comprising advertising pictures, reduce the load of server simultaneously, improve the antijamming capability of server when filtering spam mail simultaneously.

For achieving the above object, the invention provides a kind of recognition methods to email advertisement picture, comprise the following steps:

Text block orientation is determined after carrying out pre-service after picture in S1, extraction mail;

S2, set up virtual coordinate system according to text block orientation;

The binaryzation data of each text block in virtual coordinate system in S3, respectively calculating picture;

The size of the text block in S4, statistics picture and quantity;

S5, whether be advertising pictures according to setting threshold decision picture.

As a further improvement on the present invention, the pre-service in step S1 comprises frame process, inverse process, removes background process, binary conversion treatment, noise reduction process.

As a further improvement on the present invention, step S2 is specially: the continuity of the projection result fastened at virtual coordinates according to image content, is the virtual coordinate system that coupling set up by picture.

As a further improvement on the present invention, step S3 is specially: text block each in picture projected relative to the pole axis of virtual coordinates axle, if coordinate points has foreground pixel, be labeled as black, otherwise be labeled as white.

As a further improvement on the present invention, step S4 is specially: carry out independent projection process to the binaryzation data in picture relative to the pole axis of virtual coordinate system, the wide high level fastened along virtual coordinates of shorthand text block and non-legible text block, and be saved to server database after adding up respective numbers.

As a further improvement on the present invention, server database comprises MySQL database, oracle database.

As a further improvement on the present invention, virtual coordinate system comprises an axle virtual coordinate system, two axle virtual coordinate systems.

As a further improvement on the present invention, two axle virtual coordinate systems comprise two axle orthographic virtual coordinate systems, the nonopiate virtual coordinate system of two axles.

As a further improvement on the present invention, the setting threshold value in step S5 is specially: the scope of writing text number of blocks T is 50 to 300, and it is 50 to 100 that writing text block area summation accounts for picture area percentage scope, and non-legible text block quantitative range is 0 to 2T.

Compared with prior art, the invention has the beneficial effects as follows: by obtaining the projection in virtual coordinate system of text in picture block and calculating binaryzation data, can whether be advertising pictures by the size of text block in statistics picture and quantity and according to setting threshold decision picture effectively, considerably improve the extraction effect to the word in the advertising pictures in spam, antijamming capability is strong, and reduces the load of server.

Accompanying drawing explanation

Fig. 1 is the schematic diagram of a kind of recognition methods to email advertisement picture of the present invention;

Fig. 2 is a kind of type picture extracted from mail;

The picture of Fig. 3 for generating after Fig. 2 is carried out the pre-service of step S2;

Fig. 4 is the another kind of type picture extracted from mail;

The picture of Fig. 5 for generating after Fig. 4 is carried out the pre-service of step S2;

Fig. 6 is that Fig. 3 is carried out Continuity Analysis by projection result foreground pixel being labeled as to black thus determined the schematic diagram in ranks direction;

Fig. 7 is that Fig. 5 is carried out Continuity Analysis by projection result foreground pixel being labeled as to black thus determined the schematic diagram in ranks direction;

Fig. 8 is for carrying out the schematic diagram of independent projection process to the first row text block in the picture shown in Fig. 7;

Fig. 9 is the schematic diagram according to the wide high level of projection result schematic diagram recording text block shown in Fig. 8 and text block quantity.

Embodiment

Below in conjunction with each embodiment shown in the drawings, the present invention is described in detail; but should be noted that; these embodiments are not limitation of the present invention; those of ordinary skill in the art are according to these embodiment institute work energy, method or structural equivalent transformations or substitute, and all belong within protection scope of the present invention.

In the present embodiment, a kind of recognition methods to email advertisement picture, described recognition methods comprises the following steps:

Text block orientation is determined after carrying out pre-service after picture in step S1, extraction mail.This pre-service comprises frame process, inverse process, removes background process, binary conversion treatment, noise reduction process.

Frame process is to judge whether picture has frame, if there is frame, removes the outside and/or inner frame of picture by cutting.Inverse process is to calculate foreground in picture and/or background colour.Removing background process is by calculating the background colour obtaining picture, and is removed; Exchanging of foreground and background colour is carried out to the picture of inverse process simultaneously.If comprise the background interference such as landscape or personage factor in picture, then according to whole style or the pixel color Distribution value situation of the picture extracted from mail in step 1, remove the disturbing factor such as personage's background or background scenery.Binary conversion treatment is the configuration operation according to computing machine, adopts Error Compensation Algorithm, carries out overall binary conversion treatment to according to the picture extracted from mail in step 1.File through the picture of binary conversion treatment is very little, is convenient to whether the computing machine later stage is that advertising pictures judges to it.Noise reduction process carries out noise reduction process specifically by two background filter method to the picture that computing machine extracts, thus the noise reduced in picture calculates the harmful effect caused to the identification of later stage advertising pictures.

Shown in ginseng Fig. 2 and Fig. 3, Fig. 4 and Fig. 5, Fig. 2 is the pre-processed results generated after the pre-service of inverse process as shown in Figure 3.Fig. 4 is the pre-processed results generated after the pre-service of frame process as shown in Figure 5.

Step S2, set up virtual coordinate system according to text block orientation.

In order to determine size and the quantity of text in picture block, need the orientation of the text block first determining to comprise in image content.Such as, text block in Fig. 2 and Fig. 4 is horizontal cross arrangement and vertical longitudinal arrangement respectively.

Shown in ginseng Fig. 6, step S2 is specially: the continuity of the projection result fastened at virtual coordinates according to image content, is the virtual coordinate system that coupling set up by picture.This virtual coordinate system comprises an axle virtual coordinate system, two axle virtual coordinate systems, and two axle virtual coordinate systems comprise two axle orthographic virtual coordinate systems, the nonopiate virtual coordinate system of two axles.

Specifically, if when the word in advertising pictures is rendered as a horizontal arrangement or longitudinally arranges, then only set up an axle virtual coordinate system (transverse direction) or an axle virtual coordinate system (vertically) according to the text block orientation in picture.

If when the word in advertising pictures is rendered as that many laterally arrangement or many vertically arrange, then sets up two axle orthographic virtual coordinate systems, and the pole axis of horizontal direction be defined as X-axis, the pole axis of vertical direction is defined as Y-axis.

If during the oblique arrangement of the word imaging in picture, then need by being with the rotation process of textual image to set up virtual coordinate system.Realize especially by following technical scheme.

Step S11: picture is set up coordinate axis according to the wide high natural direction of picture, and mark vertical direction is X-axis, and horizontal direction is Y-axis.Calculate high point, the pole low spot of picture in X-axis, the pole far point in Y-axis, pole near point; Wherein,

High point is the point that in X-direction, numerical value is maximum;

Pole low spot is the point that in X-direction, numerical value is minimum;

Pole far point is the point that in Y direction, numerical value is maximum;

Pole near point is the point that in Y direction, numerical value is minimum.

Step 12: setting extreme value deviate tdev=20px, calculates the set of high point, low spot set, far point set, near point set.Account form is as follows:

In picture, X-direction is less than or equal to the point of tdev apart from high point, is recorded as high some set h;

In picture, X-direction distance pole low spot is more than or equal to the point of tdev, is recorded as low spot set l;

In picture, Y direction distance pole far point is less than or equal to the point of tdev, is recorded as far point set f;

In picture, Y direction distance pole near point is more than or equal to the point of tdev, is recorded as near point set n.

Step 13: the width calculating the set of high point, low spot set, is recorded as hw, lw respectively.Calculate far point set, near point roll-in altitude, be recorded as fh, nh respectively.

Step 14: judge whether picture content of text is an axle orthogonal graph: set an axle orthogonal decision threshold v11=20, v12=80, decision method is as follows:

If hw, lw meet be less than or equal to v11, and fh or nh is more than or equal to v12, then judge that picture is orthogonal as an axle;

If fh, nh meet be less than or equal to v11, and hw or lw is more than or equal to v12, then judge that picture is orthogonal as an axle.

If to be an axle orthogonal can directly use for picture, do not need to continue process, otherwise, enter next step.

Step 15: judge whether picture content of text is two axle orthogonal graphs: set the orthogonal decision threshold v2=80 of two axles, decision method is as follows:

If hw or lw meets be more than or equal to v2, then judge that picture is orthogonal as two axles;

If fh or nh meets be more than or equal to v2, then judge that picture is orthogonal as two axles.

If picture is two axle orthogonal graphs, do not need to continue process, otherwise redirect performs next step.

Step 16: the angle of inclination calculating two axles nonopiate picture content of text: get high point, pole far point, calculates the angle of inclination of picture content of text.

Step 17: according to angle of inclination, carries out rotation process to picture, becomes two axle orthogonal graphs.

The binaryzation data of each text block in virtual coordinate system in step S3, respectively calculating picture, and be specially: text block each in picture is projected relative to the pole axis of virtual coordinates axle, if coordinate points has foreground pixel, be labeled as black, otherwise be labeled as white.

Shown in ginseng Fig. 6 and Fig. 7, after the picture obtained after pre-service projects in two axle orthographic virtual coordinate systems, if there is writing text block in image content, then there will be black region perpendicular on projecting direction, if when there is null, space, English, numeral (i.e. " non-legible text block ") in image content, then there will be white portion perpendicular on projecting direction.

Then size and the quantity of the text block in step S4, statistics picture is performed, and be specially: relative to the pole axis of virtual coordinate system, independent projection process is carried out to the binaryzation data in picture, the wide high level fastened along virtual coordinates of shorthand text block and non-legible text block, and be saved to server database after adding up respective numbers.Concrete, this server database comprises MySQL according to storehouse, oracle database, and is more preferably MySQL database.Shown in ginseng Fig. 8, if a certain text block is middle word, then usually be rendered as the projection width being greater than English or numeral in the width projection of X-direction, and the standoff height of English or numeral is greater than in the standoff height of Y direction, thus realize judging efficiently and screening to the type of the text block in picture, and progressively or column by column independent projection process is carried out to the text block in picture.

In the present embodiment, be labeled as the region (namely text block is Chinese) that wider black region is writing text block, narrower black region is the region (namely text block is English or numeral) of non-legible text block, and other white portions are the region (namely without the region of any middle word) of non-legible text block.

Shown in reference Fig. 9, it should be noted that, first the present invention both can project line by line from top to bottom along X-axis or project line by line from bottom to up along X-axis; The present invention simultaneously also can project by column from top to bottom along Y-axis or project by column from bottom to up along Y-axis, thus the size realized the text block in picture statistics picture and quantity.

Shown in ginseng Fig. 9, in execution step S5, whether can be advertising pictures according to setting threshold decision picture.Setting threshold value is in step s 5 specially: the scope of writing text number of blocks T is 50 to 300, and it is 50 to 100 that writing text block area summation accounts for picture area percentage scope, and non-legible text block quantitative range is 0 to 2T.

After statistics is completed to all text block (comprising writing text block and non-legible text block) in virtual coordinate system, just can judge whether this picture extracted from mail is advertising pictures according to statistics.Concrete, in the present embodiment, the width range of text block is 20px-40px, and text block altitude range is 35px-60px.

By the present invention, the advertising pictures that can realize comprising in mail accurately identifies, discrimination reaches 99.99%, thus is spam by the mail recognition comprising this advertising pictures.This recognition methods can be applicable to send out in spam engine, to improve identification to spam, filtration, intercepting efficiency.

A series of detailed description listed is above only illustrating for feasibility embodiment of the present invention; they are also not used to limit the scope of the invention, all do not depart from the skill of the present invention equivalent implementations done of spirit or change all should be included within protection scope of the present invention.

To those skilled in the art, obviously the invention is not restricted to the details of above-mentioned one exemplary embodiment, and when not deviating from spirit of the present invention or essential characteristic, the present invention can be realized in other specific forms.Therefore, no matter from which point, all should embodiment be regarded as exemplary, and be nonrestrictive, scope of the present invention is limited by claims instead of above-mentioned explanation, and all changes be therefore intended in the implication of the equivalency by dropping on claim and scope are included in the present invention.Any Reference numeral in claim should be considered as the claim involved by limiting.

In addition, be to be understood that, although this instructions is described according to embodiment, but not each embodiment only comprises an independently technical scheme, this narrating mode of instructions is only for clarity sake, those skilled in the art should by instructions integrally, and the technical scheme in each embodiment also through appropriately combined, can form other embodiments that it will be appreciated by those skilled in the art that.

Claims

1. to a recognition methods for email advertisement picture, it is characterized in that, described recognition methods comprises the following steps:

S2, set up virtual coordinate system according to text block orientation;

The size of the text block in S4, statistics picture and quantity;

2. recognition methods according to claim 1, is characterized in that, the pre-service in described step S1 comprises frame process, inverse process, removes background process, binary conversion treatment, noise reduction process.

3. recognition methods according to claim 1, is characterized in that, described step S2 is specially: the continuity of the projection result fastened at virtual coordinates according to image content, is the virtual coordinate system that coupling set up by picture.

4. recognition methods according to claim 1, is characterized in that, described step S3 is specially: text block each in picture projected relative to the pole axis of virtual coordinates axle, if coordinate points has foreground pixel, be labeled as black, otherwise be labeled as white.

5. recognition methods according to claim 1, it is characterized in that, described step S4 is specially: carry out independent projection process to the binaryzation data in picture relative to the pole axis of virtual coordinate system, the wide high level fastened along virtual coordinates of shorthand text block and non-legible text block, and be saved to server database after adding up respective numbers.

6. recognition methods according to claim 5, is characterized in that, described server database comprises MySQL database, oracle database.

7. the recognition methods according to any one of claim 2 to 6, is characterized in that, described virtual coordinate system comprises an axle virtual coordinate system, two axle virtual coordinate systems.

8. recognition methods according to claim 7, is characterized in that, described two axle virtual coordinate systems comprise two axle orthographic virtual coordinate systems, the nonopiate virtual coordinate system of two axles.

9. recognition methods according to claim 1, it is characterized in that, setting threshold value in described step S5 is specially: the scope of writing text number of blocks T is 50 to 300, and it is 50 to 100 that writing text block area summation accounts for picture area percentage scope, and non-legible text block quantitative range is 0 to 2T.