CN102142088A

CN102142088A - Effective Arabic feature extraction-based Arabic identification method and system

Info

Publication number: CN102142088A
Application number: CN201010258343XA
Authority: CN
Inventors: 穆罕默德S·卡尔希德; 侯塞因K·艾尔奥玛依; 哈利德M·艾尔法依夫
Original assignee: Individual
Current assignee: Individual
Priority date: 2010-08-17
Filing date: 2010-08-17
Publication date: 2011-08-03
Anticipated expiration: 2030-08-17
Also published as: CN102142088B

Abstract

The invention relates to a method for identifying Arabic texts automatically, which comprises the following steps of: digitalizing a row of Arabic characters to form a two-dimensional array relevant to pixels, and giving a pixel value to each pixel, wherein the pixel value is expressed by binary numbers; dividing the row of Arabic characters into a plurality of strip images; defining a plurality of units in one of the plurality of strip images, wherein each unit contains a group of adjacent pixels; arranging the pixel value in each unit of the plurality of units in one of the plurality of strip images continuously to form a binary unit number; constructing text feature vectors according to the binary unit numbers of the plurality of units from one of the plurality of strip images; and providing the text feature vectors for a hidden Markov model to identify the row of Arabic characters.

Description

Arabic recognition methods and system based on effective Arabic feature extraction

Technical field

Present patent application relates generally to the automatic identification of Arabic text.

Background technology

Text identification, promptly the automatic reading of text is a branch of pattern-recognition, the purpose of text identification is that the speed with people's precision and Geng Gao reads the print text content.Most of text recognition methods all are that the hypothesis content of text can be separated into independently character.This type of technology is though the processor Latin language file beating or set type successfully can not reliably be applied to handle cursive script writing, for example Arabic.Attempt that Arabic word is divided into independent character and have difficulties about the Study of recognition of Arabic writing is verified before.

Arabic has proposed several challenges to text recognition system.The Arabic writing itself has the cursive script characteristics, and it is unacceptable that the mode of employing block letter removes to write each character discretely.And the shape and the context of Arabic alphabet are closely connected; The shape of Arabic alphabet depends on the position of this letter in a word.For example, letter

Have four kinds of different shapes: when using separately be

For example exist

In; When the word reference position be For example exist

In; When the word centre position be

For example at word

In; When the word end position be

For example at word

In.In addition, not every Arabic alphabet is all relevant with word.Because spacing also is used for separating some letter in a word, therefore the border between two words is difficult to and can discerns automatically.

Present existing multiple categorizing system, for example statistical model has been applied to Arabic text identification.Yet, suitably extract the major obstacle that text feature remains accurate identification Arabic text.

Summary of the invention

On the one hand, the present invention relates to a kind of Arabic text automatic identifying method, method of the present invention comprises: obtain the text image that comprises the Arabic character of delegation; The Arabic character digitizing of this row is formed a two-dimensional array about pixel, and each pixel is given a pixel value, and wherein pixel value adopts binary number representation; The Arabic character of this row is divided into a plurality of images; Define a plurality of unit in an image therein, comprise one group of adjacent pixels in each unit; Pixel value in the unit in the bar image is arranged binary cell numbering of formation continuously; According to the binary cell numbering structure text feature vector that from the bar elementary area, obtains; Text proper vector added to remove to discern the Arabic character of this row in the hidden Markov model.

On the other hand, the present invention relates to a kind of Arabic text automatic identifying method.This method comprises: obtain the text image that comprises the Arabic character of delegation; The Arabic character digitizing of this row is formed a two-dimensional array about pixel, and each pixel is given a pixel value, and wherein pixel value adopts binary number representation, and this two-dimensional array is made up of multirow on the first direction and the multiple row on the second direction; Calculate the frequency that has the contiguous pixels of same pixel value in the row pixel; This frequency that utilization calculates from the row pixel is constructed a text feature vector; Text proper vector added to remove to discern the Arabic character of this row in the hidden Markov model.

On the other hand, the present invention relates to a kind of Arabic text automatic identifying method.This method comprises: obtain the text image that comprises the Arabic character of delegation; The Arabic character digitizing of this row is formed a two-dimensional array about pixel, and each pixel is given a pixel value; The Arabic character of this row is divided into a plurality of images; Dwindle the bar image after one of them bar image generation one is dwindled at least; Pixel value in each row pixel in the bar image after this is dwindled is arranged continuously and is formed a string sequence numbering, and wherein this string sequence numbering constitutes a text feature vector; Text proper vector added to remove to discern the Arabic character of this row in the hidden Markov model.

On the other hand, the present invention relates to a computer program, it comprises that one can be used for the medium of computing machine, is embedded with computer readable program code in this medium, and this program code can make computing machine obtain a text image that comprises the Arabic character of delegation; The Arabic character of this row of digitizing forms a two-dimensional array about pixel, and each pixel is given a pixel value, and wherein pixel value adopts binary number representation; The Arabic character of this row is divided into a plurality of images; Define a plurality of unit in an image therein, comprise one group of adjacent pixels in each unit; Pixel value in each unit in the bar image is arranged binary cell numbering of formation continuously; According to the binary cell numbering structure text feature vector that from the bar elementary area, obtains; Text proper vector added to remove to discern the Arabic character of this row in the hidden Markov model.

The execution of system can comprise one or more following modes.This method may further include binary cell numbering conversion decimal location numbering; Arrange continuously and form a string decimal location numbering obtaining in a plurality of images in the image decimal location numbering of a plurality of unit; According to a string decimal location numbering structure text feature vector that will obtain in a plurality of images a plurality of unit in the image.The pixel two-dimensional array can comprise multirow on the first direction and the multiple row on the second direction.The Arabic character of this row can be along this first direction series arrangement.A plurality of images can be along this first direction series arrangement.The height of at least one image can be by the M of first direction capable the qualification, wide N row by second direction limit.M and N are integer.This pixel two-dimensional array can comprise the capable pixel of N, and the N span is between 2 to about 100, but value is between 3 to about 10.Pixel value in this pixel two-dimensional array adopts the single-bit binary number representation, also can adopt many bit binary number to represent.Hidden Markov model can be used as a hidden Markov model kit and uses.

System and method among the present invention provides comprehensive, a large amount of, accurate technique for the feature extraction of Arabic text.Compared with prior art, Arabic character recognition efficient disclosed by the invention higher and operation time still less.Compared with prior art, also simpler, the easy row of method and system disclosed by the invention.

Though the present invention shows by a plurality of instantiations and describes that those skilled in the relevant art are appreciated that can have multiple change on way of realization and the details on the basis that does not break away from spirit and scope of the invention.

Description of drawings

Following accompanying drawing, the part of book is used to explain instantiation of the present invention as an illustration, and combines with instructions, is used to explain principle of the present invention.

Fig. 1 is the method flow diagram of Arabic text identification step of the present invention.

Fig. 2 has shown that one comprises the text image of Arabic text.

Fig. 3 A has shown that text image is divided into a plurality of images, and each bar image comprises a plurality of pixels.

Fig. 3 B and Fig. 3 C have shown the pixel and the pixel value of bar image shown in the partial graph 3A.

Fig. 4 has shown a kind of text feature extracting method of the present invention;

Fig. 5 is the process flow diagram of text feature extraction step shown in Figure 4;

Fig. 6 has shown another kind of text feature extracting method of the present invention;

Fig. 7 A-7D has shown another kind of text feature extracting method of the present invention;

Fig. 8 is the process flow diagram of text feature extraction step shown in Fig. 7 A-7D.

Embodiment

Fig. 1 has shown the overall procedure of Arabic text identification step of the present invention.With reference to figure 1-3C, from Arabic text, obtain a text image 200 (as the step 110 of Fig. 1).Arabic text in the text image 200 may be aligned to multirow 211-214, and each bar comprises the Arabic character of a string cursive script writing.Line of text 211-214 is divided into a plurality of image 311-313 (as the step 120 of Fig. 1).Wherein image 311,312 or 313 is divided into pixel 321-323 again, and each pixel is given a pixel value (as the step 130 of Fig. 1).Bar image 311,312 or 313 wide span be 2 pixels to 100 pixels, or wide span in 3 pixels between 10 pixels.Bar image 311,312 or 313 can comprise a complete character, part character or a plurality of characters that connect together.

The pixel value representative is at the intensity level of a certain concrete pixel position text image 200.In an instantiation, the color of image of the high represent pixel point of intensity level bright (or density is low), this pixel may be in white background.The color of image dark (or density height) of the low represent pixel point of intensity level, this pixel may be in the stroke of an Arabic character.Pixel value can adopt different system systems to represent, such as scale-of-two, and the decimal system and sexadecimal number.

With reference to figure 3A-3C, bar image 311 comprises an image 320, and image 320 comprises many pixels of 321-323, and wherein each pixel is endowed scale-of-two pixel value " 0 " or " 1 ".Pixel value " 1 " is represented the dark color of image (perhaps low intensity value) in white background, the Arabic stroke of pixel value " 0 " representative.It should be noted that system and method for the present invention also is applicable to many bit-binary pixel value, many bit-binary pixel value can be expressed masstone grade image intensity value (for example gray scale).

According to the present invention, can from line of text 211 or bar image 311-313, extract text feature vector (as the step 140 of Fig. 1).Below in conjunction with the various text feature extracting method of the concrete discussion of Fig. 4-8.The concrete form of text feature vector can change according to extracting method, as described in following.

The proper vector that obtains in the step 140 is imported in the hidden Markov model (HMM) (as the step 150 of Fig. 1).Among the present invention, HMM can adopt hidden Markov model kit (HTK), HTK be one easily kit be used for making up and the operation hidden Markov model.HTK does not need dictionary and depends on the character model and the grammer of training sample.The probability translation that HMM provides can be accepted the patterns of change in the proper vector.A lot of functions among the HTK can be embedded in the module library of C language source code.These modules are designed to adopt traditional order line style interface to call, and therefore are easy to hand-written control HTK instrument and carry out.

Utilization comprises the proper vector that obtains in the text image of known Arabic word (data mode copy) can train HMM (as the step 160 of Fig. 1).When carrying out sample training, need provide a character model and a benchmark for HTK.The character modeling unit receives proper vector and corresponding benchmark estimation character model.The training data that training sample produces is used for the adjustment model parameter, and the training result that those test sample books produce is used for analyzing system performance.Each model state is represented the letter in the alphabet, and each proper vector is corresponding to a training result.The HTK training tool can utilize ready training data to adjust the character model parameter, predicts known data mode copy.

Estimate the HMM parameter according to the benchmark of training image fragment.Also can be used for profile to search cut-point this cutting apart, and extracts feature from these image segments, then proper vector converted to a training data sequences.This technology based on image segmentation adopts dynamic aspect with coupling word image and character string.Training stage receive with the benchmark coupling through the line of text of scrutiny as input, its Chinese version is corresponding to text image.Then, every text is divided into a plurality of long and narrow vertical windows, extracts proper vector from this vertical window.

Use a dictionary and a language model then, utilize the Arabic text (as the step 170 of Fig. 1) in the HMM recognition feature vector that trains.Cognitive phase is followed identical step and is extracted proper vector that the different knowledge sources estimated by the training stage use to search the character string with highest similarity.Network system of identification facility needs removes to describe the transition probability from a model to alternate model.Dictionary and language model can be input in the identification facility to help the correct status switch of recognizer output.

In some instances, with reference to figure 3A-5, bar image 311-313 is digitized in the array of a pixel 321-323, and each pixel has a pixel value (as the step 510 of Fig. 5).Bar image 311 as shown in Figure 4, is divided into a plurality of unit 410-460 (as the step 520 of Fig. 5).Each unit of 410-460 comprises one group of adjacent pixels, such as the array of one 3 * 3 pixels.For example, Unit 420 comprise pixel 422,423 and other pixels.

Pixel value in each unit is represented as binary cell numbering (as the step 530 of Fig. 5) subsequently.Pixel value in each unit is arranged at first continuously.Such as, these 9 pixels of the 322-323 in Unit 420 connect row with three: 1,1,1,1,0,0,1,0, and 0 mode is arranged continuously.Then, scale-of-two pixel value sequence is mapped as one 9 bit-binary element numbers.Pixel value in 322 pixels is mapped to most significant bit, and the pixel value in 323 pixels is mapped to significant bits.Therefore, the pixel value in Unit 420 is represented as one 9 bit-binary element numbers 111100100.In like manner, the pixel value in the 410-460 unit is converted into binary cell numbering, i.e. 480 among Fig. 4, and the span of each binary cell numbering is all between 0 and 511.

Binary cell numbering in bar image Unit 311 is converted into decimal location numbering, i.e. 490 among Fig. 4 (as the step 540 of Fig. 5) subsequently.Decimal location is numbered the proper vector (as the step 550 of Fig. 5) of 490 being arranged formation bar images 311 then.For other images, repeating step 520-550.The proper vector that to extract from different bar image 311-313 is input to the Arabic character (as the step 560 of Fig. 5) that goes in the hidden Markov model to discern in this article one's own profession.

On behalf of a kind of text feature shown in Figure 1, above-mentioned extracting method in conjunction with Fig. 4-5 description extract the embodiment of flow process.As everyone knows, above-mentioned text feature extracting method is equally applicable to multi-bit pixel value and other numeric representations of serial data.For example, pixel value can adopt 3-bit or 5-bit binary number to represent, can obtain half-tone information (or masstone) like this in text image.Multi-bit pixel value can improve the precision of describing text feature along the stroke border.

Further, except binary number, pixel value can be expressed as the numerical value interval between a minimum value and the maximal value.In some embodiments, pixel value can be a predetermined interval by linear scale (or normalization), such as [0,1] or [1,1].Therefore pixel value can be quantized.Proper vector can be obtained by the step similar to step 530-550.

In some instantiations, reduced (promptly scaled) in proportion to form a bar image 620 that dwindles with reference to 6, one image 610 resolution of figure.For example, the height of bar image 610 can be 60 pixels.The height of the bar image 620 after dwindling can be 20 pixels, and being reduced in size is 1/3.Form one 630 array of pixels after bar image 620 digitizings after dwindling, each pixel indicates a pixel value in 630 arrays.Pixel value in each row of array 630 is changed into binary by sequence.The binary number of different lines constitutes a serial data 640, and this serial data 640 constitutes a proper vector.The proper vector that obtains from the bar image of text bar is added to hidden Markov model with the Arabic character (as the step 560 of Fig. 5) in the one's own profession of identification this article.

With reference to figure 7A, 7B, and Fig. 8, similar to step 510 (as shown in Figure 5), bar image 700 is digitized as (as the step 810 of Fig. 8) in the array of pixels.Pixel is aligned to multiple row.Pixel value is with the single-bit binary number representation, and value is " 1 " or " 0 ".The pixel value of every row all sequence turns to a string single-bit binary number (as the step 830 of Fig. 8).

Next, shown in Fig. 7 C and 7D, calculate the frequency of contiguous pixels with identical scale-of-two pixel value " 1 " and " 0 ".Calculated rate ends number of transitions up to one.The frequency tabulation is formed frequency meter numerical table 750 and 760 (as the step 850 of Fig. 8).But in order to distinguish two row with the complementation of identical number of transitions pixel value, for example,

0 1

1 0

0 1

1 0

Frequency counting is since summit pixel " 1 " statistical counting of row.The counting of first pixel value of left column " 1 " is " 0 ", and next the counting of pixel value " 0 " is " 3 ".The complementary pixel value of two row will obtain following frequency statistics result:

0 3

3 2

2 2

2 1

1 0

0 0

As everyone knows, do not break away from spirit of the present invention, the initial pixels statistics that begins at each row also can begin with pixel value " 0 ".

Each row (as Fig. 7 C, shown in the 7D) in frequency meter numerical table 750,760 is represented a saltus step on the pixel value, promptly changes to a dark text one's respective area (pixel value is " 0 ") from a white background (pixel value is " 1 "), and vice versa.For packed data, the frequency meter numerical table is by the end of a maximum number of transitions.

Frequency counting in frequency meter numerical table 750,760 each row constitutes a proper vector (as the step 860 of Fig. 8).Like this, row can be used as a vector in the example of current description.The proper vector that different lines in the bar image is constituted is input to (as the step 870 of Fig. 8) in the hidden Markov model.

Determine maximum number of transitions based on a large amount of Arabic text sample statistics analyses.As shown in table 1, about 99.31% row have 6 or saltus step still less.That is to say, choose 6 as by hop value, can both be suitable for for the overwhelming majority's text image.

Variation number in table 1, the material

Variation number in one row	Columns	Number percent
				0	3003663	18.44％
1	95418	0.59％
			2	7694625	47.24％
3	74196	0.46％
			4	4231776	25.98％
5	45013	0.28％

6	1028765	6.32％
			＜＝6		99.31％
7	7403	0.04％
			8	94771	0.57％
9	900	0.01％
			10	9543	0.05％
12	1367	0.01％
			＜＝12		0.01％

When creating out the HMM of native system, at first configure the type of the proper vector that is used for the training and testing native system.Proper vector can be divided into continuous type and discrete type.System adopting the continuous type proper vector uses a coefficient sets or uses matrix to be input in the model sometimes.In discrete system, single coefficient is input in the model.Vector quantization is converted into vector row discrete vector exactly, and method for transformation can utilize HQuant and HCopy instrument to realize in conjunction with HTK.The HQuant instrument is used for making up code book from training data, use with the HCopy instrument subsequently to generate discrete vector.Code book is created its size is depended in the influence of system performance, and is subjected to creating its used data volume influence.The HQuant instrument uses the linear vector quantization algorithm to make up code book, and this is an algorithm that calculated amount is very big.Introduced a kind of newly among the present invention, (Unique Vector Quantization, UVQ) method reduces computing time to an only vector quantization by name, improves system performance.This method key has been to reduce the quantity of the proper vector that is used to make up code book, i.e. the proper vector that deletion repeats in linear Vector Quantization algorithm and keep each proper vector that portion is only arranged.As shown in table 2, the proper vector quantity in the material is reduced greatly.

An only vector count in table 2, the material

Line number in the material	Vector count	An only vector count	The number percent that reduces
				10,000	12,285,426	413,410	96.64％
15,000	16,288,252	591,673	96.37％

When we attempted to use whole proper vectors in 2000 different bar images to create code book, the full-size that we find to create code book was 728.The time of creating this code book approximately is 9 hours, but not creates only 1 hour 30 minutes of code book cost of one 1024 size with an only eigenvector.Discrimination as shown in table 3, as to obtain for adopting single source model to experimentize.When an only proper vector was used in combination with the linear vector quantization algorithm, the size of code book had increased.Be reduced to sixth computing time, discrimination has improved.

Table 3 adopts the discrimination of an only vector count

The code book type	The code book size	Creation-time	Discrimination
				Do not use UVQ	728	9 hours	83.59％
Use UVQ	1024	1 hour 30 minutes	85.22％

Be understandable that the method that above-described method is not limited in the instantiation to be adopted.Concrete form can change on the basis that does not break away from spirit of the present invention.For example, the saltus step cutoff can be selected other numbers except that 6.Cell size in the height and width of bar image and the bar image also can with above-mentioned example in different.The form of text feature vector also can be according to the difference of extracting method and difference.For example, proper vector can adopt a string binary number, decimal number, or other count the numerical value of system.

Claims

1. automatic method of identification Arabic text, its step comprises:

Obtain the text image that comprises the Arabic character of delegation;

The Arabic character digitizing of this row is formed a two-dimensional array about pixel, and each pixel is given a pixel value, and wherein pixel value adopts binary number representation;

The Arabic character of this row is divided into a plurality of images;

A plurality of unit in the definition multiple bar chart picture in image wherein comprise one group of adjacent pixels in each unit;

Pixel value series arrangement in each unit of a plurality of unit in image in a plurality of the images is formed binary cell numbering;

According to the binary cell numbering structure one text feature vector that derives from a plurality of unit of an image in a plurality of the images;

Text proper vector is offered a hidden Markov model to discern the Arabic character of this row.

2. the method for claim 1 is characterized in that, further comprises

Convert described binary cell numbering to decimal location numbering;

The decimal location number order that derives from a plurality of unit of an image in a plurality of the images is arranged a string decimal location numbering of formation;

This string decimal location numbering structure text feature vector according to a plurality of unit that derive from an image in a plurality of the images.

3. the method for claim 1 is characterized in that, this pixel two-dimensional array comprises multirow on the first direction and the multiple row on the second direction, and the Arabic character of this row is arranged along this first direction, and described a plurality of images are along this first direction series arrangement.

4. method as claimed in claim 3 is characterized in that, this pixel two-dimensional array comprises the capable pixel of N, and the height of at least one image is by capable qualification of M of first direction in the wherein said multiple bar chart picture, and wide N row by second direction limit, and M and N are integer.

5. method as claimed in claim 4 is characterized in that the span of N is between 2 to 100.

6. method as claimed in claim 5 is characterized in that the span of N is between 3 to 10.

7. the method for claim 1 is characterized in that, the pixel value in this pixel two-dimensional array adopts the single-bit binary number representation.

8. the method for claim 1 is characterized in that, the pixel value in this pixel two-dimensional array adopts many bit binary number to represent.

9. automatic method of identification Arabic text comprises:

Obtain the text image that comprises the Arabic character of delegation;

The Arabic character digitizing of this row is formed a two-dimensional array about pixel, and each pixel is given a pixel value;

The Arabic character of this row is divided into a plurality of images;

Dwindle the bar image after one of them bar image generation one is dwindled at least;

Pixel value in each row pixel in the bar image after this is dwindled is arranged continuously and is formed a string sequence numbering, and wherein this string sequence numbering constitutes a text feature vector;

10. method as claimed in claim 9, it is characterized in that, this pixel two-dimensional array comprises multirow on the first direction and the multiple row on the second direction, and wherein the Arabic character of this row is arranged along this first direction, and described a plurality of images are along this first direction series arrangement.

11. method as claimed in claim 10 is characterized in that, this pixel two-dimensional array comprises the capable pixel of N, and the height of at least one image is by capable qualification of M of first direction in the wherein said multiple bar chart picture, and wide N row by second direction limit, and M and N are integer.

12. a method of discerning Arabic text automatically comprises:

Obtain the text image that comprises the Arabic character of delegation;

The Arabic character digitizing of this row is formed a two-dimensional array about pixel, and each pixel is given a pixel value, and pixel value adopts binary number representation, and wherein this pixel two-dimensional array comprises multirow on the first direction and the multiple row on the second direction;

Calculate the frequency that has the contiguous pixels of same pixel value in the row pixel;

This frequency that utilization calculates from the row pixel is constructed a text feature vector;

13. method as claimed in claim 12 is characterized in that, the frequency counting of contiguous pixels with same pixel value is to predefined saltus step cutoff.

14. method as claimed in claim 13 is characterized in that, this saltus step cutoff is 6.

15. method as claimed in claim 12 is characterized in that, the pixel value in this two-dimensional array adopts the single-bit binary number representation.

16. method as claimed in claim 15 is characterized in that, the step of calculated rate comprises: pixel value begins counting for the pixel of " 1 " from the row pixel.

17. method as claimed in claim 16 is characterized in that, if the pixel value of first pixel in the row is " 0 ", then frequency counting is " 0 ".

18. a system that discerns Arabic text automatically is characterized in that comprise that one can be used for the medium of computing machine, this medium computer embedded readable program function makes computing machine:

Obtain a text image that comprises the Arabic character of delegation;

The Arabic character of this row of digitizing forms a two-dimensional array about pixel, and each pixel is given a pixel value, and wherein pixel value adopts binary number representation;

The Arabic character of this row is divided into a plurality of images; Should comprise multirow on the first direction and the multiple row on the second direction about two-dimensional array of pixel, wherein the Arabic character of this row is arranged along this first direction, and a plurality of images are along this first direction series arrangement, and the span of bar figure image width is 2 to 100 row pixels;

The a plurality of unit of definition comprise one group of adjacent pixels in each unit in an image of a plurality of images;

Pixel value in each unit of a plurality of unit in the image of a plurality of images is arranged binary cell numbering of formation continuously;

19. system as claimed in claim 18 is characterized in that, this computer readable program code that is embedded in the medium makes computing machine:

Convert described binary cell numbering to decimal location numbering;

20. system as claimed in claim 18 is characterized in that, this pixel two-dimensional array comprises the capable pixel of N, and the height of at least one image is by capable qualification of M of first direction in the wherein said multiple bar chart picture, and wide N row by second direction limit, and M and N are integer.