CN102142088B

CN102142088B - Effective Arabic feature extraction-based Arabic identification method and system

Info

Publication number: CN102142088B
Application number: CN 201010258343
Authority: CN
Inventors: 穆罕默德S·卡尔希德; 侯塞因K·艾尔奥玛依; 哈利德M·艾尔法依夫
Original assignee: Individual
Current assignee: Individual
Priority date: 2010-08-17
Filing date: 2010-08-17
Publication date: 2013-01-23
Anticipated expiration: 2030-08-17
Also published as: CN102142088A

Abstract

The invention relates to a method for identifying Arabic texts automatically, which comprises the following steps of: digitalizing a row of Arabic characters to form a two-dimensional array relevant to pixels, and giving a pixel value to each pixel, wherein the pixel value is expressed by binary numbers; dividing the row of Arabic characters into a plurality of strip images; defining a plurality of units in one of the plurality of strip images, wherein each unit contains a group of adjacent pixels; arranging the pixel value in each unit of the plurality of units in one of the plurality of strip images continuously to form a binary unit number; constructing text feature vectors according to the binary unit numbers of the plurality of units from one of the plurality of strip images; and providing the text feature vectors for a hidden Markov model to identify the row of Arabic characters.

Description

Arabic recognition methods and system based on effective Arabic feature extraction

Technical field

Present patent application relates generally to the automatic identification of Arabic text.

Background technology

Text identification, namely the automatic reading of text is a branch of pattern-recognition, the purpose of text identification is that the speed with people's precision and Geng Gao reads the print text content.Most of text recognition methods all are that the hypothesis content of text can be separated into independently character.This type of technology, although the processor Latin language file beating or set type successfully, can not reliability application in processing cursive script writing, for example Arabic.Attempt that Arabic word is divided into independent character and have difficulties about the Study of recognition of Arabic writing is verified before.

Arabic has proposed several challenges to text recognition system.The Arabic writing itself has the cursive script characteristics, and it is unacceptable that the mode of employing block letter removes to write each character discretely.And shape and the context of Arabic alphabet are closely connected; The shape of Arabic alphabet depends on the position of this letter in a word.For example, letter

Have four kinds of different shapes: when using separately be

For example exist

In; When the word reference position be

For example exist

In; When the word centre position be

For example at word

In; When the word end position be

For example at word

In.In addition, not every Arabic alphabet is all relevant with word.Because spacing also is used for separating some letter in a word, therefore the border between two words is difficult to and can automatically identifies.

Present existing multiple categorizing system, for example statistical model has been applied to Arabic text identification.Yet, suitably extract the major obstacle that text feature remains accurate identification Arabic text.

Summary of the invention

On the one hand, the present invention relates to a kind of Arabic text automatic identifying method, method of the present invention comprises: obtain the text image that comprises the Arabic character of delegation; The Arabic character digitizing of this row is formed a two-dimensional array about pixel, and each pixel is given a pixel value, and wherein pixel value adopts binary number representation; The Arabic character of this row is divided into a plurality of images; Define a plurality of unit in an image therein, comprise one group of adjacent pixel in each unit; Pixel value continuous arrangement in the unit in the bar image is formed a binary cell numbering; According to the binary cell numbering structure Text eigenvector that from the bar elementary area, obtains; Text proper vector added to remove to identify the Arabic character of this row in the hidden Markov model.

On the other hand, the present invention relates to a kind of Arabic text automatic identifying method.This method comprises: obtain the text image that comprises the Arabic character of delegation; The Arabic character digitizing of this row is formed a two-dimensional array about pixel, and each pixel is given a pixel value, and wherein pixel value adopts binary number representation, and this two-dimensional array is comprised of the multirow on the first direction and the multiple row on the second direction; Calculate the frequency that has the contiguous pixels of same pixel value in the row pixel; This frequency that utilization calculates from the row pixel is constructed a Text eigenvector; Text proper vector added to remove to identify the Arabic character of this row in the hidden Markov model.

On the other hand, the present invention relates to a kind of Arabic text automatic identifying method.This method comprises: obtain the text image that comprises the Arabic character of delegation; The Arabic character digitizing of this row is formed a two-dimensional array about pixel, and each pixel is given a pixel value; The Arabic character of this row is divided into a plurality of images; Dwindle the bar image after one of them bar image generation one is dwindled at least; Pixel value continuous arrangement in each row pixel in the bar image after this is dwindled forms a string sequence numbering, and wherein this string sequence numbering consists of a Text eigenvector; Text proper vector added to remove to identify the Arabic character of this row in the hidden Markov model.

On the other hand, the present invention relates to a computer program, it comprises that one can be used for the medium of computing machine, is embedded with computer readable program code in this medium, and this program code can make computing machine obtain a text image that comprises the Arabic character of delegation; The Arabic character of this row of digitizing forms a two-dimensional array about pixel, and each pixel is given a pixel value, and wherein pixel value adopts binary number representation; The Arabic character of this row is divided into a plurality of images; Define a plurality of unit in an image therein, comprise one group of adjacent pixel in each unit; Pixel value continuous arrangement in each unit in the bar image is formed a binary cell numbering; According to the binary cell numbering structure Text eigenvector that from the bar elementary area, obtains; Text proper vector added to remove to identify the Arabic character of this row in the hidden Markov model.

The execution of system can comprise one or more following modes.This method may further include binary cell numbering conversion decimal location numbering; The decimal location numbering continuous arrangement that obtains in a plurality of images a plurality of unit in the image is formed a string decimal location numbering; According to a string decimal location numbering structure Text eigenvector that will obtain in a plurality of images a plurality of unit in the image.The pixel two-dimensional array can comprise multirow on the first direction and the multiple row on the second direction.The Arabic character of this row can be arranged sequentially along this first direction.A plurality of images can be arranged sequentially along this first direction.The height of at least one image can be by the M of first direction capable the restriction, wide N row by second direction limit.M and N are integer.This pixel two-dimensional array can comprise the capable pixel of N, and the N span is between 2 to about 100, but value is between 3 to about 10.Pixel value in this pixel two-dimensional array adopts the single-bit binary number representation, also can adopt many bit binary number to represent.Hidden Markov model can be used as a hidden Markov model kit and uses.

System and method among the present invention provides comprehensive, a large amount of, accurate technology for the feature extraction of Arabic text.Compared with prior art, Arabic character recognition efficient disclosed by the invention higher and operation time still less.Compared with prior art, also simpler, easy row of method and system disclosed by the invention.

Although the present invention shows by a plurality of instantiations and describe, those skilled in the relevant art are appreciated that can have multiple change on way of realization and the details on the basis that does not break away from spirit and scope of the invention.

Description of drawings

Following accompanying drawing, the part of book is used for explaining instantiation of the present invention as an illustration, and combines with instructions, is used for explaining principle of the present invention.

Fig. 1 is the method flow diagram of Arabic text identification step of the present invention.

Fig. 2 has shown that one comprises the text image of Arabic text.

Fig. 3 A has shown that text image is divided into a plurality of images, and each bar image comprises a plurality of pixels.

Fig. 3 B and Fig. 3 C have shown pixel and the pixel value of bar image shown in the partial graph 3A.

Fig. 4 has shown a kind of text feature of the present invention;

Fig. 5 is the process flow diagram of text feature extraction step shown in Figure 4;

Fig. 6 has shown another kind of text feature of the present invention;

Fig. 7 A-7D has shown another kind of text feature of the present invention;

Fig. 8 is the process flow diagram of text feature extraction step shown in Fig. 7 A-7D.

Embodiment

Fig. 1 has shown the overall procedure of Arabic text identification step of the present invention.With reference to figure 1-3C, from Arabic text, obtain a text image 200 (such as the step 110 of Fig. 1).Arabic text in the text image 200 may be aligned to multirow 211-214, and each bar comprises the Arabic character of a string cursive script writing.Line of text 211-214 is divided into a plurality of image 311-313 (such as the step 120 of Fig. 1).Wherein image 311,312 or 313 is divided into again pixel 321-323, and each pixel is given a pixel value (such as the step 130 of Fig. 1).Bar image 311,312 or 313 wide span be 2 pixels to 100 pixels, or wide span in 3 pixels between 10 pixels.Bar image 311,312 or 313 can comprise a complete character, partial character or a plurality of characters that connect together.

The pixel value representative is at the intensity level of a certain concrete pixel position text image 200.In an instantiation, the color of image of the high represent pixel point of intensity level bright (or density is low), this pixel may be in white background.The color of image dark (or density is high) of the low represent pixel point of intensity level, this pixel may be in the stroke of an Arabic character.Pixel value can adopt different system systems to represent, such as scale-of-two, and the decimal system and sexadecimal number.

With reference to figure 3A-3C, bar image 311 comprises an image 320, and image 320 comprises many pixels of 321-323, and wherein each pixel is endowed scale-of-two pixel value " 0 " or " 1 ".Pixel value " 1 " represents the dark color of image (perhaps low intensity value) in white background, the Arabic stroke of pixel value " 0 " representative.It should be noted that system and method for the present invention also is applicable to many bit-binary pixel value, many bit-binary pixel value can be expressed masstone grade image intensity value (for example gray scale).

According to the present invention, can from line of text 211 or bar image 311-313, extract Text eigenvector (such as the step 140 of Fig. 1).Below in conjunction with the various text features of the concrete discussion of Fig. 4-8.The concrete form of Text eigenvector can change according to extracting method, as described in following.

The proper vector that obtains in the step 140 is imported in the hidden Markov model (HMM) (such as the step 150 of Fig. 1).Among the present invention, HMM can adopt hidden Markov model kit (HTK), HTK be one easily kit be used for to make up and the operation hidden Markov model.HTK does not need dictionary and depends on character model and the grammer of training sample.The probability translation that HMM provides can be accepted the patterns of change in the proper vector.A lot of functions among the HTK can be embedded in the module library of C language source code.These modules are designed to adopt traditional order line style interface to call, and therefore are easy to hand-written control HTK instrument and carry out.

Utilization comprises the proper vector that obtains in the text image of known Arabic word (data mode copy) can train HMM (such as the step 160 of Fig. 1).When carrying out sample training, need to provide a character model and a benchmark for HTK.The corresponding benchmark estimation of character constructing model unit receive feature vector sum character model.The training data that training sample produces is used for the adjustment model parameter, and the training result that those test sample books produce is used for analyzing system performance.Each model state represents the letter in the alphabet, and each proper vector is corresponding to a training result.The HTK training tool can utilize ready training data to adjust the character model parameter, predicts known data mode copy.

Estimate the HMM parameter according to the benchmark of training image fragment.Also can be used for profile to search cut-point this cutting apart, and extracts feature from these image segments, then proper vector converted to a training data sequences.This technology based on image segmentation adopts dynamic aspect with coupling word image and character string.Training stage receives the careful line of text that checks of the warp that mates with benchmark as input, and its Chinese version is corresponding to text image.Then, every text is divided into a plurality of long and narrow vertical windows, extracts proper vector from this vertical window.

Then use a dictionary and a language model, utilize the Arabic text (such as the step 170 of Fig. 1) in the HMM recognition feature vector that trains.Cognitive phase is followed identical step and is extracted proper vector that the different knowledge sources estimated by the training stage use to search the character string with highest similarity.Network system of identification facility needs removes to describe the transition probability from a model to alternate model.Dictionary and language model can be input in the identification facility to help the correct status switch of recognizer output.

In some instances, with reference to figure 3A-5, bar image 311-313 is digitized in the array of a pixel 321-323, and each pixel has a pixel value (such as the step 510 of Fig. 5).Bar image 311 as shown in Figure 4, is divided into a plurality of unit 410-460 (such as the step 520 of Fig. 5).Each unit of 410-460 comprises one group of adjacent pixel, such as the array of one 3 * 3 pixels.For example, Unit 420 comprise pixel 422,423 and other pixels.

Pixel value in each unit is represented as binary cell numbering (such as the step 530 of Fig. 5) subsequently.Pixel value in each unit is continuous arrangement at first.Such as, these 9 pixels of the 322-323 in Unit 420 connect row with three: 1,1,1,1,0,0,1,0, and 0 mode continuous arrangement.Then, scale-of-two pixel value sequence is mapped as one 9 bit-binary element numbers.Pixel value in 322 pixels is mapped to most significant bit, and the pixel value in 323 pixels is mapped to significant bits.Therefore, the pixel value in Unit 420 is represented as one 9 bit-binary element numbers 111100100.In like manner, the pixel value in the 410-460 unit is converted into binary cell numbering, i.e. 480 among Fig. 4, and the span of each binary cell numbering is all between 0 and 511.

Binary cell numbering in bar image Unit 311 is converted into decimal location numbering, i.e. 490 among Fig. 4 (such as the step 540 of Fig. 5) subsequently.Then decimal location is numbered the proper vector (such as the step 550 of Fig. 5) of 490 being arranged formation bar images 311.For other images, repeating step 520-550.The proper vector that to extract from different bar image 311-313 is input to the Arabic character (such as the step 560 of Fig. 5) that goes in the hidden Markov model to identify in this article one's own profession.

The above-mentioned extracting method of describing in conjunction with Fig. 4-5 has represented the embodiment that a kind of text feature shown in Figure 1 extracts flow process.As everyone knows, above-mentioned text feature is equally applicable to multi-bit pixel value and other numeric representations of serial data.For example, pixel value can adopt 3-bit or 5-bit binary number to represent, can obtain half-tone information (or masstone) like this in text image.Multi-bit pixel value can improve along the precision of stroke contour description text feature.

Further, except binary number, the numerical value that pixel value can be expressed as between a minimum value and the maximal value is interval.In some embodiments, pixel value can be a predetermined interval by linear scale (or normalization), such as [0,1] or [1,1].Therefore pixel value can be quantized.Proper vector can be obtained by the step similar to step 530-550.

In some instantiations, reduced in proportion (namely scaled) to form a bar image 620 that dwindles with reference to 6, one image 610 resolution of figure.For example, the height of bar image 610 can be 60 pixels.The height of the bar image 620 after dwindling can be 20 pixels, and being reduced in size is 1/3.Form one 630 array of pixels after bar image 620 digitizings after dwindling, each pixel indicates a pixel value in 630 arrays.Pixel value in each row of array 630 is changed into binary by sequence.The binary number of different lines consists of a serial data 640, and this serial data 640 consists of a proper vector.The proper vector that obtains from the bar image of text bar is added to hidden Markov model with the Arabic character (such as the step 560 of Fig. 5) in the one's own profession of identification this article.

With reference to figure 7A, 7B, and Fig. 8, similar to step 510 (as shown in Figure 5), bar image 700 is digitized as (such as the step 810 of Fig. 8) in the array of pixels.Pixel is aligned to multiple row.Pixel value is with the single-bit binary number representation, and value is " 1 " or " 0 ".The pixel value of every row all sequence turns to a string single-bit binary number (such as the step 830 of Fig. 8).

Next, shown in Fig. 7 C and 7D, calculate the frequency of the contiguous pixels with identical scale-of-two pixel value " 1 " and " 0 ".Calculated rate is until a cut-off number of transitions.With the frequency forming frequency count table 750 and 760 (such as the step 850 of Fig. 8) of tabulating.But in order to distinguish two row with the complementation of identical number of transitions pixel value, for example,

0 1

1 0

0 1

1 0

Frequency counting is since summit pixel " 1 " statistical counting of row.The counting of first pixel value of left column " 1 " is " 0 ", and next the counting of pixel value " 0 " is " 3 ".The complementary pixel value of two row will obtain following frequency statistics result:

0 3

3 2

2 2

2 1

1 0

0 0

As everyone knows, do not break away from spirit of the present invention, the initial pixels statistics that begins at each row also can begin with pixel value " 0 ".

At frequency counting table 750, the every delegation (such as Fig. 7 C, shown in the 7D) in 760 represents a saltus step on the pixel value, namely changes to a dark text one's respective area (pixel value is " 0 ") from a white background (pixel value is " 1 "), and vice versa.For packed data, the frequency counting table is by the end of a maximum number of transitions.

Frequency counting table 750, the frequency counting in 760 each row consists of a proper vector (such as the step 860 of Fig. 8).Like this, row can be used as a vector in the example of current description.The proper vector that different lines in the bar image is consisted of is input to (such as the step 870 of Fig. 8) in the hidden Markov model.

Determine maximum number of transitions based on a large amount of Arabic text sample statistics analyses.As shown in table 1, about 99.31% row have 6 or saltus step still less.That is to say, choose 6 as the cut-off hop value, can both be suitable for for the overwhelming majority's text image.

Variation number in table 1, the material

Variation number in one row	Columns	Number percent
				0	3003663	18.44％
1	95418	0.59％
			2	7694625	47.24％
3	74196	0.46％
			4	4231776	25.98％
5	45013	0.28％
			6	1028765	6.32％
＜＝6		99.31％
			7	7403	0.04％
8	94771	0.57％
			9	900	0.01％
10	9543	0.05％
			12	1367	0.01％
＜＝12		0.01％

When creating out the HMM of native system, at first set the type for the proper vector of training and testing native system.Proper vector can be divided into continuous type and discrete type.System adopting the continuous type proper vector uses a coefficient sets or sometimes uses Input matrix in model.In discrete system, single coefficient is input in the model.Vector quantization is converted into vector row discrete vector exactly, and method for transformation can utilize HQuant and HCopy instrument to realize in conjunction with HTK.The HQuant instrument is used for making up code book from training data, use with the HCopy instrument subsequently to generate discrete vector.Code book creates its size is depended in the impact of system performance, and is subject to creating its used data volume impact.The HQuant instrument uses the linear vector quantization algorithm to make up code book, and this is an algorithm that calculated amount is very large.Introduced a kind of newly among the present invention, an only vector quantization (Unique Vector Quantization, UVQ) method by name reduces computing time, improves system performance.This method key has been to reduce the quantity of the proper vector that is used for the structure code book, namely deletes the proper vector that repeats and keep each proper vector that portion is only arranged in linear Vector Quantization algorithm.As shown in table 2, the proper vector quantity in the material is reduced greatly.

An only vector count in table 2, the material

Line number in the material	Vector count	An only vector count	The number percent that reduces
				10,000	12,285,426	413,410	96.64％
15,000	16,288,252	591,673	96.37％

When we attempted to use whole proper vectors in 2000 different bar images to create code book, the full-size that we find to create code book was 728.The time that creates this code book approximately is 9 hours, but not creates only 1 hour 30 minutes of code book cost of 1024 sizes with an only eigenvector.As shown in table 3, test the discrimination that obtains for adopting single source model.When an only proper vector was combined with the linear vector quantization algorithm, the size of code book had increased.Be reduced to sixth computing time, discrimination has improved.

Table 3 adopts the discrimination of an only vector count

The code book type	The code book size	Creation-time	Discrimination
				Do not use UVQ	728	9 hours	83.59％
Use UVQ	1024	1 hour 30 minutes	85.22％

Be understandable that, above-described method is not limited to the method that adopts in the instantiation.Concrete form can change on the basis that does not break away from spirit of the present invention.For example, the saltus step cutoff can be selected other numbers except 6.Cell size in the height and width of bar image and the bar image also can from above-mentioned example in different.The form of Text eigenvector also can be according to the difference of extracting method and difference.For example, proper vector can adopt a string binary number, decimal number, or other count the numerical value of system.

Claims

1. automatic method of identification Arabic text, its step comprises:

Obtain the text image that comprises the Arabic character of delegation;

The Arabic character digitizing of this row is formed a two-dimensional array about pixel, and each pixel is given a pixel value, and wherein pixel value adopts binary number representation;

The Arabic character of this row is divided into a plurality of images;

A plurality of unit in the definition multiple bar chart picture in image wherein comprise one group of adjacent pixel in each unit;

With the formation one binary cell numbering arranged sequentially of the pixel value in each unit of a plurality of unit in image in a plurality of the images;

According to binary cell numbering structure one Text eigenvector that derives from a plurality of unit of an image in a plurality of the images;

Text proper vector is offered a hidden Markov model to identify the Arabic character of this row.

2. the method for claim 1 is characterized in that, further comprises

Convert described binary cell numbering to decimal location numbering;

The decimal location number order that derives from a plurality of unit of an image in a plurality of the images is arranged a string decimal location numbering of formation;

This string decimal location numbering structure Text eigenvector according to a plurality of unit that derive from an image in a plurality of the images.

3. the method for claim 1 is characterized in that, this pixel two-dimensional array comprises multirow on the first direction and the multiple row on the second direction, and the Arabic character of this row is arranged along this first direction, and described a plurality of images are arranged sequentially along this first direction.

4. method as claimed in claim 3 is characterized in that, this pixel two-dimensional array comprises the multirow pixel, and the height of at least one image is by capable restriction of M of first direction in the wherein said multiple bar chart picture, and wide N row by second direction limit, and M and N are integer.

5. method as claimed in claim 4 is characterized in that, the span of N is between 2 to 100.

6. method as claimed in claim 5 is characterized in that, the span of N is between 3 to 10.

7. the method for claim 1 is characterized in that, the pixel value in this pixel two-dimensional array adopts the single-bit binary number representation.

8. the method for claim 1 is characterized in that, the pixel value in this pixel two-dimensional array adopts many bit binary number to represent.

9. automatic method of identification Arabic text comprises:

Obtain the text image that comprises the Arabic character of delegation;

The Arabic character digitizing of this row is formed a two-dimensional array about pixel, and each pixel is given a pixel value; The Arabic character of this row is divided into a plurality of images;

Dwindle the bar image after one of them bar image generation one is dwindled at least;

Pixel value continuous arrangement in each row pixel in the bar image after this is dwindled forms a string sequence numbering, and wherein this string sequence numbering consists of a Text eigenvector;

10. method as claimed in claim 9, it is characterized in that, this pixel two-dimensional array comprises multirow on the first direction and the multiple row on the second direction, and wherein the Arabic character of this row is arranged along this first direction, and described a plurality of images are arranged sequentially along this first direction.

11. method as claimed in claim 10 is characterized in that, this pixel two-dimensional array comprises the multirow pixel, and the height of at least one image is by capable restriction of M of first direction in the wherein said multiple bar chart picture, and wide N row by second direction limit, and M and N are integer.

12. a method of automatically identifying Arabic text comprises:

Obtain the text image that comprises the Arabic character of delegation;

The Arabic character digitizing of this row is formed a two-dimensional array about pixel, and each pixel is given a pixel value, and pixel value adopts binary number representation, and wherein this pixel two-dimensional array comprises multirow on the first direction and the multiple row on the second direction;

Calculate the frequency that has the contiguous pixels of same pixel value in the row pixel;

This frequency that utilization calculates from the row pixel is constructed a Text eigenvector;

13. method as claimed in claim 12 is characterized in that, has the frequency counting of contiguous pixels of same pixel value to predefined saltus step cutoff.

14. method as claimed in claim 13 is characterized in that, this saltus step cutoff is 6.

15. method as claimed in claim 12 is characterized in that, the pixel value in this two-dimensional array adopts the single-bit binary number representation.

16. method as claimed in claim 15 is characterized in that, the step of calculated rate comprises: pixel value begins counting for the pixel of " 1 " from the row pixel.

17. method as claimed in claim 16 is characterized in that, if the pixel value of first pixel in the row is " 0 ", then frequency counting is " 0 ".

18. a system that automatically identifies Arabic text is characterized in that, comprising:

Obtain a device that comprises the text image of the Arabic character of delegation;

With the device of the Arabic character digitizing formation one of this row about the two-dimensional array of pixel, make each pixel give a pixel value, wherein pixel value adopts binary number representation;

The device that the Arabic character of this row is divided into a plurality of images;

The device of a plurality of unit of definition wherein comprises one group of adjacent pixel in each unit in an image of a plurality of images;

Pixel value continuous arrangement in each unit of a plurality of unit in the image of a plurality of images is formed the device of a binary cell numbering;

Device according to binary cell numbering structure one Text eigenvector that derives from a plurality of unit of an image in a plurality of the images;

Text proper vector is offered a hidden Markov model to identify the device of the Arabic character of this row.

19. system as claimed in claim 18 is characterized in that, comprising:

Described binary cell numbering is converted to the device of decimal location numbering;

The decimal location number order that derives from a plurality of unit of an image in a plurality of the images is arranged the device that forms a string decimal location numbering;

Construct the device of Text eigenvector according to this string decimal location numbering of a plurality of unit that derive from an image in a plurality of the images.