CN102142088A - Effective Arabic feature extraction-based Arabic identification method and system - Google Patents

Effective Arabic feature extraction-based Arabic identification method and system Download PDF

Info

Publication number
CN102142088A
CN102142088A CN201010258343XA CN201010258343A CN102142088A CN 102142088 A CN102142088 A CN 102142088A CN 201010258343X A CN201010258343X A CN 201010258343XA CN 201010258343 A CN201010258343 A CN 201010258343A CN 102142088 A CN102142088 A CN 102142088A
Authority
CN
China
Prior art keywords
pixel
row
image
arabic
pixel value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201010258343XA
Other languages
Chinese (zh)
Other versions
CN102142088B (en
Inventor
穆罕默德S·卡尔希德
侯塞因K·艾尔奥玛依
哈利德M·艾尔法依夫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN 201010258343 priority Critical patent/CN102142088B/en
Publication of CN102142088A publication Critical patent/CN102142088A/en
Application granted granted Critical
Publication of CN102142088B publication Critical patent/CN102142088B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a method for identifying Arabic texts automatically, which comprises the following steps of: digitalizing a row of Arabic characters to form a two-dimensional array relevant to pixels, and giving a pixel value to each pixel, wherein the pixel value is expressed by binary numbers; dividing the row of Arabic characters into a plurality of strip images; defining a plurality of units in one of the plurality of strip images, wherein each unit contains a group of adjacent pixels; arranging the pixel value in each unit of the plurality of units in one of the plurality of strip images continuously to form a binary unit number; constructing text feature vectors according to the binary unit numbers of the plurality of units from one of the plurality of strip images; and providing the text feature vectors for a hidden Markov model to identify the row of Arabic characters.

Description

Arabic recognition methods and system based on effective Arabic feature extraction
Technical field
Present patent application relates generally to the automatic identification of Arabic text.
Background technology
Text identification, promptly the automatic reading of text is a branch of pattern-recognition, the purpose of text identification is that the speed with people's precision and Geng Gao reads the print text content.Most of text recognition methods all are that the hypothesis content of text can be separated into independently character.This type of technology is though the processor Latin language file beating or set type successfully can not reliably be applied to handle cursive script writing, for example Arabic.Attempt that Arabic word is divided into independent character and have difficulties about the Study of recognition of Arabic writing is verified before.
Arabic has proposed several challenges to text recognition system.The Arabic writing itself has the cursive script characteristics, and it is unacceptable that the mode of employing block letter removes to write each character discretely.And the shape and the context of Arabic alphabet are closely connected; The shape of Arabic alphabet depends on the position of this letter in a word.For example, letter
Figure BSA00000236813800011
Have four kinds of different shapes: when using separately be
Figure BSA00000236813800012
For example exist
Figure BSA00000236813800013
In; When the word reference position be For example exist
Figure BSA00000236813800015
In; When the word centre position be
Figure BSA00000236813800016
For example at word
Figure BSA00000236813800017
In; When the word end position be
Figure BSA00000236813800018
For example at word
Figure BSA00000236813800019
In.In addition, not every Arabic alphabet is all relevant with word.Because spacing also is used for separating some letter in a word, therefore the border between two words is difficult to and can discerns automatically.
Present existing multiple categorizing system, for example statistical model has been applied to Arabic text identification.Yet, suitably extract the major obstacle that text feature remains accurate identification Arabic text.
Summary of the invention
On the one hand, the present invention relates to a kind of Arabic text automatic identifying method, method of the present invention comprises: obtain the text image that comprises the Arabic character of delegation; The Arabic character digitizing of this row is formed a two-dimensional array about pixel, and each pixel is given a pixel value, and wherein pixel value adopts binary number representation; The Arabic character of this row is divided into a plurality of images; Define a plurality of unit in an image therein, comprise one group of adjacent pixels in each unit; Pixel value in the unit in the bar image is arranged binary cell numbering of formation continuously; According to the binary cell numbering structure text feature vector that from the bar elementary area, obtains; Text proper vector added to remove to discern the Arabic character of this row in the hidden Markov model.
On the other hand, the present invention relates to a kind of Arabic text automatic identifying method.This method comprises: obtain the text image that comprises the Arabic character of delegation; The Arabic character digitizing of this row is formed a two-dimensional array about pixel, and each pixel is given a pixel value, and wherein pixel value adopts binary number representation, and this two-dimensional array is made up of multirow on the first direction and the multiple row on the second direction; Calculate the frequency that has the contiguous pixels of same pixel value in the row pixel; This frequency that utilization calculates from the row pixel is constructed a text feature vector; Text proper vector added to remove to discern the Arabic character of this row in the hidden Markov model.
On the other hand, the present invention relates to a kind of Arabic text automatic identifying method.This method comprises: obtain the text image that comprises the Arabic character of delegation; The Arabic character digitizing of this row is formed a two-dimensional array about pixel, and each pixel is given a pixel value; The Arabic character of this row is divided into a plurality of images; Dwindle the bar image after one of them bar image generation one is dwindled at least; Pixel value in each row pixel in the bar image after this is dwindled is arranged continuously and is formed a string sequence numbering, and wherein this string sequence numbering constitutes a text feature vector; Text proper vector added to remove to discern the Arabic character of this row in the hidden Markov model.
On the other hand, the present invention relates to a computer program, it comprises that one can be used for the medium of computing machine, is embedded with computer readable program code in this medium, and this program code can make computing machine obtain a text image that comprises the Arabic character of delegation; The Arabic character of this row of digitizing forms a two-dimensional array about pixel, and each pixel is given a pixel value, and wherein pixel value adopts binary number representation; The Arabic character of this row is divided into a plurality of images; Define a plurality of unit in an image therein, comprise one group of adjacent pixels in each unit; Pixel value in each unit in the bar image is arranged binary cell numbering of formation continuously; According to the binary cell numbering structure text feature vector that from the bar elementary area, obtains; Text proper vector added to remove to discern the Arabic character of this row in the hidden Markov model.
The execution of system can comprise one or more following modes.This method may further include binary cell numbering conversion decimal location numbering; Arrange continuously and form a string decimal location numbering obtaining in a plurality of images in the image decimal location numbering of a plurality of unit; According to a string decimal location numbering structure text feature vector that will obtain in a plurality of images a plurality of unit in the image.The pixel two-dimensional array can comprise multirow on the first direction and the multiple row on the second direction.The Arabic character of this row can be along this first direction series arrangement.A plurality of images can be along this first direction series arrangement.The height of at least one image can be by the M of first direction capable the qualification, wide N row by second direction limit.M and N are integer.This pixel two-dimensional array can comprise the capable pixel of N, and the N span is between 2 to about 100, but value is between 3 to about 10.Pixel value in this pixel two-dimensional array adopts the single-bit binary number representation, also can adopt many bit binary number to represent.Hidden Markov model can be used as a hidden Markov model kit and uses.
System and method among the present invention provides comprehensive, a large amount of, accurate technique for the feature extraction of Arabic text.Compared with prior art, Arabic character recognition efficient disclosed by the invention higher and operation time still less.Compared with prior art, also simpler, the easy row of method and system disclosed by the invention.
Though the present invention shows by a plurality of instantiations and describes that those skilled in the relevant art are appreciated that can have multiple change on way of realization and the details on the basis that does not break away from spirit and scope of the invention.
Description of drawings
Following accompanying drawing, the part of book is used to explain instantiation of the present invention as an illustration, and combines with instructions, is used to explain principle of the present invention.
Fig. 1 is the method flow diagram of Arabic text identification step of the present invention.
Fig. 2 has shown that one comprises the text image of Arabic text.
Fig. 3 A has shown that text image is divided into a plurality of images, and each bar image comprises a plurality of pixels.
Fig. 3 B and Fig. 3 C have shown the pixel and the pixel value of bar image shown in the partial graph 3A.
Fig. 4 has shown a kind of text feature extracting method of the present invention;
Fig. 5 is the process flow diagram of text feature extraction step shown in Figure 4;
Fig. 6 has shown another kind of text feature extracting method of the present invention;
Fig. 7 A-7D has shown another kind of text feature extracting method of the present invention;
Fig. 8 is the process flow diagram of text feature extraction step shown in Fig. 7 A-7D.
Embodiment
Fig. 1 has shown the overall procedure of Arabic text identification step of the present invention.With reference to figure 1-3C, from Arabic text, obtain a text image 200 (as the step 110 of Fig. 1).Arabic text in the text image 200 may be aligned to multirow 211-214, and each bar comprises the Arabic character of a string cursive script writing.Line of text 211-214 is divided into a plurality of image 311-313 (as the step 120 of Fig. 1).Wherein image 311,312 or 313 is divided into pixel 321-323 again, and each pixel is given a pixel value (as the step 130 of Fig. 1).Bar image 311,312 or 313 wide span be 2 pixels to 100 pixels, or wide span in 3 pixels between 10 pixels.Bar image 311,312 or 313 can comprise a complete character, part character or a plurality of characters that connect together.
The pixel value representative is at the intensity level of a certain concrete pixel position text image 200.In an instantiation, the color of image of the high represent pixel point of intensity level bright (or density is low), this pixel may be in white background.The color of image dark (or density height) of the low represent pixel point of intensity level, this pixel may be in the stroke of an Arabic character.Pixel value can adopt different system systems to represent, such as scale-of-two, and the decimal system and sexadecimal number.
With reference to figure 3A-3C, bar image 311 comprises an image 320, and image 320 comprises many pixels of 321-323, and wherein each pixel is endowed scale-of-two pixel value " 0 " or " 1 ".Pixel value " 1 " is represented the dark color of image (perhaps low intensity value) in white background, the Arabic stroke of pixel value " 0 " representative.It should be noted that system and method for the present invention also is applicable to many bit-binary pixel value, many bit-binary pixel value can be expressed masstone grade image intensity value (for example gray scale).
According to the present invention, can from line of text 211 or bar image 311-313, extract text feature vector (as the step 140 of Fig. 1).Below in conjunction with the various text feature extracting method of the concrete discussion of Fig. 4-8.The concrete form of text feature vector can change according to extracting method, as described in following.
The proper vector that obtains in the step 140 is imported in the hidden Markov model (HMM) (as the step 150 of Fig. 1).Among the present invention, HMM can adopt hidden Markov model kit (HTK), HTK be one easily kit be used for making up and the operation hidden Markov model.HTK does not need dictionary and depends on the character model and the grammer of training sample.The probability translation that HMM provides can be accepted the patterns of change in the proper vector.A lot of functions among the HTK can be embedded in the module library of C language source code.These modules are designed to adopt traditional order line style interface to call, and therefore are easy to hand-written control HTK instrument and carry out.
Utilization comprises the proper vector that obtains in the text image of known Arabic word (data mode copy) can train HMM (as the step 160 of Fig. 1).When carrying out sample training, need provide a character model and a benchmark for HTK.The character modeling unit receives proper vector and corresponding benchmark estimation character model.The training data that training sample produces is used for the adjustment model parameter, and the training result that those test sample books produce is used for analyzing system performance.Each model state is represented the letter in the alphabet, and each proper vector is corresponding to a training result.The HTK training tool can utilize ready training data to adjust the character model parameter, predicts known data mode copy.
Estimate the HMM parameter according to the benchmark of training image fragment.Also can be used for profile to search cut-point this cutting apart, and extracts feature from these image segments, then proper vector converted to a training data sequences.This technology based on image segmentation adopts dynamic aspect with coupling word image and character string.Training stage receive with the benchmark coupling through the line of text of scrutiny as input, its Chinese version is corresponding to text image.Then, every text is divided into a plurality of long and narrow vertical windows, extracts proper vector from this vertical window.
Use a dictionary and a language model then, utilize the Arabic text (as the step 170 of Fig. 1) in the HMM recognition feature vector that trains.Cognitive phase is followed identical step and is extracted proper vector that the different knowledge sources estimated by the training stage use to search the character string with highest similarity.Network system of identification facility needs removes to describe the transition probability from a model to alternate model.Dictionary and language model can be input in the identification facility to help the correct status switch of recognizer output.
In some instances, with reference to figure 3A-5, bar image 311-313 is digitized in the array of a pixel 321-323, and each pixel has a pixel value (as the step 510 of Fig. 5).Bar image 311 as shown in Figure 4, is divided into a plurality of unit 410-460 (as the step 520 of Fig. 5).Each unit of 410-460 comprises one group of adjacent pixels, such as the array of one 3 * 3 pixels.For example, Unit 420 comprise pixel 422,423 and other pixels.
Pixel value in each unit is represented as binary cell numbering (as the step 530 of Fig. 5) subsequently.Pixel value in each unit is arranged at first continuously.Such as, these 9 pixels of the 322-323 in Unit 420 connect row with three: 1,1,1,1,0,0,1,0, and 0 mode is arranged continuously.Then, scale-of-two pixel value sequence is mapped as one 9 bit-binary element numbers.Pixel value in 322 pixels is mapped to most significant bit, and the pixel value in 323 pixels is mapped to significant bits.Therefore, the pixel value in Unit 420 is represented as one 9 bit-binary element numbers 111100100.In like manner, the pixel value in the 410-460 unit is converted into binary cell numbering, i.e. 480 among Fig. 4, and the span of each binary cell numbering is all between 0 and 511.
Binary cell numbering in bar image Unit 311 is converted into decimal location numbering, i.e. 490 among Fig. 4 (as the step 540 of Fig. 5) subsequently.Decimal location is numbered the proper vector (as the step 550 of Fig. 5) of 490 being arranged formation bar images 311 then.For other images, repeating step 520-550.The proper vector that to extract from different bar image 311-313 is input to the Arabic character (as the step 560 of Fig. 5) that goes in the hidden Markov model to discern in this article one's own profession.
On behalf of a kind of text feature shown in Figure 1, above-mentioned extracting method in conjunction with Fig. 4-5 description extract the embodiment of flow process.As everyone knows, above-mentioned text feature extracting method is equally applicable to multi-bit pixel value and other numeric representations of serial data.For example, pixel value can adopt 3-bit or 5-bit binary number to represent, can obtain half-tone information (or masstone) like this in text image.Multi-bit pixel value can improve the precision of describing text feature along the stroke border.
Further, except binary number, pixel value can be expressed as the numerical value interval between a minimum value and the maximal value.In some embodiments, pixel value can be a predetermined interval by linear scale (or normalization), such as [0,1] or [1,1].Therefore pixel value can be quantized.Proper vector can be obtained by the step similar to step 530-550.
In some instantiations, reduced (promptly scaled) in proportion to form a bar image 620 that dwindles with reference to 6, one image 610 resolution of figure.For example, the height of bar image 610 can be 60 pixels.The height of the bar image 620 after dwindling can be 20 pixels, and being reduced in size is 1/3.Form one 630 array of pixels after bar image 620 digitizings after dwindling, each pixel indicates a pixel value in 630 arrays.Pixel value in each row of array 630 is changed into binary by sequence.The binary number of different lines constitutes a serial data 640, and this serial data 640 constitutes a proper vector.The proper vector that obtains from the bar image of text bar is added to hidden Markov model with the Arabic character (as the step 560 of Fig. 5) in the one's own profession of identification this article.
With reference to figure 7A, 7B, and Fig. 8, similar to step 510 (as shown in Figure 5), bar image 700 is digitized as (as the step 810 of Fig. 8) in the array of pixels.Pixel is aligned to multiple row.Pixel value is with the single-bit binary number representation, and value is " 1 " or " 0 ".The pixel value of every row all sequence turns to a string single-bit binary number (as the step 830 of Fig. 8).
Next, shown in Fig. 7 C and 7D, calculate the frequency of contiguous pixels with identical scale-of-two pixel value " 1 " and " 0 ".Calculated rate ends number of transitions up to one.The frequency tabulation is formed frequency meter numerical table 750 and 760 (as the step 850 of Fig. 8).But in order to distinguish two row with the complementation of identical number of transitions pixel value, for example,
0 1
0 1
0 1
1 0
1 0
0 1
0 1
1 0
Frequency counting is since summit pixel " 1 " statistical counting of row.The counting of first pixel value of left column " 1 " is " 0 ", and next the counting of pixel value " 0 " is " 3 ".The complementary pixel value of two row will obtain following frequency statistics result:
0 3
3 2
2 2
2 1
1 0
0 0
As everyone knows, do not break away from spirit of the present invention, the initial pixels statistics that begins at each row also can begin with pixel value " 0 ".
Each row (as Fig. 7 C, shown in the 7D) in frequency meter numerical table 750,760 is represented a saltus step on the pixel value, promptly changes to a dark text one's respective area (pixel value is " 0 ") from a white background (pixel value is " 1 "), and vice versa.For packed data, the frequency meter numerical table is by the end of a maximum number of transitions.
Frequency counting in frequency meter numerical table 750,760 each row constitutes a proper vector (as the step 860 of Fig. 8).Like this, row can be used as a vector in the example of current description.The proper vector that different lines in the bar image is constituted is input to (as the step 870 of Fig. 8) in the hidden Markov model.
Determine maximum number of transitions based on a large amount of Arabic text sample statistics analyses.As shown in table 1, about 99.31% row have 6 or saltus step still less.That is to say, choose 6 as by hop value, can both be suitable for for the overwhelming majority's text image.
Variation number in table 1, the material
Variation number in one row Columns Number percent
0 3003663 18.44%
1 95418 0.59%
2 7694625 47.24%
3 74196 0.46%
4 4231776 25.98%
5 45013 0.28%
6 1028765 6.32%
<=6 99.31%
7 7403 0.04%
8 94771 0.57%
9 900 0.01%
10 9543 0.05%
12 1367 0.01%
<=12 0.01%
When creating out the HMM of native system, at first configure the type of the proper vector that is used for the training and testing native system.Proper vector can be divided into continuous type and discrete type.System adopting the continuous type proper vector uses a coefficient sets or uses matrix to be input in the model sometimes.In discrete system, single coefficient is input in the model.Vector quantization is converted into vector row discrete vector exactly, and method for transformation can utilize HQuant and HCopy instrument to realize in conjunction with HTK.The HQuant instrument is used for making up code book from training data, use with the HCopy instrument subsequently to generate discrete vector.Code book is created its size is depended in the influence of system performance, and is subjected to creating its used data volume influence.The HQuant instrument uses the linear vector quantization algorithm to make up code book, and this is an algorithm that calculated amount is very big.Introduced a kind of newly among the present invention, (Unique Vector Quantization, UVQ) method reduces computing time to an only vector quantization by name, improves system performance.This method key has been to reduce the quantity of the proper vector that is used to make up code book, i.e. the proper vector that deletion repeats in linear Vector Quantization algorithm and keep each proper vector that portion is only arranged.As shown in table 2, the proper vector quantity in the material is reduced greatly.
An only vector count in table 2, the material
Line number in the material Vector count An only vector count The number percent that reduces
10,000 12,285,426 413,410 96.64%
15,000 16,288,252 591,673 96.37%
When we attempted to use whole proper vectors in 2000 different bar images to create code book, the full-size that we find to create code book was 728.The time of creating this code book approximately is 9 hours, but not creates only 1 hour 30 minutes of code book cost of one 1024 size with an only eigenvector.Discrimination as shown in table 3, as to obtain for adopting single source model to experimentize.When an only proper vector was used in combination with the linear vector quantization algorithm, the size of code book had increased.Be reduced to sixth computing time, discrimination has improved.
Table 3 adopts the discrimination of an only vector count
The code book type The code book size Creation-time Discrimination
Do not use UVQ 728 9 hours 83.59%
Use UVQ 1024 1 hour 30 minutes 85.22%
Be understandable that the method that above-described method is not limited in the instantiation to be adopted.Concrete form can change on the basis that does not break away from spirit of the present invention.For example, the saltus step cutoff can be selected other numbers except that 6.Cell size in the height and width of bar image and the bar image also can with above-mentioned example in different.The form of text feature vector also can be according to the difference of extracting method and difference.For example, proper vector can adopt a string binary number, decimal number, or other count the numerical value of system.

Claims (20)

1. automatic method of identification Arabic text, its step comprises:
Obtain the text image that comprises the Arabic character of delegation;
The Arabic character digitizing of this row is formed a two-dimensional array about pixel, and each pixel is given a pixel value, and wherein pixel value adopts binary number representation;
The Arabic character of this row is divided into a plurality of images;
A plurality of unit in the definition multiple bar chart picture in image wherein comprise one group of adjacent pixels in each unit;
Pixel value series arrangement in each unit of a plurality of unit in image in a plurality of the images is formed binary cell numbering;
According to the binary cell numbering structure one text feature vector that derives from a plurality of unit of an image in a plurality of the images;
Text proper vector is offered a hidden Markov model to discern the Arabic character of this row.
2. the method for claim 1 is characterized in that, further comprises
Convert described binary cell numbering to decimal location numbering;
The decimal location number order that derives from a plurality of unit of an image in a plurality of the images is arranged a string decimal location numbering of formation;
This string decimal location numbering structure text feature vector according to a plurality of unit that derive from an image in a plurality of the images.
3. the method for claim 1 is characterized in that, this pixel two-dimensional array comprises multirow on the first direction and the multiple row on the second direction, and the Arabic character of this row is arranged along this first direction, and described a plurality of images are along this first direction series arrangement.
4. method as claimed in claim 3 is characterized in that, this pixel two-dimensional array comprises the capable pixel of N, and the height of at least one image is by capable qualification of M of first direction in the wherein said multiple bar chart picture, and wide N row by second direction limit, and M and N are integer.
5. method as claimed in claim 4 is characterized in that the span of N is between 2 to 100.
6. method as claimed in claim 5 is characterized in that the span of N is between 3 to 10.
7. the method for claim 1 is characterized in that, the pixel value in this pixel two-dimensional array adopts the single-bit binary number representation.
8. the method for claim 1 is characterized in that, the pixel value in this pixel two-dimensional array adopts many bit binary number to represent.
9. automatic method of identification Arabic text comprises:
Obtain the text image that comprises the Arabic character of delegation;
The Arabic character digitizing of this row is formed a two-dimensional array about pixel, and each pixel is given a pixel value;
The Arabic character of this row is divided into a plurality of images;
Dwindle the bar image after one of them bar image generation one is dwindled at least;
Pixel value in each row pixel in the bar image after this is dwindled is arranged continuously and is formed a string sequence numbering, and wherein this string sequence numbering constitutes a text feature vector;
Text proper vector is offered a hidden Markov model to discern the Arabic character of this row.
10. method as claimed in claim 9, it is characterized in that, this pixel two-dimensional array comprises multirow on the first direction and the multiple row on the second direction, and wherein the Arabic character of this row is arranged along this first direction, and described a plurality of images are along this first direction series arrangement.
11. method as claimed in claim 10 is characterized in that, this pixel two-dimensional array comprises the capable pixel of N, and the height of at least one image is by capable qualification of M of first direction in the wherein said multiple bar chart picture, and wide N row by second direction limit, and M and N are integer.
12. a method of discerning Arabic text automatically comprises:
Obtain the text image that comprises the Arabic character of delegation;
The Arabic character digitizing of this row is formed a two-dimensional array about pixel, and each pixel is given a pixel value, and pixel value adopts binary number representation, and wherein this pixel two-dimensional array comprises multirow on the first direction and the multiple row on the second direction;
Calculate the frequency that has the contiguous pixels of same pixel value in the row pixel;
This frequency that utilization calculates from the row pixel is constructed a text feature vector;
Text proper vector is offered a hidden Markov model to discern the Arabic character of this row.
13. method as claimed in claim 12 is characterized in that, the frequency counting of contiguous pixels with same pixel value is to predefined saltus step cutoff.
14. method as claimed in claim 13 is characterized in that, this saltus step cutoff is 6.
15. method as claimed in claim 12 is characterized in that, the pixel value in this two-dimensional array adopts the single-bit binary number representation.
16. method as claimed in claim 15 is characterized in that, the step of calculated rate comprises: pixel value begins counting for the pixel of " 1 " from the row pixel.
17. method as claimed in claim 16 is characterized in that, if the pixel value of first pixel in the row is " 0 ", then frequency counting is " 0 ".
18. a system that discerns Arabic text automatically is characterized in that comprise that one can be used for the medium of computing machine, this medium computer embedded readable program function makes computing machine:
Obtain a text image that comprises the Arabic character of delegation;
The Arabic character of this row of digitizing forms a two-dimensional array about pixel, and each pixel is given a pixel value, and wherein pixel value adopts binary number representation;
The Arabic character of this row is divided into a plurality of images; Should comprise multirow on the first direction and the multiple row on the second direction about two-dimensional array of pixel, wherein the Arabic character of this row is arranged along this first direction, and a plurality of images are along this first direction series arrangement, and the span of bar figure image width is 2 to 100 row pixels;
The a plurality of unit of definition comprise one group of adjacent pixels in each unit in an image of a plurality of images;
Pixel value in each unit of a plurality of unit in the image of a plurality of images is arranged binary cell numbering of formation continuously;
According to the binary cell numbering structure one text feature vector that derives from a plurality of unit of an image in a plurality of the images;
Text proper vector is offered a hidden Markov model to discern the Arabic character of this row.
19. system as claimed in claim 18 is characterized in that, this computer readable program code that is embedded in the medium makes computing machine:
Convert described binary cell numbering to decimal location numbering;
The decimal location number order that derives from a plurality of unit of an image in a plurality of the images is arranged a string decimal location numbering of formation;
This string decimal location numbering structure text feature vector according to a plurality of unit that derive from an image in a plurality of the images.
20. system as claimed in claim 18 is characterized in that, this pixel two-dimensional array comprises the capable pixel of N, and the height of at least one image is by capable qualification of M of first direction in the wherein said multiple bar chart picture, and wide N row by second direction limit, and M and N are integer.
CN 201010258343 2010-08-17 2010-08-17 Effective Arabic feature extraction-based Arabic identification method and system Expired - Fee Related CN102142088B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010258343 CN102142088B (en) 2010-08-17 2010-08-17 Effective Arabic feature extraction-based Arabic identification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010258343 CN102142088B (en) 2010-08-17 2010-08-17 Effective Arabic feature extraction-based Arabic identification method and system

Publications (2)

Publication Number Publication Date
CN102142088A true CN102142088A (en) 2011-08-03
CN102142088B CN102142088B (en) 2013-01-23

Family

ID=44409585

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010258343 Expired - Fee Related CN102142088B (en) 2010-08-17 2010-08-17 Effective Arabic feature extraction-based Arabic identification method and system

Country Status (1)

Country Link
CN (1) CN102142088B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5335289A (en) * 1991-02-13 1994-08-02 International Business Machines Corporation Recognition of characters in cursive script
US5933525A (en) * 1996-04-10 1999-08-03 Bbn Corporation Language-independent and segmentation-free optical character recognition system and method
CN1606028A (en) * 2004-11-12 2005-04-13 清华大学 Printed font character identification method based on Arabic character set
CN1741035A (en) * 2005-09-23 2006-03-01 清华大学 Blocks letter Arabic character set text dividing method
CN101038627A (en) * 2007-04-30 2007-09-19 哈尔滨工程大学 Method for recognizing print hand Arabic alphabets based on boundary characteristic

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5335289A (en) * 1991-02-13 1994-08-02 International Business Machines Corporation Recognition of characters in cursive script
US5933525A (en) * 1996-04-10 1999-08-03 Bbn Corporation Language-independent and segmentation-free optical character recognition system and method
CN1606028A (en) * 2004-11-12 2005-04-13 清华大学 Printed font character identification method based on Arabic character set
CN1741035A (en) * 2005-09-23 2006-03-01 清华大学 Blocks letter Arabic character set text dividing method
CN101038627A (en) * 2007-04-30 2007-09-19 哈尔滨工程大学 Method for recognizing print hand Arabic alphabets based on boundary characteristic

Also Published As

Publication number Publication date
CN102142088B (en) 2013-01-23

Similar Documents

Publication Publication Date Title
US8111911B2 (en) System and methods for arabic text recognition based on effective arabic text feature extraction
US10936862B2 (en) System and method of character recognition using fully convolutional neural networks
JP2667954B2 (en) Apparatus and method for automatic handwriting recognition using static and dynamic parameters
JP3133403B2 (en) Neighborhood block prediction bit compression method
CN106295245B (en) Method of the storehouse noise reduction based on Caffe from coding gene information feature extraction
CN107004140B (en) Text recognition method and computer program product
CN110114776B (en) System and method for character recognition using a fully convolutional neural network
EP2943911A1 (en) Process of handwriting recognition and related apparatus
JP2005242579A (en) Document processor, document processing method and document processing program
CN108804397A (en) A method of the Chinese character style conversion based on a small amount of target font generates
CN109285111A (en) A kind of method, apparatus, equipment and the computer readable storage medium of font conversion
CN112633431B (en) Tibetan-Chinese bilingual scene character recognition method based on CRNN and CTC
CN114005123A (en) System and method for digitally reconstructing layout of print form text
CN113449801B (en) Image character behavior description generation method based on multi-level image context coding and decoding
CN100485711C (en) Computer identification and automatic inputting method for hand writing character font
CN109472020B (en) Feature alignment Chinese word segmentation method
Ghosh et al. R-PHOC: segmentation-free word spotting using CNN
Chherawala et al. Arabic word descriptor for handwritten word indexing and lexicon reduction
CN102142088B (en) Effective Arabic feature extraction-based Arabic identification method and system
CN115455955A (en) Chinese named entity recognition method based on local and global character representation enhancement
JPS60153574A (en) Character reading system
JP5986051B2 (en) Method for automatically recognizing Arabic text
CN112329389A (en) Automatic Chinese character stroke extraction method based on semantic segmentation and tabu search
CN116311275B (en) Text recognition method and system based on seq2seq language model
EP2735999A2 (en) Systems and methods for arabic text recognition based on effective arabic text feature extraction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130123

Termination date: 20180817

CF01 Termination of patent right due to non-payment of annual fee