WO1993018483A1

WO1993018483A1 - Method and apparatus for image recognition

Info

Publication number: WO1993018483A1
Application number: PCT/US1993/001843
Authority: WO
Inventors: Esther Levin; Roberto Pieraccini
Original assignee: American Telephone And Telegraph Company
Priority date: 1992-03-02
Filing date: 1993-03-02
Publication date: 1993-09-16

Abstract

A method for image recognition is provided which involves storing a plurality of two-dimensional hidden Markov models (60) each such model comprising a one-dimensional shape-level hidden Markov model comprising one or more shape-level states, each shape-level state comprising a one-dimensional pixel-level hidden Markov model comprising one or more pixel-level states. An image is scanned to produce one or more sequences of pixels. For a stored two-dimensional hidden Markov model, local Viterbi scores for a plurality of pixel-level hidden Markov models are determined for each sequence of pixels (6). A global Viterbi score of a shape-level hidden Markov model is determined based on a plurality of local Viterbi scores and the sequences of pixels. The scanned image is recognized based on one or more global Viterbi scores.

Description

METHOD AND APPARATUS FOR IMAGE RECOGNITION

Field of the Invention

The present invention relates generally to the field of image recognition, and specifically to pattern based image recognition.

Background of the Invention

Signal recognition systems operate to label, classify, or otherwise recognize an unknown signal. Signal recognition may be performed by comparing characteristics or features of unknown signals to those of known signals.

Features or characteristics of known signals are determined by a process known as training. Through training, one or more samples of known signals are examined and their features or characteristics recorded as reference patterns in a database of a signal recognizer.

To recognize an unknown signal, a signal recognizer extracts features from the signal to characterize it. The features of the unknown signal are referred to as the test pattern. The recognizer then compares each reference pattern in the database to the test pattern of the unknown signal. A scoring technique is used to provide a relative measure of how well each reference pattern matches the test pattern. The unknown signal is recognized as the reference pattern which most closely matches the unknown signal.

There are many types of signal recognizers, e.g., template-based recognizers and hidden Markov model (HMM) recognizers. Template-based recognizers are trained using first-order statistics based on known signal samples (e.g., spectral means of such samples) to build reference patterns. Typically, scoring is accomplished with a time registration technique, such as dynamic time warping (DTW). DTW provides an optimal time alignment between reference and test patterns by locally shrinking or expanding the time axis of one pattern until that pattern optimally matches the other. DTW scoring reflects an overall distance between two optimally aligned reference and test patterns. The reference pattern having the lowest score (i.e, the shortest distance between itself and the test pattern) identifies the test pattern.

HMM recognizers are trained using both first and second order statistics (i.e., means and variances) of known signal samples to build reference patterns. Each reference pattern is an N-state statistical model incorporating these means and variances. An HMM is characterized by a state transition matrix, A (which provides a statistical description of how new states may be reached from old states), and an observation probability matrix, B (which provides a description of which spectral features are likely to be observed at a given state). Scoring of a test pattern reflects the probability of the sequence of features in the pattern given a model (i.e., given a reference pattern). Scoring across all models may be provided by conventional dynamic programming techniques, such as Viterbi scoring well known in the art. The HMM which indicates the highest probability of the sequence of features in the test pattern identifies the test pattern.

Pattern-based signal recognition techniques, such as DTW and HMMs, have been applied in the past to the one-dimensional problem of speech recognition, where unknown signals to be recognized are speech, signals and the one dimension is time. It has been a problem of some interest to provide for multi- dimensional signals, such as two-dimensional image signals, a set of general tools analogous to those available for one-dimensional signal recognition.

Summary of the Invention

The present invention provides a method and apparatus for multi-dimensional signal recognition. The invention accomplishes recognition through multi-dimensional reference pattern scoring techniques.

An illustrative embodiment of the present invention provides a two-dimensional image recognizer for optical character recognition. The recognizer is based on planar hidden Markov models (PHMMs) with constrained transition probabilities. Each PHMM comprises a one-dimensional shape-level hidden Markov model and represents a single image reference pattern. A shape-level HMM comprises one or more pixel-level hidden Markov models, each of which represents a localized portion of a shape-level HMM. The embodiment operates to determine, for a given PHMM and a given sequence of pixels in an unknown character image, a local Viterbi score for each of one or more pixel-level HMMs in a shape-level HMM. Furthermore, the embodiment operates to determine a global Viterbi score for a shape-level HMM based on the plurality of local Viterbi scores. Character images are recognized based on the global Viterbi scores. A global Viterbi score is provided for each PHMM (i.e., each shape-level HMM) reference pattern.

Brief Description of the Drawings

Figure 1 presents illustrative groupings of pixel-level hidden Markov model states. Figure 2 presents the illustrative groupings of pixel-level hidden Markov model states from Figure 1 associated with the shape-level states of a shape-level hidden Markov model.

Figure 3 presents a shape-level hidden Markov model comprising the shape-level states presented in Figure 2.

Figure 4 presents an illustrative optical character recognition -system according to the present invention.

Figure 5 presents components of the two-dimensional pattern matcher presented in Figure 4.

Figure 6 presents an image of a scanned character, T, comprising a plurality of linear pixel sequences.

Detailed Description

Introduction

An illustrative optical character recognition system according to the present invention includes a plurality of two-dimensional (or planar) hidden Markov models to represent images to be recognized. Each planar hidden Markov model is defined by:

i. a set of pixel-level states:

S = { s(x,y) } , x= 1, . . . , X, y= 1, . . . , Y;

ii. a set of transition probabilities:

A_{(i,j),(k,l),(m,n)}≡P(s(x,y) = (m,n) | s(x-1,y) = (i,j) , s(x,y- 1) = (k,l)) ,(1) where x and y are abscissa and ordinate, respectively, in a conventional two- dimensional coordinate system; and

iii. a set of observation probability densities B(x,y), one for each state s(x,y).

Each two-dimensional hidden Markov model may be represented as a set of shape-level states . Each shape-level state, G_j, corresponds

to a particular grouping of one or more pixel-level states, S. According to the principles expressed in the Appendix hereto, these groupings of pixel-level states should satisfy the following conditions:

a. The number of groups of shape-level states, N_G, is a polynomial function of the number of pixel-level states X×Y. b. The union of all groups of shape-level states, G=∪ G_j, coincides with the set of pixel-level states, S.

With respect to the groups of shape-level states, the transition probabilities should fulfill the two following conditions:

c. A_{(i,j),(k,l),(m,n)}≠ 0 only if there exists p, 1≤ p≤ N_G, such that

(i,j) , (m,n) e G_P; and (2) d. A_{(i,j),(k,l),(m,n)} = A_{(i,j),(k₁,l₁),(m,n)} if there exists p, 1≤p ,r≤ N_G, such that (k,l) , (k₁ ,l ₁) ∈ γ_P , (3)

if there exists p, 1≤p≤N_G, such that (i,j), (i₁ ,j₁), (m,n), (m₁,n₁) ∈ γ_p, and where (k,l) ∈ γ_r, (k₁,l₁ ) ∈ γ_r.

An example ofthe application of these conditions (a-d) is presented in Figures 1 - 3. In Figure 1, seven shape-level states, G₁ to G₇, are shown with reference to a 4×4 matrix of pixel-level states. As shown in Figure 2, each shapelevel state, G_j, corresponds to a one-dimensional pixel-level hidden Markov model comprising four pixel-level states. Moreover, each shape-level state, G_j, is but one state in a shape-level hidden Markov model, as shown in Figure 3. The arrows between states in the HMM of Figure 3 indicate legal state transitions within the constraints of conditions c and d, above.

The transition probabilities among the pixel-level and shape-level states are derived from A_{(i,j) ,(k,l) ,(m,n)}. When conditions c and d hold for a particular grouping, then transition probability A_{(i,j),(k,l),(m,n)} can be represented as:

where

and

α _rp = P( s(x,y)∈ G_P | s(x,y- 1)∈ G_r ). (6) Hence, (5) defines the transition probabilities between pixel-level states in a one-dimensional pixel-level HMM (such as, e.g., any of those appearing in Figure 2, and (6) defines the transition probabilities between shape-level states in a one-dimensional shape-level HMM By virtue of (i) the nesting of pixel-level HMMs in a shape-level HMM, and (ii) conditions c and d specified above, a general two-dimensional (or planar) HMM for use in image recognition is provided. Note that the pixel-level state observation probabilities are not affected by the grouping of states.

An Illustrative Embodiment

For clarity of explanation, the illustrative embodiment of the present invention is presented as comprising individual functional blocks (including functional blocks labeled as "processors"). The functions these blocks represent may be provided through the use of either shared or dedicated hardware, including, but not limited to, hardware capable of executing software. (Use of the term "processor" should not be construed to refer exclusively to hardware capable of executing software.) Illustrative embodiments may comprise digital signal processor (DSP) hardware, such as the AT&T DSP 16 or DSP32C, and software performing the operations discussed below. Very large scale integration (VLSI) hardware embodiments of the present invention, as well as hybrid DSP/VLSI embodiments, may also be provided.

Figure 4 presents an illustrative optical character recognition system according to the present invention. The system comprises a conventional image scanner 10, a two-dimensional pattern matcher 20, control switches R and T, a decision processor 30, a state image memory 35, a probability estimation processor 45, and a planar hidden Markov model memory 40.

The conventional image scanner 10 receives a physical image of a character and scans it to generate as output a matrix signal, g(x,y). This signal represents the intensity of the physical image at each pixel location, x,y, within the image. PHMMs, developed through a training process discussed below, are stored in the PHMM memory 40. Each PHMM in memory 40 represents a character to be recognized in an optical character application.

The matrix signal for the image, g(x,y), is processed by the two-dimensional pattern matcher 20 to generate, for each PHMM, a global Viterbi score 1 resulting from the comparison of the PHMM and the signal g(x,y). A state image, s (x,y ), is also generated to represent the index of the PHMM state corresponding to pixel, x,y.

The two-dimensional pattern matcher 20 is presented in Figure 5.

Pattern matcher 20 comprises a windowing processor 5, a pixel- level Viterbi processor 6, a local-level score memory 7, and a shape-level Viterbi processor 8.

The windowing processor 5 receives the matrix signal, g(x,y), and extracts therefrom successive sequences of pixels, L₁ , L₂, . . . , L_M. As shown illustratively in the example of Figure 6, these sequences may be linear sequences of pixels.

The pixel-level Viterbi processor 6 determines for each pixel sequence Li and each group G_j (comprising a pixel-level HMM) a local state score d _ij. This is done by computing the Viterbi score of the linear sequence of pixels, L_i, with the pixel-level linear HMM, G_j. An Ν_G × M matrix of the local-level state scores is stored in memory 7.

The shape-level Viterbi processor 8 computes a global score for a given PHMM as the Viterbi score of a linear shape-level hidden Markov model using the sequence L_i as the observation sequence and dy as the local state score for each shape-level state G_j and each observation L_i. Also, the state image, s (x,y), is computed using conventional backtracking methods for hidden Markov models.

The operations performed by the two-dimensional pattern matcher 20 are repeated for each PHMM in the PHMM memory 40. In recognition mode (i.e., when switch R is closed and switch T is open), the decision processor 30 recognizes the scanned image as the character corresponding to the PHMM with the highest score, l_h.

In training mode, switch T is closed and switch R is open. The training mode operation of the embodiment involves conventional Viterbi training of a linear hidden Markov model. Known samples of all characters to be recognized are provided sequentially as input to scanner 10. For each such sample of a given character, a state image s (x,y) is determined by the two-dimensional pattern matcher 20 as described above, using only the PHMM corresponding to the known sample. All known samples for the character are processed in this fashion, with each state image s (x,y ) stored in state image memory 35. Once all such samples for a character are processed and the resulting state images are stored, the probability estimation processor 45 estimates new transition and observation probabilities for the PHMM (as frequency counts) in conventional fashion taking into account the conditions c and d described above for the state transition probabilities. APPENDIX

1. Introduction

In this appendix we extend the dynamic time warping (DTW) algorithm, widely used in automatic speech recognition (ASR), to a dynamic plane warping (DPW) algorithm, for applications in the field of optical character recognition (OCR) or similar applications.

This appendix is written from the point of view of a "speech-researcher"; i.e., we start with the description of the single-dimensional case and then extend it to two dimensions in order to point out the similarity and the differences between the two algorithms. No previous knowledge about speech recognition is assumed.

In the next section we first discuss t the general template matching approach to pattern recognition and show the role of DTW or DPW algorithms in this paradigm. Then we describe the single-dimensional warping, or time alignment problem, and show how the DTW algorithm solves the problem for template-based systems in polynomial time using a general principle of optimality. The two-dimensional warping problem is defined in section 2.2, and its general solution using the same optimality principle is presented. Although the application of the optimality principle in this case reduces the computational complexity of planar warping, the complexity still remains exponential in the dimensions of the image. We show that by restricting the original warping problem, by limiting the class of possible distortions somewhat, we can reduce the computational complexity dramatically, and find the optimal solution to the restricted problem in polynomial time. This approach differs from the one taken in references ^[1] and ^[2] , where instead of restricting the problem, a suboptimal solution to the general problem was found. In section 3, the statistical modeling approach to pattern recognition is described. In section 3.1, we discuss statistical modeling of temporal signals using HMM, and show how this approach is more general, but still similar to DTW. In section 3.2, we introduce the planar hidden Markov model (PHMM) that, on one hand, extends the HMM concept to model images and, on the other hand, generalizes the DPW approach. We show that the restricted formulation of the planar warping problem of section 2.2.3 is equivalent to zeroing some transition probabilities in the PHMM. In section 4, experimental results of isolated hand-written digit recognition experiments are presented. The results indicate that even in the simple case of isolated characters, the elimination of planar distortions enhances the performance significantly. We anticipate that the advantage of this approach will be even more prominent in harder tasks, such as cursive writing recognition/spotting, that involve some of the above mentioned problems. The major ideas of this appendix are summarized in section 5.

2. Template Matching Approach to Pattern Recognition

The task of pattern recognition is that of classifying a set of measured patterns ( e.g., acoustic signals, pixel map images, etc.) into a finite set C={C ₁, ^{. . .} , C_N } of distinct classes representing spoken words or phonemes in the case of speech recognition, and written words or characters in the OCR task. Template matching is one of the many possible ways to solve this problem. According to this approach, each class is represented by a template (a reference pattern), and a new pattern is classified by selecting the class C_k for which the distance D_k between the new pattern and the class representative template is minimal, i.e.

The difficulty in the pattern recognition task arises because of the intra-class variability of the patterns. Methods have to be developed to reduce such variability, thereby building up some invariance properties for the classifier. This intra-class variability is sometimes caused by non-linear distortions during the generation process of the patterns. In speech recognition this problem is known as the 'time alignment' problem, and its source is the temporal variability of the spoken utterances. The DTW procedure described below attempts to reduce the magnitude of this problem. The purpose of the procedure is to time-align the test and the reference patterns by stretching and contracting the test pattern to optimally match it to the reference, by minimizing a measure of the spectral distance D_k between the time-aligned patterns, temporal distortions.

The problem of intra-class variability also arises in optical character recognition due to non-linear, non-uniform elastic distortions (i.e., stretching, contracting) of the hand-written characters. In this appendix we show how to address this problem by generalizing the DTW procedure for planar alignment of images.

2.1 Matching Temporal Signals

2.1.1 One-Dimensional Problem Formulation The DTW algorithm is a procedure that Was developed for optimally aligning two temporal signals:

G^k _R={g_Rk(t): 1≤t∈ Z⁺≤T_R ,g^k _R(·)∈ G⊂R"}, the reference or template signal, representing the k-th class, and G={g(t): 1≤t∈ Z⁺≤T ,g(·)∈ G⊂ R"}, the test signal to be classified. Z⁺ is the set of positive integers, and R" is the n-dimensional real space. The goal of DTW is to find a mapping function

that maps the test time scale to the reference time scale, such that the distortion

between the aligned patterns is minimal, where d(·,·) is a defined local distance measure in G. For simplicity of notation we omit the class index k hereafter. The mapping function is constrained by global constraints, such as the boundary conditions, f(1)=1 , f(T)=T_R , (3) where we assume that the beginnings and the ends of the two patterns line up, and local monotonicity constraints, such as,

Δf=f(t+1)-f(t)≥0 , (4) that prevent the mapping from "folding backwards" in time. We denote by f the set of all mapping functions that satisfy (3) and (4). Constraints (3) and (4) are typical, but not unique. Since the treatment of other kinds of global and local constraints is similar, we continue with the problem defined by (3) and (4) only.

2.12 The Procedure The problem of finding the optimal mapping has an exponential complexity since there are O(T^T _R) possible mappings in f. These mappings are shown as a set of paths in a time-time grid (Fig.1), where each path is a monotonically increasing curve that starts at point

and ends at poin

. The DTW algorithm finds the optimal alignment curve among all possible paths in polynomial time, using the dynamical programming optimality principle.^[3] The optimality principle is based on the fact that the optimal alignment curve (i.e., the one with the minimal distortion along the path) connecting point A to point B through point C is found among all curves that optimally connect A and C. This basic principle leads to an efficient iterative procedure for finding the optimal curve connecting A and B.

In the n-th step of the procedure, 2≤n≤T, we assume that the optimal warping of the (n-1)-th interval of the test signal g(t) ,1≤t≤n— 1, to the i-th interval of the reference signal

is known for all 1≤i≤T_R. Each optimal warping is defined by a mapping f_{i,n- 1} and a distortion D_i,n-1 , such that

Here we denote by f_i,n, 1≤i≤T_R, 1≤n≤T the set of all mapping functions from an interval 1≤t≤n to an interval

, satisfying the local monotonicity conditions (4) on their domain and global boundary conditions: for all f ∈ f_i,n 1 =f(1) ; i=f(n) . (6)

It is clear that The warping f_i,n-1 corresponds to a curve in the time-time grid

that optimally connects point A to the point (t=n -1,7=0.

At this stage we can find the optimal warping of the n-th interval of the test signal to the i-th interval of the reference signal, namely fi,n and Di,n, for all 1≤i≤T_R.

and,

where j is the argument minimizing (7a). Note that the range of minimization over j, constrained to the interval 1≤j≤i, guarantees the satisfaction of the monotonicity constraint (4).

The procedure is initialized for n =1 by setting

and

and is terminated when n =T. This initialization assures that the optimal curve ending at any point in the grid (including point B) does start at point A, according to the global constraints of (3). Therefore, the optimal curve connecting point A to point B is found after T iterations, each requiring on the order of T_R operations described by (7) so that the total computational cost is o(TT_R ).

2.2 Matching Images

22.1 DPW Problem Formulation In extending the DTW algorithm to the alignment of images, our goal is to match the 2-dimensional reference image, to an elastically

distorted test image, G = {g(x,y):x∈ Z⁺,y∈ Z⁺ , (x,y)∈ L_X,Y , g(·, ·)∈ G⊂ R"}. Here an (x,y) pair describes pixel location by horizontal and vertical coordinates, and L_N,M denotes a rectangular discrete lattice, i.e., a set of pixels . Figure 2 shows a simple example of G_R and G. This

example is used to illustrate the definitions and the procedures described below.

The idea of planar warping is to map the test lattice to the reference one through a mapping function F:

such that the distortion

is minimal, subject to possible constraints like global boundary conditions:

F_x(1,y)=1 ; (11a)

F_x(X,y)=X_R ; (11b)

F_y(x, 1)= 1 ; (11e)

F_y(x,Y)=Y_R , (11d) and local monotonicity constraints, such as

ΔF_xx = F_x( x+1,y)-F_x( x,y)≥0 ; (12a)

ΔF_yy=F_y(x,y+1)-F_y(x,y)≥0 . (12b)

We denote by F the set of all admissible mappings that satisfy the above conditions. Although we limit the discussion in this appendix to constraints (11) and (12), the treatment of other kinds of constraints is similar.

2.2.2 The General Approach The complexity of the problem of finding the optimal warping function is exponential, namely o((X_RY_R)^xy). This complexity can be reduced, as in the one-dimensional case, by generalizing the optimality principle. We will use the following definitions: 1. Define Θ to be a set of N_T test sub-shapes {θ_n}, where each test sub-shape is a set of pixels {(x,y)} satisfying the following conditions:

has a natural mono-dimensional parametrization,

where θ₀ is the empty set. In particular, we choose θ to be a set of Y rectangles, θ_n = L_χ,n (see Fig. 2), n=1, · · · ,Y. In this case Δθ_n are pixels of the n-th row.

2. Define Φ to be a set of admissible warping sequences , where Φ_i is

a sequence of X reference pixels that meets the

following conditions:

This definition of the set Φ depends on the particular choice of the set Θ and the constraints (11a),(11b) and (12a). Φ is constructed to contain all possible warping sequences of each Δθ_n that satisfy the constraints.

From this definition it is clear that for each i, 1≤i≤N_Φ, and for each n, 2≤n≤Y-1, there exists F∈F such that for 1≤x≤X. Also for each F∈ F and any n,

1≤n≤Y, there exists Φ_i∈ Φ such that for 1≤x≤X. The cardinality of Φ is

N_Φ=O((X_RY_R)^X).

3. Each sequence Φ_i∈ Φ determines a subset Λ_i⊂Φ of sequences

Whenever we consider Φ_i to be a candidate warping sequence for the n-th row of the test image, the preceding (n-1)-th row can be matched only with a warping sequence in Λ_i in order to meet the vertical monotonicity condition (12b).

Figure 3 shows the concepts defined above, applied to the example of figure 2. In figure 3a the set Θ is shown. The set Φ includes in this case 16 sequences, shown in figure 3b. The corresponding Λ_i for each Φ_i∈ Φ is also given.

4. Denote by F_i,n a set of sub-mapping functions from the n-th test rectangle θ_n, 1≤n≤N_T, that satisfy the monotonicity conditions (12a) and (12b), boundary conditions (11a) , (11b) and (11c), and match the n-th row of the test Δθ_n with Φ_i: for any

Using these definitions we are ready to describe the DPW algorithm.

In the n-th iteration of the algorithm, 2≤n≤Y, we assume that the optimal warpings of the (n-1)-th rectangle of the test image g(x,y), (x,y)∈ θ_n-1, that match the (n-1)-th test image row g(x,y), (x,y)∈ Δθ_n-1 with the warping sequence Φ_i are known for 1≤i≤N_Φ. Each optimal warping is defined by a mapping F_i,n-1∈ F_i,n-1 and a distortion D_i,n-1 , such that

Now we can find the optimal warping of the n-th test rectangle, g(x,y), (x,y)∈ θ_n, that matches the n-th test image row to to the j-th warping sequence, g_R(x,y), (x,y)∈ Φ_j:

The optimal mapping F_j,n is

where z is the argument minimizing (19). Constraining the minimization in (19) only to those i such that Φ_i∈ Λ_j, guarantees that the vertical monotonicity condition (12b) is satisfied. The horizontal monotonicity condition (12a), and the two boundary conditions (11a) and (11b) are satisfied through the definition of Φ_j.

To complete the n-th iteration, the optimal warping of the n-th test rectangle has to be found for every warping sequence Φ_j∈ Φ, thus requiring N_Φ X operations.

The algorithm is initialized for n=1 by setting

where δ(·) is the Kronecker delta function. This initialization guarantees the satisfaction of condition (11c). The algorithm is stopped after n=Y, when the optimal warpings F_i,Y are found for all i for which Λ_i = Φ, thereby requiring a total of O(YXN_Φ ) computations. The global optimal warping function F_oplimαt minimizing (10) and satisfying (11) and (12) is chosen among these warpings as the one that produces the minimal distortion: F_optimal = F_{j, Y} , (22) where j=arg min D_{i,Y .}

i:Λ_i = Φ

Constraining the minimization in (22) only to those i for which Λ_i=Φ, guarantees satisfaction of the boundary condition (11d).

Figure 4 shows the values of D_i,n and F_optimal for the example of figure 2, using a quadratic distance measure d(g_R(xi.yi).g(x,n))=(g_R(x_x ^J,y_x)-g(x,n))².

2.2.3 Constraining the Warping Problem Even though applying the optimality principle reduces the complexity of the planar warping, the computation is still exponential. Therefore the algorithm is impractical for real-size images (since N_Φ=O((Y_RX_R)^X) ). Further reduction of the computational complexity can be achieved in two different ways:

1. Finding a sub-optimal solution to the warping problem. Examples of sub-optimal procedures can be found in^[3,4], where the images are divided into small sub-images, usually containing up to three rows of pixels. These sub-images were small enough, that finding a (local) optimal warping function is possible. The global solution, however, is not optimal, since the dependence across sub-images is neglected.

2. Redefining and simplifying the original warping problem.

The idea here is to limit the number of admissible warping sequences in Φ , or, equivalently, constrain the class of admissible mappings F in such a way that an optimal solution to the constrained problem can be found in polynomial time. The additional constraints used are not arbitrary, but instead reflect the geometric properties of the specific set of images being compared. For example, we can constrain the possible mappings to be of the form

where the vertical distortion is independent of the horizontal position. In this case N_Φ=O(X^X _R Y), and the admissible warping sequences Φ∈ Φ are naturally grouped into Y_R subsets. The m-th subset, λ_m contains all those sequences Φ_i for which y_x=m , l≤x≤X (e.g., in figure 2, the set Φ contains only four sequences,

Φ= {Φ₁, Φ₅, Φ₁₂, Φ₁₆}, and λ₁ = {Φ₁, Φ₅}, λ₂ = {Φ₁₂, Φ₁₆} ). For all Φ_i∈ λ_m, , i.e.,

the satisfaction of the vertical monotonicity condition is independent of a particular horizontal warping. This allows further reduction of computational complexity as follows. We define

and

where i is the argument that minimizes (24a). The recursion relation (19) can now be rewritten in terms of these quantities as:

The second term of (25), , is the distortion

resulting from optimally aligning the n-th row of the test image to the m-the row of the reference image while satisfying both the horizontal monotonicity and the boundary conditions (11a), (11b) and (12a). This is equivalently a single-dimensional warping, as described in the previous section, and requires O(XX_R) calculations. Denote by f_m,n the optimal mapping that aligns the n-th row of the test image to the m-th row of the reference image constrained by (11a), (11b) and (12a) and minimizing ΔDm,n. Then

where

The optimal mapping, F_optimat∈ F, minimizing (10) and satisfying (11,12), is

. and the complexity of its computation is only O(Y_RYX_RX)! Figure 5

shows and F_optimal obtained by applying the restricted approach to the problem of

figure 2. Note that the solution obtained here is the same as the one obtained by the general approach shown in figure 4.

An important remaining question is what are the limitations of this restricted approach, as compared to the original one? Assumption (23) implies that a row of the test image can be mapped to pixels that belong to a single row of the reference, - i.e., a horizontal line in the test image will be mapped to some horizontal line in the reference image. This fact does not severely restrict the generality of the approach, since many kinds of distortion can be accounted for in this manner. For example, a straight line with a small but non-zero slope can be transformed into any straight or not straight line, excluding the line with zero slope. An example of a test image, for which the solution obtained by the restricted approach differs from that obtained by the general approach, is shown in figure 6.

The restricted formulation of the problem should reflect the geometry of the application. The restriction (23) discussed here is only one among many possibilities. For other restricted formulations it might be useful to design the sets Θ and Φ in a different manner. For example, Θ can be the set of nested vertical rectangles {R_n,Y : 1≤n≤X}; Φ, in this case, includes the warping sequences for test image columns similarly to (14), and the set of admissible mappings is restricted to contain functions F for which F_x(x,y)=F_x(x). Generic description of the type of needed constraints is presented in appendix A.

3. Statistical Modeling Approach to Pattern Recognition

Another way of approaching the pattern recognition problem is by means of statistical modeling of the pattern source. The fc-th class of patterns C_k, k=1, . . . ,N_c, is represented by a model, which is assumed to generate the k-th class patterns according to the probability distribution P(G | C_k). Under this paradigm, the criterion that yields the minimal classification error is maximum a posteriori probability decoding: an unclassified pattern G is assigned to the class C_k according to

The term P(C_n | G) can be rewritten as

where P(G) is independent of C_n and therefore can be ignored. The prior class probability P(C_n) is generally attributed to higher level knowledge (e.g., syntactic knowledge). If such knowledge is not readily available, we usually assume a uniform class probability . Then the classification problem is that of maximizing

the likelihood

P(G I C_n)≡P_n(G) . (29)

The computation of this likelihood is performed using the underlying stochastic model that represents the n-th class.

In the next subsection we describe a stochastic model called the "Hidden Markov model" frequently used to model temporal signals. We show that the statistical classification approach, using this model, generalizes the template matching paradigm based on DTW. Then we proceed to define a new stochastic model, that can both extend the HMM approach for planar signals, and generalize the template matching approach using DPW. 3.1 Hidden Markov Model

The HMM is a statistical model that is used to compute P_n(G) for temporal signals G={g(t): 1≤t≤T, g∈ G⊂Rⁿ} such as speech ^{[4] [5] [6]} For simplicity we omit the class index n. The HMM is a composite statistical source, comprising a set of T_R sources, called states s= { 1, . . . ,T_R}. The i-th state, i∈ s, is characterized by its probability distribution over G, P_i(g). At each time t only one of the states is active, emitting the observable g(t). We denote the random variable, corresponding to the active state at time t by s(t), s(t)∈ s. The joint probability distribution (for real-valued g) or discrete probability mass (for g being a discrete variable) P(s(t),g(t)) for t > 1 is characterized by the following property:

P(s(t),g(t) | s(1:t-1),g(1:t-1))=P(s(t) | s(t-1)) P(g(t) | s(t))≡ (30)

=P(s(t) | s(t-1)) P_s(t)(g(t)), where s(1:t-1) stands for the sequence {s(1), ^{. . .} s(t-1)}, and g(1:t-1)={g( 1), . . . ,g(t-1)}.

We denote by a_ij the transition probability P(s(t)=j | s(t-1)=i ), and by π_i, the probability of state i being active at t=1, π_i =P(s(1)=i ).

The probability of the entire sequence of states S≡s(1:T) and observations G=g(1:T) can be expressed as

The interpretation of equations (30) and (31) is that the observable sequence G is generated in two stages: first, a sequence S of T states is chosen according to the Markovian distribution parametrized by {a_ij} and {π_i}; then each one of the states s(t), 1≤t≤T, in S generates an observable g(t) according to its own memoryless distribution P_s(t), forming the observable sequence G. This model is called a hidden Markov model, because the state sequence S is not given in most applications, and only the observation sequence G is known. We can estimate the most probable state sequence Ŝ, given the observation G, as

Then the likelihood (29) of the sequence of observations is approximated by

i.e., instead of the sum

only the maximal term is taken into account. This approximation is computationally economical, and has been shown, both experimentally and theoretically^[7], to be valid, i.e., to have a vanishingly small approximation error.

The problem of finding Ŝ and

can be restated as that of minimizing

over all possible state sequences S. The problem of minimizing L is of exponential complexity, since there exist

possible state sequences, but it can be solved in polynomial time using a dynamical programming approach (similarly to description of Section 2.1). It is useful to understand this similarity: a state sequence S defines a mapping from the observation time scale 1≤t≤T to the active state at time t, 1≤s(t)≤T_R, that corresponds to the reference time scale 7 in the DTW approach. The first term in provides a distortion measure, as in (2). For example, for a

Gaussian HMM, where where

t=s(t). The penalty term in (35), , generalizes the global

and the local constraints of equations (3) and (4) of DTW. A particular case of this model, called a left-to-right HMM, is especially useful for speech modeling and recognition. In this case a_ij = 0 for j<i, and π₁ = 1. This type of model provides an infinite penalty to state sequences that do not start with S ₁=1, and for which the monotonicity condition s(t+l)≥s(t) does not hold. If in addition the absorbing state s(T) is constrained to be the last state of the model s(T)=T_R, the minimization (35) is, in effect, performed only among those state sequences that correspond to mappings satisfying conditions that are equivalent to (3) and (4). The only difference between the minimization problem defined by (2), (3) and (4) and this one is the non-zero penalty term in (35). The optimality principle can be applied to the minimization (35) in a manner similar to DTW as described in section 2.1.2.

This statistical description not only provides a formal interpretation of the heuristic warping procedure and aids its understanding, but also enables natural integration with higher-level syntactical knowledge. 3.2 The Two-Dimensional Case: Planar HMM

In this section we describe a statistical model for P_n(G), when G is a planar image, G = {g(x,y):(x,y)∈ L_χ,Y , g∈ G}. We call this model the "Planar HMM" (PHMM) and design it not only to extend the conventional HMM to the two-dimensional case, but also to provide a statistical interpretation and generalization of the DPW approach.

The PHMM is a composite source, comprising a set, s, of N=X_RY_R states

. Each state in s is a stochastic source characterized by its probability density

over the space of observations g∈ G. It is convenient to think of the states of the model as being located on a rectangular lattice

corresponding to the reference lattice of DPW. Similarly to the conventional HMM, only one state is active in the generation of the (x,y)-th image pixel g(x,y). We denote by s(x,y)∈ s the active state of the model that generates g(x,y). The joint distribution governing the choice of active states and image values has the following Markovian property (see figure 7):

P(g(x,y), s(x,y) | g(1:X, 1:y-1), g(1:x-1,y), 5(1:X, 1:y-1),s(1:x-1,y))= (36)

=P(g(x.y) | s(x,y)) P(s(x,y) | s(x-1,y),s(x,y-1))=

=P_{s(x,y)(g(x,y) )} P(s(x,y) | s(x-1,y),s(x,y-1))= where g(1:X,y-1)≡{g(x,y) : (x,y)≡ R_χ,y-1}, g(1:x-1,y)≡{g(1,y), . . . ,g(x-1,y)}, and s(1:X, 1:y-1), s(1 :x-1,y) are the active states involved in generating g(1:X,y-1), g(1:x-1,y), respectively ( see figure 6). Using property (36), the joint likelihood of the image G=g(1:X,1:Y) and the state image S =s (1:X, 1:Y) can be written as

where

Similarly to HMM, (37) suggests that an image G is generated by the PHMM in two successive stages: in the first stage the state matrix S is generated according to the

Markovian probability distribution parametrized by {A}, {a^H}, {a^v }, and{π}. In the second stage, the image value in the (x,y)-th pixel is produced independently from other pixels according to the distribution of the s(x,y)-th state P_S(x,y)(g ). As in HMM, the state matrix S in most of the applications is not known, only G is given. The state matrix Ŝ that best explains the observable G can be estimated as in (32) by

, and then observation likelihood P(G) is approximated as

Therefore, the problem of finding Ŝ and P(G) is that of minimizing

over all possible state matrices S. Again, the problem is of exponential complexity, since there are (X_R Y_R )^XY different state matrices. This complexity can be reduced, as with DPW, by applying the optimality principle and by restricting the model. The similarity between the problem of finding the most probable state matrix in PHMM and DPW can be shown as follows: the states of the PHMM correspond to the pixels of the reference image, and therefore the active state matrix S corresponds to the mapping F of DPW. The first term in is equivalent to

the distortion measure D of DPW. The second term, C, generalizes constraints (11), (12), and (23). In particular, by restricting the PHMM parameter values to be

the active state matrix S that minimizes (38) must satisfy conditions equivalent to (11), (12a) and (12c). The PHMM constrained by (40) can be referred to as the left-to-right bottom-up PHMM, since it doesn't allow for "foldovers" in the state images.

The other boundary conditions (12b) and (12d) can be imposed on Ŝ by restricting the values of s (x,Y), 1≤x≤X and s (X,y), 1≤y≤Y,

3.2.1 Constraining the parameters of PHMM In this section we describe the ways of constraining the values of transition probabilities {A_{(i,j),(k,l),(m,n)}} in order to reduce the complexity of the problem of finding Ŝ and to polynomial, similarly to the additional

constraints on DPW discussed in appendix A, and section 2.2.3.

For the problem of finding Ŝ and

to be solved in polynomial time, there should exist a grouping of the set s of states of the model into N_G subsets of states γ_p, These subsets do not have to be mutually exclusive, and can share states.

Two examples of such groupings are shown in figure 8. The number of subsets, N_G, should be polynomial in the dimensions of the model X_R, Y_R. The probabilities {A _{(i,j),(k,l),(m,n)}} should satisfy the two following constraints with respect to such grouping: A_{(i,j),(k,l),(m,n)}≠ 0 only if there exists p, 1≤p≤N_G, such that (i,j), (m,n)∈ γ_p. (42)

Condition (42) means that die the left neighbor of the state (m,n) in the state matrix S must be a member of the same group γ_p as (m,n). The second constraint is:

if there exist p, 1≤p≤N_G, such that (i,j) , (i₁,j₁) , (m,n), (m ₁,n ₁)∈ γ_p, and where(k,l)∈ γ_r , (k₁ ,l₁)∈ γ_r1

The condition (43) makes the penalty term C independent of the horizontal warping.

In the case when (42) and (43) hold for a particular grouping, the nonzero transition

probabilities A (i,j),(k,l),(m,n) can be factorized into

where

and

The ratio K(p,r,r₁) of Eq. (43b) can be expressed as . Using this equivalent

representation of transition probabilities (given by equations (44-46)), a convenient

description of PHMM can be derived. Each subset γ_p of PHMM can be considered as a

one-dimensional HMM, comprising the states , with transition

probabilities among those states of equation (45), and the respective

observation probabilities. The whole PHMM can now be represented as a collection of such subsets, with a Markovian probability of transitions between the subsets defined by α_rp of equation (46). This equivalent representation, illustrated in figure 9, suggests an iterative algorithm for computing the state matrix Ŝ and

in polynomial time, similarly to DPW case of section 2.2.3. Denote by L_p,n the local cost, related to the probability that the n-th raw of the image G was generated by a single-dimensional HMM corresponding to the subset γ_p, and by Ŝ_p,n the corresponding state sequence:

This cost can be calculated in a polynomial time using the Viterbi algorithm, since this is a single-dimensional case. After all the local costs L_p,n have been calculated for 1≤n≤Y, 1≤p≤N_G, the global cost and the optimal state matrix Ŝ are

found using the Viterbi algorithm for the single-dimensional HMM defined by a set of N_G states (the subsets γ_p of the PHMM), transition probabilities between these states (α_pr of Eq. 46), and the observation probabilities given by exp [-L_p,n ]. The algorithm is illustrated in figure 10.

Although conditions (42), (43) are hard to check in practice since any possible grouping of the states has to be considered, they can be effectively used in constructive mode, i.e., chosing one particular grouping, and then imposing the constraints (42) and (43) on the probabilities {A_{(i,j)(k,l),(m,n)}} with respect to this grouping. For example, if we choose y_p = {(x,y) | l≤x≤X_R, y=p }, 1≤p≤Y_r, then the constraints (42),(43) transform to

and,

equivalently to the restriction imposed on DPW by (23).

The constraints (42) and (43) can be trivially changed by applying a coordinate transformation.

4. Experimental Results

The PHMM approach was tested on a writer-independent isolated handwritten digit recognition application. The data we used in our experiments was collected from 12 subjects (6 for training and 6 for test). The subjects were each asked to write 10 samples of each digit. Each sample was written in a fixed-size box, therefore the samples were naturally size-normalized and centered. Figure 11 shows the 100 samples written by one of the subjects. Each sample in the database was represented by a 16 ×16 binary image. Each character class (digit) was represented by a single PHMM, satisfying (49) and (50). Each PHMM had a strictly left-to-right bottom-up structure, where the state matrix 5 was restricted to contain every state of the model, i.e., states could not be skipped. All models had die same number of states. Each state was represented by its own binary probability distribution, i.e., the probability of a pixel being 1 (black) or 0 (white). We estimated these probabilities from the training data with the following generalization of the Viterbi training algorithm.^[8] For the initialization we uniformly divided each training image into regions corresponding to the states of its model. The initial value of P_i(g=1) for the i-th state was obtained as a frequency count of die black pixels in the corresponding region over all the samples of the same digit. Each iteration of the algorithm consisted of two stages: first, the samples were aligned with the corresponding model, by finding the best state matrix Ŝ. Then, a new frequency count for each state was used to update P_i(1), according to the obtained alignment. We noticed that the training procedure converged usually after 2-4 iterations, and in all the experiments the algorithm was stopped at the 10th iteration. The recognition was performed as explained in section 3: The test sample was assigned to die class k for which was maximal.

It is worth noting the following two points. First, the test error shows a minimum for X_R = Y_R = 10 of 5%. By increasing or decreasing the number of states this error increases. This phenomenon is due to the following:

1. The typical under/over parametrization behavior.

2. Increasing the number of states closer to the size of the modeled images reduces the flexibility of the alignment procedure, making this a trivial uniform alignment when X_R =Y_R = 16.

Also, the training error decreases monotonically with increasing number of states up to X_R = Y_R = 16. This is again typical behavior for such systems, since by increasing the number of states, the number of model parameters grows, improving the fit to the training data. But when the number of states equals the dimensions of the sample images, X_R=Y_R = 16, there is a sudden significant increase in the training error. This behavior is consistent with point (2) above.

Figure 12 shows three sets of models with different numbers of states. The states of the models in this figure are represented by squares, where the grey level of the square encodes the probability P(g=1). The (6×6) state models have a very coarse representation of the digits, because the number of states is so small. The (10×10) state models appear much sharper than the (16×16) state models, due to their ability to align the training samples.

This preliminary experiment shows that eliminating elastic distortions by the alignment procedure discussed above plays an important role in the task of isolated character recognition, improving the recognition accuracy significantly. Note that the simplicity of this task does not stress the full power of the PHMM representation, since the data was isolated, size-normalized, and centered. We expect that the advantages of this approach will be even more prominent in harder tasks, such as cursive/connected hand-writing, recognition with grammatical constraints, noisy images, etc..

5. Summary and Discussion

In this appendix we demonstrated how the DTW algorithm and HM modeling, extensively used for speech recognition, can be generalized to OCR. We found two key problems in this generalization:

1. Applying the optimality principle in the planar case is not trivial, since the two dimensions of an image cannot be treated separately. In order to use the optimality principle here, the set of all possible warping sequences satisfying horizontal constraints must be defined. For the n-th row of the test image every such sequence has to be considered as a candidate warping. The vertical constraints are taken into account by limiting the set of possible warping sequences of the previous (n-1)-th row. In this way the complexity of computation was reduced from O((X_RY_R ^XY ) to O(YX(Y_RX_R)^X ).

2. Although applying the optimality principle reduces the computational complexity, it still remains exponential in the dimensions of the image. We show that by restricting the original warping problem by limiting the class of possible distortions (for example, assuming that the vertical distortion is independent of a horizontal position), we can reduce the computational complexity dramatically, and find the optimal solution to the restricted problem in linear time O(XY X_RY_R ).

A statistical model (the planar hidden Markov model - PHMM) was developed to provide a probabilistic formulation to the planar warping problem. This model, on one hand, generalizes the single-dimensional HMM to die planar case, and on the other extends the DPW approach. The restricted formulation of the warping problem corresponds to PHMM with constrained transition probabilities. The PHMM approach was tested on an isolated, hand-written digit recognition application, yielding 95% digit recognition. Further analysis of the results indicate that even in a simple case of isolated characters, the elimination of planar distortions enhances recognition performance significantly. We expect that the advantage of this approach will be even more valuable in harder tasks, such as cursive writing recognition/spotting, for which an effective solution using the current available techniques has not yet been found.

Figure Captions

Figure 1: Time-time grid. Abscissa: test time scale l≤r≤T. Ordinate: reference time scale 1≤ t≤T_R. Any monotonically increasing curve connecting point A to point B corresponds to a mapping f∈ f .

Figure 2: Example of warping problem. G_R is a 2×2 reference image, and G is a 3×3 test image. Inside each pixels are shown its 9x,y) coordinates. The value of the image g(χ,y) is encoded by texture, as shown.

Figure 3: Illustration of the definitions of Θ, Φ,and Λ for the example of figure 2.

Figure 4: Illustration of the two-dimensional warping algorithm on the example of figure 2. The table shows the values of D_i,n for 1≤i≤16 and 1≤n≤3, calculated according to the DPW algorithm. The optimal value of D is D =0. and the corresponding F_optimal is shown.

Figure 5: Illustration of the constrained DPW algorithm for the example of. figure 2. The table shows the values of

for 1≤k≤2 and 1≤n <3. In this case the obtained solution is the same as in figure 4.

Figure 6: Example of a test image G for which the optimal mapping obtained according to the general DPW formulation differs from the one obtained according to the restricted formulation.

Figure 7: Illustration of the planar Markov property. The probability of a state in the light grey pixel given the states of all the dark grey pixels in (a) equals the probability of a state in the light grey pixel given the states of only two dark pixels in (b).

Figure 8: two groupings of the 4×4 PHMM states into subsets.

a. Here the set of states is divided into 4 mutually exclusive subsets, each contains states of one raw only.

b. The same set of states are grouped into 7 subsets.

Figure 9: Equivalent representation of constrained PHMM, for the grouping of figure 8a.

Figure 10: Illustration of the algorithm for the case of figure 8a.

a. First, the local costs are computed using Viterbi algorithm

b. The global solution is found using Viterbi algorithm with the local costs.

Figure 11: The 100 samples of the digits from one subject.

Figure 12: The digit models obtained by training, for different number of states. The grey level in these images encodes the value of P(g=1) for each state.

6. Appendix A: Properties of the constraints

Changing the choice of sub-shape set Θ, and changing the set of admissible mappings Φ accordingly, is equivalent to coordinate transformation. The example discussed in the end of section 2.2.3, corresponds to such simple coordinate transformation, exchanging the roles of the vertical and the horizontal coordinates. In what follows we discuss a generic description of the constraints on the set Φ, keeping the set Θ fixed for θ_n=L_X,n.

For the computational complexity of DPW process to be polynomial in the sizes of the images, there should exist a grouping of the set of admissible warping sequences

(defined by F) into N_G mutually exclusive subsets, . The number of subsets N_G

should be polynomial in the sizes of the images {X, Y, X_R, Y_R }, and this grouping should fulfill the following conditions:

1. For 1≤k≤N_G, if Φ_i∈ λ_k and Φ_j∈ λ_k then Λ_i=Λ_j≡Λ^k.

2. Por 1≤k≤N_G, Λ^k can be expressed as a union of some subsets λ_j, i.e. for any k,

1≤k≤N_G, there exist the indices

, such that .

It is clear that the example (23) discussed in section 2.2.3, satisfy these conditions. The analysis of the general case described by conditions 1,2 above is similar to the analysis of the example (23) given in Eq. (24-26), and is therefore omitted. REFERENCES

1. R. Chellappa, S. Chatterjee, "Classification of textures Using Gaussian Markov Random Fields," IEEE Transactions on ASSP , Vol. 33, No. 4, pp. 959-963, August 1985.

2. H. Derin, H. Elliot, "Modeling and Segmentation of Noisy and Textured Images Using Gibbs Random Fields," IEEE Transactions on PAMI , Vol. 9, No. 1 pp. 39-55, January 1987.

3. R. Bellman, Dynamic Programming," Princeton, NJ: Princeton University Press, 1957

4. C.-H. Lee, L. R. Rabiner, R. Pieraccini, J. G. Wilpon, "Acoustic Modeling for Large Vocabulary Speech Recognition," Computer Speech and Language, 1990, No. 4, pp. 127-165.

5. J. G. Wilpon, L. R. Rabiner, C.-H. Lee, E. R. Goldman, "Automatic Recognition of Keywords in Unconstrained Speech Using Hidden Markov Models," IEEE Trans, on ASSP, Vol. 38, No. 11, pp 1870-1878, November 1990.

6. R. Pieraccini, Ε. Levin, "Stochastic Representation of Semantic Structure for Speech Understanding," Proceedings of EUROSPEECH 91, Vol.2, pp. 383-386, Genova, September 1991. 7. N. Merhav and Y. Ephraim, "Maximum likelihood hidden Markov modeling using a dominant sequence of states," accepted for publication in IEEE Transaction on ASSP.

8. F. Jelinek, "Continuous Speech Recognition by Statistical Methods," Proceedings of IEEE, vol. 64, pp. 532-556, April 1976.

Claims

Claims:

1. A method of optical character recognition, the method comprising the steps of:

a. storing a plurality of two-dimensional hidden Markov models, each such

model comprising a one-dimensional shape-level hidden Markov model comprising one or more shape-level states, each shape-level state comprising a one-dimensional pixel-level hidden Markov model comprising one or more pixel-level states;

b. scanning an image to produce one or more sequences of pixels;

c. for a stored two-dimensional hidden Markov model,

i. determining for each sequence of pixels a local Viterbi score for a

plurality of pixel-level hidden Markov models; and

ii. determining a global Viterbi score of a shape-level hidden Markov model based on a plurality of local Viterbi scores and the sequences of pixels; and

d. recognizing the scanned image based on one or more global Viterbi scores.

2. The method of claim 1 wherein the step of recognizing the scanned image comprises the step of recognizing the scanned image based on the twodimensional hidden Markov model having the highest global Viterbi score.

3. The method of claim 1 wherein the probability of a first state in a stored two-dimensional hidden Markov model equals zero when a left neighbor state is not a member of the same pixel-level model as the first state.

4. The method of claim 1 wherein the probability of a first state in a stored two-dimensional hidden Markov model is based on the value of a left neighbor pixel-level state and the value of a bottom neighbor shape-level state.

5. An optical character recognition system, the system comprising: a. a memory storing a plurality of two-dimensional hidden Markov models, each such model comprising a one-dimensional shape-level hidden Markov model comprising one or more shape-level states, each shape-level state comprising a one-dimensional pixel-level hidden Markov model comprising one or more pixel-level states;

b. means for scanning an image to produce one or more sequences of pixels; c. means, coupled to the means for scanning and the memory, for determining local Viterbi scores for a sequence of pixels, each such score based on a pixel- level hidden Markov model;

d. means, coupled to the means for determining local Viterbi scores, for

determining a global Viterbi score of a shape-level hidden Markov model based on a plurality of local Viterbi scores and the sequences of pixels; and e. means, coupled to the means for determining a global Viterbi score, for

recognizing the scanned image based on one or more global Viterbi scores.

6. The system of claim 5 wherein the means for recognizing the scanned image comprises means for recognizing the scanned image based on the two-dimensional hidden Markov model having the highest global Viterbi score.