Lecture Notes in Computer Science

bet	24/88
Sana	16.12.2017
Hajmi	12.42 Mb.
	#22381

1 ... 20 21 22 23 24 25 26 27 ... 88

Preliminaries

In this paper we treat the two-category, normal-distribution case similarly to [2].

The categories are denoted by θ

and θ

respectively and we set Θ =

{θ

1

, θ

Let R

be the d-dimensional Euclidean space (R = R

) and let x

∈ R

be the

patterns to be classiﬁed. Denote by N (μ

, Σ

), i = 1, 2, the normal distributions,

where μ

i

and Σ

are the mean vectors and the covariance matrices. They are

the distributions of the patterns x from the respective categories. The state-

conditional probability density functions are

√

2πΣ

−

−μ

)

−1

−μ

)

i = 1, 2.

(1)

We suppose that the covariance matrices are not degenerate and, hence, Σ

well as Σ

−1

are positive deﬁnite.

Let P (θ

) and P (θ

) be the prior probabilities and let P (θ

|x) and P (θ

|x) be

the posterior probabilities of the patterns from the respective categories. Then,

by the Bayesian relation, we have

P (θ

1

|x)

P (θ

|x)

P (θ

)p(x

|θ

)

P (θ

)p(x

|θ

)

(2)

This ratio can be used as a Bayesian discriminant function [1]. The logistic

function σ is deﬁned by σ(t) = (1 + e

−t

)

−1

. Since this is a strictly monotone

increasing function, the logistic transform of the logarithm of (2):

240

Y. Ito, C. Srinivasan, and H. Izumi

σ(log

P (θ

|x)

P (θ

|x)

) = P (θ

|x)

(3)

is also a Bayesian discriminant function [1]. In the case where the state-conditional

probability distributions are normal, the log ratio of the posterior probabilities is

a quadratic form:

1

(x) = log

P (θ

|x)

P (θ

|x)

−

{(x − μ

)

−1

− μ

)

− (x − μ

)

−1

− μ

)

}

+ log

P (θ

)

P (θ

)

−

log

|Σ

(4)

The activation function of the output unit of our neural network is the logistic

function and the network is trained so as to approximate (3). Hence, the inner

potential of the output unit approximates the quadratic form (4).

The Mahalanobis distance with respect to the normal distribution N (μ

, Σ

)

is deﬁned by d

− y) = (x − y)

−1

− y), i = 1, 2. Hence,

(x) =

−

{(x − μ

)

−1

− μ

)

− (x − μ

)

−1

− μ

)

}

(5)

can be a Mahalanobis discriminant function. This can be obtained by removing

a constant

log

P (θ

)

P (θ

)

−

log

|Σ

(6)

from (4) as stated in [7].

Diﬃculty of Learning Posterior Probabilities

A one-hidden-layer neural network outputs a linear sum of the form

i=1

· x + t

) + a

where a

, a

and t

are constants and w

= (w

· · · , w

) are vectors. Since

these constants are to be modiﬁed while training, we call them parameters.

Among them we call w

,

· · · , w

and t

inner parameters and a

and a

outer

parameters. We sometimes identify an approximation formula with the neural

network based on this approximation formula. The network above has n(d+2)+1

constants to be optimized. Among them n(d + 1) are inner parameters, and n + 1

are outer parameters.

Figure 1 is to show the diﬃculty in learning with dichotomic random sample.

The space of patterns is one-dimensional.

Learning of Bayesian Discriminant Functions by a Layered Neural Network

241

Fig. 1. Dichotomic samples and the

posterior probability. The abscissa is

the x-axis.

The dots in Figure 1 are random sam-

ples from

{(x, ξ(x, θ))}, x ∈ R, and θ ∈

Θ =

{θ

1

, θ

}, where ξ(x, θ

) = 1 and

ξ(x, θ

2

) = 0. The probability of the event,

ξ(x, θ) = 1, is P (θ

|x) = E[ξ(x, θ

)

|x].

For simple ﬁgurative representation, it is

based on one-dimensional normal distrib-

utions. The smooth curve is the posterior

probability, which the neural network has

to approximate.

In the higher dimensional case, the

network has to approximate a (hyper)

surface. If the teacher signals are

{(x,

P (θ

|x))}, P (θ

|x) = E[ξ(x, θ)|x], then the learning is an ordinary function ap-

proximation by a neural network. However, the teacher signals are

{(x, ξ(x, θ))}

for our neural network. Comparing the two sets of teacher signals, one may

understand that approximation of P (θ

|x), using the random samples from

{(x, ξ(x, θ))}, is more diﬃcult.

For approximation of a quadratic formula (4), Funahashi [2] has proposed an

approximation formula

i=1

g(w

· x + t

) + a

(7)

Since d is the dimension of the space of the pattern x, this approximation formula

has 2d(d + 2) + 1 constants to be optimized. Among them, 2d(d + 1) constants

are inner constants. Actually an approximation formula

i=0

g(w

· x) + a

(8)

having a smaller number of the activation functions can approximate the

quadratic form [3], [5]. In this formula, the number of the constants to be opti-

mized is (d + 1)

+ 1 and that of the inner constants is d(d + 1). Since one of

the activation function is to approximate a linear form in x, the approximation

formula (8) can be replaced by

i=1

g(w

· x) + w

· x + a

(9)

The values of the constants are of course distinct in (8) and (9), though the same

notation is used. In (9), the numbers of the activation functions and the para-

meters are respectively decreased to d and (d + 1)

. Hence, we used (9) in [6] and

[7], expecting that the parameter space was simpliﬁed remarkably. Nevertheless,

the network based on (9) was successful only in the case where the probability

distributions are simple. In particular, training of the inner parameters was very

diﬃcult. This convinced us that to overcome the diﬃculties there is no other

way other than decreasing the inner parameters.

242

Y. Ito, C. Srinivasan, and H. Izumi

Modiﬁed One-Hidden-Layer Neural Network

In this section we construct a neural network having more hidden-layer units

but less parameters to be optimized. There are unit vectors u

, k = 1,

· · ·,

d(d + 1), in R

such that (u

· x), k = 1, · · · , d, are linearly independent and

k

· x)

, k = 1,

· · · ,

d(d + 1), are also linearly independent. Here, we illustrate

an example. Suppose that u

, k = 1,

· · ·,

d(d + 1), are distinct unit vectors in

such that, for k = 1,

· · · , d, the d-th element is 1 and others are 0, and, for

k = d + 1,

· · ·,

d(d + 1), two elements are equal to

√

and others are 0. Then,

they satisfy the condition [3]. Hence, any quadratic form Q(x) in R

can be

expressed as

Q(x) =

d(d+1)

k=1

· x)

k=1

· x) + a

Though this expression can be generalized to polynomials of higher degree [3],

this is enough in this paper.

Let u be a unit vector in R

. For a probability measure μ in R

, we denote

by μ

u

its projection onto a line

{tu : −∞ < t < ∞}. Then, we have a theorem

below.

Theorem 1. Let μ be a probability measure on R

and let g

∈ C

(R) (N

≥ 2)

such that g

(i)

(0) = 0 (1

≤ i ≤ 2). Suppose that there exists a function g

on R

such that , for any δ (

|δ| < 1),

(i)

(δt)t

| < g

(t),

≤ i ≤ 2,

and, for any unit vector u in R

, g

∈ L

(R, μ

)(1

≤ p ≤ ∞). Then, for

any quadratic form Q and any ε > 0, there exist coeﬃcients a

and constants

≤ i ≤ 2, 1 ≤ k ≤

d(d + 1)) for which

∂

∂x

−

∂

∂x

,μ)

< ε,

|α| ≤ N,

where

Q(x) =

d(d+1)

k=1

g(δ

· x) +

k=1

g(δ

· x) + a

(10)

and u

are unit vectors satisfying the condition above.

This theorem is a special case of a theorem in [3]. Though we have not enough

space to prove this, it may be imaginable that (10) comes from the expres-

sion above of the quadratic form Q. The condition on g and μ is to suppress

the L

-norm in a neighborhood of inﬁnity. If the activation function and its

derivatives are bounded and the probability measure is rapidly decreasing, this

condition is always satisﬁed.

Learning of Bayesian Discriminant Functions by a Layered Neural Network

243

Since the second sum in (10) is to approximate a linear form, the approxima-

tion formula can be modiﬁed to

Q(x) =

d(d+1)/2

k=1

g(δu

· x) +

k=1

+ a

(11)

When δ is inﬁnitesimal, (11) can coincide with the quadratic form Q. This ap-

proximation formula can be realized by a one-hidden-layer neural network having

direct connections between the input and output layers. Figure 2 illustrates such

a neural network.

Fig. 2. A one-hidden-layer neural net-

work having direct connections be-

tween the input layer and output unit.

D =

d(d + 1).

Since the vectors u

are ﬁxed before-

hand, the number of the parameters to

be optimized in (11) is

2

(d

+ 3d + 4).

Moreover, only one is an inner parameter

among them and others are outer parame-

ters. This must make the parameter space

very simple. We performed simulations to

conﬁrm usefulness of the network.

The network has

2

d(d+1) hidden-layer

units. The linear sum

k=1

is input

to the output unit via the direct connec-

tions. Hence, the inner potential of the

output unit of the neural network illus-

trated in Figure 2 can approximate (10).

If the activation function of the output

unit is the logistic function, then, by (3)

and (4), it can approximate the posterior

probability P (θ

|x) [2].

Training of the Neural Network

Let F (x, w) be the output of the neural network. Training of the network is

based on the proposition below by Ruck et al. [9]. As we have repeatedly used

this proposition [4-7], we omit details here.

Proposition 2. Set

E(w) =

R

d

i=1

(F (x, w)

− ξ(x, θ

))

P (θ

)p(x

|θ

)dx.

(12)

Then,

E(w) =

(F (x, w)

− E[ξ(x, ·)|x])

p(x)dx +

V [ξ(x,

·)|x]p(x)dx.

(13)

If ξ(x, θ

) = 1 and ξ(x, θ

) = 0, then E[ξ(x,

·)|x] = P (θ

|x). The integrand of the

second integral is the conditional variance of ξ and independent of the weight

244

Y. Ito, C. Srinivasan, and H. Izumi

vector w. Hence, when

E(w) is minimized, the output F (x, w) is expected to ap-

proximate P (θ

|x) if the network has a capability of approximating the posterior

probability. Accordingly, learning of the network is carried out by minimizing

(w) =

k=1

(F (x

(k)

, w)

− ξ(x

(k)

, θ

(k)

))

(14)

where

{(x

(k)

, ξ(x

(k)

, θ

(k)

))

}

k=1

, (x

(k)

, θ

(k)

)

⊂ R

× Θ, is the training set. This

method of training has been treated by many authors [2,4-10].

In the case of the ordinary approximation of functions, to be minimized is

n

(w) =

k=1

(F (x

(k)

, w)

− f(x

(k)

))

(15)

It is obvious that minimization of (14) is more diﬃcult than that of (15).

Simulations

Unlike ordinary one-hidden-layer neural networks, the network based on (11) can

learn two-dimensional Bayesian discriminant functions. We have conﬁrmed this

by simulation. Fig. 3a and 3b illustrate the probability density functions of the

normal distributions N (μ

, Σ

) and N (μ

, Σ

) used in one of the simulations,

where μ

1

= (1, 0), μ

= (0, 0) and Σ

=

2 1

1 1

, Σ

1 0

0 1

respectively. These

are the state-conditional probability density functions of the categories θ

and

respectively. Fig. 3c illustrates the posterior probability of the category θ

in the case of the prior probabilities P (θ

) = 0.7 and P (θ

) = 0.3. As stated

in Section 2, this can be used as a Bayesian discriminant function. The neural

network was trained with 1000 teacher signals.

b

c

Fig. 3. a and b are the state-conditional probability density functions of Categories θ

and θ

respectively, and c is the posterior probability

Fig 4 illustrates an example of the learning process. Fig 4a is the output of

the network with the initial values of the parameters. It converged to Fig 4c via

Fig 4b. The unit vectors are u

= (1, 0), u

= (0, 1) and u

= (

√

). While

learning, the parameters a

, a

were optimized, but the value

Learning of Bayesian Discriminant Functions by a Layered Neural Network

245

Fig. 4. Learning process. The target is the surface illustrated in Fig. 3c.

of the constant δ did not change much. This probably implies diﬃculty of training

of the inner parameter. Even this neural network is not always successful in

learning: its success depends on the choice of the initial values of the parameters.

In particular, if the initial output of the network is a ﬂat surface, learning rarely

proceeds.

One may recognize slight diﬀerences between the Bayesian discriminant func-

tion theoretically obtained (Fig. 3c) and that obtained by simulation (Fig. 4c).

These are probably because the teacher signals are random variables and the

inner parameter δ is not inﬁnitesimal. Classiﬁcation capabilities of the two dis-

criminant functions (Fig. 3c and Fig. 4c) are compared in Table 1. By the method

stated in [7], we can easily convert the Bayesian discriminant function (4) to the

Mahalanobis discriminant function (5), because both are quadratic forms dis-

tinct from each other by a constant. Hence, in the table the capabilities of the

Mahalanobis discriminant function theoretically obtained and that obtained by

simulation are also compared. The test signals are distinct from those used for

training.

In this table, the numbers in the row of “Alloc. to θ

(resp. θ

)” are those

of patterns allocated to the category θ

(resp. category θ

), and in the row of

“Correc. Alloc.” is the total number of correctly allocated patterns. The columns

TS, BT, BS, MT and MS stand for the test signals (TS), the Bayesian discrim-

inant functions theoretically obtained (BT, Fig. 3c) and obtained by simulation

(BS, Fig. 4c), and the Mahalanobis discriminant functions theoretically obtained

(MT) and obtained by simulation (MS) respectively. Only in the column TS, all

signals are correctly allocated. The BT allocated 843 (resp. 157) patterns to Cat-

egory θ

1

(resp. θ

), among which 175 (resp. 37) patters were to be allocated to

Table 1. Classiﬁcation results in the ﬁrst example (See Text)

Alloc. to θ

705

843(175)

861(187)

614 (87)

741(109)

Alloc. to θ

295

157 (37)

139 (31)

386(178)

259(150)

Correc. Alloc.

1000

788

782

735

741

246

Y. Ito, C. Srinivasan, and H. Izumi

Table 2. Classiﬁcation results in the second example (See Text)

Alloc. to θ

708

717 (32)

706 (26)

687 (17)

664 (11)

Alloc. to θ

292

283 (23)

294 (28)

313 (38)

336 (55)

Correct Alloc.

1000

945

946

945

934

Category θ

(resp. θ

). Since the distributions of the two categories are overlap-

ping, even the BT can allocate correctly only 1000

− (175 + 37) = 788 patterns

among 1000. However, the allocations by the BT and BS coincide at 982 patterns

among 1000 (not shown in Table 2), and those by MT and MS at 924. These

implies that the learning is successful.

Another example is illustrated in Table 2, where μ

= (1, 1), μ

= (

−1, −2) and

2 1

1 1

, Σ

1 0.5

0.5 1

. In this case the two probability distributions

are rather separated. Hence, the classiﬁcations by the respective discriminant

functions are more successful. Moreover, the allocations by BT and BS coincide

at 989 patterns among 1000, and those by MT and MS at 977.

Conclusion and Discussions

We have often experienced that the ordinary layered neural network based on

the approximation formula (9) fails in approximating the posterior probability,

unless the probability distribution is very simple. In the case where the dimension

of the patterns x is higher than or equal to 2, it looks impossible for the network

to learn the posterior probability. However, the neural network based on (11)

performs the task better even in the two dimensional case. We think that this

is because the latter has a smaller number of parameters to be optimized and,

more importantly, they are mostly outer parameters.

Our network has more hidden layer units than the minimum requirement.

This contradicts to the general belief that a neural network having a smaller

number of units works better. However, the number of the parameters to be

trained and whether they are inner or outer may be more directly related to the

complexity in learning. Since our network has less parameters to be trained and

most of them are outer parameters, it is understandable that the network works

better.

When the network is based on (11) the error function (12) has no global min-

imum usually, because the output F (x, w) converges to the posterior probability

P (θ

|x) as the scaling parameter δ tends to inﬁnitesimal. Hence, we expected

that δ converges toward zero while training, but it usually stays in the neigh-

borhood of the initial value. The network often realizes the approximation using

the initially given value of δ with a slight modiﬁcation. This probably implies

diﬃculty of training of inner constants with random dichotomic teacher signals.

This may in turn justify our policy of decreasing the inner parameters to be

Learning of Bayesian Discriminant Functions by a Layered Neural Network

247

trained, even if it causes increment of the number of the hidden-layer units. In

short, the inner parameters are converted to the outer parameters in our neural

network.

The ﬂexibility of the basis functions may characterize the neural network.

It makes neural networks useful powerful tools for multiple purposes. One of

the standard methods of approximating functions is to use the basis functions

of a complete orthonormal system. In this case only the coeﬃcients of the basis

functions are optimized. Approximation by our neural network is an intermediate

method between the two methods because the basis functions are ﬂexible only

to some extent.

Since the space of target functions is restricted to quadratic forms, the ﬂex-

ibility of basis functions can be restricted in this paper. However, usefulness of

the neural network is not limited to this special case. The approximation based

on Theorem 1 is on the whole space in the L

sense and the original theorem in

[3] is on approximation of polynomials of arbitrary degree. Hence, by the polyno-

mial approximation theorem, our neural network may be useful in approximating

general continuous functions when the probability measure is rapidly decreas-

ing. We think that the ideas of making easier the learning by converting inner

parameters to outer parameters can be widely applied.

References

1. Duda, R.O., Hart, P.E.: Pattern classiﬁcation and scene analysis. John Wiley &

Sons, New York (1973)

2. Funahashi, K.: Multilayer neural networks and Bayes decision theory. Neural Net-

works 11, 209–213 (1998)

3. Ito, Y.: Simultaneous approximations of polynomials and derivatives and their

applications to neural networks (submitted)

4. Ito, Y., Srinivasan, C.: Multicategory Bayesian decision using a three-layer neural

network. In: Kaynak, O., Alpaydın, E., Oja, E., Xu, L. (eds.) ICANN 2003 and

ICONIP 2003. LNCS, vol. 2714, pp. 253–261. Springer, Heidelberg (2003)

5. Ito, Y., Srinivasan, C.: Bayesian decision theory on three-layer neural networks.

Neurocomputing 63, 209–228 (2005)

6. Ito, Y., Srinivasan, C., Izumi, H.: Bayesian learning of neural networks adapted to

changes of prior probabilities. In: Duch, W., Kacprzyk, J., Oja, E., Zadro˙zny, S.

(eds.) ICANN 2005. LNCS, vol. 3697, pp. 253–259. Springer, Heidelberg (2005)

7. Ito, Y., Srinivasan, C., Izumi, H.: Discriminant analysis by a neural network with

Mahalanobis distance. In: Kollias, S., Stafylopatis, A., Duch, W., Oja, E. (eds.)

ICANN 2006. LNCS, vol. 4132, pp. 350–360. Springer, Heidelberg (2006)

8. Richard, M.D., Lipmann, R.P.: Neural network classiﬁers estimate Bayesian a pos-

teriori probabilities. Neural Computation 3, 461–483 (1991)

9. Ruck, M.D., Rogers, S., Kabrisky, M., Oxley, H., Sutter, B.: The multilayer percep-

tron as approximator to a Bayes optimal discriminant function. IEEE Transactions

on Neural Networks 1, 296–298 (1990)

10. White, H.: Learning in artiﬁcial neural networks: A statistical perspective. Neural

Computation 1, 425–464 (1989)

RNN with a Recurrent Output Layer for

Learning of Naturalness

J´

an Dolinsk´

y and Hideyuki Takagi

Kyushu University, 4-9-1 Shiobaru, Minami-ku, Fukuoka, 815-8540 Japan

jan@plazma.sk, takagi@design.kyushu-u.ac.jp

Abstract. The behavior of recurrent neural networks with a recurrent

output layer (ROL) is described mathematically and it is shown that

using ROL is not only advantageous, but is in fact crucial to obtaining

satisfactory performance for the proposed naturalness learning. Conven-

tional belief holds that employing ROL often substantially decreases the

performance of a network or renders the network unstable, and ROL is

consequently rarely used. The objective of this paper is to demonstrate

that there are cases where it is necessary to use ROL. The concrete

example shown models naturalness in handwritten letters.

Introduction

In engineering, recurrent neural networks (RNN) have not been often proposed

as a promising solution. The diﬃculties with training a RNN have been over-

come, and recent theoretical advances in the ﬁeld have made training a RNN

quicker and easier [4]. Recurrent connections have not been found to increase

the approximational capabilities of the network [7]. Nevertheless, we may obtain

decreased complexity, network size, etc. while solving the same problem.

In some applications - such as speech recognition or object detection or pre-

diction - our classiﬁcation / prediction at time t should be more accurate if we

can account for what we saw at earlier times. The most common approach to

model such systems is to use a suitably powerful feed-forward network and feed

it with a ﬁnite history of the inputs and optionally the outputs through a sliding

window. Early attempts to improve this, often tedious, technique resulted in var-

ious network architectures based on the feed-forward topology with the recurrent

connections being set to ﬁxed values to ensure that the backpropagation rule can

be used [1][6]. Many works have been done on autonomous Hopﬁeld networks as

well as on training algorithms that can be applied to a RNN in a feed-forward

fashion (i.e. BPTT). Recent theoretical advances in RNN research such as Echo

State Networks (ESN) aﬀord the modeling of fully general topologies, which were

diﬃcult to train directly with the former techniques.

An interesting example is a topology where output units have connections

from not only the internal units but also the input units and output units,

yielding a recurrent output layer - ROL. Although connections from the input

units are often used, connections from the output layer are rare. In the following

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 248–257, 2008.

c Springer-Verlag Berlin Heidelberg 2008

RNN with a Recurrent Output Layer for Learning of Naturalness

249

chapters, we explain what behavior ROL implies, introduce the concept of our

proposed naturalness learning, and show that using ROL not only increases the

performance but is actually an intrinsic part of modeling with the proposed

naturalness learning.

One of the earliest RNN, where the output activation values from the previous

step were used to compute the output activation values in the next step, was the

Jordan network [6][5]. In the Jordan network, the activation values of the output

units are fed back into the input layer through a set of extra input units called

the state units. This type of network is called output-recurrent network. Various

modiﬁcations to output-recurrent networks have been proposed and successfully

used for modeling diﬃcult non-linear tasks [8]. RNN with ROL, in contrast to

output-recurrent network, uses the output activation values of the previous step

directly to compute the output in the next step. The output activation values

of the previous step can be thought of as a extra hidden units. We are aware of

no papers discussing applications of RNN with ROL.

Dynamics of RNN with a Recurrent Output Layer

Adopting a standard perspective of system theory, we view a deterministic and

discrete-time dynamical system as a function G which yields the next system

output, given an input and the output history:

y(n + 1) = G(..., u(n), u(n + 1); ...y(n

− 1), y(n))

(1)

where u(n) is the input vector and y(n) is the output vector for the time step n.

The echo-state approach enables us to approximate systems represented by G

directly, without the necessity to convert a time series into static input patterns

by the sliding window technique [2].

Consider a discrete-time ESN [4] consisting of K input units with an acti-

vation vector u(n) = (u

(n), ..., u

(n))

, N internal units with an activation

vector x(n) = (x

(n), ..., x

(n))

, and L output units with an activation vector

y(n) = (y

(n), ..., y

(n))

, where

denotes the transpose. The corresponding in-

put, internal and output connection weights are collected in the N

× K, N × N,

×(K +N +L) weight matrices W

, W, W

out

respectively. Optionally, a N

×L

matrix W

back

may be used to project the output units back to the internal units.

The internal units’ activation is computed according to

x(n + 1) = f(W

u(n + 1) + Wx(n) + W

back

y(n))

(2)

where f denotes the component-wise application of the transfer (activation) func-

tion to each internal unit. The output is computed as

y(n + 1) = f

out

(W

out

(u(n + 1), x(n + 1), y(n))

(3)

where (u(n + 1), x(n + 1), y(n)) represents the concatenated vector consisting

of input, internal and output activation vectors. The concatenated vector often

250

J. Dolinsk´

y and H. Takagi

Fig. 1. Echo-state network: the dotted lines plot connections which can be trained, the

gray lines plot connections which are optional

consits only of input and internal activations or internal activation only. Fig. 1

shows the architecture of an ESN. See [4] for further details concerning the

training of ESN.

A closer look at Eq. (3) reveals that a system output y(n + 1) is constructed

from the given input and output history via two distinct mechanisms: from the

activation vectors of the internal units x(n) indirectly (by computing x(n + 1)

via Eq. (2)) and optionally from the activation vectors of the input units u(n+1)

and/or output units y(n) directly.

The internal units’ activation x(n+1) is computed using the input and output

activation u(n + 1), y(n) and the activation x(n) of the internal units from

the previous step which recursively reﬂects the inﬂuence of input and output

activations from previous steps. We can therefore rewrite Eq. (2) as

x(n + 1) = E(..., u(n), u(n + 1); ..., y(n

− 1), y(n))

(4)

where E depends on the history of input signal u and the history of desired

output signal y itself, thus in each particular task, E shares certain properties

with the desired output and/or given input. How strongly the internal activation

x(n+1) is inﬂuenced by the activations u(n+1), y(n) and x(n) (which recursively

consist of previous input/output activations) is controlled by the size of the

weights in matrices W

, W

back

and W respectively. Algebraic properties of the

matrix W are particularly important for the short-term memory property of an

ESN [3].

Besides using the activations of internal units, sometimes it is advantageous

to use also the activations of input and output units directly. Although the

activation vector x(n) reﬂects the history of the desired output and/or given

input, the activation vectors u(n + 1), y(n) in Eq. 3 are used merely as another

form of input. This usage corresponds to connecting the input units to the output

units and output units to output units themselves directly. Direct connection of

input units to output units is often used whereas direct connection of output

RNN with a Recurrent Output Layer for Learning of Naturalness

251

units to output units is rare. It is the connecting of output units to each other

what makes the output layer recurrent.

ROL implies a substantial inﬂuence of the previously generated output y(n) on

the successive output y(n+1). The activation y(n) is only an approximation of a

system output at the step n and, thus, it is always generated with a certain error.

This error is included in the computation of the successive output activation

y(n + 1) and can easily accumulate with each update step. It is for this reason

that computation using ROL has been rare.

RNN with a Recurrent Output Layer for Naturalness

Learning

In this section, we will demonstrate the principles of naturalness learning by

showing how to express and model the unique quality of hand-written letters.

We also explain why ROL works well with the naturalness learning.

The style of writing of any given person is very much individual and can be

distinguished easily from mechanically or electronically printed text. Moreover,

we can distinguish between the writing of diﬀerent people. Everybody learns their

alphabet in a school, and while writing in one’s own individual way, a person is

trying to approximate the shapes of the letters as they learned them in school.

We can therefore understand handwriting as turning the basic shape of a letter

as learned in a school into the writer’s particular, unique, handwritten form. A

human then can be seen as a ‘ﬁlter’ which adds his or her own characteristics to

the shapes of those basic letters.

To explain the term of naturalness, ﬁrst we must deﬁne our terminology. Let

us refer to the letters used in textbooks (either printed or cursive) as the font

letters (fontL). We view handwriting as the process of turning a fontL into its

handwritten form. The unique quality of the handwritten letter (handL) can

be then understood as the diﬀerence between the handL and the fontL and

expressed as a 2-D displacement vector ﬁeld of evenly spaced points along the

strokes of the font and its respective handwritten version (see Fig. 2)

. We refer

to this diﬀerence as naturalness. In other words, adding the naturalness to the

fontL will result in a handL of a unique form. We can thus, assume a relation

between the fontL and the naturalness. This assumption enables us to model

naturalness by a system which employs certain characteristics of the fontL as its

input.

The task of naturalness learning is to ﬁnd and learn the relation between font

letters and naturalness. Speaking in the terms of naturalness learning, the target

system (handwritten letters) resembles the basic system (font letters) with its

behavior (shape of handwritten letters) deviating from the basic system to a

certain extent.

Learning and modeling naturalness using a RNN with ROL produced inter-

esting results. As mentioned in section 2., computing with ROL is prone to

A displacement vector is not the only mean of expressing naturalness, it can be

expressed using an arbitrary mechanism.

252

J. Dolinsk´

y and H. Takagi

Fig. 2. Hiragana letter /ka/. Naturalness expressed by 2-D displacement vector ﬁeld.

Font letter strokes shown in black, handwritten strokes shown in gray.

the accumulation of error from previous steps. This phenomena has been found

harmful in many tasks. With naturalness learning, however, ROL performs well.

In the handwriting task, we found that introducing an activation y(n) into

play via update Eq. (3) always helped network to generate y(n+1) with a greater

accuracy. An intuitive explanation is as follows. The way a person writes the ﬁrst

half of a handwritten stroke inﬂuences, to a certain extent, how the second part

is going to look, i.e. distortion of a certain part of a stroke usually implies some

other distortion to a successive part of the stroke. It is therefore reasonable to

assume that a short sequence of points on a handwritten stroke inﬂuences where

the next point will appear, with the very last point of a sequence having the

greatest inﬂuence. The same holds for the naturalness extracted as a diﬀerence

between handL and fontL. On the other hand, it is not only the very last point

which inﬂuences the position of the next point of a stroke, but a short sequence

of the previous points. Thus, a backprojection matrix W

back

was used so as

to ensure recent short sequence of generated output ..., y(n

− 1), y(n) is also

reﬂected in y(n + 1) via the activation x(n + 1) (see Eq. (2) and Eq. (3)).

The same principle holds for the input activations u(n + 1) extracted from

fontL strokes. Update Eq. (2) ensures that a short sequence ..., u(n), u(n + 1) is

reﬂected in the activation x(n + 1) which is in turn used to compute y(n + 1)

via update Eq. (3). The activation u(n + 1) is also used directly in Eq. (3) to

ensure that the very last point of the input sequence has signiﬁcant inﬂuence in

the computation of y(n + 1).

The fact that naturalness is being modeled, instead of the target system,

allows us to control the amount of the naturalness being added to the font

letters. A weight of value 1 will render generated letters as close to a person’s

handwriting as possible. The value of, say, 0.6 will reduce the naturalness to

60%, providing us with a neat handwritten letters. Values close to 0 will render

generated letters very close to the font letters. It is also possible to combine

several individual’s naturalness (e.g. 40% of person A’s naturalness with 60% of

person B’s naturalness).

RNN with a Recurrent Output Layer for Learning of Naturalness

253

The naturalness learning approach is not limited to the handwriting task only.

We believe, one can generate natural looking movements for parts of the human

body with the inverse kinematics algorithm being employed as the basic system.

Generating emotional human speech with synthesized speech being used as the

basic system might be another interesting application of the naturalness learning

approach.

Experiments

In this chapter, we demonstrate how the naturalness learning approach copes

with the handwriting task. The letters used in the experiments are symbols of

Japanese syllabary - hiragana.

The fontL were extracted from the Bitstream Vera Sans font onto 250x250

pixel canvas. Every stroke of a given letter was turned into a set of bezier curves

- a path. Every path was then evenly spaced into a set of points. Let each P

f l

ij

be the set of points represented by n

× 2 matrix where i denotes the index of

the letter, j the index of the stroke within the letter and n

is the number of

points. P

f l

(k) = (x

, y

) therefore represents the k-th point of the j-th stroke

of the i-th letter (k-th row of the matrix P

f l

The handL were ﬁrst scanned and then appropriately scaled so as to ensure

they ﬁt the canvas. Every handL stroke was then aligned to its fontL stroke

counterpart. This alignment ensures that the displacement vector ﬁeld expresses

only a shape transformation between a pair of fontL and handL strokes. In order

to split strokes of handL, the same procedure was applied as with the fontL,

saving the points for each stroke into P

. The spacing interval along each stroke

(path) of a handL was, however, also adjusted so that number of points matched

those in the corresponding fontL stroke.

4.1

Data Speciﬁcation

The input signals were extracted from the points of fontL’s strokes as follows.

Let D

f l

(k) be the diﬀerence vector P

f l

ij

(k + 1)

− P

f l

(k). Then the inertia for the

k-th point of the j-th stroke of the i-th letter is given by

inertia

(k) = D

f l

(k)

(5)

with each inertia

(k) being the k-th row of the (n

− 1) × 2 matrix inertia

.

The inertia can be thought of as a representation of the movement of an imag-

inary pen which ‘wrote’ the font letter. The curvature for the k-th point of the

j-th stroke of the i-th letter is given by

curv

ij

(k) =

f l

(k)

f l

(k)D

f l

(k)

(6)

254

J. Dolinsk´

y and H. Takagi

with each curv

(k) being the k-th row of the (n

− 1) × 1 matrix curv

.

Figure 3 illustrates the geometrical meaning of (5) and (6). Each matrix curv

and inertia

was merged into a single (n

− 1) × 3 matrix U

with each

column being normalized into the interval (

−1, 1). In order to partially erase the

transient dynamics from the previous stroke, a zero sequence of size n

gap

×3 was

inserted before every U

, resulting in the ﬁnal input matrix U.

Naturalness, which serves as the (2 dimensional) output signal, is represented

by 2-D displacement vector ﬁeld. The 2-D displacement vector ﬁeld for the j-th

stroke of the i-th letter is given by

= P

− P

f l

(7)

with Y

, P

and P

f l

each being n

×2 matrices. The last row of every Y

was

dropped to ensure each pair of Y

andU

have the same length of n

− 1. Each

column of Y

was scaled down to the interval (

−1, 1). A zero sequence of size

gap

× 2 was inserted before every Y

, resulting in the ﬁnal output matrix Y.

4.2

Network Parameters

A 300 unit network was used with activation function f = RBF . Internal weights

in the matrix W were randomly assigned values of 0, 0.2073, -0.2073 with prob-

abilities 0.95, 0.025, 0.025. For a 300

× 300 matrix W, this implies a spectral

radius of

≈ 0.85, providing for a relatively long short-term memory [4]. 3 input

and 2 output units were attached. Input weights were randomly drawn from a

uniform distribution over (

−1, 1) with probability 0.9 or set to 0. With such an

input matrix, the network is strongly driven by the input activations because

many elements of the matrix W

are non-zero values. The network had output

feedback connections, which were randomly set to one of the three values of 0,

0.1, -0.1 with probabilities 0.9, 0.05, 0.05. Such a setup of feedback connections

makes the network excited only marginally by previous output activations. The

activation function for the output units was identity f

out

(x) = x.

Fig. 3. Geometrical meaning of the input data. inertia

(k) represents diﬀerence vector

between two successive points P

f l

(k + 1), P

f l

ij

(k). curv

(k) is the sine of angle φ.

RNN with a Recurrent Output Layer for Learning of Naturalness

255

Fig. 4. Testing with the training data: Letters produced by RNN with/out ROL with

teacher-forcing switched oﬀ from n = 301

Fig. 5. Testing with the testing data: Letters produced by RNN with/out ROL

4.3

Training and Testing

The training data was made from the letters shown in Fig. 4, resulting in a

3027

×3 input matrix for U

train

and a 3027

×2 output matrix for Y

train

, which

were prepared according to the data speciﬁcation with n

gap

= 16 being used.

Eq. (2) was used for updating with u(n), y(n) being the transposed n-th rows

of matrices U

train

and Y

train

respectively. The ﬁrst 300 steps were discarded

and the network internal states with input unit states x(n), u(n) were collected

from n = 301 through n = 3355. The output weights W

out

were computed using

the internal states and input unit states only. The training errors of ﬁrst and

second output units were mse

train,1

≈ 9.5 × 10

−4

and mse

train,2

≈ 2.8 × 10

−3

respectively. Making the output layer recurrent (ROL) and computing W

out

using not only internal/input states x(n), u(n) but also output activation values

256

J. Dolinsk´

y and H. Takagi

y(n

−1) reduced the training errors mse

train,1

and mse

train,2

down to

≈ 6×10

−4

and

≈ 2.0 × 10

−3

respectively. A visual comparison is shown in Fig. (4).

The testing data was made from letters shown in Fig. 5, resulting in a 6045

×3

input matrix U

test

and a 6045

×2 output matrix Y

test

with n

gap

= 16 being

used. For non-ROL topology the test errors were found to be mse

test,1

≈ 1.2 ×

−2

, mse

test,2

≈ 3.5 × 10

−2

. Using ROL reduced the test errors to mse

test,1

≈

0.9

× 10

−3

, mse

test,2

≈ 3.0 × 10

−2

. The errors mse

test,i

provide only a rough

indication of network performance. A visual comparison between these two trials

is shown in Fig. 5. We can observe that the network also produces appropriate

naturalness for letters on which it had not been trained.

Discussion

5.1

Network

Here we try to provide an insight into why the setup from section 4. works

the best. The network has information concerning the shape of the strokes in

fontL and handL via its history of input and output activations. The matrices

and W from the setup in sec. 4. make the network strongly driven by

the (ﬁnite) history of input. Moreover, using the most recent input activation

u(n + 1) in Eq. (3) directly makes the impact even stronger. It is most likely

the case that the last several points of the path from where a fontL stroke is

coming has a substantial impact on the naturalness and thus also on where

the handL stroke is going to continue. An intuitive explanation is that a human,

while writing in one’s own individual way, is trying to ‘approximate’ a font letter

shape as memorized in a school. This ﬁnding is in line with the basic idea of

the naturalness learning: to model a target system (handL) by means of a basic

system (fontL) and its diﬀerence (naturalness) with the target system.

The feedback weights W

back

from the setup in section 4. only slightly excite

the network with the output history. Surprisingly, using the most recent output

activation y(n) in Eq. (3) directly (= ROL) always improved the performance. A

plausible explanation is that the already written part of a handL stroke inﬂuences

how the going to be written part will look like with the naturalness of the the

previous step having highest relevance to the naturalness generated in the next

step. (i.e. distortion of a certain part of a stroke usually implies some other

distortion to a successive part of the stroke).

Feedback connections with larger weights (up to +1) and no ROL were also

tested. With this setup the network training error was about the same as in

sec. 4 but driving the network with the testing data rendered worse performance

or made the network unstable.

To check the robustness of the advantage of ROL, we tested the setup in sec. 4

with a range of diﬀerent starting internal weights W. In all cases ROL enabled

better performance.

RNN with a Recurrent Output Layer for Learning of Naturalness

257

5.2

Data Structure

The naturalness in this paper is represented as a 2-D displacement vector ﬁeld

(Fig. 2). We are, however, not strictly bound to this representation. The natural-

ness can be also represented as a set of parameters in a system which represents

a diﬀerence between the target (handL) and the basic (fontL) system. The input

data characterizing the basic system (handL) was made position independent so

as to ensure the same stroke will be represented by the same data regardless of

its starting position. This reduces signiﬁcantly the complexity of the handwrit-

ing task because while separate strokes are substantially diﬀerent in shape, some

of their small parts are often similar. The short-term memory of a RNN makes

distinction between a identical/similar short stroke sequences possible because

the RNN accounts also for points before the identical/similar stroke part.

Conclusion

In the handwriting task we showed that by modeling the target system by means

of a basic system and its diﬀerence from the target system, a substantial relevance

is revealed in the diﬀerence produced in step n and step n + 1. In such a case

the usage of a ROL turned out to be advantageous.

We would like to conﬁrm these ﬁndings by applying naturalness learning to

other tasks as well. Modeling the unique individualistic quality of human motion

is the next step in conﬁrming the feasibility of both naturalness learning and the

usability of a ROL for tasks formulated in terms of naturalness learning.

References

1. Elman, J.L.: Finding structure in time. Cognitive Science: A Multidisciplinary Jour-

nal 14(2), 179–211 (1990)

2. Jaeger, H.: The “echo state” approach to analysing and training recurrent neural

networks. Sankt Augustin: GMD-Forschungszentrum Informationstechnik, GMD-

Report 148 (December 2001)

3. Jaeger, H.: Short term memory in echo state networks. Sankt Augustin: GMD-

Forschungszentrum Informationstechnik, GMD-Report 152 (March 2002)

4. Jaeger, H.: Supervised training of recurrent neural networks, especially with ESN

approach. Sankt Augustin: GMD-Forschungszentrum Informationstechnik, GMD-

Report 159 (October 2002)

5. Jordan, M.I.: Serial Order: a parallel distributed processing approach. Tech. Rep.

8604, Univ. of California at San Diego, Inst. for Cognitive Science (May 1986)

6. Jordan, M.I.: Attractor dynamics and parallelism in a connectionist sequential ma-

chine. In: Eighth Annual Conf. of Cognitive Science Society, Amherst, MA, USA,

pp. 531–546 (August 1986)

7. Krose, B., van der Smagt, P.: Recurrent networks. In: Ch. 5, An introduction to

neural networks, Eighth Edition, Univ. of Amsterdam (November 1996)

8. Wang, Y.-C., Chien, C.-J., Teng, C.-C.: Direct adaptive iterative learning control of

nonlinear systems using an output-recurrent fuzzy neural network. IEEE Trans. on

SMC-B 34(3), 1348–1359 (2004)

Using Generalization Error Bounds to Train the

Set Covering Machine

Zakria Hussain and John Shawe-Taylor

Centre for Computational Statistics and Machine Learning

Department of Computer Science

University College, London

{z.hussain,j.shawe-taylor}@cs.ucl.ac.uk

Abstract. In this paper we eliminate the need for parameter estimation

associated with the set covering machine (SCM) by directly minimizing

generalization error bounds. Firstly, we consider a sub-optimal greedy

heuristic algorithm termed the bound set covering machine (BSCM).

Next, we propose the branch and bound set covering machine (BBSCM)

and prove that it ﬁnds a classiﬁer producing the smallest generalization

error bound. We further justify empirically the BBSCM algorithm with

a heuristic relaxation, called BBSCM(

τ), which guarantees a solution

whose bound is within a factor

τ of the optimal. Experiments compar-

ing against the support vector machine (SVM) and SCM algorithms

demonstrate that the approaches proposed can lead to some or all of

the following: 1) faster running times, 2) sparser classiﬁers and 3) com-

petitive generalization error, all while avoiding the need for parameter

estimation.

Motivation

Two algorithms that use very diﬀerent mechanisms in order to build their classi-

ﬁers are the support vector machine (SVM) [2,6,3] and the set covering machine

(SCM) [4]. Taking the binary classiﬁcation task as an example, the ﬁrst more fa-

miliar algorithm ﬁnds a hyperplane in feature space that maximizes the margin

between the two classes of examples. The second looks for a classiﬁer constructed

from a small subset of data derived decision functions consistent with the set of

positive examples. Fundamentally these two algorithms diﬀer in their approaches

to solving a classiﬁcation problem. However, as was argued in [4] it may be more

eﬀective (for some learning tasks) to build a classiﬁer from a small subset of

data derived decision functions as opposed to a maximum margin separating

hyperplane.

In this paper, we demonstrate a method of model selection for the SCM

without the need for parameter estimation. Instead, we apply the generalization

error bound directly to the algorithm in order to determine its output classiﬁer.

First we apply the bound directly to the SCM using a greedy approach and call

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 258–268, 2008.

c Springer-Verlag Berlin Heidelberg 2008

Using Generalization Error Bounds to Train the Set Covering Machine

259

the heuristic the bound set covering machine (BSCM). Next we look to globally

optimize the bound by proposing the branch and bound set covering machine

(BBSCM) and prove that the algorithm will indeed ﬁnd the hypothesis with the

smallest generalization error bound. This algorithm turns out to be too slow

experimentally and so we introduce a relaxation called BBSCM(τ ) and prove

that its solutions are a factor τ from the optimal.

The structure of this paper will be as follows: In the following section we de-

scribe the set covering machine (SCM) and state two generalization error bounds

for the SCM. Section 3 describes the bound set covering machine (BSCM). In

part 4 we discuss in detail the branch and bound set covering machine (BB-

SCM). Section 5 describes a heuristic derived from the BBSCM algorithm called

BBSCM(τ ). Section 6 proves the optimality of the algorithms proposed. The ex-

perimental section 7 discusses results for several UCI repository data sets with

section 8 concluding the paper.

The Set Covering Machine

Deﬁnition 1. Let S = {(x

, y

) . . . (x

, y

)

} be a sample where x

∈ X =

P ∪ N is a training example and y

∈ {−1, +1} its classiﬁcation. We use the

convention x = (x

, . . . , x

) for each training example and refer to X as the

training set. Let

P be the set of positive (+1) and N the set of negative (−1)

training examples for the conjunction case and let

P be the set of negative (−1)

and

N the set of positive (+1) training examples in the disjunction case.

The set covering machine tries to build a conjunction or disjunction of data-

dependent features (data derived decision functions) in order to classify future

test examples. The set of data derived decision functions we will use are the

following set of data-dependent balls.

Deﬁnition 2. For a training example x

∈ X with label y

∈ {−1, 1} and (real-

valued) radius ρ, let f

i,ρ

be the following data-dependent ball centered around x

i,ρ

(x) =

if d(x, x

)

≤ ρ

otherwise

where ¯

y

i

is the complement of y

and d(x, x

) is the distance between x and x

.

Furthermore, we deﬁne a center x

∈ X and restrict the deﬁnition of a border

j

∈ P. Also the radius ρ is deﬁned as:

ρ =

d(x

, x

) + α

if x

∈ P

d(x

, x

)

− α

if x

∈ N

where α is a small positive real number.

For any ball h ∈ H, let ν(h) be the set of N examples correctly classiﬁed by h

and let π(h) be the set of P examples misclassiﬁed by h. Given these deﬁnitions

we have the following description for the usefulness of a data-dependent ball.

260

Z. Hussain and J. Shawe-Taylor

Deﬁnition 3. The usefulness (or utility) U of h is expressed as:

U (h) = |ν(h)| − p|π(h)|

(1)

where p is a small positive real number.

Finally, when we discuss a

P example we will mean an example from the set of

P. Similarly, an N example will denote an example from the set of N examples.

Using these deﬁnitions we can now describe how to train the SCM with data-

dependent balls.

2.1

Training

The SCM algorithm uses a greedy approach to try and completely classify the

set of

N examples whilst misclassifying a small number of P examples. Let N

be the set of

N examples to be covered and let P be the set of P examples

that have been misclassiﬁed. Therefore, initially hypothesis

B ← ∅, N ← N and

← ∅. At the ﬁrst iteration the SCM algorithm looks for ball h

that maximizes

the utility value U (h

). After the ball h

with the highest usefulness is added,

the hypothesis becomes

B ← B ∪ {h

}, N ← N

{ν(h

)

} and P ← P ∪ {π(h

)

This is repeated until no more N examples are left to cover (N =

∅) or until the

early stopping criterion

|B| ≤ s is satisﬁed.

Clearly the algorithm is greedy as it only adds ball h to hypothesis B if it

maximizes the utility U (h). Once the SCM has output its hypothesis B, it can

be used to make predictions.

2.2

Testing

Given hypothesis

B containing a conjunction or disjunction of data-dependent

balls h

i

, we can classify example x like so:

B(x) = y

if h

(x) = y for i = 1, . . . , |B|

y otherwise

(2)

where ¯

y is the complement of y and, from Deﬁnition 1, y = 1 for a conjunction

and y = −1 for a disjunction.

2.3

Generalization Error Bounds

We now state a loss bound for the SCM, that will be used for the theoretical

and experimental section of the paper. We also use a second bound but only

reference it in order to save space.

Theorem 1. [4] Suppose a SCM ﬁnds a solution given by a set

B of features

with R = R(B) = |B| features, R

= R

(

B) of which are centered around P

examples, with k

= k

(

B) the number of P training errors and k

= k

(

B) the

number of

N training errors on a sample of size m > 2R + k

+ k

. Then with

Using Generalization Error Bounds to Train the Set Covering Machine

261

probability 1

− δ over random draws of training sets, the generalization error

of the resulting classiﬁer can be bounded by

(m, R, R

, k

)

≤ 1 − exp

−1

m − 2R − k

− k

+ ln

m − 2R

+ k

+ ln

(3)

The second bound we will use is the Equation 10 of [5] and is slightly tighter than

the above bound. These bounds are tight and non-trivial (i.e., always less than

1). Exploiting this verity we will apply the generalization error bound directly

to obtain classiﬁers for the SCM and, with it, remove the need for parameter

estimation in the SCM. Because of its greedy application of the bound, we call

this ﬁrst heuristic the bound set covering machine (BSCM).

The Bound Set Covering Machine

In this variant of the SCM we allow the algorithm to be driven by one of the

generalization error bounds given by equation (3) or Equation 10 of [5] . For sim-

plicity and to save space we only describe the remaining work with Theorem 1.

Although the bound of [5] can also be applied with the same reasoning.

Given any hypothesis

B and ball h, we can calculate the risk bound of adding

ball h to B as (B ∪ {h}) using (m, R, R

, k

). This is true for adding any

ball h. Therefore, similarly to the SCM, the bound set covering machine (BSCM)

algorithm can be described as follows: Initially hypothesis

B ← ∅, N ← N ,

← ∅ and best bound

∗

← 1. At the ﬁrst iteration, the BSCM algorithm

looks for ball h

that minimizes the generalization error bound

(

B ∪ {h

}).

Ball h

that minimizes (

B ∪ {h

}) is added to the hypothesis B ← B ∪ {h

← N

{ν(h

)

}, P ← P ∪ {π(h

)

} and best bound

∗

← (B ∪ {h

}). This is

repeated until no new loss bound (

B ∪ {h}), for any new ball h, can be found

such that (

B ∪ {h}) <

∗

. Finally the resulting hypothesis

B can be used to

classify new test examples using (2).

Note that the BSCM has eliminated the need for penalty parameter p and

soft-stopping parameter s that was present in the SCM. It is clear that the BSCM

heuristic is greedy in its solution in much the same way as the SCM, with the

diﬀerence that it looks to minimize the generalization error bound as opposed

to maximizing the utility function for each ball. However, the nature of a greedy

approach implies that we are not always guaranteed a globally optimal classiﬁer.

Therefore, in the following section, we present a branch and bound approach to

solving the set covering machine that ensures a much stronger result, that of

global optimality.

The Branch and Bound Set Covering Machine

In this section we use a branch and bound approach to enumerate all possible

hypotheses that can be generated from the set of data-dependent balls. This is

262

Z. Hussain and J. Shawe-Taylor

done by evaluating the bound every time a new ball can be added to the current

hypothesis. If a hypothesis cannot be extended with a ball such that the new

bound is smaller than the best bound currently found then there is no need to

include this ball in the current hypothesis. Note that the choice of balls can also

be thought of in terms of a branch and bound search tree. Where any ball at a

particular depth d is found from its parent ball at depth d − 1 and can produce

child balls at level d + 1.

The motivation for such a strategy for solving the set covering machine is that

if the function to be minimized is the generalization error bound then we are

assured of ﬁnding a hypothesis with the smallest generalization error bound.

4.1

Algorithm

The algorithm relies on functions addball and createtable. We have detailed

below each function and also included the pseudo code. Before we describe the

functions in detail we will now give some notation. Let S be the sample con-

taining training set X and label set Y , H the set of data-dependent balls and T

the machine type which can either be a conjunction or disjunction of balls. As

earlier, let π(h) be the set of P examples misclassiﬁed by h and ν(h) the set of

N examples correctly classiﬁed by h.

For any hypothesis

B and any ball h

∈ B let B

B∪{h

}. The generalization

error bound of

B is given by (B) where,

(

B) = (m, R, R

, k

also for the same hypothesis

B the best potential (bp) generalization error bound

η(B) is given by,

η(B) = (m, R + 1, R

, k

, 0).

So the bp generalization error bound η(B) is the bound (B

) for

B∪{h

}

if a single ball h

can be added to hypothesis

B such that all of the remaining

N examples are covered and none of the remaining P examples misclassiﬁed.

Contrast this to the generalization error bound (

B) which is simply the bound

given for hypothesis

Algorithm BBSCM: The input of the algorithm is sample S that contains

training set X = P ∪ N , machine type T that is either a ‘conjunction’ or ‘dis-

junction’ and

H that is the set of data-dependent balls computed from training

set X.

The algorithm contains local variables

B,N ,P,

∗

and global variable . Ini-

tially

B ← ∅ is the empty hypothesis that we start with, N contains all the N

training examples that are still to be covered, P is empty because no

P exam-

ples are initially misclassiﬁed,

∗

is the best generalization error bound found

so far for any hypothesis

B and initially set to 1. Global variable is set to the

Using Generalization Error Bounds to Train the Set Covering Machine

263

Input

S,T ,H

B ← ∅, N ← N , P ← ∅,

∗

← 1, ← |H|;

Call

: addball(B,

∗

,N ,P )

Output : A T =‘conjunction’ or T =‘disjunction’ of data-dependent balls

∗

⊆ H

Algorithm 1. BBSCM(S, T, H)

Function addball(B,

∗

,N ,P )

Data: Consider all h ∈ H

B ;

Order according to (

{h} ∪ B) → (h

,

1

, η

)

, . . . , (h , , η ) ;

1: for i = 1, . . . , do

Btemp ← B, N temp ← N , Ptemp ← P

if η

i

<

∗

then

B ← B ∪ {h

}, N ← N

{ν(h

)

}, P ← P ∪ {π(h

)

}

i

<

∗

then

∗

← B,

∗

←

call createtable(

∗

end if

found ← false, R ← |B|, A ← −|N |

10:

while ¬found do

11:

R ← R + 1, A ← A + |N temp| − |N |, kn ← table(R, |P |)

12:

if kn = −1 or A ≥ −kn then

13:

found ← true

14:

end if

15:

end while

16:

if kn = −1 then

17:

call addball(B,

∗

,

N ,P)

18:

end if

19:

end if

20: end for

number of balls possible for current

B. Using the above inputs and variables,

algorithm BBSCM calls recursively function addball(B,

∗

,N ,P ) (see below).

Finally the output of algorithm BBSCM is a conjunction or disjunction of balls

(classiﬁer/hypothesis)

∗

that can be used for classiﬁcation using equation (2).

Function addball: This function adds each possible ball h

to the current

hypothesis

B and checks to see if its generalization error bound

is smaller

than the best generalization error bound

∗

found so far. If so, then the value

∗

is replaced with

B and the best risk bound

∗

is replaced with

(line

6). Also at this stage function createtable is called (line 7) to get table (see

description of function createtable below).

On line 11 if table(R,|P |) returns kn = −1 then this indicates that there is

no bound for R and |P | that is smaller than

∗

. If table(R,|P|) is a positive

integer kn, then there is a possibility of ﬁnding a ball to add to the current

264

Z. Hussain and J. Shawe-Taylor

hypothesis that will give a smaller risk bound than

∗

provided there exists a

set of R additional balls that leave no more than kn N examples uncovered and

no additional

P examples misclassiﬁed. If kn ≥ 0, then line 12 checks whether a

larger number of

N examples can be covered using R balls (see Lemma 2). If so,

then the procedure calls itself recursively (line 17) until all balls in

H have been

enumerated.

Function createtable: Local variable array table is an m × m matrix where

m = |X| is the number of training examples. Initially all values in the table are

set to

−1. Function createtable calculates for R balls and kp misclassiﬁcations

(on the

P examples) the number of N examples that can be covered without

creating a bound lower than the best bound

∗

found so far (line 5). This function

returns table.

Function createtable(

∗

,m)

Initialize: table ← −1, R

← 0, kpfound ← true, kp ← −1;

1: while kpfound do

kpfound ← false, kp ← kp + 1, Rfound ← false, R ← 0

while Rfound do

Rfound ← false, R ← R + 1, kn ← 0

while (m, R, R

, kp, kn) <

∗

kn ← kn + 1

end while

if kn > 0 then

kpfound ← true, Rfound ← true

10:

end if

11:

table(R, kp) ← kn − 1

12:

end while

13: end while

BBSCM(

τ )

BBSCM(τ ) allows a trade-oﬀ between the accuracy of the classiﬁer and speed. In

function createtable(

∗

, m) the while condition computes (m, R, R

, kp, kn)

∗

. In the BBSCM(τ ) heuristic this while condition becomes (m, R, R

, kp, kn)

<τ ·

∗

. Allowing the BBSCM algorithm to ignore solutions whose generalization

error bounds are not a factor τ from the optimal found so far.

Clearly setting τ = 1 gives the BBSCM algorithm – however as mentioned

above this algorithm is too slow for data sets with m > 100 training examples.

Therefore we would like to set τ < 1. Setting τ close to 1 may cause the heuristic

to be too slow but create hypotheses that have small generalization error bounds

similar to those for τ = 1. Setting τ close to 0 will speed up the solution but

may not create a large enough search space in order for BBSCM(τ ) to ﬁnd

hypotheses with relatively small generalization error bounds. Hence, setting τ

Using Generalization Error Bounds to Train the Set Covering Machine

265

is unlike setting a regularization parameter since it is trading accuracy against

time - bigger τ is always better in terms of generalization error bounds, but costs