Lecture Notes in Computer Science
Download 12.42 Mb. Pdf ko'rish
|
2 Preliminaries In this paper we treat the two-category, normal-distribution case similarly to [2]. The categories are denoted by θ 1 and θ
2 respectively and we set Θ = {θ 1
2 }. Let R d be the d-dimensional Euclidean space (R = R 1 ) and let x ∈ R d be the patterns to be classified. Denote by N (μ i , Σ i ), i = 1, 2, the normal distributions, where μ i
i are the mean vectors and the covariance matrices. They are the distributions of the patterns x from the respective categories. The state- conditional probability density functions are 1 √
i e − 1 2 (x −μ i ) t Σ −1 i (x −μ i ) , i = 1, 2. (1)
We suppose that the covariance matrices are not degenerate and, hence, Σ i as well as Σ −1 i are positive definite. Let P (θ
1 ) and P (θ 2 ) be the prior probabilities and let P (θ 1 |x) and P (θ 2 |x) be
the posterior probabilities of the patterns from the respective categories. Then, by the Bayesian relation, we have P (θ 1
P (θ 2 |x) = P (θ
1 )p(x
|θ 1 ) P (θ 2 )p(x |θ 2 ) . (2)
This ratio can be used as a Bayesian discriminant function [1]. The logistic function σ is defined by σ(t) = (1 + e −t )
. Since this is a strictly monotone increasing function, the logistic transform of the logarithm of (2): 240 Y. Ito, C. Srinivasan, and H. Izumi σ(log P (θ
1 |x)
P (θ 2 |x) ) = P (θ 1 |x) (3) is also a Bayesian discriminant function [1]. In the case where the state-conditional probability distributions are normal, the log ratio of the posterior probabilities is a quadratic form: g 1
P (θ 1 |x) P (θ 2 |x) = − 1 2 {(x − μ
1 ) t Σ −1 1 (x − μ
1 ) − (x − μ 2 ) t Σ −1 2 (x − μ
2 ) } + log P (θ
1 ) P (θ 2 ) − 1 2 log |Σ 1 | |Σ 2 | . (4)
The activation function of the output unit of our neural network is the logistic function and the network is trained so as to approximate (3). Hence, the inner potential of the output unit approximates the quadratic form (4). The Mahalanobis distance with respect to the normal distribution N (μ i , Σ
i ) is defined by d i (x − y) = (x − y) t Σ −1 i (x − y), i = 1, 2. Hence, g 2 (x) = − 1 2 {(x − μ 1 ) t Σ −1 1 (x − μ 1 ) − (x − μ 2 ) t Σ −1 2 (x − μ
2 ) } (5) can be a Mahalanobis discriminant function. This can be obtained by removing a constant log
P (θ 1 ) P (θ 2 ) − 1 2 log |Σ 1 | |Σ 2 | (6)
from (4) as stated in [7]. 3 Difficulty of Learning Posterior Probabilities A one-hidden-layer neural network outputs a linear sum of the form n i=1 a i (w i · x + t
i ) + a
0 , where a 0 , a
i and t
i are constants and w i = (w
i1 , · · · , w id ) are vectors. Since these constants are to be modified while training, we call them parameters. Among them we call w i1 ,
id and t
i inner parameters and a i and a
0 outer
parameters. We sometimes identify an approximation formula with the neural network based on this approximation formula. The network above has n(d+2)+1 constants to be optimized. Among them n(d + 1) are inner parameters, and n + 1 are outer parameters. Figure 1 is to show the difficulty in learning with dichotomic random sample. The space of patterns is one-dimensional. Learning of Bayesian Discriminant Functions by a Layered Neural Network 241
Fig. 1. Dichotomic samples and the posterior probability. The abscissa is the x-axis. The dots in Figure 1 are random sam- ples from {(x, ξ(x, θ))}, x ∈ R, and θ ∈ Θ = {θ
, θ 2 }, where ξ(x, θ 1 ) = 1 and ξ(x, θ 2
ξ(x, θ) = 1, is P (θ 1 |x) = E[ξ(x, θ 1 ) |x]. For simple figurative representation, it is based on one-dimensional normal distrib- utions. The smooth curve is the posterior probability, which the neural network has to approximate. In the higher dimensional case, the network has to approximate a (hyper) surface. If the teacher signals are {(x, P (θ
1 |x))}, P (θ 1 |x) = E[ξ(x, θ)|x], then the learning is an ordinary function ap- proximation by a neural network. However, the teacher signals are {(x, ξ(x, θ))} for our neural network. Comparing the two sets of teacher signals, one may understand that approximation of P (θ 1 |x), using the random samples from {(x, ξ(x, θ))}, is more difficult. For approximation of a quadratic formula (4), Funahashi [2] has proposed an approximation formula 2d i=1 a i g(w i · x + t
i ) + a
0 . (7) Since d is the dimension of the space of the pattern x, this approximation formula has 2d(d + 2) + 1 constants to be optimized. Among them, 2d(d + 1) constants are inner constants. Actually an approximation formula d i=0 a i g(w i · x) + a
0 (8)
having a smaller number of the activation functions can approximate the quadratic form [3], [5]. In this formula, the number of the constants to be opti- mized is (d + 1) 2 + 1 and that of the inner constants is d(d + 1). Since one of the activation function is to approximate a linear form in x, the approximation formula (8) can be replaced by d i=1
a i g(w i · x) + w
0 · x + a
0 . (9) The values of the constants are of course distinct in (8) and (9), though the same notation is used. In (9), the numbers of the activation functions and the para- meters are respectively decreased to d and (d + 1) 2 . Hence, we used (9) in [6] and [7], expecting that the parameter space was simplified remarkably. Nevertheless, the network based on (9) was successful only in the case where the probability distributions are simple. In particular, training of the inner parameters was very difficult. This convinced us that to overcome the difficulties there is no other way other than decreasing the inner parameters.
242 Y. Ito, C. Srinivasan, and H. Izumi 4 Modified One-Hidden-Layer Neural Network In this section we construct a neural network having more hidden-layer units but less parameters to be optimized. There are unit vectors u k , k = 1,
· · ·, 1 2 d(d + 1), in R d such that (u k · x), k = 1, · · · , d, are linearly independent and (u k
2 , k = 1,
· · · , 1 2 d(d + 1), are also linearly independent. Here, we illustrate an example. Suppose that u k , k = 1,
· · ·, 1 2 d(d + 1), are distinct unit vectors in R d such that, for k = 1, · · · , d, the d-th element is 1 and others are 0, and, for k = d + 1, · · ·,
1 2 d(d + 1), two elements are equal to 1 √ 2 and others are 0. Then, they satisfy the condition [3]. Hence, any quadratic form Q(x) in R d can be
expressed as Q(x) =
1 2 d(d+1) k=1 a 2k (u k · x) 2 + d k=1 a 1k (u k · x) + a 0 . Though this expression can be generalized to polynomials of higher degree [3], this is enough in this paper. Let u be a unit vector in R d . For a probability measure μ in R d , we denote by μ u
{tu : −∞ < t < ∞}. Then, we have a theorem below.
Theorem 1. Let μ be a probability measure on R d and let g ∈ C N (R) (N ≥ 2) such that g (i) (0) = 0 (1 ≤ i ≤ 2). Suppose that there exists a function g 0 on R such that , for any δ ( |δ| < 1), |g (i)
(δt)t i | < g 0 (t),
0 ≤ i ≤ 2,
and, for any unit vector u in R d , g 0 ∈ L
p (R, μ
u )(1
≤ p ≤ ∞). Then, for any quadratic form Q and any ε > 0, there exist coefficients a ik and constants δ i (0 ≤ i ≤ 2, 1 ≤ k ≤ 1 2 d(d + 1)) for which ∂ α ∂x α Q − ∂ α ∂x α ¯ Q L p (R d ,μ) < ε, |α| ≤ N,
where ¯ Q(x) = 1 2 d(d+1) k=1 a 2k g(δ 2 u k · x) +
d k=1
a 1k g(δ 1 u k · x) + a 0 . (10) and u
k are unit vectors satisfying the condition above. This theorem is a special case of a theorem in [3]. Though we have not enough space to prove this, it may be imaginable that (10) comes from the expres- sion above of the quadratic form Q. The condition on g and μ is to suppress the L
p -norm in a neighborhood of infinity. If the activation function and its derivatives are bounded and the probability measure is rapidly decreasing, this condition is always satisfied. Learning of Bayesian Discriminant Functions by a Layered Neural Network 243
Since the second sum in (10) is to approximate a linear form, the approxima- tion formula can be modified to ¯ Q(x) =
d(d+1)/2 k=1
a 2k g(δu k · x) +
d k=1
a 1k x k + a
0 . (11) When δ is infinitesimal, (11) can coincide with the quadratic form Q. This ap- proximation formula can be realized by a one-hidden-layer neural network having direct connections between the input and output layers. Figure 2 illustrates such a neural network. Fig. 2. A one-hidden-layer neural net- work having direct connections be- tween the input layer and output unit. D =
1 2 d(d + 1). Since the vectors u k are fixed before- hand, the number of the parameters to be optimized in (11) is 1 2
2 + 3d + 4). Moreover, only one is an inner parameter among them and others are outer parame- ters. This must make the parameter space very simple. We performed simulations to confirm usefulness of the network. The network has 1 2
units. The linear sum d k=1 a 1k x k is input
to the output unit via the direct connec- tions. Hence, the inner potential of the output unit of the neural network illus- trated in Figure 2 can approximate (10). If the activation function of the output unit is the logistic function, then, by (3) and (4), it can approximate the posterior probability P (θ 1 |x) [2].
5 Training of the Neural Network Let F (x, w) be the output of the neural network. Training of the network is based on the proposition below by Ruck et al. [9]. As we have repeatedly used this proposition [4-7], we omit details here. Proposition 2. Set E(w) = R
2 i=1
(F (x, w) − ξ(x, θ
i )) 2 P (θ i )p(x |θ i )dx. (12) Then,
E(w) = R d (F (x, w) − E[ξ(x, ·)|x]) 2 p(x)dx +
R d V [ξ(x, ·)|x]p(x)dx. (13)
If ξ(x, θ 1 ) = 1 and ξ(x, θ 2 ) = 0, then E[ξ(x, ·)|x] = P (θ 1 |x). The integrand of the second integral is the conditional variance of ξ and independent of the weight 244 Y. Ito, C. Srinivasan, and H. Izumi vector w. Hence, when E(w) is minimized, the output F (x, w) is expected to ap- proximate P (θ 1 |x) if the network has a capability of approximating the posterior probability. Accordingly, learning of the network is carried out by minimizing E n (w) = 1 n n k=1
(F (x (k)
, w) − ξ(x
(k) , θ
(k) )) 2 , (14)
where {(x
(k) , ξ(x
(k) , θ
(k) )) } n k=1
, (x (k)
, θ (k)
) ⊂ R
d × Θ, is the training set. This method of training has been treated by many authors [2,4-10]. In the case of the ordinary approximation of functions, to be minimized is E n
1 n n k=1 (F (x
(k) , w)
− f(x (k)
)) 2 . (15) It is obvious that minimization of (14) is more difficult than that of (15). 6 Simulations Unlike ordinary one-hidden-layer neural networks, the network based on (11) can learn two-dimensional Bayesian discriminant functions. We have confirmed this by simulation. Fig. 3a and 3b illustrate the probability density functions of the normal distributions N (μ 1 , Σ
1 ) and N (μ 2 , Σ
2 ) used in one of the simulations, where μ 1
2 = (0, 0) and Σ 1 =
1 1 , Σ
2 = 1 0 0 1 respectively. These are the state-conditional probability density functions of the categories θ 1 and θ 2 respectively. Fig. 3c illustrates the posterior probability of the category θ 1 in the case of the prior probabilities P (θ 1 ) = 0.7 and P (θ 2 ) = 0.3. As stated in Section 2, this can be used as a Bayesian discriminant function. The neural network was trained with 1000 teacher signals. a b
Fig. 3. a and b are the state-conditional probability density functions of Categories θ 1 and θ 2 respectively, and c is the posterior probability Fig 4 illustrates an example of the learning process. Fig 4a is the output of the network with the initial values of the parameters. It converged to Fig 4c via Fig 4b. The unit vectors are u 1 = (1, 0), u 2 = (0, 1) and u 3 = (
1 √ 2 , 1 √ 2 ). While
learning, the parameters a 0 , a 11 , a
12 , a
21 , a
22 , a
23 were optimized, but the value Learning of Bayesian Discriminant Functions by a Layered Neural Network 245
a b c Fig. 4. Learning process. The target is the surface illustrated in Fig. 3c. of the constant δ did not change much. This probably implies difficulty of training of the inner parameter. Even this neural network is not always successful in learning: its success depends on the choice of the initial values of the parameters. In particular, if the initial output of the network is a flat surface, learning rarely proceeds. One may recognize slight differences between the Bayesian discriminant func- tion theoretically obtained (Fig. 3c) and that obtained by simulation (Fig. 4c). These are probably because the teacher signals are random variables and the inner parameter δ is not infinitesimal. Classification capabilities of the two dis- criminant functions (Fig. 3c and Fig. 4c) are compared in Table 1. By the method stated in [7], we can easily convert the Bayesian discriminant function (4) to the Mahalanobis discriminant function (5), because both are quadratic forms dis- tinct from each other by a constant. Hence, in the table the capabilities of the Mahalanobis discriminant function theoretically obtained and that obtained by simulation are also compared. The test signals are distinct from those used for training. In this table, the numbers in the row of “Alloc. to θ 1 (resp. θ
2 )” are those of patterns allocated to the category θ 1 (resp. category θ 2 ), and in the row of “Correc. Alloc.” is the total number of correctly allocated patterns. The columns TS, BT, BS, MT and MS stand for the test signals (TS), the Bayesian discrim- inant functions theoretically obtained (BT, Fig. 3c) and obtained by simulation (BS, Fig. 4c), and the Mahalanobis discriminant functions theoretically obtained (MT) and obtained by simulation (MS) respectively. Only in the column TS, all signals are correctly allocated. The BT allocated 843 (resp. 157) patterns to Cat- egory θ 1
2 ), among which 175 (resp. 37) patters were to be allocated to Table 1. Classification results in the first example (See Text) TS BT BS MT MS Alloc. to θ 1 705 843(175) 861(187)
614 (87) 741(109)
Alloc. to θ 2 295 157 (37) 139 (31)
386(178) 259(150)
Correc. Alloc. 1000
788 782
735 741
246 Y. Ito, C. Srinivasan, and H. Izumi Table 2. Classification results in the second example (See Text) TS BT BS MT MS Alloc. to θ 1 708 717 (32) 706 (26)
687 (17) 664 (11)
Alloc. to θ 2 292 283 (23) 294 (28)
313 (38) 336 (55)
Correct Alloc. 1000
945 946
945 934
Category θ 2 (resp. θ 1 ). Since the distributions of the two categories are overlap- ping, even the BT can allocate correctly only 1000 − (175 + 37) = 788 patterns among 1000. However, the allocations by the BT and BS coincide at 982 patterns among 1000 (not shown in Table 2), and those by MT and MS at 924. These implies that the learning is successful. Another example is illustrated in Table 2, where μ 1 = (1, 1), μ 2 = (
−1, −2) and Σ 1 = 2 1
1 1 , Σ
2 = 1 0.5 0.5 1 . In this case the two probability distributions are rather separated. Hence, the classifications by the respective discriminant functions are more successful. Moreover, the allocations by BT and BS coincide at 989 patterns among 1000, and those by MT and MS at 977. 7 Conclusion and Discussions We have often experienced that the ordinary layered neural network based on the approximation formula (9) fails in approximating the posterior probability, unless the probability distribution is very simple. In the case where the dimension of the patterns x is higher than or equal to 2, it looks impossible for the network to learn the posterior probability. However, the neural network based on (11) performs the task better even in the two dimensional case. We think that this is because the latter has a smaller number of parameters to be optimized and, more importantly, they are mostly outer parameters. Our network has more hidden layer units than the minimum requirement. This contradicts to the general belief that a neural network having a smaller number of units works better. However, the number of the parameters to be trained and whether they are inner or outer may be more directly related to the complexity in learning. Since our network has less parameters to be trained and most of them are outer parameters, it is understandable that the network works better. When the network is based on (11) the error function (12) has no global min- imum usually, because the output F (x, w) converges to the posterior probability P (θ
1 |x) as the scaling parameter δ tends to infinitesimal. Hence, we expected that δ converges toward zero while training, but it usually stays in the neigh- borhood of the initial value. The network often realizes the approximation using the initially given value of δ with a slight modification. This probably implies difficulty of training of inner constants with random dichotomic teacher signals. This may in turn justify our policy of decreasing the inner parameters to be
Learning of Bayesian Discriminant Functions by a Layered Neural Network 247
trained, even if it causes increment of the number of the hidden-layer units. In short, the inner parameters are converted to the outer parameters in our neural network. The flexibility of the basis functions may characterize the neural network. It makes neural networks useful powerful tools for multiple purposes. One of the standard methods of approximating functions is to use the basis functions of a complete orthonormal system. In this case only the coefficients of the basis functions are optimized. Approximation by our neural network is an intermediate method between the two methods because the basis functions are flexible only to some extent. Since the space of target functions is restricted to quadratic forms, the flex- ibility of basis functions can be restricted in this paper. However, usefulness of the neural network is not limited to this special case. The approximation based on Theorem 1 is on the whole space in the L p sense and the original theorem in [3] is on approximation of polynomials of arbitrary degree. Hence, by the polyno- mial approximation theorem, our neural network may be useful in approximating general continuous functions when the probability measure is rapidly decreas- ing. We think that the ideas of making easier the learning by converting inner parameters to outer parameters can be widely applied. References 1. Duda, R.O., Hart, P.E.: Pattern classification and scene analysis. John Wiley & Sons, New York (1973) 2. Funahashi, K.: Multilayer neural networks and Bayes decision theory. Neural Net- works 11, 209–213 (1998) 3. Ito, Y.: Simultaneous approximations of polynomials and derivatives and their applications to neural networks (submitted) 4. Ito, Y., Srinivasan, C.: Multicategory Bayesian decision using a three-layer neural network. In: Kaynak, O., Alpaydın, E., Oja, E., Xu, L. (eds.) ICANN 2003 and ICONIP 2003. LNCS, vol. 2714, pp. 253–261. Springer, Heidelberg (2003) 5. Ito, Y., Srinivasan, C.: Bayesian decision theory on three-layer neural networks. Neurocomputing 63, 209–228 (2005) 6. Ito, Y., Srinivasan, C., Izumi, H.: Bayesian learning of neural networks adapted to changes of prior probabilities. In: Duch, W., Kacprzyk, J., Oja, E., Zadro˙zny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 253–259. Springer, Heidelberg (2005) 7. Ito, Y., Srinivasan, C., Izumi, H.: Discriminant analysis by a neural network with Mahalanobis distance. In: Kollias, S., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4132, pp. 350–360. Springer, Heidelberg (2006) 8. Richard, M.D., Lipmann, R.P.: Neural network classifiers estimate Bayesian a pos- teriori probabilities. Neural Computation 3, 461–483 (1991) 9. Ruck, M.D., Rogers, S., Kabrisky, M., Oxley, H., Sutter, B.: The multilayer percep- tron as approximator to a Bayes optimal discriminant function. IEEE Transactions on Neural Networks 1, 296–298 (1990) 10. White, H.: Learning in artificial neural networks: A statistical perspective. Neural Computation 1, 425–464 (1989) RNN with a Recurrent Output Layer for Learning of Naturalness J´ an Dolinsk´ y and Hideyuki Takagi Kyushu University, 4-9-1 Shiobaru, Minami-ku, Fukuoka, 815-8540 Japan jan@plazma.sk, takagi@design.kyushu-u.ac.jp Abstract. The behavior of recurrent neural networks with a recurrent output layer (ROL) is described mathematically and it is shown that using ROL is not only advantageous, but is in fact crucial to obtaining satisfactory performance for the proposed naturalness learning. Conven- tional belief holds that employing ROL often substantially decreases the performance of a network or renders the network unstable, and ROL is consequently rarely used. The objective of this paper is to demonstrate that there are cases where it is necessary to use ROL. The concrete example shown models naturalness in handwritten letters. 1 Introduction In engineering, recurrent neural networks (RNN) have not been often proposed as a promising solution. The difficulties with training a RNN have been over- come, and recent theoretical advances in the field have made training a RNN quicker and easier [4]. Recurrent connections have not been found to increase the approximational capabilities of the network [7]. Nevertheless, we may obtain decreased complexity, network size, etc. while solving the same problem. In some applications - such as speech recognition or object detection or pre- diction - our classification / prediction at time t should be more accurate if we can account for what we saw at earlier times. The most common approach to model such systems is to use a suitably powerful feed-forward network and feed it with a finite history of the inputs and optionally the outputs through a sliding window. Early attempts to improve this, often tedious, technique resulted in var- ious network architectures based on the feed-forward topology with the recurrent connections being set to fixed values to ensure that the backpropagation rule can be used [1][6]. Many works have been done on autonomous Hopfield networks as well as on training algorithms that can be applied to a RNN in a feed-forward fashion (i.e. BPTT). Recent theoretical advances in RNN research such as Echo State Networks (ESN) afford the modeling of fully general topologies, which were difficult to train directly with the former techniques. An interesting example is a topology where output units have connections from not only the internal units but also the input units and output units, yielding a recurrent output layer - ROL. Although connections from the input units are often used, connections from the output layer are rare. In the following M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 248–257, 2008. c Springer-Verlag Berlin Heidelberg 2008
RNN with a Recurrent Output Layer for Learning of Naturalness 249
chapters, we explain what behavior ROL implies, introduce the concept of our proposed naturalness learning, and show that using ROL not only increases the performance but is actually an intrinsic part of modeling with the proposed naturalness learning. One of the earliest RNN, where the output activation values from the previous step were used to compute the output activation values in the next step, was the Jordan network [6][5]. In the Jordan network, the activation values of the output units are fed back into the input layer through a set of extra input units called the state units. This type of network is called output-recurrent network. Various modifications to output-recurrent networks have been proposed and successfully used for modeling difficult non-linear tasks [8]. RNN with ROL, in contrast to output-recurrent network, uses the output activation values of the previous step directly to compute the output in the next step. The output activation values of the previous step can be thought of as a extra hidden units. We are aware of no papers discussing applications of RNN with ROL. 2 Dynamics of RNN with a Recurrent Output Layer Adopting a standard perspective of system theory, we view a deterministic and discrete-time dynamical system as a function G which yields the next system output, given an input and the output history: y(n + 1) = G(..., u(n), u(n + 1); ...y(n − 1), y(n)) (1)
where u(n) is the input vector and y(n) is the output vector for the time step n. The echo-state approach enables us to approximate systems represented by G directly, without the necessity to convert a time series into static input patterns by the sliding window technique [2]. Consider a discrete-time ESN [4] consisting of K input units with an acti- vation vector u(n) = (u 1 (n), ..., u K (n))
t , N internal units with an activation vector x(n) = (x 1 (n), ..., x N (n))
t , and L output units with an activation vector y(n) = (y 1 (n), ..., y L (n))
t , where
t denotes the transpose. The corresponding in- put, internal and output connection weights are collected in the N × K, N × N, L ×(K +N +L) weight matrices W in , W, W
out respectively. Optionally, a N ×L matrix W
back may be used to project the output units back to the internal units. The internal units’ activation is computed according to x(n + 1) = f(W in u(n + 1) + Wx(n) + W back y(n))
(2) where f denotes the component-wise application of the transfer (activation) func- tion to each internal unit. The output is computed as y(n + 1) = f out (W
(u(n + 1), x(n + 1), y(n)) (3)
where (u(n + 1), x(n + 1), y(n)) represents the concatenated vector consisting of input, internal and output activation vectors. The concatenated vector often 250 J. Dolinsk´ y and H. Takagi Fig. 1. Echo-state network: the dotted lines plot connections which can be trained, the gray lines plot connections which are optional consits only of input and internal activations or internal activation only. Fig. 1 shows the architecture of an ESN. See [4] for further details concerning the training of ESN. A closer look at Eq. (3) reveals that a system output y(n + 1) is constructed from the given input and output history via two distinct mechanisms: from the activation vectors of the internal units x(n) indirectly (by computing x(n + 1) via Eq. (2)) and optionally from the activation vectors of the input units u(n+1) and/or output units y(n) directly. The internal units’ activation x(n+1) is computed using the input and output activation u(n + 1), y(n) and the activation x(n) of the internal units from the previous step which recursively reflects the influence of input and output activations from previous steps. We can therefore rewrite Eq. (2) as x(n + 1) = E(..., u(n), u(n + 1); ..., y(n − 1), y(n)) (4)
where E depends on the history of input signal u and the history of desired output signal y itself, thus in each particular task, E shares certain properties with the desired output and/or given input. How strongly the internal activation x(n+1) is influenced by the activations u(n+1), y(n) and x(n) (which recursively consist of previous input/output activations) is controlled by the size of the weights in matrices W in , W
back and W respectively. Algebraic properties of the matrix W are particularly important for the short-term memory property of an ESN [3].
Besides using the activations of internal units, sometimes it is advantageous to use also the activations of input and output units directly. Although the activation vector x(n) reflects the history of the desired output and/or given input, the activation vectors u(n + 1), y(n) in Eq. 3 are used merely as another form of input. This usage corresponds to connecting the input units to the output units and output units to output units themselves directly. Direct connection of input units to output units is often used whereas direct connection of output
RNN with a Recurrent Output Layer for Learning of Naturalness 251
units to output units is rare. It is the connecting of output units to each other what makes the output layer recurrent. ROL implies a substantial influence of the previously generated output y(n) on the successive output y(n+1). The activation y(n) is only an approximation of a system output at the step n and, thus, it is always generated with a certain error. This error is included in the computation of the successive output activation y(n + 1) and can easily accumulate with each update step. It is for this reason that computation using ROL has been rare. 3 RNN with a Recurrent Output Layer for Naturalness Learning In this section, we will demonstrate the principles of naturalness learning by showing how to express and model the unique quality of hand-written letters. We also explain why ROL works well with the naturalness learning. The style of writing of any given person is very much individual and can be distinguished easily from mechanically or electronically printed text. Moreover, we can distinguish between the writing of different people. Everybody learns their alphabet in a school, and while writing in one’s own individual way, a person is trying to approximate the shapes of the letters as they learned them in school. We can therefore understand handwriting as turning the basic shape of a letter as learned in a school into the writer’s particular, unique, handwritten form. A human then can be seen as a ‘filter’ which adds his or her own characteristics to the shapes of those basic letters. To explain the term of naturalness, first we must define our terminology. Let us refer to the letters used in textbooks (either printed or cursive) as the font letters (fontL). We view handwriting as the process of turning a fontL into its handwritten form. The unique quality of the handwritten letter (handL) can be then understood as the difference between the handL and the fontL and expressed as a 2-D displacement vector field of evenly spaced points along the strokes of the font and its respective handwritten version (see Fig. 2) 1 . We refer to this difference as naturalness. In other words, adding the naturalness to the fontL will result in a handL of a unique form. We can thus, assume a relation between the fontL and the naturalness. This assumption enables us to model naturalness by a system which employs certain characteristics of the fontL as its input. The task of naturalness learning is to find and learn the relation between font letters and naturalness. Speaking in the terms of naturalness learning, the target system (handwritten letters) resembles the basic system (font letters) with its behavior (shape of handwritten letters) deviating from the basic system to a certain extent. Learning and modeling naturalness using a RNN with ROL produced inter- esting results. As mentioned in section 2., computing with ROL is prone to 1 A displacement vector is not the only mean of expressing naturalness, it can be expressed using an arbitrary mechanism. 252 J. Dolinsk´ y and H. Takagi Fig. 2. Hiragana letter /ka/. Naturalness expressed by 2-D displacement vector field. Font letter strokes shown in black, handwritten strokes shown in gray. the accumulation of error from previous steps. This phenomena has been found harmful in many tasks. With naturalness learning, however, ROL performs well. In the handwriting task, we found that introducing an activation y(n) into play via update Eq. (3) always helped network to generate y(n+1) with a greater accuracy. An intuitive explanation is as follows. The way a person writes the first half of a handwritten stroke influences, to a certain extent, how the second part is going to look, i.e. distortion of a certain part of a stroke usually implies some other distortion to a successive part of the stroke. It is therefore reasonable to assume that a short sequence of points on a handwritten stroke influences where the next point will appear, with the very last point of a sequence having the greatest influence. The same holds for the naturalness extracted as a difference between handL and fontL. On the other hand, it is not only the very last point which influences the position of the next point of a stroke, but a short sequence of the previous points. Thus, a backprojection matrix W back
was used so as to ensure recent short sequence of generated output ..., y(n − 1), y(n) is also reflected in y(n + 1) via the activation x(n + 1) (see Eq. (2) and Eq. (3)). The same principle holds for the input activations u(n + 1) extracted from fontL strokes. Update Eq. (2) ensures that a short sequence ..., u(n), u(n + 1) is reflected in the activation x(n + 1) which is in turn used to compute y(n + 1) via update Eq. (3). The activation u(n + 1) is also used directly in Eq. (3) to ensure that the very last point of the input sequence has significant influence in the computation of y(n + 1). The fact that naturalness is being modeled, instead of the target system, allows us to control the amount of the naturalness being added to the font letters. A weight of value 1 will render generated letters as close to a person’s handwriting as possible. The value of, say, 0.6 will reduce the naturalness to 60%, providing us with a neat handwritten letters. Values close to 0 will render generated letters very close to the font letters. It is also possible to combine several individual’s naturalness (e.g. 40% of person A’s naturalness with 60% of person B’s naturalness). RNN with a Recurrent Output Layer for Learning of Naturalness 253
The naturalness learning approach is not limited to the handwriting task only. We believe, one can generate natural looking movements for parts of the human body with the inverse kinematics algorithm being employed as the basic system. Generating emotional human speech with synthesized speech being used as the basic system might be another interesting application of the naturalness learning approach. 4 Experiments In this chapter, we demonstrate how the naturalness learning approach copes with the handwriting task. The letters used in the experiments are symbols of Japanese syllabary - hiragana. The fontL were extracted from the Bitstream Vera Sans font onto 250x250 pixel canvas. Every stroke of a given letter was turned into a set of bezier curves - a path. Every path was then evenly spaced into a set of points. Let each P f l ij
ij × 2 matrix where i denotes the index of the letter, j the index of the stroke within the letter and n ij is the number of points. P f l
ij (k) = (x
k , y
k ) therefore represents the k-th point of the j-th stroke of the i-th letter (k-th row of the matrix P f l
ij ). The handL were first scanned and then appropriately scaled so as to ensure they fit the canvas. Every handL stroke was then aligned to its fontL stroke counterpart. This alignment ensures that the displacement vector field expresses only a shape transformation between a pair of fontL and handL strokes. In order to split strokes of handL, the same procedure was applied as with the fontL, saving the points for each stroke into P hl ij . The spacing interval along each stroke (path) of a handL was, however, also adjusted so that number of points matched those in the corresponding fontL stroke. 4.1
Data Specification The input signals were extracted from the points of fontL’s strokes as follows. Let D f l
ij (k) be the difference vector P f l ij
− P f l
ij (k). Then the inertia for the k-th point of the j-th stroke of the i-th letter is given by inertia
ij (k) = D
f l ij (k) (5) with each inertia ij (k) being the k-th row of the (n ij − 1) × 2 matrix inertia ij .
inary pen which ‘wrote’ the font letter. The curvature for the k-th point of the j-th stroke of the i-th letter is given by curv ij
D f l
ij (k)
0 1 D f l ij (k)D f l ij (k) t (6)
254 J. Dolinsk´ y and H. Takagi with each curv ij (k) being the k-th row of the (n ij − 1) × 1 matrix curv ij .
ij and inertia ij was merged into a single (n ij − 1) × 3 matrix U ij with each column being normalized into the interval ( −1, 1). In order to partially erase the transient dynamics from the previous stroke, a zero sequence of size n gap
×3 was inserted before every U ij , resulting in the final input matrix U. Naturalness, which serves as the (2 dimensional) output signal, is represented by 2-D displacement vector field. The 2-D displacement vector field for the j-th stroke of the i-th letter is given by Y ij = P hl ij − P f l
ij (7)
with Y ij , P hl ij and P f l ij each being n ij ×2 matrices. The last row of every Y ij was
dropped to ensure each pair of Y ij andU ij have the same length of n ij − 1. Each column of Y ij was scaled down to the interval ( −1, 1). A zero sequence of size n gap × 2 was inserted before every Y ij , resulting in the final output matrix Y. 4.2 Network Parameters A 300 unit network was used with activation function f = RBF . Internal weights in the matrix W were randomly assigned values of 0, 0.2073, -0.2073 with prob- abilities 0.95, 0.025, 0.025. For a 300 × 300 matrix W, this implies a spectral radius of ≈ 0.85, providing for a relatively long short-term memory [4]. 3 input and 2 output units were attached. Input weights were randomly drawn from a uniform distribution over ( −1, 1) with probability 0.9 or set to 0. With such an input matrix, the network is strongly driven by the input activations because many elements of the matrix W in are non-zero values. The network had output feedback connections, which were randomly set to one of the three values of 0, 0.1, -0.1 with probabilities 0.9, 0.05, 0.05. Such a setup of feedback connections makes the network excited only marginally by previous output activations. The activation function for the output units was identity f out (x) = x.
Fig. 3. Geometrical meaning of the input data. inertia ij (k) represents difference vector between two successive points P f l
ij (k + 1), P f l ij
ij (k) is the sine of angle φ. RNN with a Recurrent Output Layer for Learning of Naturalness 255
Fig. 4. Testing with the training data: Letters produced by RNN with/out ROL with teacher-forcing switched off from n = 301 Fig. 5. Testing with the testing data: Letters produced by RNN with/out ROL 4.3
Training and Testing The training data was made from the letters shown in Fig. 4, resulting in a 3027 ×3 input matrix for U train and a 3027 ×2 output matrix for Y train
, which were prepared according to the data specification with n gap = 16 being used. Eq. (2) was used for updating with u(n), y(n) being the transposed n-th rows of matrices U train and Y
train respectively. The first 300 steps were discarded and the network internal states with input unit states x(n), u(n) were collected from n = 301 through n = 3355. The output weights W out were computed using the internal states and input unit states only. The training errors of first and second output units were mse train,1 ≈ 9.5 × 10 −4 and mse
train,2 ≈ 2.8 × 10 −3 respectively. Making the output layer recurrent (ROL) and computing W out using not only internal/input states x(n), u(n) but also output activation values 256 J. Dolinsk´ y and H. Takagi y(n
−1) reduced the training errors mse train,1
and mse train,2
down to ≈ 6×10
−4 and
≈ 2.0 × 10 −3 respectively. A visual comparison is shown in Fig. (4). The testing data was made from letters shown in Fig. 5, resulting in a 6045 ×3 input matrix U test and a 6045 ×2 output matrix Y test
with n gap
= 16 being used. For non-ROL topology the test errors were found to be mse test,1 ≈ 1.2 ×
10 −2 , mse test,2 ≈ 3.5 × 10 −2 . Using ROL reduced the test errors to mse test,1 ≈ 0.9 × 10 −3 , mse test,2 ≈ 3.0 × 10 −2 . The errors mse test,i provide only a rough indication of network performance. A visual comparison between these two trials is shown in Fig. 5. We can observe that the network also produces appropriate naturalness for letters on which it had not been trained. 5 Discussion 5.1 Network
Here we try to provide an insight into why the setup from section 4. works the best. The network has information concerning the shape of the strokes in fontL and handL via its history of input and output activations. The matrices W in and W from the setup in sec. 4. make the network strongly driven by the (finite) history of input. Moreover, using the most recent input activation u(n + 1) in Eq. (3) directly makes the impact even stronger. It is most likely the case that the last several points of the path from where a fontL stroke is coming has a substantial impact on the naturalness and thus also on where the handL stroke is going to continue. An intuitive explanation is that a human, while writing in one’s own individual way, is trying to ‘approximate’ a font letter shape as memorized in a school. This finding is in line with the basic idea of the naturalness learning: to model a target system (handL) by means of a basic system (fontL) and its difference (naturalness) with the target system. The feedback weights W back
from the setup in section 4. only slightly excite the network with the output history. Surprisingly, using the most recent output activation y(n) in Eq. (3) directly (= ROL) always improved the performance. A plausible explanation is that the already written part of a handL stroke influences how the going to be written part will look like with the naturalness of the the previous step having highest relevance to the naturalness generated in the next step. (i.e. distortion of a certain part of a stroke usually implies some other distortion to a successive part of the stroke). Feedback connections with larger weights (up to +1) and no ROL were also tested. With this setup the network training error was about the same as in sec. 4 but driving the network with the testing data rendered worse performance or made the network unstable. To check the robustness of the advantage of ROL, we tested the setup in sec. 4 with a range of different starting internal weights W. In all cases ROL enabled better performance.
RNN with a Recurrent Output Layer for Learning of Naturalness 257
5.2 Data Structure The naturalness in this paper is represented as a 2-D displacement vector field (Fig. 2). We are, however, not strictly bound to this representation. The natural- ness can be also represented as a set of parameters in a system which represents a difference between the target (handL) and the basic (fontL) system. The input data characterizing the basic system (handL) was made position independent so as to ensure the same stroke will be represented by the same data regardless of its starting position. This reduces significantly the complexity of the handwrit- ing task because while separate strokes are substantially different in shape, some of their small parts are often similar. The short-term memory of a RNN makes distinction between a identical/similar short stroke sequences possible because the RNN accounts also for points before the identical/similar stroke part. 6 Conclusion In the handwriting task we showed that by modeling the target system by means of a basic system and its difference from the target system, a substantial relevance is revealed in the difference produced in step n and step n + 1. In such a case the usage of a ROL turned out to be advantageous. We would like to confirm these findings by applying naturalness learning to other tasks as well. Modeling the unique individualistic quality of human motion is the next step in confirming the feasibility of both naturalness learning and the usability of a ROL for tasks formulated in terms of naturalness learning. References 1. Elman, J.L.: Finding structure in time. Cognitive Science: A Multidisciplinary Jour- nal 14(2), 179–211 (1990) 2. Jaeger, H.: The “echo state” approach to analysing and training recurrent neural networks. Sankt Augustin: GMD-Forschungszentrum Informationstechnik, GMD- Report 148 (December 2001) 3. Jaeger, H.: Short term memory in echo state networks. Sankt Augustin: GMD- Forschungszentrum Informationstechnik, GMD-Report 152 (March 2002) 4. Jaeger, H.: Supervised training of recurrent neural networks, especially with ESN approach. Sankt Augustin: GMD-Forschungszentrum Informationstechnik, GMD- Report 159 (October 2002) 5. Jordan, M.I.: Serial Order: a parallel distributed processing approach. Tech. Rep. 8604, Univ. of California at San Diego, Inst. for Cognitive Science (May 1986) 6. Jordan, M.I.: Attractor dynamics and parallelism in a connectionist sequential ma- chine. In: Eighth Annual Conf. of Cognitive Science Society, Amherst, MA, USA, pp. 531–546 (August 1986) 7. Krose, B., van der Smagt, P.: Recurrent networks. In: Ch. 5, An introduction to neural networks, Eighth Edition, Univ. of Amsterdam (November 1996) 8. Wang, Y.-C., Chien, C.-J., Teng, C.-C.: Direct adaptive iterative learning control of nonlinear systems using an output-recurrent fuzzy neural network. IEEE Trans. on SMC-B 34(3), 1348–1359 (2004)
Using Generalization Error Bounds to Train the Set Covering Machine Zakria Hussain and John Shawe-Taylor Centre for Computational Statistics and Machine Learning Department of Computer Science University College, London {z.hussain,j.shawe-taylor}@cs.ucl.ac.uk Abstract. In this paper we eliminate the need for parameter estimation associated with the set covering machine (SCM) by directly minimizing generalization error bounds. Firstly, we consider a sub-optimal greedy heuristic algorithm termed the bound set covering machine (BSCM). Next, we propose the branch and bound set covering machine (BBSCM) and prove that it finds a classifier producing the smallest generalization error bound. We further justify empirically the BBSCM algorithm with a heuristic relaxation, called BBSCM( τ), which guarantees a solution whose bound is within a factor τ of the optimal. Experiments compar- ing against the support vector machine (SVM) and SCM algorithms demonstrate that the approaches proposed can lead to some or all of the following: 1) faster running times, 2) sparser classifiers and 3) com- petitive generalization error, all while avoiding the need for parameter estimation. 1 Motivation Two algorithms that use very different mechanisms in order to build their classi- fiers are the support vector machine (SVM) [2,6,3] and the set covering machine (SCM) [4]. Taking the binary classification task as an example, the first more fa- miliar algorithm finds a hyperplane in feature space that maximizes the margin between the two classes of examples. The second looks for a classifier constructed from a small subset of data derived decision functions consistent with the set of positive examples. Fundamentally these two algorithms differ in their approaches to solving a classification problem. However, as was argued in [4] it may be more effective (for some learning tasks) to build a classifier from a small subset of data derived decision functions as opposed to a maximum margin separating hyperplane. In this paper, we demonstrate a method of model selection for the SCM without the need for parameter estimation. Instead, we apply the generalization error bound directly to the algorithm in order to determine its output classifier. First we apply the bound directly to the SCM using a greedy approach and call M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 258–268, 2008. c Springer-Verlag Berlin Heidelberg 2008
Using Generalization Error Bounds to Train the Set Covering Machine 259
the heuristic the bound set covering machine (BSCM). Next we look to globally optimize the bound by proposing the branch and bound set covering machine (BBSCM) and prove that the algorithm will indeed find the hypothesis with the smallest generalization error bound. This algorithm turns out to be too slow experimentally and so we introduce a relaxation called BBSCM(τ ) and prove that its solutions are a factor τ from the optimal. The structure of this paper will be as follows: In the following section we de- scribe the set covering machine (SCM) and state two generalization error bounds for the SCM. Section 3 describes the bound set covering machine (BSCM). In part 4 we discuss in detail the branch and bound set covering machine (BB- SCM). Section 5 describes a heuristic derived from the BBSCM algorithm called BBSCM(τ ). Section 6 proves the optimality of the algorithms proposed. The ex- perimental section 7 discusses results for several UCI repository data sets with section 8 concluding the paper. 2 The Set Covering Machine Definition 1. Let S = {(x 1 , y 1 ) . . . (x m , y
m ) } be a sample where x i ∈ X =
P ∪ N is a training example and y i ∈ {−1, +1} its classification. We use the convention x = (x 1 , . . . , x n ) for each training example and refer to X as the training set. Let P be the set of positive (+1) and N the set of negative (−1) training examples for the conjunction case and let P be the set of negative (−1) and N the set of positive (+1) training examples in the disjunction case. The set covering machine tries to build a conjunction or disjunction of data- dependent features (data derived decision functions) in order to classify future test examples. The set of data derived decision functions we will use are the following set of data-dependent balls. Definition 2. For a training example x i ∈ X with label y i ∈ {−1, 1} and (real- valued) radius ρ, let f i,ρ
be the following data-dependent ball centered around x i : f i,ρ
(x) = y i if d(x, x i ) ≤ ρ ¯ y i otherwise where ¯ y
is the complement of y i and d(x, x i ) is the distance between x and x i .
i ∈ X and restrict the definition of a border x j
ρ = d(x
i , x
j ) + α
if x i ∈ P d(x i , x j ) − α if x i ∈ N where α is a small positive real number. For any ball h ∈ H, let ν(h) be the set of N examples correctly classified by h and let π(h) be the set of P examples misclassified by h. Given these definitions we have the following description for the usefulness of a data-dependent ball. 260 Z. Hussain and J. Shawe-Taylor Definition 3. The usefulness (or utility) U of h is expressed as: U (h) = |ν(h)| − p|π(h)| (1) where p is a small positive real number. Finally, when we discuss a P example we will mean an example from the set of P. Similarly, an N example will denote an example from the set of N examples. Using these definitions we can now describe how to train the SCM with data- dependent balls. 2.1
Training The SCM algorithm uses a greedy approach to try and completely classify the set of N examples whilst misclassifying a small number of P examples. Let N be the set of N examples to be covered and let P be the set of P examples that have been misclassified. Therefore, initially hypothesis B ← ∅, N ← N and P ← ∅. At the first iteration the SCM algorithm looks for ball h i that maximizes the utility value U (h i ). After the ball h i with the highest usefulness is added, the hypothesis becomes B ← B ∪ {h i }, N ← N
{ν(h i ) } and P ← P ∪ {π(h i ) }. This is repeated until no more N examples are left to cover (N = ∅) or until the early stopping criterion |B| ≤ s is satisfied. Clearly the algorithm is greedy as it only adds ball h to hypothesis B if it maximizes the utility U (h). Once the SCM has output its hypothesis B, it can be used to make predictions. 2.2 Testing
Given hypothesis B containing a conjunction or disjunction of data-dependent balls h i
B(x) = y if h
i (x) = y for i = 1, . . . , |B| ¯ y otherwise (2) where ¯
y is the complement of y and, from Definition 1, y = 1 for a conjunction and y = −1 for a disjunction. 2.3 Generalization Error Bounds We now state a loss bound for the SCM, that will be used for the theoretical and experimental section of the paper. We also use a second bound but only reference it in order to save space. Theorem 1. [4] Suppose a SCM finds a solution given by a set B of features with R = R(B) = |B| features, R + = R
+ ( B) of which are centered around P examples, with k p = k p ( B) the number of P training errors and k n = k
n ( B) the number of N training errors on a sample of size m > 2R + k p + k
n . Then with Using Generalization Error Bounds to Train the Set Covering Machine 261
probability 1 − δ over random draws of training sets, the generalization error of the resulting classifier can be bounded by (m, R, R
+ , k
p , k
n ) ≤ 1 − exp −1 m − 2R − k p − k
n ln m 2R + ln
2R R + + ln m − 2R
k p + k n + ln
2m 2 R δ (3)
The second bound we will use is the Equation 10 of [5] and is slightly tighter than the above bound. These bounds are tight and non-trivial (i.e., always less than 1). Exploiting this verity we will apply the generalization error bound directly to obtain classifiers for the SCM and, with it, remove the need for parameter estimation in the SCM. Because of its greedy application of the bound, we call this first heuristic the bound set covering machine (BSCM). 3 The Bound Set Covering Machine In this variant of the SCM we allow the algorithm to be driven by one of the generalization error bounds given by equation (3) or Equation 10 of [5] . For sim- plicity and to save space we only describe the remaining work with Theorem 1. Although the bound of [5] can also be applied with the same reasoning. Given any hypothesis B and ball h, we can calculate the risk bound of adding ball h to B as (B ∪ {h}) using (m, R, R + , k p , k
n ). This is true for adding any ball h. Therefore, similarly to the SCM, the bound set covering machine (BSCM) algorithm can be described as follows: Initially hypothesis B ← ∅, N ← N , P ← ∅ and best bound ∗ ← 1. At the first iteration, the BSCM algorithm looks for ball h i that minimizes the generalization error bound ( B ∪ {h
i }).
Ball h i that minimizes ( B ∪ {h i }) is added to the hypothesis B ← B ∪ {h i }, N ← N {ν(h
i ) }, P ← P ∪ {π(h i ) } and best bound ∗ ← (B ∪ {h i }). This is repeated until no new loss bound ( B ∪ {h}), for any new ball h, can be found such that ( B ∪ {h}) < ∗ . Finally the resulting hypothesis B can be used to classify new test examples using (2). Note that the BSCM has eliminated the need for penalty parameter p and soft-stopping parameter s that was present in the SCM. It is clear that the BSCM heuristic is greedy in its solution in much the same way as the SCM, with the difference that it looks to minimize the generalization error bound as opposed to maximizing the utility function for each ball. However, the nature of a greedy approach implies that we are not always guaranteed a globally optimal classifier. Therefore, in the following section, we present a branch and bound approach to solving the set covering machine that ensures a much stronger result, that of global optimality. 4 The Branch and Bound Set Covering Machine In this section we use a branch and bound approach to enumerate all possible hypotheses that can be generated from the set of data-dependent balls. This is 262 Z. Hussain and J. Shawe-Taylor done by evaluating the bound every time a new ball can be added to the current hypothesis. If a hypothesis cannot be extended with a ball such that the new bound is smaller than the best bound currently found then there is no need to include this ball in the current hypothesis. Note that the choice of balls can also be thought of in terms of a branch and bound search tree. Where any ball at a particular depth d is found from its parent ball at depth d − 1 and can produce child balls at level d + 1. The motivation for such a strategy for solving the set covering machine is that if the function to be minimized is the generalization error bound then we are assured of finding a hypothesis with the smallest generalization error bound. 4.1 Algorithm The algorithm relies on functions addball and createtable. We have detailed below each function and also included the pseudo code. Before we describe the functions in detail we will now give some notation. Let S be the sample con- taining training set X and label set Y , H the set of data-dependent balls and T the machine type which can either be a conjunction or disjunction of balls. As earlier, let π(h) be the set of P examples misclassified by h and ν(h) the set of N examples correctly classified by h. For any hypothesis B and any ball h i ∈ B let B i = B∪{h i }. The generalization error bound of B is given by (B) where, ( B) = (m, R, R + , k
p , k
n ), also for the same hypothesis B the best potential (bp) generalization error bound η(B) is given by, η(B) = (m, R + 1, R + , k p , 0).
So the bp generalization error bound η(B) is the bound (B i ) for B i = B∪{h i } if a single ball h i can be added to hypothesis B such that all of the remaining N examples are covered and none of the remaining P examples misclassified. Contrast this to the generalization error bound ( B) which is simply the bound given for hypothesis B. Algorithm BBSCM: The input of the algorithm is sample S that contains training set X = P ∪ N , machine type T that is either a ‘conjunction’ or ‘dis- junction’ and H that is the set of data-dependent balls computed from training set X.
The algorithm contains local variables B,N ,P,
∗ and global variable . Ini- tially B ← ∅ is the empty hypothesis that we start with, N contains all the N training examples that are still to be covered, P is empty because no P exam-
ples are initially misclassified, ∗ is the best generalization error bound found so far for any hypothesis B and initially set to 1. Global variable is set to the Using Generalization Error Bounds to Train the Set Covering Machine 263
Input : S,T ,H B ← ∅, N ← N , P ← ∅, ∗ ← 1, ← |H|; Call : addball(B, ∗ ,N ,P )
Output : A T =‘conjunction’ or T =‘disjunction’ of data-dependent balls B ∗ ⊆ H Algorithm 1. BBSCM(S, T, H) Function addball(B, ∗ ,N ,P ) Data: Consider all h ∈ H B ;
Order according to ( {h} ∪ B) → (h 1 ,
, η 1 ) , . . . , (h , , η ) ; 1: for i = 1, . . . , do 2: Btemp ← B, N temp ← N , Ptemp ← P 3: if η
i < ∗ then 4: B ← B ∪ {h i }, N ← N
{ν(h i ) }, P ← P ∪ {π(h i ) } 5: if i < ∗ then 6: B ∗ ← B, ∗ ← i 7: call createtable( ∗ , m) 8: end if
9: found ← false, R ← |B|, A ← −|N | 10: while ¬found do 11: R ← R + 1, A ← A + |N temp| − |N |, kn ← table(R, |P |) 12: if kn = −1 or A ≥ −kn then 13: found ← true 14: end if
15: end while 16: if kn = −1 then 17: call addball(B, ∗ ,
18: end if
19: end if
20: end for number of balls possible for current B. Using the above inputs and variables, algorithm BBSCM calls recursively function addball(B, ∗ ,N ,P ) (see below). Finally the output of algorithm BBSCM is a conjunction or disjunction of balls (classifier/hypothesis) B ∗
Function addball: This function adds each possible ball h i to the current hypothesis B and checks to see if its generalization error bound i is smaller than the best generalization error bound ∗ found so far. If so, then the value of B ∗ is replaced with B and the best risk bound ∗ is replaced with i (line
6). Also at this stage function createtable is called (line 7) to get table (see description of function createtable below). On line 11 if table(R,|P |) returns kn = −1 then this indicates that there is no bound for R and |P | that is smaller than ∗ . If table(R,|P|) is a positive integer kn, then there is a possibility of finding a ball to add to the current 264 Z. Hussain and J. Shawe-Taylor hypothesis that will give a smaller risk bound than ∗ provided there exists a set of R additional balls that leave no more than kn N examples uncovered and no additional P examples misclassified. If kn ≥ 0, then line 12 checks whether a larger number of N examples can be covered using R balls (see Lemma 2). If so, then the procedure calls itself recursively (line 17) until all balls in H have been enumerated. Function createtable: Local variable array table is an m × m matrix where m = |X| is the number of training examples. Initially all values in the table are set to −1. Function createtable calculates for R balls and kp misclassifications (on the P examples) the number of N examples that can be covered without creating a bound lower than the best bound ∗ found so far (line 5). This function returns table. Function createtable( ∗ ,m)
Initialize: table ← −1, R + ← 0, kpfound ← true, kp ← −1; 1: while kpfound do 2: kpfound ← false, kp ← kp + 1, Rfound ← false, R ← 0 3: while Rfound do 4: Rfound ← false, R ← R + 1, kn ← 0 5: while (m, R, R + , kp, kn) < ∗ do 6: kn ← kn + 1 7: end while 8: if kn > 0 then 9: kpfound ← true, Rfound ← true 10: end if
11: table(R, kp) ← kn − 1 12: end while 13: end while 5 BBSCM( τ ) BBSCM(τ ) allows a trade-off between the accuracy of the classifier and speed. In function createtable( ∗ , m) the while condition computes (m, R, R + , kp, kn) < ∗ . In the BBSCM(τ ) heuristic this while condition becomes (m, R, R + , kp, kn) <τ · ∗ . Allowing the BBSCM algorithm to ignore solutions whose generalization error bounds are not a factor τ from the optimal found so far. Clearly setting τ = 1 gives the BBSCM algorithm – however as mentioned above this algorithm is too slow for data sets with m > 100 training examples. Therefore we would like to set τ < 1. Setting τ close to 1 may cause the heuristic to be too slow but create hypotheses that have small generalization error bounds similar to those for τ = 1. Setting τ close to 0 will speed up the solution but may not create a large enough search space in order for BBSCM(τ ) to find hypotheses with relatively small generalization error bounds. Hence, setting τ Using Generalization Error Bounds to Train the Set Covering Machine 265
is unlike setting a regularization parameter since it is trading accuracy against time - bigger τ is always better in terms of generalization error bounds, but costs more in terms of computational time. 6 Theory Before looking at the first Theorem we would like to present the following lemma that allows the theorem to be proved. Note that because of space constraints we only use Theorem 1 for this section, although, the results also hold for other SCM bounds. Also, we only give a sketch proof of the main theorem because the full proofs of all the results in this section can be found in the longer version of the paper. Lemma 1. The generalization error bound given by equation (3) (Theorem 1) is monotonically increasing in the second parameter. Theorem 2. Let ∗ be the smallest generalization error bound currently found for hypothesis B ∗ by the BBSCM algorithm. For any hypothesis B i that gives η(B
i ) >
∗ there exists no extension ˆ B ⊇ B i
bound for ( ˆ B) ≤
∗ . This theorem states that for any hypothesis B i if the bp generalization error bound η i is worse than the best bound ∗ then there is no need to try and cover any more N examples from this ball as there will never be a smaller bound than the best ∗ found so far. Lemma 2. Let U be a set covered by A 1 , . . . , A k . For any V ⊆ U ∃ j such that |A j
|V | ≥ 1 k . Lemma 3. Suppose createtable( ∗ , m) has been executed and kn = table(R, kp) ≥ 0. It follows that (m, R, R + , kp, kn + 1) > ∗ for R
+ ≥ 0.
Using Lemma 2 and 3 we can now prove that the BBSCM(τ ) algorithm will only disregard a ball for inclusion if it cannot lead to a hypothesis with generalization error bound smaller than the one currently found by BBSCM(τ ). Hence, giving solutions that are a τ factor from the optimal. Theorem 3 (main theorem). If algorithm BBSCM(τ ) outputs hypothesis B ∗ with generalization error bound ∗ then there exists no hypothesis ˆ B such that generalization error bound ( ˆ B) < τ · ∗ . Proof (sketch). We prove this result by contradiction. We consider a hypoth- esis whose generalization error bound is smaller than the one found by the BBSCM(τ ). Next we claim and prove, using Lemma 2 and 3, that if this hypoth- esis does indeed have a smaller loss bound than the BBSCM(τ ) solution then it should have been chosen by the BBSCM(τ ). However, this hypothesis was not chosen by the BBSCM(τ ) algorithm. Hence, a contradiction. 266 Z. Hussain and J. Shawe-Taylor Theorem 4. If algorithm BBSCM outputs a hypothesis B ∗ then its generaliza- tion error bound ∗ will be globally optimal (i.e. the smallest bound possible). Proof. Apply Theorem 3 with τ = 1. From these theoretical results we have shown that the BBSCM is guaranteed to find the hypothesis B ∗ with the smallest generalization error bound ∗ . 7 Experiments Experiments were conducted on seven standard UCI repository data sets [1] and are described in [4]. All examples with contradictory labels and whose attributes contained unknown values were removed (this reduced considerably the Votes data set). We compared the results against the SVM and SCM. The SVM was equipped with a Gaussian kernel and the SCM used the L 2 metric to construct its set of data-dependent balls. All training sets were evaluated using 10-fold cross vali- dation and both the SVM and SCM were trained with an extra 10-fold cross validation split in order to evaluate the regularization parameters for each algo- rithm. Note, the running times include this parameter tuning phase. In contrast, the BSCM and BBSCM(τ ) experiments were only conducted using one 10-fold cross validation split as there were no parameters to tune. For the SVM, we report in Table 1 the number of Support Vectors (SVs), the number of seconds (time) spent training and tuning parameters and the error (err) of the classifiers averaged over all 10 sets for various values of Gaussian width parameter γ and regularization parameter C. For the SCM we report the machine type (T ), where “c” denotes a conjunction and “d” denotes a disjunc- tion of balls that gave the smallest CV error. To compare against the number of support vectors in the SVM we also report the average number of balls (b) for each classifier over the 10 folds. For the BSCM and BBSCM(τ ) we also give the generalization error bound (thm) used for each heuristic where “(1)” denotes Theorem 1 and “(2)” denotes the Theorem of [5] given by their Equation 10. Fi- nally, we chose “c” and “d” for those machines that gave the smallest generaliza- tion error bound for each CV split. Furthermore, for the BBSCM(τ ) we only used τ = 0.1 as the search space grew too large for anything larger than this value. From Table 1 we observe that the BSCM and BBSCM(τ ) heuristics are com- petitive with the SVM and SCM in terms of generalization error for the BreastW, Votes, Pima, Haberman and Glass data sets. Also, both heuristics are consid- erably sparser than the SVM. Finally, the BSCM exhibits faster running times when compared with the SVM and SCM. 8 Conclusion We have proposed the branch and bound set covering machine (BBSCM) algo- rithm and two heuristics BSCM and BBSCM(τ ) along with theoretical results Using Generalization Error Bounds to Train the Set Covering Machine 267
Table 1. SVM and SCM model-selection results using 10-fold cross validation Data Set
SVM SCM
BSCM BBSCM(
τ) SVs time
err T b time err thmT b timeerr thmT berr BreastW 79.2 506.1 .035 c 2 658.4 .0395 (2) c 1.546.7 .0335 (2) c 1.0351 Votes 18.5 8.8
.1033 c 1 5.4
.1154 (2) d 1.50.1 .14 (2) c 1.1538 Pima 398.111071.9.246 c 5.9 1091.9.2721 (2) c 2.899.4 .2579 (2) c 1.2891 Haberman 136.45129.2 .2413 d13 272.5 .2789 (1) c 1 25.5 .2585 (2) c 1.2585 Bupa
199.43083.1 .2694 d32.6128 .3478 (2) d 4.143.2 .3769 (2) d 1.3913 Glass 92.4 183.8 .2383 c 3.8 57.9 .2331 (2) c 1 1.2 .2379 (2) c 1.2393 Credit 325.97415.1 .2545 d3.7 831.2 .317 (1) c 1 53.3 .3247 (2) c 1.3247 and experimental evidence to show that the algorithm is competitive with the SCM and SVM. The main novelty of this paper is that we have eradicated the need for the parameters in the SCM by using tight generalization error bounds to train the algorithm. We have shown theoretical results that guarantee the BBSCM algo- rithm will find a classifier with the smallest generalization error bound. Fur- thermore, we have also shown that heuristics that try to minimize directly the loss bound are competitive with traditional cross-validation model selection techniques. We believe this helps bring together the ideas of learning theory and practical machine learning – using tight bounds directly to carry out model selection. This, we hope, is the first step towards learning machines that need little human intervention in terms of tuning regularization parameters with (hopefully) no degradation in generalization ability. Future work involves pruning further the search space of the BBSCM algorithm to allow larger data sets to be tractable. Also, the development and use of tighter loss bounds in the BBSCM, BSCM and BBSCM(τ ) algorithms is a future research direction. Acknowledgements This work was supported in part by the IST programme of the European Com- munity, under the PASCAL network of excellence. References 1. Blake, C., Merz, C.: UCI Repository of machine learning databases. Department of Information and Computer Science. University of California, Irvine, CA (1998), http://www.ics.uci.edu/ ∼ mlearn/MLRepository.html 2. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pp. 144–152. ACM Press, New York (1992) 3. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge (2000)
268 Z. Hussain and J. Shawe-Taylor 4. Marchand, M., Shawe-Taylor, J.: Learning with the set covering machine. In: Pro- ceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), pp. 345–352 (2001) 5. Marchand, M., Sokolova, M.: Learning with decision lists of data-dependent features. Journal of Machine Learning Reasearch 6, 427–451 (2005) 6. Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998) |
ma'muriyatiga murojaat qiling