Lecture Notes in Computer Science
Download 12.42 Mb. Pdf ko'rish
|
as the covariate shift. There are a lot of studies to adapt the change, since the ordinary machine learning methods do not work properly un- der the shift. The asymptotic theory has also been developed in the Bayesian inference. Although many effective results are reported on sta- tistical regular ones, the non-regular models have not been considered well. This paper focuses on behaviors of non-regular models under the covariate shift. In the former study [1], we formally revealed the factors changing the generalization error and established its upper bound. We here report that the experimental results support the theoretical find- ings. Moreover it is observed that the basis function in the model plays an important role in some cases. 1 Introduction The task of regression problem is to estimate the input-output relation q(y |x)
from sample data, where x, y are the input and output data, respectively. Then, we generally assume that the input distribution q(x) is generating both of the training and test data. However, this assumption cannot be satisfied in practical situations, e.g., brain computer interfacing [2], bioinformatics [3], etc. The change of the input distribution from the training q 0 (x) into the test q 1 (x) is referred to as the covariate shift [4]. It is known that, under the covariate shift, the standard techniques in machine learning cannot work properly, and many efficient methods to tackle this issue are proposed [4,5,6,7]. In the Bayes estimation, Shimodaira [4] revealed the generalization error im- proved by the importance weight on regular cases. We formally clarified the behavior of the error in non-regular cases [1]. The result shows the generaliza- tion error is determined by lower order terms, which are ignored at the situation without the covariate shift. At the same time, it appeared that the calculation of these terms is not straightforward even in a simple regular example. To cope M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 466–476, 2008. c Springer-Verlag Berlin Heidelberg 2008
Experimental Bayesian Generalization Error of Non-regular Models 467
with this problem, we also established the upper bound using the generaliza- tion error without the covariate shift. However, it is still open how to derive the theoretical generalization error in non-regular models. In this paper, we observe experimental results calculated with a Monte Carlo method and examine the theoretical upper bound on some non-regular models. Comparing the generalization error under the covariate shift to those without the shift, we investigate an effect of a basis function in the learning model and consider the tightness of the bound. In the next section, we define non-regular models, and summarize the Bayesian analyses for the models without and with the covariate shift. We show the ex- perimental results in Section 3 and give discussions at the last. 2 Bayesian Generalization Errors with and without Covariate Shift In this section, we summarize the asymptotic theory on the non-regular Bayes inference. At first, we define the non-regular case. Then, mathematical properties of non-regular models without the covariate shift are introduced [8]. Finally, we state the results of the former study [1], which clarified the generalization error under the shift. 2.1 Non-regular Models Let us define the parametric learning model by p(y |x, w), where x, y and w are the input, output and parameter, respectively. When the true distribution r(y |x)
is realized by the learning model, the true parameter w ∗ exists, i.e., p(y |x, w ∗ ) = r(y |x). The model is regular if w ∗ is one point in the parameter space. Otherwise, non-regular; the true parameter is not one point but a set of parameters such that W
t = {w ∗ : p(y
|x, w ∗ ) = r(y |x)}. For example, three-layer perceptrons are non-regular. Let the true distribution be a zero function with Gaussian noise r(y
|x) = 1 √ 2π exp
− y 2 2 , and the learning models be a simple three-layer perceptron p(y |x, w) =
1 √ 2π exp − (y − a tanh(bx)) 2 2 , where the parameter w = {a, b}. It is easy to find that the true parameters are a set
{a = 0}∪{b = 0} since 0×tanh(bx) = a×tanh 0 = 0. This non-regularity (also called non-identifiability) causes that the conventional statistical method cannot be applied to such models (We will mention the detail in Section 2.2). In spite of this difficulty for the analysis, the non-regular models, such as perceptrons, Gaussian mixtures, hidden Markov models, etc., are mainly employed in many information engineering fields. 468 K. Yamazaki and S. Watanabe 2.2 Properties of the Generalization Error without Covariate Shift As we mentioned in the previous section, the conventional statistical manner does not work in the non-regular models. To cope with the issue, a method was developed based on algebraic geometry. Here, we introduce the summary. Hereafter, we denote the cases without and with the covariate shift subscript 0 and 1, respectively. Some of the functions with a suffix 0 will be replaced by those with 1 in the next section. Let {X
, Y n } = {X 1 , Y
1 , . . . , X n , Y
n } be a set of training samples that are independently and identically generated by the true distribution r(y |x)q
0 (x).
Let p(y |x, w) be a learning machine and ϕ(w) be an a priori distribution of parameter w. Then the a posteriori distribution/posterior is defined by p(w
|X n , Y n ) =
1 Z(X
n , Y
n ) n i=1 p(Y
i |X i , w)ϕ(w), where
Z(X n , Y n ) =
n i=1
p(Y i |X i , w)ϕ(w)dw. (1) The Bayesian predictive distribution is given by p(y |x, X
n , Y
n ) =
p(y |x, w)p(w|X n , Y
n )dw.
When the number of sample data is sufficiently large (n → ∞), the posterior has the peak at the true parameter(s). The posterior in a regular model is a Gaussian distribution, whose mean is asymptotically the parameter w ∗ . On the
other hand, the shape of the posterior in a non-regular model is not Gaussian because of W t (cf. the right panel of Fig.10 in Section 3). We evaluate the generalization error by the average Kullback divergence from the true distribution to the predictive distribution: G 0
0 X n ,Y n r(y |x)q 0 (x) log r(y |x)
p(y |x, X
n , Y
n ) dxdy . In the standard statistical manner, we can formally calculate the generalization error by integrating the predictive distribution. This integration is viable based on the Gaussian posterior. Therefore, this method is applicable only to regular models. The following is one of the solutions for non-regular cases. The stochastic complexity [9] is defined by F (X
n , Y
n ) =
− log Z(X n , Y n ), (2) which can be used for selecting an appropriate model or hyper-parameters. To analyze the behavior of the stochastic complexity, the following functions play important roles: U 0 (n) = E 0 X n ,Y n [F (X n , Y n )] ,
(3) where E
0 X n ,Y n [ ·] stands for the expectation value over r(y|x)q 0 (x) and Experimental Bayesian Generalization Error of Non-regular Models 469
F (X n , Y n ) = F (X
n , Y
n ) +
n i=1
log r(Y i |X i ). The generalization error and the stochastic complexity are linked by the fol- lowing equation [10]: G 0 (n) = U 0 (n + 1) − U 0 (n). (4) When the learning machine p(y |x, w) can attain the true distribution r(y|x), the asymptotic expansion of F (n) is given as follows [8]. U 0
− (β − 1) log log n + O(1). (5)
The coefficients α and β are determined by the integral transforms of U 0 (n). More precisely, the rational number −α and natural number β are the largest pole and its order of J (z) =
H 0 (w) z ϕ(w)dw,
H 0 (w) = r(y |x)q
0 (x) log
r(y |x)
p(y |x, w)
dxdy. (6)
J (z) is obtained by applying the inverse Laplace and Mellin transformations to exp[
−U 0 (n)]. Combining Eqs.(5) and (4) immediately gives G 0 (n) = α n − β − 1 n log n + o
1 n log n
, when G
0 (n) has an asymptotic form. The coefficients α and β indicate the speed of convergence of the generalization error when the number of training samples is sufficiently large. When the learning machine cannot attain the true distribution (i.e., the model is misspecified), the stochastic complexity has an upper bound of the following asymptotic expression [11]. U 0 (n) ≤ nC + α log n − (β − 1) log log n + O(1), (7) where C is a non-negative constant. When the generalization error has an asymp- totic form, combining Eqs.(7) and (4) gives G 0 (n) ≤ C +
α n − β − 1
n log n + o
1 n log n
, (8)
where C is the bias. 2.3
Properties of the Generalization Error with Covariate Shift Now, we introduce the results in [1]. Since the test data are distributed from r(y |x)q
1 (x), the generalization error with the shift is defined by 470 K. Yamazaki and S. Watanabe G 1
0 X n ,Y n r(y |x)q 1 (x) log r(y |x)
p(y |x, X
n , Y
n ) dxdy . (9) We need the function similar to (3) given by U 1
1 X n ,Y n E X n −1 ,Y n −1 [F (X n , Y n )].
(10) Then, the variant of (4) is obtained as G 1
1 (n + 1)
− U 0 (n). (11) When we assume that G 1 (n) has an asymptotic expansion and converges to a constant and that U i (n) has the following asymptotic expansion U i (n) = a i n + b
i log n +
· · · T i H (n)
+c i + d i n + o 1 n T i L (n) , it holds that G 0 (n) and G 1 (n) are expressed by G 0 (n) = a 0 + b 0 n + o 1 n , G 1 (n) = a 0 + (c
1 − c
0 ) +
b 0 + (d 1 − d
0 ) n + o 1 n , (12)
and that T 1 H (n) = T 0 H (n). Note that b 0 = α. The factors c 1 −c 0 and d 1 −d 0 deter-
mine the difference of the errors. We have also obtained that the generalization error G
1 (n) has an upper bound G 1
≤ MG 0 (n), (13) if the following condition is satisfied M ≡ max
x ∼q 0 (x) q 1 (x) q 0 (x) < ∞. (14) 3 Experimental Generalization Errors in Some Toy Models Even though we know the factors causing the difference between G 1 (n) and
G 0 (n) according to Eq.(12), it is not straightforward to calculate the lower or- der terms in T i L (n). More precisely, the solvable model is restricted to find the constant and decreasing factors c i , d
i in the increasing function U i (n). This im- plies to reveal the analytic expression of G 1 (n) is still open issue in non-regular models. Here, we calculate G 1 (n) with experiments and observe the behavior. A non-regular model requires the sampling from the non-Gaussian posterior in the Bayes inference. We use the Markov Chain Monte Carlo (MCMC) method to execute this task [12]. In the following examples, we use the common notations: the true distribution is defined by r(y
|x) = 1 √ 2π exp
− (y − g(x)) 2 2 , Experimental Bayesian Generalization Error of Non-regular Models 471
-10 -5 0 5 10
15 20
Fig. 1. (μ 1 , σ 1 ) = (0, 1) -10 -5
5 10
15 20
Fig. 2. (μ 1 , σ 1 ) = (0, 0.5) -10 -5
5 10
15 20
Fig. 3. (μ 1 , σ 1 ) = (0, 2) -10 -5
5 10
15 20
Fig. 4. (μ 1 , σ 1 ) = (2, 1) -10 -5
5 10
15 20
Fig. 5. (μ 1 , σ 1 ) = (2, 0.5) -10 -5
5 10
15 20
Fig. 6. (μ 1 , σ 1 ) = (2, 2) -10 -5
5 10
15 20
Fig. 7. (μ 1 , σ 1 ) = (10, 1) -10 -5
5 10
15 20
Fig. 8. (μ 1 , σ 1 )=(10, 0.5) -10 -5
5 10
15 20
Fig. 9. (μ 1 , σ 1 ) = (10, 2) The training and test distributions the learning model is given by p(y |x, w)
1 √ 2π exp − (y − f(x, w)) 2 2 , the prior ϕ(w) is a standard normal distribution, and the input distribution is of the form q i (x) = 1 √ 2πσ i exp − (x − μ i ) 2 2σ 2 i (i = 0, 1). The training input distribution has (μ 0 , σ
0 ) = (0, 1), and there are nine test distributions, where each one has the mean and variance as the combination between μ 1 =
1 = {1, 0.5, 2} (cf. Fig.1-9). Note that the case in Fig.1 corresponds to q 0 (x). As for the experimental setting, the number of traning samples is n, the number of test samples is n test , the number of parameter samples distributed from the posterior with the MCMC method is n p , and the number of samples to have the expectation E X n ,Y n [ ·] is n D . In the mathematical expressions, p(y |x, X
n , Y
n ) 1 n p n p j=1
p(y |x, w
j ) n i=1 p(Y
i |X i , w j )ϕ(w j ) 1 n p n p k=1
n i=1
p(Y i |X i , w
k )ϕ(w
k ) , G 1 (n) 1 n D n D i=1 1 n test n test
j=1 log
r(y j |x j ) p(y j |x j , D i ) , 472 K. Yamazaki and S. Watanabe -0. 5
.41 -0 .32 -0 .23
-0 .14
-0 .05
0. 04 0. 13 0. 22 0. 31 0. 4 0. 49 -3 -2 -1 0 1 2 3 4 -3 -2 -1 0 1 2 3 4 a −4 −2 0 2 4 b −4 −2 0 2 4 frequency 0 500
1000 1500
2000 2500
Fig. 10. The sampling from the posteriors. The left-upper panel shows the histogram of a for the first model. The left-middle one is the histogram of a 3 for the third model. The left-lower one is the point diagram of (a, b) for the second model, and the right one is its histogram. where D i
{X i1 , Y i1 , · · · , X in , Y
in } stands for the ith set of training data, and D i
j , y
j ) in G
1 (n) are taken from q 0 (x)r(y
|x) and q 1 (x)r(y |x), respectively. The experimental parameters were as follows: n = 500, n test
= 1000, n p = 10000, n D = 100.
Example 1 (Lines with Various Parameterizations) g(x) = 0, f 1 (x, a) = ax, f 2 (x, a, b) = abx, f 3 (x, a) = a 3 x, where the true is the zero function, the learning functions are lines with the gradient a, ab and a 3 . In this example, all learning functions belong to the same function class though the second model is non-regular (W t =
has the non-Gaussian posterior. The gradient parameters are taken from the posterior depicted by Fig.10. Table 1 summarizes the results. The first row indicates the pairs (μ 1 , σ 1 ), and the rest does the experimental average generalization errors. G 1 [f i ] stands for the error of the model with f i . M G
0 [f i ] is the upper bound in each case according to Eq.(13). Note that there are some blanks in the row because of the condition Eq.(14). To compare G 1 [f 3 ] with G
1 [f 1 ], the last row shows the values 3 × G 1 [f 3 ] of each change. Since it is regular, the first model has theoretical results: G 0
1 2n + o 1 n log n
, G 1 (n) = R 2n + o 1 n log n
, R = μ 2 1 + σ
2 1 μ 2 0 + σ 2 0 . ‘th G 1 [f 1 ]’ in Table 1 is this theoretical result. |
ma'muriyatiga murojaat qiling