Lecture Notes in Computer Science
Download 12.42 Mb. Pdf ko'rish
|
x x ^ y 1 m y m ^ y 1 ^ y m 1 Fig. 1. Architecture of CVQ First, n-dimensional original datum x = [x 1 , · · · , x n ] T ∈ n is transformed into m-dimensional feature vector y = [y 1 ,
m ] T ∈ [0, 1] m as y = ψ(x) by compressor ψ. T denotes a transpose. Then, quantizer Γ quantizes feature vector y into ˆ y = [ˆ y
, · · · , ˆy m ]
= Γ (y). Quantizer Γ consists of m uniform scalar quantizers, Γ 1 ,
m , each of which individually quantizes the corresponding element of y as ˆ y i = Γ i (y i ) with quantization level a i , as shown in Fig. 2. A set of quantization levels is represented by quantization vector a = [a 1 , · · · , a m ] T . Finally, expander φ transforms quantized feature vector ˆ y into reproduction vector ˆ
x = [ˆ x 1 , · · · , ˆx n ]
∈ {μ k ∈ n |k = 1, · · · , M }. The number of reproduc- tion vectors M has a relationship with quantization vector a as M = m i=1 a i . Reproduction vector ˆ x is hence obtained by expander φ as ˆ x = φ(ˆ y). The com- pressor and expander are parameterized as ψ(x; θ α ) and φ(ˆ y; θ β ), respectively, and are assumed to be differentiable with respect to their parameters θ α and θ β and their inputs x and ˆ y. A pair of a compressor and an expander is called a compander. As a whole, reproduction vector ˆ x is calculated as ˆ x := ρ(x) = φ(Γ (ϕ(x))). We also denote ρ(x; a, θ α , θ β ) for ρ(x), when we emphasize the de- pendence of coder ρ(x) on parameters θ α and θ β and quantization vector a. 2.2 Optimization Since we consider fixed-rate coding, the optimization problem is given by min
θ α ,θ β ,a E[d(x, ˆ x)] = min θ α ,θ β ,a p(x)d(x, ρ(x; θ α , θ β , a))dx
s.t. M ≤ M
∀ i ∈ {1, · · · , m}, a i ∈ N,
(1) Optimization of Parametric Companding Function 715
^ y i y i a 1 i a 2 i a 2 i a 1 i a 1 i 1 - 1 a 1 i 1 - 1 ^y i = i (y) ^y i = y i Fig. 2. Quantizer ˆ y i = Γ i (y i ) where M is the maximum permissible number of reproduction vectors, E[ ·] de- notes an expectation with respect to the original data distribution (source) p(x), and d(x, ˆ x) denotes a distortion between an original datum and a reproduc- tion datum. Provided that we have N independent and identically distributed samples
{x (1)
, · · · , x (N ) } from source p(x), source p(x) is usually unknown in practice, and so the expected distortion in equation (1) is approximated by the sample mean: E[d(x, ρ(x; θ α , θ β , a)]
≈ D N (θ α , θ
β , a)
≡ 1 T N k=1
d(x (k)
, ρ(x (k)
; θ α , θ β , a)). (2) Since expected distortion can be correctly estimated by using enough samples, we try to minimize average distortion (2). This optimization consists of two parts: one is optimization of companding parameters θ α and θ β , and the other is optimization of quantization vector a, i.e., bit allocation. At first, we explain the optimization of companding parameters θ α and θ
β . Here, we present an iterative optimization as follows. 1. θ β is optimized as min θ β N i=1
d x (i)
, φ(ˆ y (i) ; θ β ) , (3) where ˆ
y (i)
= Γ (ϕ(x (i)
)) while θ α is fixed. 2. θ α is optimized as min θ α N i=1
d x (i)
, φ(ˆ y (i) ) , (4)
where ˆ y (i) = Γ (ϕ(x (i)
; θ α )) while θ β is fixed.
716 S. Maeda and S. Ishii The two optimization steps above correspond to the well-recognized condi- tions that optimal VQ should satisfy: centroid and nearest neighbor, because parameters θ α and θ β determine reproduction vectors μ (k) = φ(ν
(k) ; θ
β ) and
partition functions S (k)
= x |ν (k) = ϕ(x; θ
α ) , respectively, and centroid and nearest neighbor conditions denote the conditions that optimal reproduction vectors and optimal partition function should satisfy, respectively. Optimization for θ β in equation (3) is performed by conventional gradient- based optimization methods such as steepest descent or Newton’s method. On the other hand, cost function in equation (4) is not differentiable due to the discontinuous function Γ ; therefore, we replace the discontinuous function Γ with a continuous one, e.g., Γ i
i ) ≈ y i . (5) The dotted line in Fig. 2 depicts this approximation, which becomes exact when the quantization level a i becomes sufficiently high. With this approximation, the optimization problem given by equation (4) becomes min
θ α N i=1 d(x
(i) , φ(ψ(x
(i) ; θ
α ))),
(6) which can be solved by a gradient-based method. In particular, if expander φ is invertible, optimality is always realized when ˆ θ α forces compander ψ to be an inverse of expander φ, ψ = φ −1 , because distortion takes its minimum value d(x, φ(ψ(x; θ α ))) = d(x, x) = 0 only in that condition. In this case, we need no parameter search for θ α and hence can avoid the local minima problem. Although this simple optimization scheme works well as it is expected, we have found it sometimes fails to minimize the cost function (4). To improve optimization for θ α
and employ a line search along the estimated steepest gradient. Next, we present our bit allocation scheme. Here, we consider a special case in which the distortion measure is mean squared error (MSE), d(x, ˆ x) = 1 n n i=1 (x i − ˆx i ) 2 , and constitute an efficient search method for a based on an estimated distortion: E[d(x, ˆ
x)] ≈ 1 n m j=1 1 12a
2 j E [G j (y)],
(7) where G
j (y) =
n i=1
∂φ i (y) ∂y j 2 . The derivation of equation (7) is based on Taylor expansion, but omitted due to the space limitation. This estimation becomes accurate, if the companded source pdf, p(y), is well approximated as a uniform distribution and expander φ(y) is well approximated as a linear function within a small m-dimensional hyper-rectangle that includes ν (k)
as C (k)
≡ L (k)
1 × · · · × L (k)
m , where the interval in each dimension is given by L (k) i
l (k)
i −1 a i , l (k) i a i and
Optimization of Parametric Companding Function 717
l (k)
i ∈ {1, · · · , a i } for each i = 1, · · · , m. These conditions are both satisfied in the case of an asymptotic high rate. Note that expectation with respect to feature vector y in equation (7) can be replaced practically with sample average. Since equation (7) is identical to Bennett’s original integral [3] if the dimensionalities of original datum x and feature vector y are both one (i.e., a scalar companding case), equation (7) is a multidimensional version of the Bennett’s integral. The estimated distortion is useful for improving quantization vector a in our CVQ, as shown below. Using estimated distortion (7), the optimization problem for non-negative real-valued a is given by min
a m i=1 E [G i (y)] a 2 i s.t. m i=1 log a i ≤ log M, ∀ i ∈{1, · · · , m}, a i > 0, a
i ∈R. (8)
If we allow a non-negative real-valued solution, we can obtain the solution by conventional optimization methods as done by Segall [4]. The non-negative real-valued solution above does not guarantee level a i to be a natural integer. We then employ a two-stage optimization scheme. The first stage obtains a non-negative real-valued solution and rounds it to the nearest integer solution. This stage allows a global search for the quantization vector but provides a rough solution for the original cost function given in equation (1). The second stage then tries to optimize over integer solutions around the solution obtained in the first stage, finely but locally. The second stage utilizes rate-distortion cost function: min
a J (a) = min a m
log 2 a i + γE[d(x, ˆ x)] ≈ min
a m i=1 log 2 a i + γ 12na 2 i E[G i (y)] , s.t. ∀ i ∈ {1, · · · , m}, a i ∈ N, (9) where parameter γ takes balance between the rate and distortion. γ is set so that the real-valued solution of the cost function (9) corresponds to the solution of (8), as γ = M 2
2C log 2 , where C = n j=1
E [G j (y)] 1 n . We choose the quantization vector that minimizes the rate-distortion cost function (9) among 2m quantization vectors each of which sets a component a i (i = 1,
· · · , m) at one level larger or smaller than that of the first stage solution a. After re-optimizing the companding parameters such to be suitable for the new quantization vector a, we check whether the new quantization vector and re-optimized companding parameters {a, θ α
β } actually decrease the average distortion (2). The optimization procedure in the case of MSE distortion measure is summa- rized in the panel.
718 S. Maeda and S. Ishii Summary of optimization procedure 1. Initialize ˆ D N
parameter update at step 5 is done at least once, and set
{ˆθ α , ˆ θ β , ˆ a } as an initial guess. 2. Select a trial quantization vector a according to rough global search. Initialize companding parameters {θ α
β } so a trial parameter set is {θ α , θ
β , a
}. 3. Minimize D N (θ
, θ β , a) with respect to parameters {θ α , θ β }. 4. If D N (θ α , θ β , a) < ˆ D N , then replace ˆ D N and {ˆθ α , ˆ θ β , ˆ a } by D
N (θ α , θ β , a) and {θ α , θ β , a
}, respectively, and go to step 2. Otherwise, go to step 5. 5. Improve a trial quantization vector a according to fine local search. 6. Minimize D N (θ α , θ
β , a) with respect to parameter {θ α
β }. 7. If D N (θ α , θ β , a) < ˆ D N , then replace ˆ D N and {ˆθ α , ˆ θ β , ˆ a } by D
N (θ α , θ β , a) and {θ α , θ β , a
}, respectively, and go to step 5. Otherwise, terminate the algorithm. 3 Applications to Transform Coding 3.1 Architecture Here, we apply the proposed learning method in particular to transform cod- ing that utilizes linear matrices and componentwise nonlinear functions for the companding functions,ψ(x; θ α ) = f (W x; ω α ), and φ(ˆ y; θ β
y; ω β ) where θ α = {W, ω α } and θ β = {A, ω β }. W and A are m × n and n × m matrices, re- spectively, and f and g are componentwise nonlinear functions with parameters ω α and ω β , respectively. We use a scaled sigmoid function for f which expands a certain interval of a sigmoid function so that the range becomes [0, 1], and g is chosen as the inverse function of f ; f i
i ; ω
α ) =
d i (a α i , b α i ) 1 + exp( −a α i (x i − b α i )) + c
i (a α i , b
α i ) g i (x i ; ω
β ) =
− 1 a β i log d i (a β i , b β i ) x i − c i (a β i , b
β i ) − 1 + b β i , where c
i = − e ai(1−bi)
+1 e ai −1 , d
i = −c i e a i b i , and ω α = {a α i , b α i |i = 1, · · · , m} and ω β = {a β i , b
β i |i = 1, · · · , m} are parameters of f and g, respectively. Note that when a α i = a β i and b α i = b β i , each of the functions f i and g i is an inverse function of the other. The above transform coding is optimized according to the method described in section 2.2. However, the matrix A can be obtained analytically, because A
Optimization of Parametric Companding Function 719
is a solution of min A n j=1 X j − ˆZ T A j T X j − ˆZ
T A j , where A i is the i-th row vector of matrix A, X j is a vector consisting of the j-th components of the N original data samples, and ˆ Z j is a matrix whose column is a transformed vector corresponding to the i-th original datum x (i) . The MSE solution of this cost function is given as a set of row vectors [A 1 , · · · , A n ], each of which is obtained individually as A ∗ i = ˆ Z −T X i , where ˆ Z −T is the pseudo inverse of matrix ˆ Z T . 3.2 Simulation Results To examine the performance of our optimization for CVQ, we compared the transform coding trained by our method with KLT-based transform coding, which uses a scalar quantizer trained by the Lloyd-I algorithm [5] for each feature component. Optimization of our CVQ and the trained KLT was each performed for 30 data sets, each consisting of 10,000 samples. The performance of each cod- ing scheme was evaluated as average distortion using 500,000 samples indepen- dently sampled from the true source distribution. Hereafter, average distortion using 10,000 samples and 500,000 independent samples are called training error and test error, respectively. In our CVQ, matrix A was initially set at a value randomly chosen from uniform distribution and normalized so that the variance of each transformed component was 1. Initial matrix W was determined as an inverse of A. The other companding parameters were set as a α i
β i = 6 and b α i = b β i = 0.5, which are well suited to a signal whose distribution is bell-shaped with variance being 1. First, we examined the performance of our CVQ by using two kinds of two- dimensional sources: a linear mixture of uniform source or a Gaussian source. They have different characteristics as coding targets because the former can- not be decomposed into independent signals by an orthogonal transformation, whereas the latter can be. In the case of high bit-rate coding, KLT-based trans- form coding is optimal when an orthogonal transformation exists as that de- composes the source data into independent signals [6]. Therefore, the coding performance for a linear mixture of uniform source, where the optimality of KLT is not assured, is of great interest. Moreover, it is worth examining whether our CVQ can find any superior transformation to the KLT for low bit-rate en- coding of a Gaussian source because non-orthognal matrices are permitted to be a transformation matrix in our CVQ. Figures 3(a) and (b) compare the arrangements of reproduction vectors by our CVQ and the trained KLT for two different linear mixtures of uniform sources; both of them have their density on the gray shaded parallelograms, slanted (a) 45 ◦ and (b) 60 ◦ . Each figure shows results with minimum training error among 30 runs by the two methods. Black circles indicate arranged reproduction vectors, and two axes indicate two feature vectors. Upper panels show results of 2 bit-rate coding, and the lower ones show 4 bit-rate coding. Each title denotes signal-to- noise ratio (SNR) for test data set. As seen in this figure, our CVQ obtained feature vector axes that were almost independent of each other and efficiently arranged the reproduction vectors, while 720 S. Maeda and S. Ishii 2 bit-rate 4 bit-rate SNR : 10.70 dB SNR :10.24 dB SNR : 22.70 dB SNR : 20.88 dB trained KLT CVQ
trained KLT CVQ
(b) (a)
45 60 SNR :11.44 dB SNR : 10.18 dB SNR : 23.40 dB SNR : 21.73 dB −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 Fig. 3. Arrangements of reproduction vectors for linear mixtures of uniform sources the trained KLT code failed. Then, the test performance was degraded especially in high rate coding, due to placing reproduction vectors on areas without origi- nal data. We also see that our CVQ modifies the feature vector axes as bit-rate changes even for the same source, as typically illustrated in Fig. 3(a). Thus, our CVQ adaptively seeks an efficient allocation of reproduction vectors, which is dif- ficult to perform analytically. Results with various bit-rates indicated that the test error of our CVQ was likely the best especially when bit-rate was high. On the other hand, KLT code is guaranteed to be the best transform coder to encode Gaussian source in high bit-rate. However, there remains a possibility that a non-orthogonal transform coding might outperform KLT in low bit-rate by changing the transformation matrix according to the bit-rate. SNR : 6.327 ( ) dB
SNR : 6.237 dB 1.6 bit-rate best CVQ optimal KLT code −5 0
−5 0 5 −5 0 5 −5 0 5 Fig. 4. Encoding result of a Gaussian source in low bit-rate In fact, we found that the best transformation matrix in our CVQ was superior to the analytical KLT code in low bit-rate as shown in Fig. 4. In this figure, we compare our trained CVQ with optimal KLT code, which can be numerically calculated in low bit rate and SNR of our CVQ was estimated by using 5,000,000 samples (three times the value of estimated standard deviation is shown in the brackets). At last, we examined performance when the source dimension became large. A linear mixture of uniform distribution was used for the original source because it prevents the high dimensional vector quantization from replacing a set of scalar Optimization of Parametric Companding Function 721
2 4 8 16 −4 −3.5 −3 −2.5
−2 −1.5
−1 −0.5
0 0.5
1 bit-rate 2 4 8 16 −4 −3.5 −3 −2.5 −2 −1.5
−1 −0.5
0 0.5
(b) (a)
dimension dimension 2 bit-rate SNR - SNR of CVQ [dB] SNR - SNR of CVQ [dB] CVQ (med) trained KLT (med) Fig. 5. Encoding results of a linear mixture of uniform sources with various dimen- sionality, in (a) 1 bit, and (b) 2 bits per dimension quantizations by an orthogonal transformation while Gaussian source does not. We tested 1 and 2 bits per dimension, in the dimensionality of 2, 4, 8, or 16. In Figure 5, abscissa denotes a dimension of the source, and ordinate denotes the test error relative to the best CVQ (dB) among 30 runs. Test error by CVQ over 30 runs is shown by a box-whisker plot, and median test errors by CVQ, analytical KLT, and trained KLT are denoted by a solid line, a dashed-space line, and a dot-and-dash line, respectively. The number of poor results by CVQ, which are not plotted in the figure, is shown in brackets on each x label if they exist. As seen from Fig. 5, performance variance became large as the data dimensionality increased. We found that the initial setting of the parameters largely affected the performance of our CVQ when data dimensionality was large. However, the best trained CVQ (the best run among 30 runs) was in all cases superior to both the trained and analytical KLTs, and this superiority was consistent regardless of data dimensionality. 4 Discussion The idea of companding has existed, and many theoretical analyses have been done. An optimal companding function for scalar quantization was analyzed by Bennett [3] and has been used to examine the ability of an optimal scalar quan- tizer. On the other hand, an optimal companding function for VQ has not been derived analytically except for very limited cases, and moreover, it is known that optimal CVQ does not correspond to optimal VQ except for very limited cases [7] [8] [9]. These negative results show the difficulty in analytically deriving the op- timal companding function. Yet, the optimization of CVQ is attractive because CVQs constitute a wide class of practically useful codes and avoid an exponential increase in the coder’s complexity, referred to as the ‘curse of dimensionality.’ Through the numerical simulation, we noticed that theoretical analysis based on high rate assumption often deviates from real situation and analytically de- rived companding function based on high rate analysis does not show very good performance even when the bit-rate was quite high (e.g., 4 bit-rate). We guess 722 S. Maeda and S. Ishii that such substantial disagreement stems from the high rate assumption. Since high rate coding implies that objective function of the lossy source coding solely comprises distortion by neglecting the coding length, the high rate analysis may lead to large disagreement when coding length cannot be completely neglected. These findings suggest the importance of the optimization of practically useful code which is realized by learning from the sample data. Fortunately, plenty of data are usually available in the case of practical lossy source coding, and the computation cost for the optimization is not very large in comparison to various machine learning problems, because the code optimization is necessary performed just once. Although we could show the potential ablitity of the CVQ, it sholud be noted that we must carefully choose the initial parameter for train- ing of the parametrized CVQ especially in the case of high dimensionality. To find a good initial parameter, the idea presented by Hinton to train the deep network [10] may be useful. Acknowledgement. This work was supported by KAKENHI 19700219. References 1. Huang, J.H., Schultheiss, P.M.: Block quantization of correlated gaussian random variables. IEEE Trans. Comm. CS-11, 289–296 (1963) 2. Goyal, V.K.: Theoretical foundations of transform coding. IEEE Signal Processing Mag. 18(5), 9–21 (2001) 3. Bennett, W.R.: Spectra of quantized signals. Bell Syst. Tech. J. 27, 446–472 (1948) 4. Segall, A.: Bit allocation and encoding for vector sources. IEEE Trans. Inform. Theory 22(2), 162–169 (1976) 5. Lloyd, S.: Least square optimization in pcm. IEEE Trans. Inform. Theory 28(2), 129–137 (1982) 6. Goyal, V.K., Zhuang, J., Veiterli, M.: Transform coding with backward adaptive updates. IEEE Trans. Inform. Theory 46(4), 1623–1633 (2000) 7. Gersho, A.: Asymptotically optimal block quantization. IEEE Trans. Inform. The- ory 25, 373–380 (1979) 8. Bucklew, J.A.: Companding and random quantization in several dimensions. IEEE Trans. Inform. Theory 27(2), 207–211 (1981) 9. Bucklew, J.A.: A note on optimal multidimensional companders. IEEE Trans. In- form. Theory 29(2), 279 (1983) 10. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 723–732, 2008. © Springer-Verlag Berlin Heidelberg 2008 Download 12.42 Mb. Do'stlaringiz bilan baham: |
ma'muriyatiga murojaat qiling