Lecture Notes in Computer Science

bet	65/88
Sana	16.12.2017
Hajmi	12.42 Mb.
	#22381

1 ... 61 62 63 64 65 66 67 68 ... 88

x

x

^

Fig. 1. Architecture of CVQ

First, n-dimensional original datum x = [x

· · · , x

]

∈

is transformed

into m-dimensional feature vector y = [y

,

· · · , y

]

∈ [0, 1]

as y = ψ(x)

by compressor ψ. T denotes a transpose. Then, quantizer Γ quantizes feature

vector y into ˆ

y = [ˆ

y

1

· · · , ˆy

]

T

= Γ (y). Quantizer Γ consists of m uniform scalar

quantizers, Γ

,

· · · , Γ

, each of which individually quantizes the corresponding

element of y as ˆ

= Γ

) with quantization level a

, as shown in Fig. 2. A set

of quantization levels is represented by quantization vector a = [a

· · · , a

]

Finally, expander φ transforms quantized feature vector ˆ

y into reproduction

vector ˆ

x = [ˆ

· · · , ˆx

]

T

∈ {μ

∈

|k = 1, · · · , M }. The number of reproduc-

tion vectors M has a relationship with quantization vector a as M =

i=1

Reproduction vector ˆ

x is hence obtained by expander φ as ˆ

x = φ(ˆ

y). The com-

pressor and expander are parameterized as ψ(x; θ

) and φ(ˆ

y; θ

), respectively,

and are assumed to be diﬀerentiable with respect to their parameters θ

and

and their inputs x and ˆ

y. A pair of a compressor and an expander is called

a compander. As a whole, reproduction vector ˆ

x is calculated as ˆ

x := ρ(x) =

φ(Γ (ϕ(x))). We also denote ρ(x; a, θ

, θ

) for ρ(x), when we emphasize the de-

pendence of coder ρ(x) on parameters θ

and θ

and quantization vector a.

2.2

Optimization

Since we consider ﬁxed-rate coding, the optimization problem is given by

min

,θ

E[d(x, ˆ

x)] = min

,θ

p(x)d(x, ρ(x; θ

, θ

, a))dx

s.t. M

≤ M

∀

∈ {1, · · · , m}, a

∈ N,

(1)

Optimization of Parametric Companding Function

715

(y)

Fig. 2. Quantizer ˆ

= Γ

)

where M is the maximum permissible number of reproduction vectors, E[

·] de-

notes an expectation with respect to the original data distribution (source) p(x),

and d(x, ˆ

x) denotes a distortion between an original datum and a reproduc-

tion datum. Provided that we have N independent and identically distributed

samples

(1)

· · · , x

(N )

} from source p(x), source p(x) is usually unknown in

practice, and so the expected distortion in equation (1) is approximated by the

sample mean:

E[d(x, ρ(x; θ

, θ

, a)]

≈ D

(θ

, θ

, a)

≡

k=1

d(x

(k)

, ρ(x

(k)

; θ

, θ

, a)). (2)

Since expected distortion can be correctly estimated by using enough samples,

we try to minimize average distortion (2). This optimization consists of two

parts: one is optimization of companding parameters θ

and θ

, and the other

is optimization of quantization vector a, i.e., bit allocation.

At ﬁrst, we explain the optimization of companding parameters θ

and θ

Here, we present an iterative optimization as follows.

1. θ

is optimized as

min

i=1

d x

(i)

, φ(ˆ

(i)

; θ

) ,

(3)

where ˆ

(i)

= Γ (ϕ(x

(i)

)) while θ

is ﬁxed.

2. θ

is optimized as

min

i=1

d x

(i)

, φ(ˆ

(i)

) ,

(4)

where ˆ

(i)

= Γ (ϕ(x

(i)

; θ

)) while θ

is ﬁxed.

716

S. Maeda and S. Ishii

The two optimization steps above correspond to the well-recognized condi-

tions that optimal VQ should satisfy: centroid and nearest neighbor, because

parameters θ

and θ

determine reproduction vectors μ

(k)

= φ(ν

(k)

; θ

) and

partition functions S

(k)

|ν

(k)

= ϕ(x; θ

) , respectively, and centroid and

nearest neighbor conditions denote the conditions that optimal reproduction

vectors and optimal partition function should satisfy, respectively.

Optimization for θ

in equation (3) is performed by conventional gradient-

based optimization methods such as steepest descent or Newton’s method. On

the other hand, cost function in equation (4) is not diﬀerentiable due to the

discontinuous function Γ ; therefore, we replace the discontinuous function Γ

with a continuous one, e.g.,

i

(y

)

≈ y

(5)

The dotted line in Fig. 2 depicts this approximation, which becomes exact when

the quantization level a

becomes suﬃciently high. With this approximation, the

optimization problem given by equation (4) becomes

min

i=1

d(x

(i)

, φ(ψ(x

(i)

; θ

))),

(6)

which can be solved by a gradient-based method. In particular, if expander φ

is invertible, optimality is always realized when ˆ

forces compander ψ to be

an inverse of expander φ, ψ = φ

−1

, because distortion takes its minimum value

d(x, φ(ψ(x; θ

))) = d(x, x) = 0 only in that condition. In this case, we need no

parameter search for θ

and hence can avoid the local minima problem. Although

this simple optimization scheme works well as it is expected, we have found it

sometimes fails to minimize the cost function (4). To improve optimization for

α

, therefore, we estimate a local curvature of the cost function given by (4),

and employ a line search along the estimated steepest gradient.

Next, we present our bit allocation scheme. Here, we consider a special case in

which the distortion measure is mean squared error (MSE), d(x, ˆ

i=1

− ˆx

)

, and constitute an eﬃcient search method for a based on an

estimated distortion:

E[d(x, ˆ

x)]

≈

j=1

12a

E [G

(y)],

(7)

where G

(y) =

i=1

∂φ

(y)

∂y

. The derivation of equation (7) is based on Taylor

expansion, but omitted due to the space limitation. This estimation becomes

accurate, if the companded source pdf, p(y), is well approximated as a uniform

distribution and expander φ(y) is well approximated as a linear function within

a small m-dimensional hyper-rectangle that includes ν

(k)

as C

(k)

≡ L

(k)

× · · · ×

(k)

, where the interval in each dimension is given by L

(k)

i

≡

(k)

−1

(k)

and

Optimization of Parametric Companding Function

717

(k)

∈ {1, · · · , a

} for each i = 1, · · · , m. These conditions are both satisﬁed in the

case of an asymptotic high rate. Note that expectation with respect to feature

vector y in equation (7) can be replaced practically with sample average. Since

equation (7) is identical to Bennett’s original integral [3] if the dimensionalities

of original datum x and feature vector y are both one (i.e., a scalar companding

case), equation (7) is a multidimensional version of the Bennett’s integral. The

estimated distortion is useful for improving quantization vector a in our CVQ,

as shown below.

Using estimated distortion (7), the optimization problem for non-negative

real-valued a is given by

min

i=1

E [G

(y)]

s.t.

i=1

log a

≤ log M,

∀

∈{1, · · · , m}, a

> 0, a

∈R. (8)

If we allow a non-negative real-valued solution, we can obtain the solution by

conventional optimization methods as done by Segall [4].

The non-negative real-valued solution above does not guarantee level a

to be

a natural integer. We then employ a two-stage optimization scheme. The ﬁrst

stage obtains a non-negative real-valued solution and rounds it to the nearest

integer solution. This stage allows a global search for the quantization vector

but provides a rough solution for the original cost function given in equation

(1). The second stage then tries to optimize over integer solutions around the

solution obtained in the ﬁrst stage, ﬁnely but locally.

The second stage utilizes rate-distortion cost function:

min

J (a) = min

m

i=1

log

+ γE[d(x, ˆ

x)]

≈ min

i=1

log

12na

E[G

(y)] ,

s.t.

∀

∈ {1, · · · , m}, a

∈ N,

(9)

where parameter γ takes balance between the rate and distortion. γ is set so

that the real-valued solution of the cost function (9) corresponds to the solution

of (8), as γ =

2

n

2C log 2

, where C =

j=1

E [G

(y)]

We choose the quantization vector that minimizes the rate-distortion cost

function (9) among 2m quantization vectors each of which sets a component a

(i = 1,

· · · , m) at one level larger or smaller than that of the ﬁrst stage solution

a. After re-optimizing the companding parameters such to be suitable for the

new quantization vector a, we check whether the new quantization vector and

re-optimized companding parameters

{a, θ

α

, θ

} actually decrease the average

distortion (2).

The optimization procedure in the case of MSE distortion measure is summa-

rized in the panel.

718

S. Maeda and S. Ishii

Summary of optimization procedure

1. Initialize ˆ

N

at a suﬃciently large value so that

parameter update at step 5 is done at least once, and

set

{ˆθ

, ˆ

} as an initial guess.

2. Select a trial quantization vector a according to

rough global search. Initialize companding parameters

{θ

α

, θ

} so a trial parameter set is {θ

, θ

, a

3. Minimize D

(θ

α

, θ

, a) with respect to parameters

{θ

, θ

4. If D

(θ

, θ

, a) < ˆ

, then replace ˆ

and

{ˆθ

, ˆ

} by D

(θ

, θ

, a) and

{θ

, θ

, a

respectively, and go to step 2. Otherwise, go to step 5.

5. Improve a trial quantization vector a according to

ﬁne local search.

6. Minimize D

(θ

, θ

, a) with respect to parameter

{θ

α

, θ

7. If D

(θ

, θ

, a) < ˆ

, then replace ˆ

and

{ˆθ

, ˆ

} by D

(θ

, θ

, a) and

{θ

, θ

, a

respectively, and go to step 5.

Otherwise, terminate the algorithm.

Applications to Transform Coding

3.1

Architecture

Here, we apply the proposed learning method in particular to transform cod-

ing that utilizes linear matrices and componentwise nonlinear functions for the

companding functions,ψ(x; θ

) = f (W x; ω

), and φ(ˆ

y; θ

β

) = Ag(ˆ

y; ω

) where

{W, ω

} and θ

{A, ω

}. W and A are m × n and n × m matrices, re-

spectively, and f and g are componentwise nonlinear functions with parameters

and ω

, respectively. We use a scaled sigmoid function for f which expands

a certain interval of a sigmoid function so that the range becomes [0, 1], and g

is chosen as the inverse function of f ;

i

(x

; ω

) =

, b

)

1 + exp(

−a

− b

))

+ c

, b

)

; ω

) =

−

log

, b

)

− c

, b

)

− 1 + b

where c

−

ai(1−bi)

−1

, d

−c

, and ω

, b

|i = 1, · · · , m} and

, b

|i = 1, · · · , m} are parameters of f and g, respectively. Note that

when a

= a

and b

= b

, each of the functions f

and g

is an inverse function

of the other.

The above transform coding is optimized according to the method described

in section 2.2. However, the matrix A can be obtained analytically, because A

Optimization of Parametric Companding Function

719

is a solution of min

j=1

− ˆZ

, where A

is the i-th row

vector of matrix A, X

is a vector consisting of the j-th components of the N

original data samples, and ˆ

is a matrix whose column is a transformed vector

corresponding to the i-th original datum x

(i)

. The MSE solution of this cost

function is given as a set of row vectors [A

· · · , A

], each of which is obtained

individually as A

∗

= ˆ

−T

, where ˆ

−T

is the pseudo inverse of matrix ˆ

3.2

Simulation Results

To examine the performance of our optimization for CVQ, we compared the

transform coding trained by our method with KLT-based transform coding,

which uses a scalar quantizer trained by the Lloyd-I algorithm [5] for each feature

component. Optimization of our CVQ and the trained KLT was each performed

for 30 data sets, each consisting of 10,000 samples. The performance of each cod-

ing scheme was evaluated as average distortion using 500,000 samples indepen-

dently sampled from the true source distribution. Hereafter, average distortion

using 10,000 samples and 500,000 independent samples are called training error

and test error, respectively. In our CVQ, matrix A was initially set at a value

randomly chosen from uniform distribution and normalized so that the variance

of each transformed component was 1. Initial matrix W was determined as an

inverse of A. The other companding parameters were set as a

i

= a

= 6 and

= b

= 0.5, which are well suited to a signal whose distribution is bell-shaped

with variance being 1.

First, we examined the performance of our CVQ by using two kinds of two-

dimensional sources: a linear mixture of uniform source or a Gaussian source.

They have diﬀerent characteristics as coding targets because the former can-

not be decomposed into independent signals by an orthogonal transformation,

whereas the latter can be. In the case of high bit-rate coding, KLT-based trans-

form coding is optimal when an orthogonal transformation exists as that de-

composes the source data into independent signals [6]. Therefore, the coding

performance for a linear mixture of uniform source, where the optimality of

KLT is not assured, is of great interest. Moreover, it is worth examining whether

our CVQ can ﬁnd any superior transformation to the KLT for low bit-rate en-

coding of a Gaussian source because non-orthognal matrices are permitted to be

a transformation matrix in our CVQ.

Figures 3(a) and (b) compare the arrangements of reproduction vectors by our

CVQ and the trained KLT for two diﬀerent linear mixtures of uniform sources;

both of them have their density on the gray shaded parallelograms, slanted (a)

◦

and (b) 60

◦

. Each ﬁgure shows results with minimum training error among 30

runs by the two methods. Black circles indicate arranged reproduction vectors,

and two axes indicate two feature vectors. Upper panels show results of 2 bit-rate

coding, and the lower ones show 4 bit-rate coding. Each title denotes signal-to-

noise ratio (SNR) for test data set.

As seen in this ﬁgure, our CVQ obtained feature vector axes that were almost

independent of each other and eﬃciently arranged the reproduction vectors, while

720

S. Maeda and S. Ishii

2 bit-rate

4 bit-rate

SNR : 10.70 dB

SNR :10.24 dB

SNR : 22.70 dB

SNR : 20.88 dB

trained KLT

CVQ

trained KLT

CVQ

(b)

(a)

SNR :11.44 dB

SNR : 10.18 dB

SNR : 23.40 dB

SNR : 21.73 dB

−4

−2

−4

−2

−4

−2

−4

−2

−4

−2

−4

−2

−4

−2

−4

−2

−4

−2

−4

−2

−4

−2

−4

−2

−4

−2

−4

−2

−4

−2

−4

−2

Fig. 3. Arrangements of reproduction vectors for linear mixtures of uniform sources

the trained KLT code failed. Then, the test performance was degraded especially

in high rate coding, due to placing reproduction vectors on areas without origi-

nal data. We also see that our CVQ modiﬁes the feature vector axes as bit-rate

changes even for the same source, as typically illustrated in Fig. 3(a). Thus, our

CVQ adaptively seeks an eﬃcient allocation of reproduction vectors, which is dif-

ﬁcult to perform analytically. Results with various bit-rates indicated that the test

error of our CVQ was likely the best especially when bit-rate was high.

On the other hand, KLT code is guaranteed to be the best transform coder

to encode Gaussian source in high bit-rate. However, there remains a possibility

that a non-orthogonal transform coding might outperform KLT in low bit-rate

by changing the transformation matrix according to the bit-rate.

SNR : 6.327 (

) dB

SNR : 6.237 dB

1.6 bit-rate

best CVQ

optimal KLT code

−5

0

5

−5

Fig. 4. Encoding result of a Gaussian source in low bit-rate

In fact, we found that the best transformation matrix in our CVQ was superior

to the analytical KLT code in low bit-rate as shown in Fig. 4. In this ﬁgure, we

compare our trained CVQ with optimal KLT code, which can be numerically

calculated in low bit rate and SNR of our CVQ was estimated by using 5,000,000

samples (three times the value of estimated standard deviation is shown in the

brackets).

At last, we examined performance when the source dimension became large. A

linear mixture of uniform distribution was used for the original source because it

prevents the high dimensional vector quantization from replacing a set of scalar

Optimization of Parametric Companding Function

721

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0.5

1 bit-rate

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0.5

(b)

(a)

dimension

2 bit-rate

SNR - SNR of CVQ [dB]

CVQ (med)

trained KLT (med)

Fig. 5. Encoding results of a linear mixture of uniform sources with various dimen-

sionality, in (a) 1 bit, and (b) 2 bits per dimension

quantizations by an orthogonal transformation while Gaussian source does not.

We tested 1 and 2 bits per dimension, in the dimensionality of 2, 4, 8, or 16.

In Figure 5, abscissa denotes a dimension of the source, and ordinate denotes

the test error relative to the best CVQ (dB) among 30 runs. Test error by CVQ

over 30 runs is shown by a box-whisker plot, and median test errors by CVQ,

analytical KLT, and trained KLT are denoted by a solid line, a dashed-space line,

and a dot-and-dash line, respectively. The number of poor results by CVQ, which

are not plotted in the ﬁgure, is shown in brackets on each x label if they exist. As

seen from Fig. 5, performance variance became large as the data dimensionality

increased. We found that the initial setting of the parameters largely aﬀected

the performance of our CVQ when data dimensionality was large. However, the

best trained CVQ (the best run among 30 runs) was in all cases superior to both

the trained and analytical KLTs, and this superiority was consistent regardless

of data dimensionality.

Discussion

The idea of companding has existed, and many theoretical analyses have been

done. An optimal companding function for scalar quantization was analyzed by

Bennett [3] and has been used to examine the ability of an optimal scalar quan-

tizer. On the other hand, an optimal companding function for VQ has not been

derived analytically except for very limited cases, and moreover, it is known that

optimal CVQ does not correspond to optimal VQ except for very limited cases [7]

[8] [9]. These negative results show the diﬃculty in analytically deriving the op-

timal companding function. Yet, the optimization of CVQ is attractive because

CVQs constitute a wide class of practically useful codes and avoid an exponential

increase in the coder’s complexity, referred to as the ‘curse of dimensionality.’

Through the numerical simulation, we noticed that theoretical analysis based

on high rate assumption often deviates from real situation and analytically de-

rived companding function based on high rate analysis does not show very good

performance even when the bit-rate was quite high (e.g., 4 bit-rate). We guess

722

S. Maeda and S. Ishii

that such substantial disagreement stems from the high rate assumption. Since

high rate coding implies that objective function of the lossy source coding solely

comprises distortion by neglecting the coding length, the high rate analysis may

lead to large disagreement when coding length cannot be completely neglected.

These ﬁndings suggest the importance of the optimization of practically useful

code which is realized by learning from the sample data. Fortunately, plenty

of data are usually available in the case of practical lossy source coding, and

the computation cost for the optimization is not very large in comparison to

various machine learning problems, because the code optimization is necessary

performed just once. Although we could show the potential ablitity of the CVQ,

it sholud be noted that we must carefully choose the initial parameter for train-

ing of the parametrized CVQ especially in the case of high dimensionality. To

ﬁnd a good initial parameter, the idea presented by Hinton to train the deep

network [10] may be useful.

Acknowledgement. This work was supported by KAKENHI 19700219.

References

1. Huang, J.H., Schultheiss, P.M.: Block quantization of correlated gaussian random

variables. IEEE Trans. Comm. CS-11, 289–296 (1963)

2. Goyal, V.K.: Theoretical foundations of transform coding. IEEE Signal Processing

Mag. 18(5), 9–21 (2001)

3. Bennett, W.R.: Spectra of quantized signals. Bell Syst. Tech. J. 27, 446–472 (1948)

4. Segall, A.: Bit allocation and encoding for vector sources. IEEE Trans. Inform.

Theory 22(2), 162–169 (1976)

5. Lloyd, S.: Least square optimization in pcm. IEEE Trans. Inform. Theory 28(2),

129–137 (1982)

6. Goyal, V.K., Zhuang, J., Veiterli, M.: Transform coding with backward adaptive

updates. IEEE Trans. Inform. Theory 46(4), 1623–1633 (2000)

7. Gersho, A.: Asymptotically optimal block quantization. IEEE Trans. Inform. The-

ory 25, 373–380 (1979)

8. Bucklew, J.A.: Companding and random quantization in several dimensions. IEEE

Trans. Inform. Theory 27(2), 207–211 (1981)

9. Bucklew, J.A.: A note on optimal multidimensional companders. IEEE Trans. In-

form. Theory 29(2), 279 (1983)

10. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with

neural networks. Science 313(5786), 504–507 (2006)

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 723–732, 2008.

Download 12.42 Mb.

Do'stlaringiz bilan baham:

1 ... 61 62 63 64 65 66 67 68 ... 88