Lecture Notes in Computer Science

Fig. 2. The block diagram of the error diffusion algorithm Fig. 3

bet	63/88
Sana	16.12.2017
Hajmi	12.42 Mb.
	#22381

1 ... 59 60 61 62 63 64 65 66 ... 88

Fig. 2. The block diagram of the error diffusion algorithm

Fig. 3. The Floyd-Steinburg kernel

Next, in order to estimate the performance of our method for a standard image, we

use the mean square error as

(

)

∑

−

L

y

x

y

x

y

x

z

NQ

(10)

where

ξ

x, y

and

ˆ

x y

are the pixel values of the original gray-level and reconstructed

images. On the other hand, if we estimate the performance of gray-level images gen-

erated by the true prior

P({ξ}), we evaluate the mean square error which is averaged

over the true prior as

( )

(

)

}

{

}

{

∑

−

L

y

x

y

x

y

x

z

NQ

P

(11)

This value becomes zero, if each pixel value of all reconstructed images is completely

same with that of the corresponding original images.

As we have said in above, because the edge structures embedded in the halftone

image is considered to influence on the performance of inverse-halftoning, we esti-

mate the edge length appearing both in the halftone and reconstructed images as

668 Y.

Saika

( )

(

)(

)

⎥

⎦

⎤

⎢

⎣

⎡

−

∑

n.n.

}

{

edge

}

{

y

x

y

x

y

x

y

x

z

z

P

L

δτ

(12)

which is averaged all over the snapshots of the Q-Ising model. We also estimate the

gradient of the gray-level on the edges appearing both in the halftone and recon-

structed images as

( )

(

)

⎥

⎦

⎤

⎢

⎣

⎡

−

∑

n.n.

}

{

}

{

y

x

y

x

y

x

y

x

y

x

z

z

P

δz

δτ

(13)

which is averaged over the snapshots of the Q-Ising model.

Next in order to clarify how the edge structures of the original image is observed in

the reconstructed image, we estimate the edge length on the edges in the original,

halftone and reconstructed images:

( )

(

)(

)

⎥

⎦

⎤

⎢

⎣

⎡

−

∑

n.n.

}

{

GHO

}

{

y

x

y

x

y

x

y

x

y

x

y

x

edge

z

z

P

L

δτ

δξ

(14)

which is averaged over the original images. Also we estimate the edge length on the

edges in the original, halftone and reconstructed images:

( )

(

)(

)

⎥

⎦

⎤

⎢

⎣

⎡

−

∑

n.n.

}

{

GHO

}

{

y

x

y

x

y

x

y

x

y

x

y

x

y

x

z

z

P

δz

δτ

δξ

(15)

which is averaged over the original images.

3 Performance

In order to estimate the performance of the MPM estimate, we carry out the Monte

Carlo simulation both the set of gray-level images generated by the Boltzmann factor

of the Q-Ising model and the gray-level standard image.

First, we estimate the performance of the MPM estimate for the set of the snap-

shots of the Q=4 Ising model as shown in Fig. 1 (a). Then the halftone image shown

in Fig. 1(d) is obtained by the error diffusion method using the Floyd-Steinburg ker-

nel. When we estimate the performance of the MPM estimate, we use the mean

square error which is averaged over the 10 samples of the snapshots of the Q=4 Ising

model with h

=1 and T

=1. Now we investigate static properties of the MPM estimate

based on the mean square error and the edge structures observed both in the original,

halftone and reconstructed images.

First we confirm the static properties of the MPM estimate when the posterior

probability has same form with the likelihood at J=0 or with the model prior in the

limit of J→∞. At J=0, as the likelihood is assumed to enhance the halftone image, the

MPM estimate reconstructs the gray-level image which is almost same with the half-

tone image. Therefore the reconstructed image at J=0 has the edge length which is

Inverse-Halftoning for Error Diffusion

669

Fig. 4. The mean square error as a function of J obtained by the MPM estimate for the halftone

image which is obtained from the set of the snapshots of the Q=4 Ising model by the error

diffusion method using the Floyd-Steinburg kernel when h

=1, T

=1, h=1, T=0.1

Fig. 5. The edge structures observed in the reconstructed image obtained by the MPM estimate

using the Q-Ising model for the set of the snapshots of the 4-Ising model when h

=1, T

=1, h=1,

T=1

longer than the original image, as the halftone image is expressed by the dot pattern

which is visually similar to the original image. For instance, the edge length averaged

over the set of the Q=4 Ising model and the halftone images obtained by the error

diffusion using the Floyd-Steinburg kernel are 6021.3 and 12049.8. On the other

hand, in the limit of J→∞, as the model prior is assumed to enhance smooth struc-

tures, the MPM estimate reconstructs the flat pattern.

670 Y.

Saika

Fig. 6. The mean square error as a function of J obtained by the MPM estimate for error diffu-

sion for Q=255 standard image “girl” when h=1, T=1

Fig. 7. The edge length and the gradient of the gray-level in the gray-level image restored by

the MPM estimate for Q=256 standard image “girl” when h=1, T=1

Then we investigate the performance of the MPM estimate when the posterior

probability is composed both of the model prior and the likelihood. In Fig. 4 we show

the mean square error as a function of J for the snapshots of the Q=4 Ising model. The

figure indicates that the mean square error takes its minimum at J=0.875. This means

that the MPM estimate reconstructs the gray-level image by suppressing the gradient

of the gray-level of the edges embedded in the halftone image by the Boltzmann fac-

tor of the Q-Ising model. Also the figure indicates that the mean square error rapidly

decreases as we increase J from 0.025 to 0.3. The origin of the rapid drop of the mean

square error can be explained by the edge structures of the reconstructed image, such

Inverse-Halftoning for Error Diffusion

671

as the edge length and the gradient of the gray-level as

edge

GHO

edge

L

edge

δz

and

GHO

edge

δz

which are shown in (12)-(15) in the following.

Now we evaluate the edge structures of the reconstructed images, such as the edge

length and the gradient of the gray-level. Figure 5 indicates how L

, L

GHO

, |δz

| and

|δz

GHO

| depend on the parameter J due to the MPM estimate for the set of the Q=4 Ising

model. Figure 5 indicates that L

is steady in 0<J<0.6, though it gradually decreases

with the increase in J from 0.6 and becomes zero in J>1.175. This means that the edge

structures of the halftone image survives in J<0.6 and that they are removed by the

model prior in J>0.6. On the other hand, |δz

| decreases with the increase in J from

0.025 to 0.3 and then becomes almost same with the edge length L

at J=0.3. Then, as

shown in Fig. 5, although the L

is steady in 0<J<0.6, L

becomes short with the

increase in J in from 0.6 and then takes L

=4096.3 at J=J

opt

=0.875. Then, as we fur-

ther increase J, L

edge

becomes zero = 0 at J=1.075 in the end. These results indicate that

|δz

| is rapidly suppressed by the model prior which enhances smooth structures in

J<0.6 and that the MPM estimate the edges embedded in the halftone image is re-

moved by the model prior in J>0.6. Further we show that the optimal performance is

achieved by suppressing the gradient of the gray-level and by removing a part of the

edges embedded in the halftone image if we set the parameters appropriately.

Also we numerically estimate the dynamical properties of the MPM estimate using

the Monte Carlo simulation for the Q=4 Ising model on the square lattice. The simula-

tion clarifies that the mean square error smoothly converges to the same value within

the statistical uncertainty and however that the convergent value depends on the initial

condition if we set J>J

opt

Next, we estimate the performance of the MPM estimate for the 256-level standard

image “girl” with 100×100 pixels which is shown in Fig. 1 (d). Then the halftone im-

age in Fig.1 (e) is obtained from the original image by the error diffusion method using

the Floyd-Steinburg kernel. As shown in Figs. 6 and 7, we estimate the performance in

terms of the mean square error, the edge length and the gradient of the gray-level.

Figure 6 indicates that the MPM estimate shows optimal performance at J=1.45. Then,

as shown in Fig. 7, the Monte Carlo simulation clarifies that the edge length is longer

than that of the halftone image and further that the gray-level difference is larger than

that of the edge length in the reconstructed image. These indicate that the fine struc-

tures introduced into the halftone image remains in the reconstructed image, though the

gradient of the gray-level is sharply suppressed with the increase of J in 0<J<1.0. As

shown in above, these results indicate the similar behavior to the Q=4 model.

4 Summary and Discussion

In the previous chapters, we formulate the problem of inverse-halftoning for error

diffusion using the Floyd-Steinburg kernel based on the MPM estimate. Then using

the Monte Carlo simulation for the snapshots of the Q-Ising model and the standard

image, we estimate the performance of our method in terms of the mean square error

and the edge structures of the reconstructed image. We clarify that the optimal per-

formance of the MPM estimate using the above model prior and the likelihood is

available for the problem of inverse-halftoning, if we tune the parameters. This also

means that the assumed model prior is available to suppress the fine structures intro-

duced into the halftone image and reconstruct the original gray-level image.

672 Y.

Saika

Acknowledgment

The author was financially supported by Grant-in-Aid Scientific Research on Priority

Areas “Deepening and Expansion of Statistical Mechanical Informatics (DEX-SMI)”

of the Ministry of Education, Culture, Sports, Science and Technology (MEXT)

No.18079001.

References

1. Besag, J.: Journal of the Royal Statistical Society B 48, 259-302 (1986)

2. Winkler, G.: Image analysis, Random fields and Dynamic Monte Carlo Methods. A

Mathematical Introduction. Springer, Berlin (1995)

3. Cressie, N.A.: Statistics for Spatial Data. Wiley, New York (1993)

4. Possio, A. (ed.): Spatial Statistics and Imaging. Institute of Mathematical Statistics, Hay-

ward, California. Lecture Notes-monograph Series, vol. 20 (1991)

5. Ogata, Y., Tanemura, M.: Ann. Inst. Stat. Math. B 33, 131 (1981)

6. Nishimori, H.: Statistical Physics of Spin Glasses and Information Processing: An Intro-

duction. Oxford University Press, Oxford (2001)

7. Tanaka, K.: Journal of Physics A: Mathematics and General 35, R38-R150 (2002)

8. Pryce, J.M., Bruce, A.D.: Journal of Physics A 28, 511-532 (1995)

9. Sourlas, N.: Nature 339, 693-695 (1989)

10. Sourlas, N.: Europhys. Letters 25, 159–164 (1994)

11. Nishimori, H., Wong, K.Y.M.: Statistical mechanics of image restoration and error-

correcting codes. Physical Review E 60, 132–144 (1999)

12. Tanaka, T.: Statistical mechanics of CDMA multiuser demodulation. Europhysics Let-

ters 54, 540–546 (2001)

13. Ulichney, R.: Digital Halftoning. The MIT Press, Cambridge, Massachusetts, London,

England (1987)

14. Bayer, B.E.: An optimum method for two-level rendition of continuous-tone pictures. In:

ICC CONF. RECORD, vol. 26, pp. 11–15 (1973)

15. Ulichney, R.: Dithering with blue noise. Proc. IEEE 76(1), 56–79 (1988)

16. Mitsa, T., Parker, K.J.: Digital halftoning technique using a blue noise mask. Journal of the

Optical Society of America A/9(11), 1920–1929 (1992)

17. Yao, M., Parker, K.J.: The blue noise mask and its comparison with error diffusion. SID 94

Digest 37(3), 805–807 (1994)

18. Yao, M., Parker, K.J.: Modified approach to the construction of a blue noise mask. Journal

of Electric Imaging 3(1), 92–97 (1994)

19. Miceli, C.M., Parker, K.J.: Inverse halftoning. Journal of Electric Imaging 1, 143–151 (1992)

20. Wong, P.W.: Inverse Halftoning and Kernel Estimation for Error Diffusion. IEEE Trans.

Image Processing 4, 486–498 (1995)

21. Saika, Y., Inoue, J.: Probabilistic inference to the problem of inverse-halftoning based on

statistical mechanics of spin systems, Probabilistic inference to the problem of inverse-

halftoning based on statistical mechanics of spin systems. In: Proc. SICE-ICCAS 2006, pp.

4563–4568 (2006)

22. Saika, Y., Inoue, J.: Probabilistic inference to the problem of invere-halftoning based on

the Q-Ising model. In: Proc. IEEE FOCI 2007, pp. 429–433 (2007)

23. Inoue, J., Saika, Y., Okada, M.: accepted for the 7th International Conference on Intelli-

gent the System Design and Applications (ISDA 2007) (2007)

Chaotic Motif Sampler for Motif Discovery

Using Statistical Values of Spike Time-Series

Takafumi Matsuura and Tohru Ikeguchi

Graduate School of Science and Engineering, Saitama University,

225 Shimo-Ohkubo, Sakura-ku, Saitama-city 338-8570, Japan

takafumi@nls.ics.saitama-u.ac.jp

Abstract. One of the most important issues in bioinformatics is to dis-

cover a common and conserved pattern, which is called a motif, from bi-

ological sequences. We have already proposed a motif extraction method

called Chaotic Motif Sampler (CMS) by using chaotic dynamics. Explor-

ing a searching space with avoiding undesirable local minima, the CMS

discovers the motifs very eﬀectively. During a searching process, chaotic

neurons generate very complicated spike time-series. In the present pa-

per, we analyzed the complexity of the spike time-series observed from

each chaotic neuron by using a statistical measure, such as a coeﬃcient of

variation and a local variation of interspike intervals, which are frequently

used in the ﬁeld of neuroscience. As a result, if a motif is embedded in a

sequence, corresponding spike time-series show characteristic behavior.

If we use these characteristics, multiple motifs can be identiﬁed.

Introduction

The Human Genome Project was completed in April 2003. The project gener-

ated a large amount of genomic sequence data sets. The human genomes consist

of approximately three billion base pairs and approximately 25,000 genes. Then,

one of the primary issues in the ﬁeld of bioinformatics is how to identify biolog-

ically important parts. It is generally considered that the biologically important

parts repeatedly appear in a biological sequence, for example, DNA, RNA, and

protein sequences. Therefore, one of the most popular analyses involves extract-

ing a common and conserved pattern, which is often called a motif.

The problem of extracting motifs (motif extracting problem, MEP) from bi-

ological sequences can be mathematically described as follows: We have a data

set S =

{s

1

, s

, ..., s

}, where s

is the ith biological sequence (Fig. 1). The ith

sequence consists of m

(i = 1, 2, ..., N ) elements. In the case of DNA or RNA,

the elements correspond to the four bases, while in case of protein sequences

they correspond to 20 amino acids. The most basic case usually includes exactly

one motif whose length is L in each sequence. However, real life problems have

several variations; the number of embedded motifs in each sequence can be more

than one or random (including zero).

To extract motifs from a biological sequence, one of the simple and natural

methods is a brute force method, or an enumeration method, which enumerates

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 673–682, 2008.

c Springer-Verlag Berlin Heidelberg 2008

674

T. Matsuura and T. Ikeguchi

i

L

ATGG

CTGAGTA

· · · · · ·

ACTGAGTCAGTT

CTATCCAGATT

· · · · · ·

ATAGCGCGC

CTGAGTA

CTA

...

−1

AGATCAGGA

CTGAGTA

GCCTATTTG

· · · · · ·

AGTT

s

N

ACTACAACAATACA

CTGAGTA

TGC

· · · · · ·

AGATCTC

Fig. 1. An example data set of DNA sequences. A, C, G, and T denote Adenine,

Cytosine, Guanine, and Thymine, respectively. The bold face alignments indicate a

motif. In this example, the motif “

CTGAGTA

” is embedded in each sequence.

all possible patterns. However, when using the brute force method for DNA or

RNA sequences, there are 4

possible patterns that have to be considered. For

a protein sequence, this number becomes 20

because a protein sequence has 20

amino acids. If L becomes large, the number of the motif patterns exponentially

diverges. It means that it is almost impossible to explore all the possible motif

patterns in a reasonable time frame. Actually, it has also been proved that MEP

belongs to a class of

N P-hard [2]. These facts indicate that it is inevitable to

develop an eﬀective approximation algorithm for extracting motifs. To realize

the issue, many approximation algorithms have been proposed [9].

To search near-optimal solutions of combinatorial optimization problems,

many heuristic algorithms have been proposed [3, 4]. It has also been shown

that an algorithm, which introduces a chaotic neurodynamics for exploring so-

lution spaces, is very eﬀective [6, 7, 8, 10, 11]. The central idea of involving the

chaotic neurodynamics is that if we use the chaotic dynamics, a volume of search-

ing space reduces drastically. Diﬀerent from using a kind of random algorithm,

the volume of a state space explored by a chaotic dynamics has Lebegue measure

zero, because the state space generally consists of a chaotic attractor. To realize

an eﬀective chaotic search, an algorithm for controlling a local search by the

chaotic dynamics has been proposed. In this algorithm, an execution of a local

search, such as 2-opt for solving traveling salesman problems (TSPs) [5, 7], and

position movement for MEPs [10, 11] is driven by the chaotic neurodynamics.

Then, it has been shown that the chaotic neurodynamics searches good near op-

timal solutions for TSPs [5, 7], MEPs [10, 11] and quadratic assignment problem

(QAP) [8]. In addition, it has also been clariﬁed that its high searching ability

depends on a statistical property of refractoriness in a chaotic neuron [1] for the

case of the MEPs [12, 13].

To solve MEPs, we have already proposed a new motif extraction method

“Chaotic Motif Sampler (CMS),” by using a chaotic neurodynamics [11]. To

realize the chaotic neurodynamics, we used a chaotic neural network composed

by a chaotic neuron [1]. In the CMS, the chaotic neurons are assigned to all motif

candidate position. Thus, the motif is decided by whether the corresponding

chaotic neuron is ﬁring or resting. As a result, it is shown that the CMS can ﬁnd

motif in a very high rate, if only a single motif is embedded in a sequence.

Chaotic Motif Sampler for Motif Discovery

675

Exploring a searching space with avoiding undesirable local minima, the CMS

detects the motifs by ﬁring neurons. Then, each neuron generates a complicated

spike time-series. Then, the spike time-series of the chaotic neuron corresponding

to a correct motif position may exhibit characteristic response. In this paper,

in order to reveal such characteristic property, we analyzed complexity of the

spike time-series from each chaotic neuron by using a statistical measure, such

as coeﬃcient of variation (C

) and local variation of interspike intervals (L

which are frequently used in the ﬁeld of neuroscience. As a result, if motifs are

embedded in a sequence, corresponding spike time-series shows characteristic

behavior. A chaotic neuron to which correct motifs are assigned has a higher C

value than the other neurons, while the value of the L

is lower than the other

neurons. As a result, even if multiple motifs are embedded in a sequence, CMS

can extract the motifs.

Chaotic Motif Sampler

The CMS uses chaotic neurons [1] generating chaotic neurodynamics to explore

better solutions embed in a state space [11]. In other word, the motif candidates

are decided by ﬁring of the chaotic neuron.

To realize the CMS, we ﬁrst used a chaotic neural network [1] composed of

i=1

− L + 1) neurons (Fig. 2). In this neural network, the ﬁring of a chaotic

neuron encodes a movement of the head position of a motif candidate to the

corresponding position (Fig. 2).

j = 1

2

3

· · · · · · ·

− L + 1

i = 1

...

C T A G T

i

· · ·

i i

T C A A

...

A G C G C

i i

· · ·

i i

G C A T

i i i i i i

· · ·

i i

T A A G G

N: The number of sequence

L: Motif length

: The number of elements in each sequence

: Motif candidate

: Chaotic neuron

Fig. 2. How to code motif positions to neuron ﬁrings. In this example, the motif length

is ﬁve.

The ﬁring of the (i, j)th neuron is then deﬁned by x

i,j

(t) = f (y

i,j

(t)) >

where f (y) = 1/(1 + exp(

−y/ )), and y

i,j

(t) is an internal state of the (i, j)th

neuron at time t. The internal state of the chaotic neuron [1] is decomposed

into two parts—ξ

i,j

(t) and ζ

i,j

(t)—each of which represents diﬀerent eﬀects or

determining the ﬁring of a neuron in the algorithm; a gain eﬀect, a refractory

eﬀect, and a mutual inhibition eﬀect, respectively.

676

T. Matsuura and T. Ikeguchi

The ﬁrst part, ξ

i,j

(t), which expresses the gain eﬀect, is deﬁned by

i,j

(t + 1) = β E

i,j

(t)

− ˆ

E ,

(1)

i,j

(t) =

k=1 ω

∈Ω

(ω) log

(ω)

p(ω)

(2)

where β is a scaling parameter of the eﬀect; E

i,j

(t) represents the objective

function in CMS, a relative entropy score when a candidate motif position is

moved to the jth position in the sequence s

; ˆ

E, the entropy score of the current

state; f

(ω), the number of appearances of an element (one of the four bases in

the case of DNA or RNA sequences and 20 amino acids in the case of a protein

sequence) ω

∈ Ω at the kth position of subsequences; p(ω), the background

probability of appearance of the element ω; and Ω, a set of bases (Ω

BASE

=

{A,

C, G ,T

}) or a set of amino acids (Ω

ACID

=

{N, S, D, Q, E, T, R, H, G, K, Y, W,

C, M, P, F, A, V, L, I

}). The quantity on the right-hand side of Eq.(1) becomes

positive if the new motif candidate position is better than the current state.

The second part, ζ

i,j

(t), qualitatively realizes the refractoriness of the neuron.

The refractoriness is one of the important properties of real biological neurons;

once a neuron ﬁres, it can hardly ﬁre for a certain period of time.

The second part is then expressed as follows:

i,j

(t + 1) =

−α

d=0

i,j

− d) + θ

(3)

−αx

i,j

(t) + k

i,j

(t) + θ(1

− k

)

(4)

where α is a scaling parameter to decide strength of the refractory eﬀect after

neuron ﬁring (0 < α); k

, a decay parameter that assumes values between 0 and

1; and θ, a threshold value. Thus, in Eq.(3), ζ

i,j

(t + 1) expresses the refractory

eﬀect with a factor k

because the more frequently the neuron has ﬁred in its past,

the more negative the ﬁrst term of the right side hand of Eq.(3) becomes, which

in turn depresses the value of ζ

i,j

(t+1) and leads the neuron to a relatively resting

state. Although the refractoriness realized in the chaotic neuron [1] has similar

eﬀect such as tabu eﬀect [3, 4], we have already shown that the refractoriness can

control to search a solution in a state space better than the tabu eﬀect [12, 13].

By using these two internal states, we construct an algorithm for extract-

ing motifs, as described in the following. Consider a set of N sequences of

lengths m

(i = 1, 2, ..., N ); further, let the length of a subsequence (motif) be L

(Fig. 2). We proceed as follows:

1. The position of an initial subsequence t

i,j

(i = 1, 2, ..., N ; j = 1, 2, ..., m

−

L + 1) is randomly set at the ith sequence.

2. The ith sequence s

is selected cyclically.

3. For a selected sequence s

at the second step, to change the position of the

motif candidate to a new position, y

i,j

(t + 1) is calculated from the ﬁrst

neuron (j = 1) to the last neuron (j = m

− L + 1) in the sequence s

. Then,

Chaotic Motif Sampler for Motif Discovery

677

the neuron whose internal state takes maximum (y

i,max

) is selected. If the

(i, max)th neuron ﬁres (x

i,max

(t + 1) > 1/2), a new motif position is moved

to the (i, max)th position, and the value of ˆ

E is updated.

4. Repeat the steps 2 and 3 for the suﬃcient number of iterations.

Statistical Measures C

and L

[14]

The coeﬃcient of variation (C

) of interspike intervals is a measure of random-

ness of interval variation. The C

is deﬁned as

− 1

i=1

− T )

(5)

where T

is the ith interspike interval (ISI), n is the number of ISIs, and T =

n

n

i=1

is the mean ISI. For spike intervals that are independently exponentially

distributed, C

is 1 in the limit of a large number of intervals. For a regular spike

time-series in which T

is constant, C

= 0.

Next, the local variation (L

) of interspike intervals is a measure of the spiking

characteristics of an individual neuron. The L

is deﬁned as

− 1

−1

i=1

3(T

− T

−1

)

+ T

i+1

)

(6)

For spike intervals that are independently exponentially distributed, L

= 1.

For a regular spike time-series in which T

is constant, L

= 0.

Results

4.1

Investigation of Statistical Measures of C

and L

In order to investigate the values of C

and L

[14] for chaotic neurons in

CMS which solves MEP, we used real protein sequences [9, 15]. At ﬁrst, we

used the real protein data set consisted of N = 30 sequences. The ith sequence

is composed of m

(i = 1, 2, ..., N ) amino acids, and one motif (L = 18) is

embedded in each sequences. Because correct locations of the motif for this real

protein sequences [15] has already been clariﬁed by another method [9, 15], we

can compare the analysis results between the neurons of correct motif position

and the others. For more details of the sequence data and how to identify the

correct motif positions, see Ref.[9] and the references therein. Table 1 shows

correct locations of the motif in each sequence. To extract motifs from this data

set, the parameters of the CMS are set to k

= 0.8, α = 0.25, θ=1.0, β=30.0,

and =0.01.

678

T. Matsuura and T. Ikeguchi

Table 1. Correct location of the motif in each real protein sequence [15]

Sequence number

2

3

9 10 11 12

13 14 15

Correct location 225 198 22 326 449 22 5 73 99 25 169 15 12 196 196

Sequence number 16 17 18 19 20 21 22 23 24 25 26 27

28 29 30

Correct location 252 444 11 23

3 5 26 67 495 205 160 3

3 27 25

0

100

200

300

400

500

600

700

800

900

1000

100

200

eur

100

150

200

250

0.8

Neuron index

(a)

(b)

(c)

(d)

Fig. 3. (a) Raster plot, the values of (b) the ﬁring rate, (c) C

and (d) L

for all

neurons in the 3rd sequence of Table 1

It is considered that the frequently ﬁring neurons correspond to location of a

correct motif, because the gain eﬀect of the chaotic neurons is decided by a degree

of similarity to another motif candidates. Namely, the neurons corresponding to

motifs frequently ﬁre. In this sense, it might be enough to observe ﬁring rates of

the neurons. However, only the ﬁring rate does not always give a correct result.

Figure 3 shows such an example; the results for the 3rd sequence (s

) of the data

in Table 1. For the 3rd sequence, although the location of correct motif is 22,

the most frequently ﬁring neuron is not the 22nd, but middle (160

∼ 170) and

end (270

∼ 280) neurons of the sequence (Fig. 3 (b)).

If we evaluate a raster plot of all the neurons in the 3rd sequence, the fre-

quently ﬁring neurons seem to be ﬁring randomly, and a ﬁring pattern of the 22nd

neuron shows characteristic behavior (Fig. 3(a)). To detect such characteristic

ﬁring behavior, we calculated the statistical values of output spike time-series.

As a result, it is clearly seen that the value of the C

of the 22nd neuron is

higher than the other neurons and the value of the L

is lower than the other

neurons (Figs. 3 (c) and (d)). On the other hand, for the frequently ﬁring neu-

rons, the values of C

are low and the values of the L

are high. For the other

Chaotic Motif Sampler for Motif Discovery

679

100

200

300

400

500

eque

Neuron index

(a) Firing rate of neurons

100

200

300

400

500

0.7

0.8

0.9

1.1

1.2

1.3

eque

Neuron index

(b) Values of C

100

200

300

400

500

0.7

0.8

0.9

1.1

1.2

1.3

eque

Neuron index

Fig. 4. The values of (a) the ﬁring rate, (b) C

, and (c) L

of all neurons for the real

protein data set. The correct location of the motifs are shown in Table 1.

sequences, the neurons corresponding to a correct motif show the same tendency

(Fig. 4 and Table 1). Furthermore, if the neuron at a correct motif position ﬁres

frequently, C

becomes not low but high, and L

becomes not high but low.

The results indicate that spike time-series generated by the chaotic neuron of a

correct motif position has diﬀerent characteristics from the other neurons.

4.2

Multiple Motif Case

The original CMS [11, 12] cannot always ﬁnd multiple motifs from a sequence,

because the most similar subsequence is extracted as a motif from the sequence.

However, the multiple motifs might be extracted from each sequence by using

Table 2. Correct locations of the motif in each artiﬁcial protein sequence

Sequence number

9 10 11 12

Correct locations

130

15

225 198 22 326 467 22 5 99 169 35 12 196

Sequence number 13 14 15 16 17 18 19 20 21 22 23 24 25

Correct locations

11 23

50

196 250 150

5 26 67 495 205 178

3 25

680

T. Matsuura and T. Ikeguchi

100

200

300

400

500

600

700

800

900

1000

100

200

eur

100

150

200

250

0.8

1.2

0.8

1.2

Neuron index

(a)

(b)

(c)

(d)

Fig. 5. (a) Raster plot, the values of (b) the ﬁring rate, (c) C

and (d) L

for all

neurons in the 15th sequence of Table 2

100

200

300

400

500

eque

Neuron index

(a) Firing rate of neurons

100

200

300

400

500

0.8

1.2

1.4

1.6

eque

Neuron index

(b) Values of C

100

200

300

400

500

0.7

0.8

0.9

1.1

1.2

1.3

1.4

eque

Neuron index

Fig. 6. The values of (a) the ﬁring rate, (b) C

, and (c) L

of all neurons for the

artiﬁcial protein data set. The correct location of the motifs are shown in Table 2.

characteristics of spike time-series. To verify this hypothesis, we prepared an

artiﬁcial data set. This artiﬁcial data has 25 sequences (N = 25), and the ith

sequence is composed of m

(i = 1, 2, ..., N ) amino acids. In each sequence, one

Chaotic Motif Sampler for Motif Discovery

681

or two motifs are embedded. Table 2 shows the correct locations of the motif

(L = 18) in each sequence. To extract motifs from this data set, the parameters

are set to k

= 0.8, α = 0.25, θ = 0.9, β = 25.0, and

= 0.01.

Figure 5 shows the results for the 15th sequence of Table 2 in which two motifs

are embedded. For the 15th sequence, although one neuron of the motif (23rd)

takes a high ﬁring rate, the other (156th) is low (Figs. 5 (a) and (b)). However,

the both values of C

are higher than the other neurons, and both values of L

are lower than the other neurons (Figs. 5 (c) and (d)). For the other sequences,

the neurons of correct motifs show the same tendency (Fig. 6 and Table 2).

Conclusions

In this paper, to improve solving performance of motif extraction problem by

a chaotic neurodynamics, we introduced the spike time-series produced from

chaotic neurons in the CMS, and analyzed its complexity by using a statistical

measure, such as coeﬃcient of variation (C

) and local variation (L

) of inter-

spike intervals, which are frequently used in the ﬁeld of neuroscience. As a result,

of corresponding neurons to correct motif positions becomes higher than the

other neurons and L

becomes lower than the other neurons, independent of

ﬁring rates of the neurons. The result of using these characteristics for ﬁnding

motifs, we can solve MEP or extract the motifs in case that multiple motifs

are embed in each sequence. Although statistical values of C

and L

of spikes

produced from neurons corresponding to the correct location are depend on each

sequence, their value exhibit statistically signiﬁcant in the sequence.

Then, if we evaluate the motif positions more quantitatively by using statis-

tical values of C

and L

, we could develop a new algorithm of the CMS, elim-

inating the sequences with no motifs. The research of T.I. is partially supported

by a Grant-in-Aid for Scienctiﬁc Research (B) (No.16300072) and a research

grant from the Mazda Foundation.

References

[1] Aihara, K., et al.: Chaotic Neural Networks. Physics Letters A 144, 33–340 (1990)

[2] Akutsu, T., et al.: On Approximation Algorithms for Local Multiple Alignment.

In: Proceedings of the 4th Annual International Conference Molecular Biology,

pp. 1–7 (2000)

[3] Glover, F.: Tabu Search I. ORSA Journal on Computing 1(3), 190–206 (1989)

[4] Glover, F.: “Tabu Search II”. ORSA Journal on Computing 2(1), 4–32 (1990)

[5] Hasegawa, M., et al.: Combination of Chaotic Neurodynamics with the 2-opt

Algorithm to Solve Traveling Salesman Problems. Physical Review Letters 79(12),

2344–2347 (1997)

[6] Hasegawa, M., et al.: Exponential and Chaotic Neurodynamical Tabu Searches for

Quadratic Assignment Problems. Control and Cybernetics 29(3), 773–788 (2000)

[7] Hasegawa, M., et al.: Solving Large Scale Traveling Salesman Problems by Chaotic

Neurodynamics. Neural Networks 15(2), 271–283 (2002)

682

T. Matsuura and T. Ikeguchi

[8] Hasegawa, M., et al.: A Novel Chaotic Search for Quadratic Assignment Problems.

European Journal of Operational Research 39(3), 543–556 (2002)

[9] Lawrence, C.E., et al.: Detecting Subtle Sequence Signals: A Gibbs Sampling

Strategy for Multiple Alignment. Science 262, 208–214 (1993)

[10] Matsuura, T., et al.: A Tabu Search for Extracting Motifs from DNA Sequences.

In: Proceedings of Nonlinear Circuits and Signal Processing 2004, pp. 347–350

(2004)

[11] Matsuura, T., et al.: A Tabu Search and Chaotic Search for Extracting Motifs

from DNA Sequences. In: Proceedings of Metaheuristics International Conference

2005, pp. 677–682 (2005)

[12] Matsuura, T., Ikeguchi, T.: Refractory Eﬀects of Chaotic Neurodynamics for Find-

ing Motifs from DNA Sequences. In: Corchado, E.S., Yin, H., Botti, V., Fyfe, C.

(eds.) IDEAL 2006. LNCS, vol. 4224, pp. 1103–1110. Springer, Heidelberg (2006)

[13] Matsuura, T., et al.: Analysis on Memory Eﬀect of Chaotic Dynamics for Com-

binatorial Optimization Problem. In: Proceedings of Metaheuristics International

Conference (2007)

[14] Shinomoto, S., et al.: Diﬀerences in Spiking Patterns Among Cortical Neurons.

Neural Computation 15, 2823–2842 (2003)

[15] Information available at, http://www.genome.jp/

A Thermodynamical Search Algorithm for

Feature Subset Selection

F´

elix F. Gonz´

alez and Llu´ıs A. Belanche

Languages and Information Systems Department

Polytechnic University of Catalonia, Barcelona, Spain

{fgonzalez,belanche}@lsi.upc.edu

Abstract. This work tackles the problem of selecting a subset of features

in an inductive learning setting, by introducing a novel Thermodynamic

Feature Selection algorithm (TFS). Given a suitable objective function,

the algorithm makes uses of a specially designed form of simulated an-

nealing to ﬁnd a subset of attributes that maximizes the objective func-

tion. The new algorithm is evaluated against one of the most widespread

and reliable algorithms, the Sequential Forward Floating Search (SFFS).

Our experimental results in classiﬁcation tasks show that TFS achieves

signiﬁcant improvements over SFFS in the objective function with a no-

table reduction in subset size.

Introduction

The main purpose of Feature Selection is to ﬁnd a reduced set of attributes from

a data set described by a feature set. This is often carried out with a subset

search process in the power set of possible solutions. The search is guided by the

optimization of a user-deﬁned objective function. The generic purpose pursued

is the improvement in the generalization capacity of the inductive learner by

reduction of the noise generated by irrelevant or redundant features.

The idea of using powerful general-purpose search strategies to ﬁnd a subset

of attributes in combination with an inductive learner is not new. There has been

considerable work using genetic algorithms (see for example [1], or [2] for a recent

successful application). On the contrary, simulated annealing (SA) has received

comparatively much less attention, probably because of the prohibitive cost,

which can be specially acute when it is combined with an inducer to evaluate

the quality of every explored subset. In particular, [3] developed a rule-based

system based on the application of a generic SA algorithm by standard bitwise

random mutation. We are of the opinion that such algorithms must be tailored

to feature subset selection if they are to be competitive with state-of-the-art

stepwise search algorithms.

In this work we contribute to tackling this problem by introducing a novel

Thermodynamic Feature Selection algorithm (TFS). The algorithm makes uses

of a specially designed form of simulated annealing (SA) to ﬁnd a subset of at-

tributes that maximizes the objective function. TFS has a number of distinctive

characteristic over other search algorithms for feature subset selection. First, the

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 683–692, 2008.

c Springer-Verlag Berlin Heidelberg 2008

684

F.F. Gonz´

alez and L.A. Belanche

probabilistic capability of SA to accept momentarily worse solutions is enhanced

by the concept of an -improvement, explained below. Second, the algorithm is

endowed with a feature search window in the backward steps that limits the

neighbourhood size while retaining eﬃcacy. Third, the algorithm accepts any

objective function for evaluating the quality of generated subsets. In this paper

three diﬀerent inducers have been tried to assess this quality. TFS is evaluated

against one of the most reliable stepwise search algorithms, the Sequential For-

ward Floating Search method (SFFS). Our experimental results in classiﬁcation

tasks show that TFS notably improves over SFFS in the objective function in

all cases, with a very notable reduction in subset size, compared to the full set

size. We also ﬁnd that the compared computational cost is aﬀordable for small

number of features and much better as the number of features increases.

Feature Selection

Let X =

, . . . , x

}, n > 0 denote the full feature set. Without loss of gene-

rality, we assume that the objective function J :

P(X) → R

∪ {0} is to be

maximized, where

P denotes the power set. We denote by X

⊆ X a subset of

selected features, with

| = k. Hence, by deﬁnition, X

∅, and X

= X. The

feature selection problem can be expressed as: given a set of candidate features,

select a subset

deﬁned by one of two approaches:

1. Set 0 < m < n. Find X

⊂ X, such that J(X

) is maximum.

2. Set a value J

, the minimum acceptable J . Find the X

⊆ X with smaller

m such that J (X

)

≥ J

. Alternatively, given α > 0, ﬁnd the X

⊆ X with

smaller m, such that J (X

)

≥ J(X) or |J(X

)

− J(X)| < αJ(X).

In the literature, several suboptimal algorithms have been proposed for feature

selection. A wide family is formed by those algorithms which, departing from

an initial solution, iteratively add or delete features by locally optimizing the

objective function. Among them we ﬁnd the sequential forward generation (SFG)

and sequential backward generation (SBG), the plus-l-minus-r (also called plus

l - take away r) [12], the Sequential Forward Floating Search (SFFS) and its

backward counterpart SFBS [4]. The high trustworthiness and performance of

SFFS has been ascertained experimentally (see e.g. [5], [6]).

Simulated Annealing

Simulated Annealing (SA) is a stochastic technique inspired on statistical mechan-

ics for ﬁnding near globally optimum solutions to large (combinatorial) optimiza-

tion problems. SA is a weak method in that it needs almost no information about

the structure of the search space. The algorithm works by assuming that some

part of the current solution belongs to a potentially better one, and thus this part

Such an optimal subset of features always exists but is not necessarily unique.

A Thermodynamical Search Algorithm for Feature Subset Selection

685

should be retained by exploring neighbors of the current solution. Assuming the

objective function is to be minimized, SA can jump from hill to hill and escape or

simply avoid sub-optimal solutions. When a system S (considered as a set of pos-

sible states) is in thermal equilibrium (at a given temperature T ), the probability

T

(s) that it is in a certain state s depends on T and on the energy E(s) of s. This

probability follows a Boltzmann distribution:

(s) =

exp(

−

E(s)

)

, with Z =

∈S

exp

−

E(s)

where k is the Boltzmann constant and Z acts as a normalization factor. Metropo-

lis and his co-workers [7] developed a stochastic relaxation method that works by

simulating the behavior of a system at a given temperature T . Being s the current

state and s a neighboring state, the probability of making a transition from s to

s is the ratio P

(s

→ s ) between the probability of being in s and the probability

of being in s :

→ s ) =

(s )

(s)

= exp

−

ΔE

where we have deﬁned ΔE = E(s )

−E(s). Therefore, the acceptance or rejection

of s as the new state depends on the diﬀerence of the energies of both states

at temperature T . If P

(s )

≥ P

(s) then the “move” is always accepted. It

(s ) < P

(s) then it is accepted with probability P

(s, s ) < 1 (this situation

corresponds to a transition to a higher-energy state). Note that this probability

depends upon and decreases with the current temperature. In the end, there will

be a temperature low enough (the freezing point ), wherein these transitions will

be very unlikely and the system will be considered frozen.

In order to maximize the probability of ﬁnding states of minimal energy at

every temperature, thermal equilibrium must be reached. The SA algorithm pro-

posed in [8] consists on using the Metropolis idea at each temperature for a ﬁnite

amount of time. In this algorithm the temperature is ﬁrst set at a initially high

value, spending enough time at it so to approximate thermal equilibrium. Then

a small decrement of the temperature is performed and the process is iterated

until the system is considered frozen. If the cooling schedule is well designed, the

ﬁnal reached state may be considered a near-optimal solution.

TFS: A Thermodynamic Feature Selection Algorithm

In this section we introduce TFS (Thermodynamic Feature Selection). Consi-

dering SA as a combinatorial optimization process [10], TFS ﬁnds a subset of

attributes that maximizes the value of a given objective function. A special-

purpose feature selection mechanism is embedded into the SA tehcnique that

takes advantage of the probabilistic acceptance capability of worse scenarios

over a ﬁnite time. This characteristic is enhanced in TFS by the notion of an

-improvement: a feature -improves a current solution if it has a higher value of

the objective function or a value not worse than %. This mechanism is intended

686

F.F. Gonz´

alez and L.A. Belanche

to account for noise in the evaluation of the objective function due to ﬁnite sam-

ple sizes. The algorithm is also endowed with a feature search window (of size

l) in the backward step, as follows. In forward steps always the best feature is

added (by looking at all possible additions). In backward steps this search is lim-

ited to l tries at random (without repetition). The value of l is incremented by

one at every thermal equilibrium point. This mechanism is an additional source

of non-determinism and a bias towards adding a feature only when it is the best

option available. In contrast, to remove a feature, it suﬃces that its removal

-improves the current solution. A direct consequence is of course a considerable

speed-up of the algorithm. Note that the design of TFS is such that it grows

more and more deterministic, informed and costly as it converges towards the

ﬁnal conﬁguration.

Fig. 1. TFS algorithm for feature selection. For the sake of clarity, only those variables

that are modiﬁed are passed as parameters to Forward and Backward.

The pseudo-code of TFS is depicted in Figs. 1 to 3. The algorithm consists

of two major loops. The outer loop waits for the inner loop to ﬁnish and then

updates the temperature according to the chosen cooling schedule. When the

outer loop reaches T

min

, the algorithm halts. The algorithm keeps track of the

best solution found (which is not necessarily the current one). The inner loop

is the core of the algorithm and is composed of two interleaved procedures:

Forward and Backward, that iterate until a thermal equilibrium point is found,

represented by reaching the same solution before and after. These procedures

work independently of each other, but share information about the results of their

respective searches in the form of the current solution. Within them, feature

selection takes place and the mechanism to escape from local minima starts

A Thermodynamical Search Algorithm for Feature Subset Selection

687

PROCEDURE Forward (var Z, J

)

Repeat

x := argmax

{ J(Z ∪ {x

}) }, x

∈ X

\ Z

-improves (Z, x, true) then accept := true

else

ΔJ := J(Z

∪ {x}) − J(Z)

accept := rand(0, 1) < e

ΔJ

t

endif

If accept then Z := Z

∪ {x} endif

If J (Z) > J

then J

:= J(Z) endif

Until not accept

END Forward

Fig. 2. TFS Forward procedure (note Z, J

are modiﬁed)

PROCEDURE Backward (var Z, J

)

A :=

∅; A

∅

Repeat

For i := 1 to min(l,

|Z|) do

FUNCTION -improves (Z, x, b)

Select x

∈ Z \ A

randomly

RETURNS boolean

-improves (Z, x, f alse)

If b then Z := Z

∪ {x}

then A := A

∪ {x} endif

else Z := Z

\ {x} endif

B

:= A

∪ {x}

Δx := J(Z )

− J(Z)

EndFor

If Δx > 0 then return true

0

:= argmax

{J(Z \ {x})}, x ∈ A

else return

−Δx

J (Z)

endif

If x

∈ A then accept := true

END -improves

else

ΔJ := J(Z

\ {x

}) − J(Z)

accept := rand(0, 1) < e

ΔJ

endif

If accept then Z := Z

\ {x

0

} endif

If J (Z) > J

then J

:= J(Z) endif

Until not accept

END Forward

Fig. 3. Left: TFS Backward procedure (note Z, J

are modiﬁed and x

can be eﬃ-

ciently computed while in the For loop). Right: The function for -improvement.

working, as follows: these procedures iteratively add or remove features one at a

time in such a way that an -improvement is accepted unconditionally, whereas

a non -improvement is accepted probabilistically.

An Experimental Study

In this section we report on empirical work. There are nine problems, taken

mostly from the UCI repository and chosen to be a mixture of diﬀerent real-life

feature selection processes in classiﬁcation tasks. In particular, full feature size

688

F.F. Gonz´

alez and L.A. Belanche

ranges from just a few (17) to dozens (65), sample size from few dozens (86)

to the thousands (3,175) and feature type is either continuous, categorical or

binary. The problems have also been selected with a criterion in mind not very

commonly found in similar experimental work, namely, that these problems are

amenable to feature selection. By this it is meant that performance beneﬁts

clearly from a good selection process (and less clearly or even worsens with a

bad one). Their main characteristics are summarized in Table 1.

Table 1. Problem characteristics. Instances is the number of instances, Features is that

of features, Origin is real/artiﬁcial and Type is categorical/continuous/binary.

Name Instances Features Origin

Type

Breast Cancer (BC)

569

real

continuous

Ionosphere (IO)

351

34

real

continuous

Sonar (SO)

208

60

real

continuous

Mammogram (MA)

65

real

continuous

Kdnf (KD)

500

40

artiﬁcial

binary

Knum (KN)

500

30

artiﬁcial

binary

Hepatitis (HP)

129

17

real

categorical

Splice (SP)

3,175

60

real

categorical

Spectrum (SC)

187

22

real

categorical

5.1

Experimental Setup

Each data set was processed with both TFS and SFFS in wrapper mode [11], us-

ing the accuracy of several classiﬁers as the objective function : 1-Nearest Neigh-

bor (1NN) for categorical data sets; 1-Nearest Neighbor, Linear and Quadratic

discriminant analysis (LDA and QDA) for continuous data sets [12]. We chose

these for being fast, parameter-free and not prone to overﬁt (speciﬁcally we

discarded neural networks). Other classiﬁers (e.g. Naive Bayes or Logistic Re-

gression) could be considered, being this matter user-dependent. Decision trees

are not recommended since they perform their own feature selection in addition

to that done by the algorithm and can hinder an accurate assessment of the

results. The important point here is noting that the compared algorithms do not

speciﬁcally depend on this choice.

Despite taking more time, leave-one-out cross validation is used to obtain

reliable generalization estimates, since it is known to have low bias [9]. The

FS parameters are as follows:

= 0.01, T

= 0.1 and T

min

= 0.0001. These

settings were chosen after some preliminary trials and are kept constant for all

the problems. The cooling function was chosen to be geometric α(t) = ct, taking

c = 0.9, following recommendations in the literature [10]. Note that SFFS needs

the speciﬁcation of the desired size of the ﬁnal solution [4], acting also as a stop

criterion. This parameter is very diﬃcult to estimate in practice. To overcome

this problem, we let SFFS to run over all possible sizes 1 . . . n, where n is the size

of each data set. This is a way of getting the best performance of this algorithm.

In all cases, a limit of 100 hours computing time was imposed to the executions.

A Thermodynamical Search Algorithm for Feature Subset Selection

689

Table 2. Performance results. J

opt

is the value of the objective function for the ﬁnal

solution and k its size. For categorical problems only 1NN is used. Note NR are the

corresponding results with no reduction of features. The results for SFFS on the SP

data set (marked with a star) were the best after 100 hours of computing time, when

the algorithm was cut (this is the only occasion this happened).

TFS

SFFS

1NN

LDA

QDA

1NN LDA QDA

1NN

LDA

QDA

opt

k J

opt

k J

opt

k J

opt

k J

opt

BC 0.96 14 0.98 11 0.99 9

0.92

0.95

0.94 11 0.98 13 0.97 7

IO 0.95 12 0.90 13 0.95 8

0.87

0.85

0.87

0.93 13 0.89 6 0.94 14

SO 0.91 25 0.84 9 0.88 10

0.80

0.70

0.60

0.88 7 0.79 23 0.85 11

MA 0.91 6 0.99 10 0.94 6

0.70

0.71

0.62

0.84 16 0.93 6 0.93 5

SP 0.91 8 n/a n/a n/a n/a 0.78

n/a

0.90

∗

n/a n/a n/a n/a

SC 0.74 7 n/a n/a n/a n/a 0.64

n/a

0.73 9 n/a n/a n/a n/a

KD 0.70 11 n/a n/a n/a n/a 0.60

n/a

0.68 11 n/a n/a n/a n/a

KN 0.86 14 n/a n/a n/a n/a 0.69

n/a

0.85 12 n/a n/a n/a n/a

HP 0.91 4 n/a n/a n/a n/a 0.82

n/a

0.91 4 n/a n/a n/a n/a

A number of questions are raised prior to the realization of the experiments:

1. Does the Feature Selection process help to ﬁnd solutions of similar or better

accuracy using lower numbers of features? Is there any systematic diﬀerence

in performance for the various classiﬁers?

2. Does TFS ﬁnd better solutions in terms of the objective function J ? Does

it ﬁnd better solutions in terms of the size k?

5.2

Discussion of the Results

Performance results for TFS and SFFS are displayed in Table 2, including the

results obtained with no feature selection of any kind, as a reference. Upon

realization of the results we can give answers to the previously raised questions:

1. The feature selection process indeed helps to ﬁnd solutions of similar or

better accuracy using (much) lower numbers of features. This is true for both

algorithms and all of the three classiﬁers used. Regarding a systematic diﬀer-

ence in performance, for the classiﬁers, the results are non-conclusive, as can

reasonably be expected, being this matter problem-dependent in general [11].

2. In terms of the best value of the objective function, TFS outperforms SFFS

in all the tasks, no matter the classiﬁer, except in HP for 1NN, where there is

a tie. The diﬀerence is sometimes substantial (more than 10%). Recall SFFS

was executed for all possible ﬁnal size values and the ﬁgure reported is the best

overall. In this sense SFFS did never came across the subsets obtained by TFS

in the search process (otherwise they would have been recorded). It is hence

conjectured that TFS can have a better access to hidden good subsets than

SFFS does. In terms of the ﬁnal subset size, the results are quite interesting.

690

F.F. Gonz´

alez and L.A. Belanche

Both TFS and SFFS ﬁnd solutions of smaller size using QDA, then using LDA

and ﬁnally using 1NN, which can be checked by calculating the column totals.

FS gives priority to the optimization of the objective function, without any

speciﬁc concern for reducing the ﬁnal size, However, it ﬁnds very competitive

solutions both in terms of accuracy and size. This fact should not be overlooked,

since it is by no means clear that solutions of bigger size should have better value

of the objective function and viceversa. In order to avoid the danger of loosing

relevant information, many times it is better to accept solutions of somewhat

bigger size if these entail a signiﬁcantly higher value of the objective function.

Finally, it could be argued that the conclusion that TFS consistently yields

better ﬁnal objective function values is but optimistic. Possibly the results for

other sizes are consistently equal or even worse. To check whether this is the

case, we show the entire solution path for one of the runs, representative of the

experimental behaviour found (ﬁgs. 4). It is seen that for (almost) all sizes, TFS

oﬀers better accuracy, very specially at lowest values.

(9, 0.99)

(7, 0.97)

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

K-TFS

Accuracy

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

K-SFFS

TFS-QDC

SFFS-QDC

Fig. 4. Comparative full performance for all k with QDA on the BC data set. The best

solutions (those shown in Table 2) are marked by a bigger circle or square.

5.3

Computational Cost

The computational cost of the performed experiments is displayed in Table 3. It

is seen that SFFS takes less time for smaller sized problems (e.g. HP, BC, IO

and KN), whereas TFS is more eﬃcient for medium to bigger sized ones (e.g.

SO, MA, KD and SP), where time is more and more an issue. In this vain, the

two-phase interleaved mechanism for forward and backward exploration carries

out a good neighborhood exploration, thereby contributing to a fast relaxation

of the algorithm. Further, the setup of the algorithm is made easier than in

a conventional SA since the time spent at each temperature is automatically