Lecture Notes in Computer Science
Variable Selection for Multivariate Time Series
Download 12.42 Mb. Pdf ko'rish
|
- Bu sahifa navigatsiya:
- Keywords
- 2 Modeling Multivariate Chaotic Time Series
- 2.1 Multivariate Phase Space Reconstruction
- 2.2 Neural Network Model
- 2.3 Statistical Variable Selection Method
- 3 Sensitivity Analysis with Neural Networks
Variable Selection for Multivariate Time Series Prediction with Neural Networks Min Han and Ru Wei School of Electronic and Information Engineering, Dalian University of Technology, Dalian 116023, China minhan@dlut.edu.cn
networks for multivariate time series prediction. Sensitivity analysis of the neu- ral network error function with respect to the input is developed to quantify the saliency of each input variables. Then the input nodes with low sensitivity are pruned along with their connections, which represents to delete the correspond- ing redundant variables. The proposed algorithm is tested on both computer- generated time series and practical observations. Experiment results show that the algorithm proposed outperformed other variable selection method by achieving a more significant reduction in the training data size and higher pre- diction accuracy. Keywords: Variable selection, neural network pruning, sensitivity, multivariate prediction. 1 Introduction Nonlinear and chaotic time series prediction is a practical technique which can be used for studying the characteristics of complicated dynamics from measurements. Usually, multivariate variables are required since the output may depend not only on its own previous values but also on the past values of other variables. However, we can’t make sure that all of the variables are equally important. Some of them may be redundant or even irrelevant. If these unnecessary input variables are included into the prediction model, the parameter estimation process will be more difficult, and the overall results may be poorer than if only the required inputs are used [1]. Variable selection is such a problem to discard the redundant variables, which will reduce the number of input variables and the complexity of the prediction model. A number of variable selection methods based on statistical or heuristics tools have been proposed, such as Principal Component Analysis (PCA) and Discriminant Analysis. These techniques attempt to reduce the dimensionality of the data by creat- ing new variables that are linear combinations of the original ones. The major diffi- culty comes from the separation of variable selection process and prediction process. Therefore, variable selection using neural network is attractive since one can globally adapt the variable selector together with the predictor. Variable selection with neural network can be seen as a special case of architecture pruning [2], where the pruning of input nodes is equivalent to removing the corresponding 416 M. Han and R. Wei variables from the original data set. One approach to pruning is to estimate the sensitivity of the output to the exclusion of each unit. There are several ways to perform sensitivity analysis with neural network. Most of them are weight-based [3], which is based on the idea that weights connected to important variables attain large absolute values while weights connected to unimportant variables would probably attain values somewhere near zero. However, smaller weights usually result in smaller inputs to neurons and larger sig- moid derivatives in general, which will increase the output sensitivity to the input. Mozer and Smolensky [4] have introduced a method which estimates which units are least impor- tant and can be deleted over training. Gevrey et al. [5] compute the partial derivatives of the neural network output with respect to the input neurons and compare performances of several different methods to evaluate the relative contribution of the input variables. This paper concentrates on a neural-network-based variable selection algorithm as the tool to determine which variables are to be discarded. A simple sensitivity crite- rion of the neural network error function with respect to each input is developed to quantify the saliency of each input variables. Then the input nodes are arrayed by a decreasing sensitivity order so that the neural network can be pruned efficiently by discarding the last items with low sensitivity. The variable selection algorithm is then applied to both computer-generated data and practical observations and is compared with the PCA variable reduction method. The rest of this paper is organized as follows. Section 2 reviews the basic concept of multivariate time series prediction and a statistical variable selection method. Section 3 explains the sensitivity analysis with neural networks in detail. Section 4 presents two simulation results. The work is finally concluded in section 5.
The basic idea of chaotic time series analysis is that, a complex system can be de- scribed by a strange attractor in its phase space. Therefore, the reconstruction of the equivalent state space is usually the first step in chaotic time series prediction. 2.1 Multivariate Phase Space Reconstruction Phase space reconstruction from observations can be accomplished by choosing a suitable embedding dimension and time delay. Given an M-dimensional time se- ries{X i , i=1, 2,…, M}, where X i =[x i (1), x i (2), …, x i (N)] T , N is the length of each scalar time series. As in the case of univariate time series (where M=1), the reconstructed phase-space can be made as [6]: 1 1
1 1 1 ( ) [ ( ), (
), , ( ( 1) ),
, , ( ), ( ), , ( ( 1) ] M M M M M M X t x t x t x t d x t x t x t d τ τ τ τ = − − − − − − (1)
where , 1, , t L L N = + , 1 max( 1) 1
i i M L d τ ≤ ≤ = − ⋅ +
, τ i and d i
( 1, 2, ,
M = ) are the time delays and embedding dimensions of each time series, respectively. The delay time τ i can be calculated using mutual information method and the embedding dimension is computed with the false nearest neighbor method.
Variable Selection for Multivariate Time Series Prediction with Neural Networks 417 According to Takens’ embedding theorem, if 1 M i i D d = = ∑ is large enough there exist an mapping F: X(t+1)=F{X(t)}. Then the evolvement of X(t)→X(t+1)
reflects the evolvement of the original dynamics system. The problem is then to find an appropri- ate expression for the nonlinear mapping F. Up to the present, many chaotic time series prediction models have been devel- oped. Neural network has been widely used because of its universal approximation capabilities.
A multilayer perceptron (MLP) with a back propagation (BP) algorithm is used as a nonlinear predictor for multivariate chaotic time series prediction. MLP is a super- vised learning algorithm designed to minimize the mean square error between the computed output of the neural network and the desired output. The network usually consists of three layers: an input layer, one or more hidden layers and an output layer. Consider a three layer MLP that contains one hidden layer. The D dimensional de- layed time series X(t) are used as the input of the network to generate the network output X(t+1). Then the neural network can be expressed as follows: I (I) 0 ( ) N j i ij i o f x w = = ∑
(2) H (O)
1 N k jk j j y w o = = ∑
(3) where I 1 2 [ ,
, , ] ( ) N x x x X t = denotes the input signal, I N is number of input signal to the neural network, (I)
ij w is the weight connected from the i th input neuron to the j th hidden neuron, o j are the output of the j th hidden neuron, H
is the number of neurons in the hidden layer, O 1 2 [ ,
, , ] ( 1)
y y y X t = + is the output, O
is the num- (O)
is the weight connected from the j th hidden neuron and the
th output neuron. The activation function
(·) is the sigmoid function given by 1 ( )
1 exp( )
x = + −
(4) The error function of the net is usually defined as the sum square of the error o 2 1 1 [ ( ) ( )]
N N k k t k E y t p t = = = − ∑∑ , t =1,2,…
N
(5) where p k (
) is the desired output for unit
,
is the length of the training sample.
418 M. Han and R. Wei 2.3 Statistical Variable Selection Method For the multivariate time series, the dimension of the reconstructed phase space is usually very high. Moreover, the increase of the input variable numbers will lead to the high complexity of the prediction model. Therefore, in many practical applica- tions, variable selection is needed to reduce the dimensionality of the input data. The aim of variable selection in this paper is to select a subset of R inputs that retains most of the important features of the original input sets. Thus,
-
irrelevant inputs are discarded. The Principle Component Analysis (PCA) is a traditional technique for variable se- lection [7]. PCA attempts to reduce the dimensionality by first decomposing the nor- malize input vector
(
) d with singular value decomposition (SVD) method
= ∑ (6)
where 1 2 [ ...
0 ... 0] p diag s s s ∑ = , 1 2 p s s s ≥ ≥ are the first p eigenvalues of X ar- rayed by a decreasing order, U and V are both orthogonal matrixes. Then the first k singular values are preserved as the principle components. The fi- nal input can be obtained as
=
(7) where U is the first k rows of U. PCA is an efficient method to reduce the input dimension. However, we can’t make sure that the factors we discard have no influence to the prediction output because the variable selection and prediction process are separated individually. Neural network selector is a good choice to combine the selection process and prediction process. 3 Sensitivity Analysis with Neural Networks Variable selection with neural networks can be achieved by pruning the input nodes of a neural network model based on some saliency measure aiming to remove less relevant variables. The significance of a variable can be defined as the error when the unit is removed minus the error when it is left in place: WithoutUnit _ WithUnit _ ( 0) ( )
i i i i i S E E E x E x x = − = = −
=
(8) where E is the error function defined in Eq.(5). After the neural network has been trained, a brute-force pruning method for ever input is setting the input to zero and evaluate the change in the error. If it increases too much, the input is restored, otherwise it is removed. Theoretically, this can be done by training the network under all possible subsets of the input set. However, this exhaustive search is computational infeasible and can be very slow for large network. This paper uses the same idea with Mozer and Smolensky [4] to approximate the sensitivity by introducing a gating term i α for each unit such that ( )
ij i i i o f w o α = ∑
(9) where j o is the activity of unit j, ij w is the weight from unit i to unit j. Variable Selection for Multivariate Time Series Prediction with Neural Networks 419 The gating term α is shown in Fig.1, where , 1, 2,
, I i I i N α = is the gating term of the ith input neuron and , 1, 2,
, H j H j N α = is the gating term of the jth output neuron. 1
α 1
α 1
i x I N x k y I I N α
H N α
Fig. 1. The gating term for each unit The gating term α is merely a notational convenience rather than a parameter that must be implied in the net. If 0 α = , the unit has no influence on the network; If 1 α = , the unit behaves normally. The importance of a unit is then approximated by the derivative 1
α α = ∂ = − ∂
(10) By using a standard error back-propagation algorithm, the derivative of Eq.(9) can be expressed in term of network weights as follows ( )
1 1 1 ( ) ( )
O H N N N H O k j k k jk j H H t k j k j j y E E S p t y t w o y α α = = = ⎡ ⎤ ∂ ∂ ∂ = − = − ⋅ = − ⎢ ⎥ ∂ ∂ ∂ ⎣ ⎦ ∑∑ ∑
(11) ( ) ( ) ( ) 1 1 1 ( )
( ) (1 ) ( ) O H N N N I O I k i k k jk j j ij i I I t k j k i i y E E S p t y t w o o w x t y α α = = = ⎡ ⎤ ∂ ∂ ∂ = − = − ⋅ = − − ⎢ ⎥ ∂ ∂ ∂ ⎣ ⎦ ∑∑ ∑
(12) where
I i S is the sensitivity of the ith input neuron, H j S is the sensitivity of the jth output neuron. Thus the algorithm can prune the input nodes as well as the hidden nodes according to the sensitivity over training. However, the undulation is high when the sensitivity is calculated directly using Eq.(11) and Eq.(12) because of the engineering approximation in Eq.(10). Sometimes, it may delete the input incorrectly. In order to possibly reduce the dimensionality of input vectors, the sensitivity matrix needs to be evaluated over the entire training set. This paper develops several ways to define the overall sensitivity such as: (1) The mean square average sensitivity: 2 ,
1 ( )
N i avg i t S S t T = = ∑
(13) where T is the number of data in the training set.
420 M. Han and R. Wei (2) The absolute value average sensitivity: , 1 1 ( )
N i abs i t S S t T = = ∑
(14) (3) The maximum absolute sensitivity: ,max
1 max
( ) i i t N S S t ≤ ≤
=
(15) Any of the sensitivity measure in Eqs.(13)~(15) can provide a useful criterion to determine which input is to be deleted. For succinctness, this paper uses the mean square average sensitivity as an example. An input with a low sensitivity has little or no influence on the prediction accuracy and can therefore be removed. In order to get a more efficient criterion for pruning inputs, the sensitivity is nor- malized. Define the absolute sum of the sensitivity for all the input nodes 1
= = ∑
(16) Then the normalized sensitivity of each unit can be defined as ˆ
i S S S =
(17) where the normalized value ˆ i S is between [0 1]. The input variables is then arrayed by a decreasing sensitivity order: 1 2
ˆ ˆ
N S S S ≥ ≥ ≥
(18) The larger values of ˆ ( ), 1, 2,
, i I S t i N = present the important variables. Define the sum the first k terms of the sensitivity k η as 1 ˆ
k j j S η = = ∑
(19) where k=1,2,…,N I . Choosing a threshold value
1
0 < < η , if 0
η η
, the first k values are preserved as the principal components and the last term of the inputs with low sensitivity are removed. The number of variable remained is increasing as the threshold 0 η increase. Download 12.42 Mb. Do'stlaringiz bilan baham: |
ma'muriyatiga murojaat qiling