Combining VAR forecast densities using fast Fourier transform

3


Introduction
Combining predictions of macroeconomic aggregates and financial variables from various models is a popular technique in applied macro research. One advantage, among others, that speaks for combining different models is that it provides an insurance against selecting inappropriate models as Bache et al. (2009) emphasize.
Traditional econometrics may suggest selection of one model that appears to be the best in terms of, say, information criteria or relevant p-values. Information resulting from other models is then a priori discarded. In contrast, combining of competing models introduces uncertainty into model structure in which outcomes from each single model are taken into account with appropriate weight and no prior information is thus ignored.
First attempts to combine forecasts of different candidate models led to combinations of point forecasts -see Bates and Granger (1969). Logical generalization can be found in literature in form of combining interval forecasts. Kascha and Ravazzolo (2009) provide a nice overview of most popular methods being in use in this area. Nevertheless, the authors rely on normality of estimated forecasts, for which there exists a good reason with respect to time that would otherwise be necessary to simulate all candidate models.
Prior model uncertainty is a feature that can be fully exploited from Bayesian perspective -approach usually marked as Bayesian model averaging (BMA), as in Andersson andKarlsson (2007), Hoeting et al. (1999), where each model is a priori given a weight in form of a probability distribution and data is then used to transform prior beliefs into posterior estimates.
In this paper I propose estimation of combined GDP forecast for Czech economy based on a broad set of vector autoregressive models (VARs). Following Stock and Watson (2004), among others, I choose ex post prediction as main criterion in comparison of competing candidate models. Most central European economies suffer from short time series available, which suggests that normal approximations of any kind can lead to underestimation of variances of relevant model outcomes.
Monte Carlo methods are therefore fully utilized instead, which makes the analysis computationally demanding. On the other hand the aim of this paper is to show that simulated predictive densities can be combined in light of convolution theory, concretely via discrete Fourier transform for which fast algorithm was developed (fast Fourier transform -in short FFT). This would not make the analysis appealing since FFT itself is not trivial to compute especially if large input vectors need to be combined repeatedly. However, main finding of this paper is that FFT works well even if predictive densities of candidate models are poorly simulated with only, say, 100 iterations. While such improper simulation would not lead under usual circumstances to meaningful results, it is shown further that FFT can help here significantly in terms of reduction of simulation time while keeping quality of results very high.
FFT algorithm is often used in technical applications, e.g. in digital signal processing. It is common to use filtering techniques in macroeconomics to reveal business cycle information from various time series. However, to my best knowledge, I have not encountered usage of FFT in combining predictive densities yet.
Even though this paper is intended to be based on Bayesian way of thinking, it does not follow pure Bayesian model averaging approach since during certain stages frequentist approach is taken as well.
The paper is further organized as follows: section 2 comprises the set of competing models that enter the analysis; section 3 uncovers underlying specifics of Czech economy that affect the modeling approach taken here and data used; section 4 describes a technique of selecting representative model subset since simulation of all models considered would not be achievable in reasonable time; section 5 shows how candidate models can be evaluated with respect to their predictive powers and combination of candidate models via fast Fourier transform is described; section 6 summarizes the results of Czech GDP prediction and section 7 concludes.

Suite of candidate models
As stated above, the focus of this paper is on combination of predictions from broadly defined class of VAR models. From technical point of view I consider in turn traditional VARs (see e.g. Hamilton -1994, Enders -2004 and Bayesian VARs -BVARs (see e.g. Lütkepohl -2007, Canova -2007 of the form where y t is a vector of k endogeneous variables, w t is k-dimensional white noise process for which following facts hold: E(w t )=0, E(w t w t ')= w a E(w t w s ')=0 for s≠t.
In case of BVARs, scheme of prior distributions in Minnesota style as proposed by Doan, Litterman, Sims (1984) or Litterman (1986) fits models where variables are in levels. Regression parameters have a priori zero mean, only autoregressive parameters of order 1 in each equation have mean equal to one. Time series are thus coped with as independent random walks as opposed to this paper where variables are modeled as stationary processes and therefore depending on type of imposed stationarity I suggest prior zero mean for all coefficients in difference stationary processes and AR(1) coefficients equal to 0.8 for trend stationary processes.
Following Lütkepohl -2007, covariance matrix of BVAR parameters is diagonal 1 with diagonal

  
where v ij,l is the prior variance of i-th coefficient in j-th equation, l stands for lag,  is parameter of tightness,  is decay parameter from interval (0;1) and  i is i-th diagonal element of covariance matrix of residuals -ratio of  i / j normalizes different scales of variables i and j.
Since only forecasting performance of models matters, models were estimated in reduced form without need to identify structural hyperparameters.
1 Diagonal form of the covariance matrix of regression parameters corresponds with VAR model represented in a compact form.

Czech specifics and used data
As was mentioned in the introduction, suggested method of combining predictive densities is demonstrated on prediction of Czech GDP. Czech Republic is a small open economy that adopted inflation targeting in 1998 when it abandoned fixed exchange rate regime. Increasing foreign demand has proved to be key factor in sound development of Czech economy in last 2 decades. Financial turmoil experienced in the end of 2008 hit the economy fully in 2009 through decrease in foreign demand. Vulnerability of Czech economy nowadays also emanates from hardly predictable exchange rate. Relevant domestic time series exist for past 15 years. With this in mind I construct a series of several models that proved well in modeling closed economies and rather larger set of models that allow for transmission of shocks from abroad -the link between domestic economy and EU members is captured via export/import prices and exchange rate (EURO). The models considered contain 2 to 4 equations, simple AR process for GDP is also assumed. Maximum lag length was set in turn to 1 -4 quarters. Domestic time series hardly allow for longer lag length. Variables considered are e.g.
domestic/foreign GDP, inter-bank market interest rates, inflation measures (Czech CPI, European HICP), dynamics of newly issued credit and unemployment rate.
Since GDP is a complex indicator of economy's output that comprises heterogeneous components, isolated blocks of GDP were studied in separate sub-models for private and government consumption Since analysis is focused on GDP prediction, all GDP components and their forecast densities can be transformed to GDP with help of error correction model (ECM) based on following intuition: Stationarity of modeled variables was achieved via differencing or with help of band-pass filter as suggested by Baxter and King (1995). Parameters of BK filter were set so as to extract business cycle information from time series that corresponds to spectrum of frequencies between 6 and 32 quarters as Canova (2007) proposes.
The amount of all considered models is quite large. All model alternatives can be evaluated as: My ambition here was to combine whole predictive densities, not just point forecasts. To simulate all 1408 models 5 would be a tedious task. Hence, while exploiting the power of MCMC methods, I tried to create a representative subset of candidate models. Selected models then served as objects of simulation in subsequent analysis. The procedure of model selection is described in next section.

Selecting a representative model subset
Sufficient simulation that would ensure estimation of moment characteristics of predicted endogenous variable (GDP) with high preciseness is time demanding. In case of residual bootstrap, which I applied on VARs in reduced form, Efron (1979) suggests several thousand simulation iterations. In case of BVARs, prior coefficient normality results in multivariate normal posterior estimates due to principle of conjugacy. It is possible to use intuitive Gibbs sampler to generate draws from relevant distributions (see Lancaster -2005). This technique, however, incorporates the property of Markov chain and generated draws are correlated. Taking, say, every tenth draw can minimize the degree of correlation.
Sadly, the simulation time this way increases ten times (!). All these facts indicate that in order to keep the simulation time manageable it is necessary to reduce the model space.
Canova (2007) and Lancaster (2005) offer a heuristic from the class of MCMC methods -Metropolis algorithm. This technique overlaps to great extent with simulated annealing algorithm. All we need is to specify a criterion that will be minimized and a simple rule stating how the space of candidate models will be explored. where 1 = 'partial model will be included in final combined model', 0 = 'partial model will be discarded'.
Slight changes in vector M (random substitution of 1 for 0 and vice versa) serve as means of exploring the space of competing models.
Criterion function has three components: (i) measure of mean absolute deviations of selected models from reality, (ii) measure of symmetry of selected model predictions around reality and (iii) amount of sub-models approved for combined prediction. Minimization of (iii) component is crucial since the key goal of this analysis is to cut down the amount of models while not violating (i) and (ii) too much. The idea behind construction of all components is in short described further on.
Measure (i), mean absolute deviation (MAD) of ex post predictions from reality, is evaluated only over models that are considered in given iteration (for which M j = 1): where j is an index of partial model, h stands for h-step ahead forecast,ŷ j h and y j h denote model prediction and reality, respectively. For the whole testing period this indicator can be aggregated for all h-step ahead forecasts (running from 1 to 4) with a discount factor  which dampens the effect of deviations of models and reality in longer horizons: Besides MAD it is possible and perhaps desirable to penalize models that constantly overshoot reality from models that fluctuate around reality in the prediction horizon. This idea is captured in measure (ii) where model that overestimates the reality in two quarters and underestimates the reality in other two quarters is considered more valuable than a model that four times under/overshoots. Aggregation over selected models yields where x j is number of overshoots of j-th model.
where E is a vector of ones of length 1408.
An aggregate criterion function, , is then an additive combination of (1), (2) and (3) and a vector of weights with the property w i  (0;1), w i  1 : Now this function may look intimidating, but on top of that it is necessary to normalize each component, C, according to (C -C min )/(C max -C min ) in order to get meaningful results.
Metropolis algorithm as suggested by Canova (2007) and Lancaster (2005)  6. If  i <  i-1 then accept suggested M, otherwise accept M with a probability of min{1; i1  i }; 7. If i = i max algorithm stops, else proceed with step 2.
With this routine I managed to cut down the number of selected models from 1408 to 159. Initial values were set as follows: imax = 10000,  ~ Uniform(0.8,1), w1 = 0.05, w2 = 0.8. Posterior mode for  came out 0.93. Overview of selected models can be found in Appendix at the end of this paper.

Combining predictive densities
Set of 159 representative candidate models in preceding analysis was selected with respect to relative successfulness in ex post point predictions. These models can now be simulated over moving time windows to provide a series of both ex post and ex ante predictions including relevant confidence bands. Residual bootstrapping 7 was fully utilized in case of traditional reduced form VAR models, Gibbs sampling helped in case of BVARs. Gerard and Nimark (2008) propose an expert linear opinion pool which can combine various models' outputs into a final distribution of relevant variable (GDP): where p(M m ) is prior belief in m-th model.

Combining predictions using Fourier transform
The foregoing combinations via expert linear opinion pool work well provided that all partial models yield sufficiently accurate outcomes. Insufficient simulation usually causes moment characteristics under consideration to be biased. One could simply conclude that there is a trade-off between quality of model outcomes on one hand and necessary simulation time on the other, both bound in the same direction. I would like to argue, though, that usage of discrete Fourier transform (DFT) provides a convenient framework in which poorly simulated models can be combined without severe loss of quality of results. As a consequence, significant portion of time can be saved during simulation.
The underlying idea is as follows. Results from different sub-models in form of histograms are combined as in a series of convolutions. We can exploit the fact, that a convolution of two random variables can be easily derived once we transform both variables into a frequency domainconvolution here becomes an easy task of element by element multiplication.
7 It is this time consuming bootstrap and Gibbs sampler why number of candidate models had to be lowered to 159.
The whole procedure in case of combining two variables can be summarized in these steps: (1) we apply DFT algorithm on each variable in order to convert underlying histograms into frequency domain counterparts, (2) transformed variables get multiplied element by element, (3) we apply inverse DFT to convert the result back into original domain, (4) average distribution of both variables is achieved by a horizontal adjustment -e.g. say, we have simulated histograms of variables X and Y at hand and need to know the distribution of (X+Y)/2 (i.e. an average distribution). First three steps of the algorithm only evaluate distribution of (X+Y), thus, each position of resulting histogram needs to be moved horizontally in order to achieve true (X+Y)/2 distribution. Asmar (2005) shows analytical formula for the DFT procedure:   figure 1 and figure 2 illustrate the same kind of experiment. The only difference is in number of simulation iterations that were used to produce appropriate graphs. Random variables X and Y are generated from normal distribution with mean 10 and variance 0.5. X and Y are drawn from their distributions independently. The question at hand could be to ask what is the distribution of random variable X+Y. In this trivial example it is easy to deduce analytically that the distribution of X+Y will again be normal with mean 20 and variance equal to 1, which emanates from the property of independence of X and Y. It is, however, interesting to see how this distribution is approximated with 'plain' simulation and with FFT improvement in the spirit of this section for 2000 and 100 simulation iterations, respectively. In figure 1 both partial histograms for X and Y relatively well approximate bubble charts that represent true Gaussian curves. This causes variable (X+Y) be approximated also to very high degree of preciseness. Note that convolution via FFT is in this case useless since the same result can be achieved without it.
The situation in figure 2 is very different. As partial histograms for X and Y relatively poorly approximate true Gaussian processes, simulated distribution of (X+Y) is also of poor quality -it even looks bimodal (!). On the other hand, convolution via FFT takes the same inputs for X and Y but the result represented by solid line closely traces true Gaussian curve for the process of (X+Y). While amount of simulation iterations in this case is insufficient for constructing simulated distribution of (X+Y), it is a fair enough amount to be used to convolve X and Y with only a minor loss in quality.
The only remaining question is to state when it makes sense to proceed with FFT after a simulation and when it does not generate any value added. According to figure 1 and figure 2, simulating 100 times and application of FFT afterwards makes perfect sense because it is improbable (in real applications, not in this simple one) that FFT would take more time than carrying out the rest of 1900 iterations, provided that we know in advance that 2000 iterations made FFT redundant (figure 1 situation).
As a real example, I consider a pair of VAR models with maximum lag length set to 3. Not to speak of variables involved, both of these models are in turn processed through a bootstrapping sequence in order to provide simulated predictive densities. This exercise is repeated for variable amount of simulation iterations and computational times necessary to accomplish each simulation cycle are stored. In the end, simulation based approach and the one with FFT extension are compared to each other in terms of computational time and degree of precision (measured by mean absolute deviation of both simulation variants from true distribution). Results of this analysis are presented in figure 3.
Graph (A) shows how overall preciseness of both simulation approaches depends on amount of iterations conducted during bootstrapping (running from 50, going by one up to 550 iterations). It can be seen that whenever FFT approach is performed after regular simulation batch, deviation of results from reality is more or less than twice as smaller (which doubles the accuracy), especially in case of lower amount of executed iterations.
Of course, additional usage of FFT algorithm is costly in terms of time -graph (B) shows how long it takes to carry out certain amount of simulation iterations, regardless of whether the time units on vertical axis are in seconds, hours, or days. It took me on average 0.004 seconds to carry out one estimation cycle needed in residual bootstrap with the VAR models under consideration.

FFT drawbacks
For the FFT algorithm to work efficiently it is necessary to provide input vectors of length that is an integer power of two (2,4,8,16,32,64,…). This can be overcome by padding input vectors with zeros when needed. Another disadvantage is that FFT can convolve two vectors according to w 1 X+w 2 Y only for integer weights. If for example w=(1/3;2/3) then FFT must be instructed to take the weights so that result for 1 X  2 Y is computed and then horizontal adjustment is needed in order to handle the fact that weights do not sum to one any more. Also, it can happen easily that convolution of vectors can lie far behind the length of input vectors -in that case convolved distribution spills over whole domain without any warning (so called wrap-around). One must arrange for sufficient length of input vectors in order to avoid this case.

Results
According to procedure described above I constructed a series of ex post predictions for Czech GDP based on combination of 159 representative models running from 2007 and ending in second quarter of 2009 from where ex ante combined prediction was computed. Figure 4 shows medians of relevant combined distributions for 1-to 6-step ahead predictions from each forecast horizon considered.   Speaking of which, it took almost 12 hours of computational time for suggested GDP combined model to get processed through brute force simulation in Bayesian manner. When FFT was applied in relevant phases of estimation, computational time was cut down tremendously to only 2 hours and 45 minutes without any visible loss encountered.
Combined ex ante prediction of Czech GDP suggests that Czech economy has touched its trough in third quarter of 2009. Relatively high GDP growth in 2010Q1, implied by the combined model, results mainly from sharp GDP decline in the same quarter in the preceding year. I would not interpret this pattern of predicted GDP as a 'W' shaped recovery.
VAR models usually serve as a convenient framework to trace out responses in endogenous variables on artificially generated shocks (impulse-response analysis). GDP in combined models has nonconstant variability across models and, what is more challenging, many models considered in the analysis here contain isolated GDP components instead of GDP. Combined impulse-response functions would need to compare variables with somehow normalized variability and take into account the way of how nonstationarity was handled or perhaps cross-correlation between GDP and modeled variables. This sounds like a tedious task but on the other hand some puzzles, e.g. price puzzle, could be perhaps solved out this way.

Appendix
Metropolis algorithm helped to cut down the amount of models considered from 1408 to 159. Table 1 provides an overview of what model types are included in the representative model set.