Forecasting Adventures 5 - Explaining ARIMAX Models

profitgenerator (68)in #science • 7 years ago

I am not going to do the modeling today, either tomorrow or in 2 days, so I just want to say a few words about ARIMAX models. I have been using them for quite some time now both in my old job and in the cryptocurrency world to analyze different markets.

ARIMAX is short for Autoregressive Integrated Moving Average with Exogenous parameters. It is pretty much the standard now in econometrics, time-series analysis and all other statistical analysis fields, it is the default tool mostly.

Pseudoscience

I always laugh when I see "technical analyists" playing with moving average crossovers and things like that, which are nonsense of course, they have no clue about statistical analysis so it's really the uneducated person's toolbox. You know it's a form of mysticism basically, like how faith healers work and things like that. I don't believe in it.

So if you really want to get into market analysis, start either with news/fundamentals, which is the easiest part, just basically analyze the events and use that as a guide, or if you have the technical thirst then learn more about statistics, take a course, read a book, it's not hard to educate yourself on it, there are literally free courses on Youtube on econometrics and time-series analysis.

Then after you got into it, you just have to stay updated. It's like they say “economics is the dysmal science”, because it's corrupted by politics. Well econometrics then is the most unreliable science because it's literally in people's interests to not innovate here since any potential innovation can make people money, so most of the knowledge is secret and proprietary.

Well I have already shared some of my knowledge here, most of it is publicly available anyway, but not many people know it or understand it, so I can help there. I have a pretty small audience so it's not like everyone will get rich off this information, but those that follow me might learn a few new tricks, I usually make my articles easy to read for anyone with basic knowledge.

Science

There are 2 sides to time series analysis:

Volatility Analysis & Prediction
Mean Analysis & Prediction

The first one is of course the GARCH models and usually variance forecasting. The second one is the moving average based mean forecasting since we predict the future mean of the price, and of course a confidence interval for it for the minimum / maximum values. It's always uncertain, even the parameters themselves have a confidence interval. And of course the more you go into the future the bigger the error margin gets.

I usually like to do the second one, but both can be used simultaneously, although It's a little bit like the Heisenberg principle, you can't know both, and you need both in order to predict profitability and risk. It's not like you can't know both, but both of them have a risk variable associated with it, and if you calculate both, the uncertainly doubles, so it's like knowing only 1, you don't gain an advantage by knowing both, so why not stick to only 1.

And since the GARCH models are pretty experimental, I mean literally a new paper comes out every week with an improvement, I tend to find the ARIMA based models more reliable, they have been reviewed multiple times and better tested. So I focus on mean analysis, and I just estimate the standard deviations and the risk by other ways.

ARIMA Models

Well it's the basic model and there are variations for it. We have a (p,d,q) variable set for each segment, the first is an autoregressive part, the second is a differencing part and the third is a moving average part.

The default way of finding the parameters and evaluating them is called the Box Jenkins method, which tells you how to estimate the parameters and there are of course other tests too like Correlelograms, ADF test and others, plus it tells you to look for random residuals with no auto-correlation and normally distributed, and things like that.

Well here is a secret, I don't really use it that much. In fact I don't even think it's correct. It's not like it's not entirely correct, but I find it hard to believe to apply to all data, since every data is different, so you can't really generalize a strategy for it. I just prefer the brute force method, test all parameters and then look for the ones that give the best results.

The auto-regressive part is pretty much always 1 lag according to Box Jenkins, but in my experience I got always better results with 0, hardly any auto-regression in the price. In fact if we assume that the market is efficient, which I don't think it fully is but it tends towards that like a central gravity point, then the auto-correlation should be 0 anyway.

I mean let's face it high-frequency traders will abuse the crap out of any inefficiency, so a raw market must be close to efficiency always. So even theoretically the AR part is 0. So p=0, usually.

The "I" part, is the differencing level, which is pretty much always needed for market data. This is to make it stationary since it's never stationary. Non-stationary means there is a trend, or the variance is not bound. By differencing the data, we transform it and make it stationary. So d>0, mostly d=1, a second order differencing usually destroys the quality of the data, even if it's not stationary after 1, I'd not go for 2.

The MA part is the moving average. Moving averages are not used for "crossovers" and nonsense like that, but for smoothing of the data. There is also no period for the MA, it's just an "exponential moving average", an MA(1) is exponential smoothing. The parameters are estimated together with the rest, we don't leave anything arbitrary there just because it looks nice on the chart, this is a science based field not mysticism.

Now the parameters too can be either brute forced or estimated with some other tools, here it depends how good a computer you have, since for a high quality estimation you might need a supercomputer, there are just so many variables to loop through. So the cost of an accurate forecast is high usually.

Seasonality

There is also sometimes a seasonal component. I have hardly found one in a market, but it's there for temperature data or other real world events that are not efficient. The market is pretty much so it's hardly occurring here. This can be detected either by looking at the ACF/PACF values or with other standardized tests. I don't think this is relevant for markets.

ARIMA vs ARIMAX

So the question is which one is better. And keep in mind there are different version of ARIMA, and depending on how parameters are estimated and what information criterion we use and things like that.

I found this paper which kind of looks very professional but I think it’s a ridiculous application of it:

http://mme2012.opf.slu.cz/proceedings/pdf/024_Durka.pdf

Long story short, there is a lot of technical mumbo-jumbo there and in the end they claim that the ARIMA model is slightly better than the ARIMAX. Well I am not going to challenge the technicalities since I am not a PhD to be an expert on the subject but through my experience and observance I have to say that it’s a false conclusion.

First of all an ARIMA model doesn’t handle the regressors properly, secondly the ARIMAX model is specifically built for exogenous regressors, whether they are or not that is a different issue, but the model is better.

Then the ARIMA regressors doesn’t always add up and ARIMAX can be computed for more parameters.

Then the exogenity problem arises which is basically we need external information to add into the model, but all datapoints are sort of like a “chicken and the egg” issue, you know like the transaction volume in BTC/USD. Does the increased transaction volume make the price increase or does the price increase in response to the increased transaction volume? Which one is it? Or do both increase somewhat in response to eachother?

Hard to tell, since correlation doesn’t equal causation, but it doesn’t matter. Not all variables are entirely endogenous, unless they are 100% correlated. But if they are less correlated, then they have less effect on the price. It’s more like a slided corrrelation is needed, the last exogenous regressor should be correlated with the current price, that means it’s a leading indicator.

All of these are much better calculated in the ARIMAX tool, and from all my analysis I have found that the ARIMAX is usually 5-10% more accurate than a naked ARIMA, so I don’t know from where do those guys get their knowledge from, but I prefer the ARIMAX model.

I am telling you guys if you get your information from “scientific papers”, in this field, most of the time they will not give you accurate information. You have to work from experience. But that doesn’t mean that “technical analysis” isn’t bogus as well.

Time Series Analysis Simple

There is really just 1 rule to time series analysis:

Only use past values!!

That’s it, if you keep that rule, there is really nothing that can go wrong, and aside from that, you can literally do anything with your data, just don’t use future values.

This also applies to the regressors, thank God that I made this wonderful tool to actually test or shift back all values in the regressors so that we don’t leak into the future:

https://steemit.com/programming/@profitgenerator/forecasting-adventures-4-processing-blockchain-info-s-data

So you can’t leak into the future in a historical dataset, that automatically invalidates your model since in the present you can’t peak into the future either. This applies to the regressors too where their date of origin must be aligned with your main data, that is important.

But other than that you can do anything, apply any tool you want, try to not add extra data though unless in a regressor form, and process and transform or smoothen the data as you want. It doesn’t matter as long as you keep that 1 rule.

So I just basically prefer brute-forcing. I find most of the theories unreliable so I just estimate the parameters on my own, with a confidence interval attached. Obviously too big or too small values are very unlikely so a probable range can be estimated, and then just add the uncertainty part into our risk model:

https://steemit.com/money/@profitgenerator/very-easy-risk-management

It’s just like that, it doesn’t have to be overcomplicated plus there are plenty of software packages that you can use, so you don’t even need to know any of the things I wrote in this article except the rule above. So just grab a software and start analyzing data now: