Hello. 🙂 I am back with a post about forecasting, a subject I have always, carefully tried to avoid … for personal reasons. 😀

Forecasting is the science (or an art for some:) of analysing trends and estimating future outcome probabilities based on historical statistics.

If a **trend** is thought to be **linear**, we analyse it as a **regression**. If the trend is not linear other tests are preferred like **Kendall’s rank correlation**, and/or other **smoothing** techniques. Descriptive statistics is an important exploratory tool to understand what kind of distributions we are dealing with.

Data about **Apple Inc share prices** was collected (more precisely, **date, open, high, low, close, adjusted closure and volume of shares traded on the market** daily from 01.01.2017 to the last trading day available before this post that is 09.06.2017) from Yahoo finance.

Once data is collected, cleaned (if necessary), normalised(if necessary), scaled (if necessary), etcetera, we are ready to perform **descriptive statistical analysis**, that will help us investigate further some issues of more critical interest.

Variable #2 (Apple Inc. Share Price – Open) |
|||

Count | 110 | Mean Deviation | 9,41869 |

Mean | 138,36169 | Second Moment | 142,21464 |

Mean LCL | 136,0978 | Third Moment | -771,31834 |

Mean UCL | 140,62558 | Fourth Moment | 46.172,3278 |

Variance | 143,51936 | ||

Standard Deviation | 11,97996 | Sum | 15.219,786 |

Mean Standard Error | 1,14224 | Sum Standard Error | 125,64685 |

Coefficient of Variation | 0,08658 | Total Sum Squares | 2.121.478,93681 |

Adjusted Sum Squares | 15.643,61058 | ||

Minimum | 114,826 | ||

Maximum | 156,009 | Geometric Mean | 137,83068 |

Range | 41,183 | Harmonic Mean | 137,28291 |

Mode | #N/A | ||

Median | 140,375 | ||

Median Error | 0,1365 | Skewness | -0,4548 |

Percentile 25% (Q1) | 132,06925 | Skewness Standard Error | 0,22834 |

Percentile 75% (Q3) | 144,869 | Kurtosis | 2,28293 |

IQR | 12,79975 | Kurtosis Standard Error | 0,44452 |

MAD (Median Absolute Deviation) | 0,702 | Skewness (Fisher’s) | -0,46111 |

Coefficient of Dispersion (COD) | 0,06588 | Kurtosis (Fisher’s) | -0,69417 |

Here the software has automatically sampled the opening price ranges (“$110 to $115”, “$115 to $120”, …).When the statistical package samples for you, you may easily get quasi-normal distributions. If you are familiar with programming, you may define the number of ranges your samples should fit in. However one should be careful when defining ranges because scientists and statisticians may often be biased by what they want to find. So setting the number of ranges may be dangerous, unless a careful methodological study has been run on sampling and methodologically-sound reasons exist for sampling and ranging that way.

**Descriptive statistics** and **plots** are produced for every variable of interest.

Being able to interpret the statistics is a huge plus at every stage. Unfortunately I cannot dig into the statistics now, as I still need to get to the gist of this post that is forecasting and predictive models.

One should however appreciate that whilst many think of **data visualisation** as an achievement, **data science professionals** use visualisation as an exploratory tool for their analysis.

For example, in the following graph we immediately see a fat tail event happening at point 21 of our list of observations: it seems that 2017-02-01 was a very good day for Apple Inc. as they traded a huge volume of shares, 111,985,000 (in one day!).

If you consider the opening price of one share was $125.96 on that day, I let you do the math to figure out of how much Apple Inc. earns – or how much money they move in one day :O) [Note though that our math will be biased because **prices are always fluctuating**. To get smaller errors we would need to collect data on smaller time-frames (i.e. say every 5 minutes)].

Being capable of reading a graph can sometimes be a life-saviour or it can be a time-saviour (which indeed is a life-saviour).

When building a predictive model on a list of observations, a statistical package will return the **residuals. The residuals are the differences between the predicted values and the actual values. ** when I built the linear regression model fitting the Apple Inc. data collected, the system generated the following plots for residuals.

In the figure on the lower left corner (titled Residuals), at the end of the x-axes, observation 110 deeps very much down. This is not human error (in handling data), nor it is a statistical error of the package (whilst in some cases both may happen).

In fact here, the linear model output a prediction that is significantly lower than the real value.The residual is the difference between the estimated (predicted) value and the real value. One can see in the following table, the biggest residual is in observation 110.

Real Value Predicted Value Residual

148,979 | 155,16951 | -6,19051 |

Residuals | ||||||||||

Observation | 116.150.002 | Predicted Y | Residual | Standardized [Excel] | Studentized | Deleted t | Leverage | Cook’s D | DFIT | PRESS |

1 | 116,019 | 116,80159 | -0,78259 | -0,67766 | -0,69074 | -0,68904 | 0,04642 | 0,01161 | -0,15202 | -0,82068 |

2 | 116,61 | 116,86821 | -0,25821 | -0,22359 | -0,22788 | -0,22686 | 0,0462 | 0,00126 | -0,04993 | -0,27071 |

3 | 117,91 | 117,67906 | 0,23094 | 0,19998 | 0,20353 | 0,20262 | 0,04358 | 0,00094 | 0,04325 | 0,24146 |

4 | 118,989 | 118,78399 | 0,20501 | 0,17753 | 0,18036 | 0,17954 | 0,04016 | 0,00068 | 0,03673 | 0,21359 |

5 | 119,11 | 119,55772 | -0,44772 | -0,38769 | -0,39341 | -0,39186 | 0,03787 | 0,00305 | -0,07775 | -0,46535 |

6 | 119,75 | 119,52917 | 0,22083 | 0,19122 | 0,19405 | 0,19318 | 0,03796 | 0,00074 | 0,03837 | 0,22954 |

7 | 119,25 | 119,68049 | -0,43049 | -0,37277 | -0,3782 | -0,37668 | 0,03752 | 0,00279 | -0,07437 | -0,44727 |

8 | 119,04 | 119,87845 | -0,83845 | -0,72603 | -0,73639 | -0,73481 | 0,03695 | 0,0104 | -0,14393 | -0,87062 |

9 | 120, | 119,15134 | 0,84866 | 0,73487 | 0,74618 | 0,74462 | 0,03906 | 0,01132 | 0,15013 | 0,88315 |

10 | 119,989 | 120,71785 | -0,72885 | -0,63113 | -0,63936 | -0,63758 | 0,03461 | 0,00733 | -0,12072 | -0,75498 |

11 | 119,779 | 120,15158 | -0,37258 | -0,32263 | -0,3271 | -0,32573 | 0,03618 | 0,00201 | -0,06311 | -0,38657 |

12 | 120, | 121,14326 | -1,14326 | -0,98997 | -1,00229 | -1,00231 | 0,03346 | 0,01739 | -0,18649 | -1,18284 |

13 | 120,08 | 120,71785 | -0,63785 | -0,55233 | -0,55953 | -0,55773 | 0,03461 | 0,00561 | -0,1056 | -0,66072 |

14 | 119,97 | 120,29339 | -0,32339 | -0,28003 | -0,28385 | -0,28263 | 0,03578 | 0,00149 | -0,05444 | -0,33539 |

15 | 121,879 | 121,11471 | 0,76429 | 0,66182 | 0,67008 | 0,66834 | 0,03354 | 0,00779 | 0,1245 | 0,79081 |

…. | ||||||||||

110 | 148,979 | 155,16951 | -6,19051 | -5,3605 | -5,41045 | -6,3183 | 0,02747 | 0,41341 | -1,06187 | -6,36537 |

Minimum | 116,019 | 116,80159 | -6,19051 | -5,3605 | -5,41045 | -6,3183 | 0,00917 | 1,22457E-6 | -1,06187 | -6,36537 |

Maximum | 156,1 | 155,94896 | 4,28537 | 3,71079 | 3,72271 | 3,97137 | 0,04642 | 0,41341 | 0,49981 | 4,35325 |

Mean | 139,35945 | 139,35945 | 0, | 0, | -0,00082 | -0,00681 | 0,01835 | 0,00993 | -0,00676 | -0,00192 |

**A Linear Model**

The null hypothesis is that there is no relation between high and low prices, and volume of shares sold in a day. (it’s a naive assumption but it is just for the sake of testing the model).

Let us run the model to see how it fits the data collected

As you can see in the following table, it turns out from R, R-squared and Adjusted R-Squared that the linear regression model generated predicts up to 99% of the cases, which as you might understand is quite good!!! (especially because the model is built on as few as 110 observations ). Imagine building it on years of data correlated micro-scopically 😀 thrilling!

As the ANOVA box shows, the p-value is 0. When the p-value is 0, the null hypothesis is rejected. As a matter of fact, high and low prices absolutely have an impact on the volume of shares sold. We know that from economic theory too.

Linear Regression | ||||||||||

Regression Statistics | ||||||||||

R | 0,99477 | |||||||||

R-Squared | 0,98957 | |||||||||

Adjusted R-Squared | 0,98947 | |||||||||

S | 1,16022 | |||||||||

MSE | 144,03477 | |||||||||

RMSE | 12,00145 | |||||||||

PRESS | 149,7971 | |||||||||

PRESS RMSE | 1,1723 | |||||||||

Predicted R-Squared | 0,98915 | |||||||||

N | 109 | |||||||||

116.150.002 = 7,47465 + 0,9517 * 114.826.157 | ||||||||||

ANOVA | ||||||||||

d.f. | SS | MS | F | p-value | ||||||

Regression | 1, | 13.662,7183 | 13.662,7183 | 10.149,70835 | 0, | |||||

Residual | 107, | 144,03477 | 1,34612 | |||||||

Total | 108, | 13.806,75307 | ||||||||

Coefficient | Standard Error | LCL | UCL | t Stat | p-value | H0 (5%) | ||||

Intercept | 7,47465 | 1,31379 | 4,87021 | 10,07909 | 5,68936 | 1,12229E-7 | rejected | |||

114.826.157 | 0,9517 | 0,00945 | 0,93298 | 0,97043 | 100,74576 | 0, | rejected | |||

T (5%) | 1,98238 | |||||||||

LCL – Lower limit of the 95% confidence interval | ||||||||||

UCL – Upper limit of the 95% confidence interval | ||||||||||

Concluding remarks:

As we saw, the linear regression model is able to explain 99% of the cases, meaning that statistically, it is a good one.

**Whilst statistically fit, this linear model is financially inefficient. As we have seen a residual of – $6 per share on 1 day could be very bad -depending on whether you are long or short on that asset – and it could be terrible if you own a lot of shares.**

A **predictive model for financial price fluctuation** should consider (applying) the following:

- Big Data Analysis
- Real-Time Data Analysis (or analysis at the shortest possible timeframe)
- Asset Correlation Analysis

One such model should consider that financial assets in a portfolio correlate, so losses on one asset may not necessarly give a very negative outlook to overall portfolio earnings.

Thank you for reading. More will come. 🙂

Best,

CA