Linear Regression Model – Predicting Apple Inc. share price

Hello. ūüôā I am back with a post about¬†forecasting, a subject I have always, carefully tried to avoid … for personal reasons. ūüėÄ

Forecasting is the science (or an art for some:) of analysing trends and estimating future outcome probabilities based on historical statistics.

If a trend is thought to be linear, we analyse it as a regression. If the trend is not linear other tests are preferred like Kendall’s rank correlation, and/or other smoothing techniques. Descriptive statistics is an important exploratory tool to understand what kind of distributions we are dealing with.

Data about Apple Inc share prices was collected (more precisely, date, open, high, low, close, adjusted closure and volume of shares traded on the market daily from 01.01.2017 to the last trading day available before this post that is 09.06.2017) from Yahoo finance.

Once data is collected, cleaned (if necessary), normalised(if necessary), scaled (if necessary), etcetera, we are ready to perform descriptive statistical analysis, that will help us investigate further some issues of more critical interest.

Variable #2 (Apple Inc. Share Price – Open)
Count 110 Mean Deviation 9,41869
Mean 138,36169 Second Moment 142,21464
Mean LCL 136,0978 Third Moment -771,31834
Mean UCL 140,62558 Fourth Moment 46.172,3278
Variance 143,51936
Standard Deviation 11,97996 Sum 15.219,786
Mean Standard Error 1,14224 Sum Standard Error 125,64685
Coefficient of Variation 0,08658 Total Sum Squares 2.121.478,93681
Adjusted Sum Squares 15.643,61058
Minimum 114,826
Maximum 156,009 Geometric Mean 137,83068
Range 41,183 Harmonic Mean 137,28291
Mode #N/A
Median 140,375
Median Error 0,1365 Skewness -0,4548
Percentile 25% (Q1) 132,06925 Skewness Standard Error 0,22834
Percentile 75% (Q3) 144,869 Kurtosis 2,28293
IQR 12,79975 Kurtosis Standard Error 0,44452
MAD (Median Absolute Deviation) 0,702 Skewness (Fisher’s) -0,46111
Coefficient of Dispersion (COD) 0,06588 Kurtosis (Fisher’s) -0,69417
Figure 1: Histogram Opening Price Values

Figure 1: Histogram Opening Price Values

Here the software has automatically sampled the¬†opening price ranges (“$110 to $115”, “$115 to $120”, …).When the statistical package samples for you, you may easily get quasi-normal distributions.¬†If you are familiar with programming, you may define the number of ranges your samples should fit in. However one¬†should be careful when defining ranges because scientists and statisticians may often be biased by what they want to find. So setting the number of ranges may be dangerous, unless a careful methodological study has been run¬†on sampling and methodologically-sound reasons exist for sampling and ranging that way.

Descriptive statistics and plots are produced for every variable of interest.

Figure 2a: Descriptive Statistics and Plots Sample

Figure 2a: Descriptive Statistics and Plots Sample

Figure 2b: Descriptive Statistics and Plots Sample

Figure 2b: Descriptive Statistics and Plots Sample

Being able to interpret the statistics is a huge plus at every stage. Unfortunately I cannot dig into the statistics now, as I still need to get to the gist of this post that is forecasting and predictive models.

One should however appreciate that whilst many think of data visualisation as an achievement, data science professionals use visualisation as an exploratory tool for their analysis.

For example, in the following graph we immediately see a fat tail event happening at point 21 of our list of observations: it seems that 2017-02-01 was a very good day for Apple Inc. as they traded a huge volume of shares, 111,985,000 (in one day!).

If you consider the opening price of one share was $125.96 on that day, I let you do the math to figure out of how much Apple Inc. earns Рor how much money they move in one day :O) [Note though that our math will be biased because prices are always fluctuating. To get smaller errors we would need to collect data on smaller time-frames (i.e. say every 5 minutes)].

Figure 3: Volume of Apple Inc. Shares traded per day

Figure 3: Volume of Apple Inc. Shares traded per day

Being capable of reading a graph can sometimes be a life-saviour or it can be a time-saviour (which indeed is a life-saviour).

When building a predictive model on a list of observations, a statistical package will return the residuals. The residuals are the differences between the predicted values and the actual values.  when I built the linear regression model fitting the Apple Inc. data collected, the system generated the following plots for residuals.

Figure 4: Linear Regression Model - Residuals Plot

Figure 4: Linear Regression Model – Residuals Plot

In the figure on the lower left corner (titled Residuals), at the end of the x-axes, observation 110 deeps very much down. This is not human error (in handling data), nor it is a statistical error of the package (whilst in some cases both may happen).

In fact here, the linear model output a prediction that is significantly lower than the real value.The residual is the difference between the estimated (predicted) value and the real value. One can see in the following table, the biggest residual is in observation 110.

Real Value                        Predicted Value                              Residual

148,979 155,16951 -6,19051
Residuals
Observation 116.150.002 Predicted Y Residual Standardized [Excel] Studentized Deleted t Leverage Cook’s D DFIT PRESS
1 116,019 116,80159 -0,78259 -0,67766 -0,69074 -0,68904 0,04642 0,01161 -0,15202 -0,82068
2 116,61 116,86821 -0,25821 -0,22359 -0,22788 -0,22686 0,0462 0,00126 -0,04993 -0,27071
3 117,91 117,67906 0,23094 0,19998 0,20353 0,20262 0,04358 0,00094 0,04325 0,24146
4 118,989 118,78399 0,20501 0,17753 0,18036 0,17954 0,04016 0,00068 0,03673 0,21359
5 119,11 119,55772 -0,44772 -0,38769 -0,39341 -0,39186 0,03787 0,00305 -0,07775 -0,46535
6 119,75 119,52917 0,22083 0,19122 0,19405 0,19318 0,03796 0,00074 0,03837 0,22954
7 119,25 119,68049 -0,43049 -0,37277 -0,3782 -0,37668 0,03752 0,00279 -0,07437 -0,44727
8 119,04 119,87845 -0,83845 -0,72603 -0,73639 -0,73481 0,03695 0,0104 -0,14393 -0,87062
9 120, 119,15134 0,84866 0,73487 0,74618 0,74462 0,03906 0,01132 0,15013 0,88315
10 119,989 120,71785 -0,72885 -0,63113 -0,63936 -0,63758 0,03461 0,00733 -0,12072 -0,75498
11 119,779 120,15158 -0,37258 -0,32263 -0,3271 -0,32573 0,03618 0,00201 -0,06311 -0,38657
12 120, 121,14326 -1,14326 -0,98997 -1,00229 -1,00231 0,03346 0,01739 -0,18649 -1,18284
13 120,08 120,71785 -0,63785 -0,55233 -0,55953 -0,55773 0,03461 0,00561 -0,1056 -0,66072
14 119,97 120,29339 -0,32339 -0,28003 -0,28385 -0,28263 0,03578 0,00149 -0,05444 -0,33539
15 121,879 121,11471 0,76429 0,66182 0,67008 0,66834 0,03354 0,00779 0,1245 0,79081
….
110 148,979 155,16951 -6,19051 -5,3605 -5,41045 -6,3183 0,02747 0,41341 -1,06187 -6,36537
Minimum 116,019 116,80159 -6,19051 -5,3605 -5,41045 -6,3183 0,00917 1,22457E-6 -1,06187 -6,36537
Maximum 156,1 155,94896 4,28537 3,71079 3,72271 3,97137 0,04642 0,41341 0,49981 4,35325
Mean 139,35945 139,35945 0, 0, -0,00082 -0,00681 0,01835 0,00993 -0,00676 -0,00192

A Linear Model

The¬†null hypothesis is¬†that there is no relation between high and low prices, and volume of shares sold in a day. (it’s a naive assumption but it is just for the sake of¬†testing the model).

Let us run the model to see how it fits the data collected

As you can see in the following table, it turns out from R, R-squared and Adjusted R-Squared that the linear regression¬†model generated predicts up to 99% of the cases, which as you might understand is quite good!!! (especially because the model is¬†built on as few as 110 observations ). Imagine building it on years of data correlated micro-scopically ūüėÄ thrilling!

As the ANOVA box shows, the p-value is 0. When the p-value is 0, the null hypothesis is rejected. As a matter of fact, high and low prices absolutely have an impact on the volume of shares sold. We know that from economic theory too.

Linear Regression
Regression Statistics
R 0,99477
R-Squared 0,98957
Adjusted R-Squared 0,98947
S 1,16022
MSE 144,03477
RMSE 12,00145
PRESS 149,7971
PRESS RMSE 1,1723
Predicted R-Squared 0,98915
N 109
116.150.002 = 7,47465 + 0,9517 * 114.826.157
ANOVA
d.f. SS MS F p-value
Regression 1, 13.662,7183 13.662,7183 10.149,70835 0,
Residual 107, 144,03477 1,34612
Total 108, 13.806,75307
Coefficient Standard Error LCL UCL t Stat p-value H0 (5%)
Intercept 7,47465 1,31379 4,87021 10,07909 5,68936 1,12229E-7 rejected
114.826.157 0,9517 0,00945 0,93298 0,97043 100,74576 0, rejected
T (5%) 1,98238
LCL – Lower limit of the 95% confidence interval
UCL – Upper limit of the 95% confidence interval

Concluding remarks:

As we saw, the linear regression model is able to explain 99% of the cases, meaning that statistically, it is a good one.

Whilst statistically fit, this linear model is financially inefficient. As we have seen a residual of Р$6 per share on 1 day could be very bad -depending on whether you are long or short on that asset Рand it could be terrible if you own a lot of shares.

A predictive model for financial price fluctuation should consider (applying) the following:

  • Big Data Analysis
  • Real-Time Data Analysis (or analysis at the shortest possible timeframe)
  • Asset Correlation Analysis

One such model should consider that financial assets in a portfolio correlate, so losses on one asset may not necessarly give a very negative outlook to overall portfolio earnings.

Thank you for reading. More will¬†come. ūüôā

Best,

CA

Leave a Reply