An illustration of random walks |

### Problem Statement and Assumptions

We are given the initial price \(P_0\) and we want to make inferences about the future stock price \(P_T\). The random variables \(P_i\) must also be non-negative. The time scale here is arbitrary and can be made as large or small as necessary.Our key assumption here is that the changes in price are independent and identically distributed (iid). We characterize the price change as the ratio \[C_i = \frac{P_i}{P_{i-1}}\] Note that we didn't use a straightforward difference (\(P_i-P_{i-1}\)). The reason is because the difference most certainly isn't iid (a price of $1 has support on \([-1,\infty]\) whereas a price of $2 has support on \([-2,\infty]\)). You'll notice that our characterization corresponds to a

*percentage*difference (plus one).

### The Normal Distribution

The normal distribution (also known as the bell curve, the Gaussian, etc.) is ubiquitous in modeling random variables. And so it would be reasonable to conjecture that \(P_T\) is normally distributed. \[ f_{\mu,\sigma^2}(x) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]The normal distribution |

However in a similar vein as to why we didn't use the difference in price as our characterization of change, the normal distribution doesn't have the correct support. If we had used the distribution as our model, we would have found that the model would assign a positive probability to the future price being less than 0.

### Logarithms to the Rescue

Okay, let's actually do the math without resorting to guessing. The price \(P_{1}\) can be expressed as \(C_1 \times P_0\), and \(P_{2}\) as \(C_2 \times P_1\), and so on. Inductively continuing this process yields \[ P_T = C_T C_{T-1} \dots C_1 P_0 \] Thus we have that \(P_T\) is proportional to the product of \(T\) iid random variables. The trick is to turn this product into a sum so then we can apply the central limit theorem. We do this by taking the logarithm of both sides \[ \begin{align*} \log P_T &= \log(C_T C_{T-1} \dots C_1 P_0) \\ &= \log C_T + \log C_{T-1} + \dots + \log C_1 + \log P_0 \\ &\thicksim N(\mu,\sigma^2) \end{align*} \] Since the \(C_i\)s are iid, their logarithms must also be iid. Now we can apply the central limit theorem to see that \(\log P_T\) converges to a normal distribution! The exponential of a normal distribution is known as the log-normal distribution so \(P_T\) is log-normal. \[ g_{\mu,\sigma^2}(x) = \frac{1}{x\sqrt{2\pi \sigma^2}}e^{-\frac{(\log x-\mu)^2}{2\sigma^2}} \]The log-normal distribution |

As a sanity check, we see that the support of the log-normal is on \((0,\infty]\) as expected.

### But wait there's more!

In the beginning we noted that the choice of time-scale is arbitrary. By considering smaller time scales, we can view our \(C_i\)s as the product of finer grained ratios. Thus by the same argument as above, each of the \(C_i\)s must also be log-normally distributed.### Experimental Results

I took ~3200 closing stock prices of Microsoft Corporation (MSFT), courtesy of Yahoo! Finance from January 3, 2000 to today. I imported the data set into R and calculated the logarithms of the \(C_i\)s. I then plotted a normalized histogram of the results and overlaid the theoretical normal distribution on top of it. The plot is shown below:### Discussion

As you can see, the theoretical distribution doesn't fit our data exactly. The overall shape is correct, but our derived distribution puts too little mass in the center and too little on the edges.We now must go back to our assumptions for further scrutiny. Our main assumption was that the changes are independent and identically distributed. In fact, it has been shown in many research papers (e.g. Schwert 1989) that the changes are

*not*identically distributed, but rather vary over time. However, the central limit theorem is fairly robust in practice. Especially under a sufficiently large of samples, each "new" distribution will eventually sum to normality (and the sum of normal distributions is normal).

I suspect that the deviation from normality is primarily caused by dependence between samples. The heavy tails can be explained by the fact that a large drop/rise in price today may be correlated to another drop/rise in the near future. This is particularly true during times of extreme depression or economic growth. A similar argument can be made about the excess of mass in the center of the distribution. It is conceivable that times of low volatility will be followed by another time of low volatility.

### Conclusion

While our model might not be perfect in practice, it is a good first step to developing a better model. I think what you should take from this is that it is important to experimentally verify your models rather than blindly taking your assumptions as ground truths. I'll conclude this post with a few closing remarks:- Many people actually do use the normal distribution to model changes in prices despite the obvious objections stated above. One can justify this by noting that the distribution of \(C_i\) in practice is usually close to 0. Thus the first order approximation \(e^x \approx 1+x\) is fairly accurate.
- The histogram and fit shown above can be reproduced for almost any stock or index (e.g. S&P 500, DJIA, NASDAQ)
- R is a great piece of software but has god awful tutorials and documentation. I am not in a position to recommend it yet because of this.