12 data analysis techniques for a trader
By Neil Dennis
12:17, 6 October 2017
Statistical analysis has been used in financial markets for many decades to help take the guesswork out of – or, at least, firm up the "gut feelings" of – some of the top names in investment.
The history of statistics goes back centuries, even millennia – think of biblical censuses and the Domesday Book – the gathering of data for the study of population demographics.
The use of charts and historical data is commonplace in private investment but the use of statistical analysis is more typically associated with quantitative investment and active fund management techniques.
Warren Buffett
The so-called "Oracle of Omaha" Warren Buffett is perhaps best known for his "get greedy when everyone else is scared, and get scared when everyone else is being greedy" line.
But he was a pioneer of analysis – starting out with tips sheets on the racetrack.
He graduated to stock trading, and part of his approach is to work out what price is right for him, compared with what profits that company could be expected to be earning in 10 years.
And there are methods of extrapolating this from historical data using some, or a combination of the techniques listed below.
Buffett's ex-daughter in law, Mary Buffett, wrote in her book on his trading style: "Warren has found, if the company is one of sufficient earning power and earns high rates of return on shareholders' equity, created by some kind of consumer monopoly, chances are good that accurate long-term projections of earnings can be made."
Limitations to data analysis techniques
Statistical analysis has its limitations. There's little room to account for "black swan" events – those sometimes catastrophic occurrences that no amount of number crunching can predict.
And during such periods – as in the volatile market conditions that followed events such as the dotcom bubble, 9/11 and the 2008 financial crisis – statistical analysis becomes something of a blunt tool, its predictive power neutered by unpredictability.
Statistical analysis tools
Below, then, are several techniques in the arsenal of statistical analysis.
We've removed most of the hard sums to leave you with just the ideas that make these tools useful.
Should these ideas prompt the desire for a deeper understanding of these types of investment methods, we've included further reading list at the end of the article
Measures of central value
There are three measures of central tendency in statistical analysis: the mean, median and mode. All three are summary measures that attempt to best describe a whole set of data in a single value that represents the core of that data set's distribution.
1. Mode
This is the most commonly occurring value in a data set.
Consider the following data set of the ages of 10 children:
4, 5, 5, 6, 6, 6, 7, 8, 8 and 9
The mode here is 6, as this is the most commonly occurring value. The mode, however, won't necessarily reflect the central value of a data set. Also, it is possible for there to be two or more modes in a data set or, indeed, no mode at all.
2. Arithmetic mean
The mean is the average value of a data set.
Consider the following data set:
2, 4, 5, 8 and 9
The arithmetic mean is arrived at by adding all the numbers together and then dividing the total by the number of data points in the set.
So, by adding 2+4+5+8+9 = 28, which we then divide by 5 (the number of data points, or numbers, in that set) we arrive at 5.6.
Mean values are useful in many circumstances in business.
Internet shopping sites always ask for your age range when you set up an account. This is not only useful to them, but also to other retailers and manufacturers of goods for targeting advertising to certain age groups.
In investment, particularly for institutions, it's becoming increasingly important to know the average buying prices at certain times of day to know whether your institution is arriving at best execution on its asset purchases.
3. Median
The media is the middle number in a data set.
Consider the same data set as above:
2, 4, 5, 8 and 9
The median is simply the number in the middle = 5. This is easily arrived at if the data set is an odd number, as above. But what if the data set were:
1, 2, 4, 5, 8 and 9
In the case of data set with an even number of data points, we take the average of the middle two numbers.
So, 4+5/2 gives us a median of 4.5.
Median values are useful in statistical analysis because they are less prone to be skewed by anomalies or other unusual appearances in a data set. Consider the following set:
2, 4, 5, 8 and 798
In reality, such an extraordinary thing isn't likely to happen in such a small set, but the median of 5 is much more representative of the majority of that data set than the arithmetic mean of 163.4.
Imagine the example of salaries in a company. Let's say there are three broad ranges of salary: 80% of those salaries are for semi-skilled and unskilled workers, while 15% are for skilled workers and supervisors, while just 5% is represented by senior managers and executives.
That top 5% skews the average salary upward.
A semi-skilled worker earning £30,000 a year isn't likely to be impressed to learn that the mean salary where he works is £45,000 a year. He knows he earns more than an unskilled worker, but the mean salary makes his route up the corporate ladder seem a terribly long one.
His salary is likely to be more closely related to the median given the percentage of workers in that group of the data set.
Probability theory
4. Mathematical expectation
This is also called the expected value (EV), is the number in probability theory one may arrive at when a task with random variable outcomes is performed many times – such as rolling a single dice.
The data set here is 1, 2, 3, 4, 5 and 6 and probability of any of those numbers turning up on a single throw is 1 in 6, or 1/6 or, expressed as a decimal, 0.16666.
The mathematical expectation or EV is arrived at by multiplying each of the possible outcomes by the probability of it occurring and adding the sums of all those values. Hence, with a dice roll:
1x0.166666+2x0.16666 . . . +6x0.16666 = 3.5
Simply, the expected value is the arithmetic mean of all possible outcomes, so:
(1+2+3+4+5+6)/6 = 3.5
The law of large numbers dictates that the more often the dice is thrown, the nearer the mathematical mean value of those throws approaches EV. This is called convergence.
In business and investment terms, expected value is used by risk managers in scenario analysis when calculating whether an investment is worth the appropriate level of risk the firm is willing to take on.
The quality and depth of statistical analysis now made possible by computing means EV can be calculated on data sets that were previously regarded as unworkably massive.
These can be of enormous value in helping investment professionals to arrive at forecasts for investment returns, particularly when used in conjunction with measures of variance and standard deviation (see below).
Distribution models
5. Normal distribution
Normal distribution is also called standard normal distribution or Gaussian distribution model.
Normal distribution can be charted along a single horizontal axis that represents the total spectrum of values within a given data set.
Half of that data set will have values that are higher than the mean and half will have values lower than the mean. Most data points will lie close to the mean and the rest will tail off in each direction.
The shape described by plotting this data will be a bell curve, as below.
Normal distribution patterns in historical returns don’t tell an investor that much, other than that the asset is apparently miraculously well-behaved and that its returns mostly reflect the historical average.
6. Skewness
Skewness measures the symmetry, or asymmetry of distribution.
In a standard normal distribution, as above, the skewness will be zero.
Negative skewness will distort the bell curve to the left and positive skewness will have the opposite effect.
When examining an asset's annual returns over a period of time, the professional investor will look for investments that show positive skewness – returns that are greater than the historical average.
This has, in some circumstances, proved disastrous for investors, however. When market bubbles form, an asset can show positive skewness, prompting investors to buy at the top of the market. Then, when the skew turns negative, they may be tempted to sell at a loss.
Statistical analysis is only as intuitive as the person using it.
7. Kurtosis
Kurtosis is another measure of deviation from normal distribution, but looks at the extremes. This introduces the well-known investment term "tail risk".
A distribution model that is said to have a fat tail is a sign of kurtosis. Tail risk arises when the possibility that an investment could move more than three standard deviations (see below) from the mean is greater than a normal distribution model.
Divergence from the mean
8. Variance
Variance is used as a data analysis tool to examine how each individual value in a set of numbers differs from the arithmetic mean of that data set.
If you take the data set 2, 4, 5, 8 and 9, the arithmetic mean (adding all and dividing by number of data points, i.e. 5) is 5.6. If you simply take the deviation from the mean by subtracting it from each number, i.e.: 2 - 5.6, 4 - 5.6 etc, you get -3.6, -1.6, -0.6, 2.4 and 3.4.
The sum of all these numbers, and any other set of numbers will always be zero. To arrive at the variance, take the difference between each number in the data set and the arithmetic mean and square it. Hence:
-3.6x-3.6; -1.6x-1.6 . . . etc, to arrive at the variance set of 12.96, 2.56, 0.36, 5.76 and 11.56 and then take a new arithmetic mean of this new set. The variance is therefore, 6.64.
Variance is also used in risk management to help determine the level of risk an investor might take when purchasing a certain asset, but usually as the square of standard deviation, which we'll examine next.
9. Standard deviation
The standard deviation is simply the square root of variance, but is one of the most important measures in statistical analysis.
When applied to annual returns on an investment, standard deviation can help determine the historical volatility of that investment.
Once you have worked out the variance, it is simple. The variance of the set 2, 4, 5, 8 and 9 as above is 6.64. The standard deviation of this set is the square root of 6.64, which is 2.577.
Standard deviation is a fundamental risk measure in investment that most professional fund and portfolio managers use to help calculate likely returns from an investment.
Knowing the returns on an investment over several previous years, the mean or average return can be calculated, and from that the standard deviation tells the investment manager the likely volatility on the average return.
If the return each year has been within the standard deviation then it is a stable investment. If the return in some years is outside the standard deviation it is more volatile.
Measures of similitude
9. Covariance
Traders use statistical analysis to plot the returns on risky investments in a portfolio. When two or more risk assets move in tandem, they are said to have high, or positive covariance.
Positive covariance isn't particularly welcome in an asset portfolio. One can expect a higher degree of returns from risk assets, but also a higher degree of losses when things go wrong – and you don't want two or more risky assets going wrong at the same time.
Low, or negative covariance provides an asset portfolio with greater diversification, because when one risk asset is not performing well, other risk assets should be offsetting that poor performance.
10. Correlation coefficient
Simple correlations can be seen when comparing two charts side by side. The eye can spot simple matches between peaks and troughs.
For a more accurate gauge of correlation, however, the correlation coefficient can be worked out by dividing the sum of the covariance of the variables in question by the sum of their standard deviations.
The answer should come between the range of 1 and -1. A positive value means there is a positive correlation between the two variables. The closer to 1, the more highly correlated the two are. The opposite effect will be seen in a negative coefficient.
This type of statistical analysis is used by fund managers to determine how well their fund is performing compared to its benchmark index.
11. Regression
The best-known regression model in finance is the capital asset pricing model (CAPM) which helps investors arrive at asset pricing and cost of capital.
Simply put, regression is the degree to which the price of an asset, or other variable, is influenced by another set of variables.
For example, it is possible using regression formula to work out the probable effect on an Australian gold miner's shares from rising gold prices, rising domestic interest rates and a fall in the US dollar.
12. R-squared
R-squared is the statistical analysis of the relationship between a fund, particular asset or security and its benchmark index.
For example, an equity fund will have a firm relationship with the index it tracks - if the fund is sector based, then it should have a close resemblance to that sector's sub-index on a main stock index.
R-squared values are measured in percentages, so an R-squared relationship of 100% would mean that security, asset or fund had no other influence than its benchmark index, and that its performance matched that of the index.
An R-squared value of less than 70% is usually said to indicate there is little relationship between the security and the index.
Conclusions and further reading
Remember that without some knowledge also of the market conditions in which certain assets and securities thrive, statistical analysis alone is of little reliable use.
If you’re only basing your investment decisions on hunches – you’re as much in the dark.
But together with analysis of economic factors such as balance sheet and profit and loss, or historical returns, statistical analysis can help reassure investors on those hunches.
Use as much information as is readily available before making your investment decisions.
Please use this article, and any further reading on statistical analysis – of which we’ve provided some examples below – in conjunction with our courses on trading and other related features.
- Introduction to Statistical Methods for Financial Models - 11 Jul 2017 by Thomas A Severini
- Stock Market Probability: Using Statistics to Predict and Optimize Investment Outcomes Hardcover – 1 Apr 1994 by Joseph E. Murphy
- Evidence-Based Technical Analysis: Applying the Scientific Method and Statistical Inference to Trading Signals – 8 Dec 2006 by David Aronson
- Statistical Models and Methods for Financial Markets - 2008 by Tze Leung Lai, Haipeng Xing