In this post, we will learn about the ANOVA and also visualize the data alongside. We will be mostly learning it in Mathematica. Like always, I will try to present with many visual elements.
For any data analysis one needs to have data to work with. If you do not have data to work with, do not fret. I will explain how to generate the data and then how to perform various analyses on it.
Some links to the code: This is the link to the wolfram website showing the codes of the ANOVA. http://reference.wolfram.com/language/ANOVA/ref/ANOVA.html
I have used the following code for the generation of the random data. The distribution is a Binormal Distribution with a varying X-mean* but a zero Y-mean*.
nu = 100;
data1 = RandomVariate[BinormalDistribution[{2, 0}, {1, 1}, 0], nu];
data2 = RandomVariate[BinormalDistribution[{n, 0}, {1, 1}, 0], nu];
Understanding the code: There are two data groups named as data1 and data2. The distribution data1 is fixed and the X-mean of data2 will be changed for each iteration with . nu is the number of points that we will be generating per group. The generation part of the data is taken care of now.
*X-mean is the mean x-position of all the points on the 2d space and Y-mean is the mean y-position of all the points in the 2d space.
Objective: To look at the F-value of the F-distribution and the p-value of the F-test on the X-values (X-coordinates) with the following conditions.
- Means Moving: What happens to the parameters when one group moves away from the other (data-means moving away)?
- Changing variances: What happens to the parameters when their standard deviations are changed?
- Changing Sample Sizes: What happens to the parameters when the number of datapoints increase?
We will look at each one of them now. Given below is a key to the graphs.
1. Means Moving:
Here, we will see what happens to the F-value and the p-value when the means are changing. This gives us the idea that as the means of the data are separating, the F-value increases and there is a less probability that they belong to the same distribution and the hypothesis of equality of the means can be rejected.
Case A – given mean: This image shows the large points with the input mean used for random number generation
Case B – actual mean: This image shows the large points with the actual mean of the randomly generated points
2. Changing variances
In the image below, we can see that there are two distributions, one in red and one in green. Their means are separated and they are held constant. Their standard deviations (in the X-direction) are decreasing starting from 5 to 0.2 in the steps of 0.2. We can also see that the F value is increasing as their standard deviation is decreasing, indicating that the distributions become more distinctive.
Also you can see that the p-value is decreasing too as the distributions are becoming distinctive. What might this mean? This means that when making an ANOVA hypothesis you would like to check the equality of means, and the low probability values mean that the its less likely that they are from the same distribution and equality of the means can be rejected.
3. Changing Sample Sizes
No we will see what happens when the number of points within each distribution increase. This means that the density of each distribution is increasing. We will see the F-value and the p-value for the distributions. From this, the important observation is that we can reject the null hypothesis with easily as the F-value is increasing.
Understanding the application of ANOVA, with a case study
Graphic generated by Mr. Chandra Sekhar for fermibot.wordpress.com
Now that we have some idea of how the ANOVA works, we will see an example of how to use it in a practical situation. Assume we are a marketing firm and observing the sales of a particular product popularly sold as cereal-O manufactured by the fictitious company Fermibot Mills. The events have unfolded as follows.
- Last year the top management at Fermibot Mills have been dissatisfied with the way the low sales volume. The forecasting department have been way off in their prediction as this was a new product and they had no prior information to extrapolate from.
- Towards the end of the last fiscal year, the management has decided to launch an advertising campaign hoping to boost the sales this year.
- The data across the key regions has been collected and its our job as analysts to see if the ad-campaign has been really useful. The data is tabulated below.
Sales in Millions | ||
Region | 2015(Pre Ad) | 2016(Post Ad) |
Connecticut | 67 | 66 |
Deetriot | 58 | 60 |
Kansas City | 60 | 69 |
Tulsa | 61 | 57 |
Dallas | 59 | 60 |
Austin | 65 | 73 |
Santa Cruz | 61 | 53 |
Las Vegas | 60 | 56 |
Seattle | 62 | 64 |
Manhattan | 62 | 66 |
Orlando | 62 | 63 |
Miami | 60 | 53 |
Maui | 63 | 54 |
Houston | 55 | 60 |
Tampa | 59 | 62 |
St. Louis | 62 | 74 |
Memphis | 69 | 56 |
Everglades | 62 | 62 |
Mean Sales | 61.50 | 61.56 |
One might say that the ad campaign has done nothing whatsoever, but the ANOVA result says otherwise. Let’s have a look.
Groups | Count | Sum | Average | Variance | ||
2015(Pre Ad) | 18 | 1084 | 60.22222 | 16.18301 | ||
2016(Post Ad) | 18 | 1090 | 60.55556 | 48.84967 | ||
ANOVA | ||||||
Source of Variation | SS | df | MS | F | P-value | F crit |
Between Groups | 262.3333 | 1 | 262.3333 | 8.067739 | 0.00755717 | 4.130018 |
Within Groups | 1105.556 | 34 | 32.51634 | |||
Total | 1367.889 | 35 |
We need to look that the following items from the table. The F value (8.067738) is greater than the critical value of 4.130018. This means that the equality of the means is rejected. What happened?
The mean of both the columns (seen as 61.50 and 61.56) apparently indicates that there has been no change to the sales after the Ad-campaign. But the means fail to say one thing. They do not say anything about the variability within the data. If we go back to the table, we see that the Between groups variation is very high as compared to the Within group variation. Also note that the variance in the sales for the 2016 has almost tripled to 48.84 from the previous year where it was only 16.18.
We can hereby summarize the following aspects about the ad-campaign.
- The ad-campaign certainly did something even though the mean sales have not seemed to be affected at all.
- The campaign affected the sales negatively in some regions. Example: Memphis with an decrease of 13 million in sales.
- The campaign affected the sales positively in some regions. Example: St. Louis with a increase of 12 million in sales.
- This means that the campaign has not been unsuccessful at all. It simply differed in its appeal to the customers across various demographics (here the demographic segments being various cities).
- The next step would be to analyse details of the effect of the ad campaign and may be plan on devising a specific plan to a particular geographic region.
I hope this example has been helpful. This is only one of the uses of ANOVA. Extensions of this technique can be used in variety of other situations.
Source Code:
The mathematica source code used for the analysis can be see here.
End of the post
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.