One Way Analysis of Variances (ANOVA) – A visual guide

In this post, we will learn about the ANOVA and also visualize the data alongside. We will be mostly learning it in Mathematica. Like always, I will try to present with many visual elements.


For any data analysis one needs to have data to work with. If you do not have data to work with, do not fret. I will explain how to generate the data and then how to perform various analyses on it.

Data Generation
Data Analysis

Some links to the code: This is the link to the wolfram website showing the codes of the ANOVA. http://reference.wolfram.com/language/ANOVA/ref/ANOVA.html


Data Generation

I have used the following code for the generation of the random data. The distribution is a Binormal Distribution with a varying X-mean* but a zero Y-mean*.

nu = 100;
data1 = RandomVariate[BinormalDistribution[{2, 0}, {1, 1}, 0], nu];
data2 = RandomVariate[BinormalDistribution[{n, 0}, {1, 1}, 0], nu];

Understanding the code: There are two data groups named as data1 and data2. The distribution data1 is fixed and the X-mean of data2 will be changed for each iteration with . nu is the number of points that we will be generating per group. The generation part of the data is taken care of now.

*X-mean is the mean x-position of all the points on the 2d space and Y-mean is the mean y-position of all the points in the 2d space.


Data Analysis

Objective: To look at the F-value of the F-distribution and the p-value of the F-test on the X-values (X-coordinates) with the following conditions.

  1. Means Moving: What happens to the parameters when one group moves away from the other (data-means moving away)?
  2. Changing variances: What happens to the parameters when their standard deviations are changed?
  3. Changing Sample Sizes: What happens to the parameters when the number of datapoints increase?

We will look at each one of them now. Given below is a key to the graphs.

anova_key

1. Means Moving:

Here, we will see what happens to the F-value and the p-value when the means are changing. This gives us the idea that as the means of the data are separating, the F-value increases and there is a less probability that they belong to the same distribution and the hypothesis of equality of the means can be rejected.

Case A – given mean: This image shows the large points with the input mean used for random number generation

anova_theo_mean

Case B – actual mean: This image shows the large points with the actual mean of the randomly generated points

 

anova_mean

2. Changing variances

In the image below, we can  see that there are two distributions, one in red and one in green. Their means are separated and they are held constant. Their standard deviations (in the X-direction) are decreasing starting from 5 to 0.2 in the steps of 0.2. We can also see that the F value is increasing as their standard deviation is decreasing, indicating that the distributions become more distinctive.

Also you can see that the p-value is decreasing too as the distributions are becoming distinctive. What might this mean? This means that when making an ANOVA hypothesis you would like to check the equality of means, and the low probability values mean that the its less likely that they are from the same distribution and equality of the means can be rejected.

anova_stndrd

3. Changing Sample Sizes

No we will see what happens when the number of points within each distribution increase. This means that the density of each distribution is increasing. We will see the F-value and the p-value for the distributions. From this, the important observation is that we can reject the null hypothesis with easily as the F-value is increasing.

 

anova_density


Understanding the application of ANOVA, with a case study


cereal-box-5-1


Graphic generated by Mr. Chandra Sekhar for fermibot.wordpress.com


Now that we have some idea of how the ANOVA works, we will see an example of how to use it in a practical situation. Assume we are a marketing firm and observing the sales of a particular product  popularly sold as cereal-O manufactured by the fictitious company Fermibot Mills. The events have unfolded as follows.

  • Last year the top management at Fermibot Mills have been dissatisfied with the way the low sales volume. The forecasting department have been way off in their prediction as this was a new product and they had no prior information to extrapolate from.
  • Towards the end of the last fiscal year, the management has decided to launch an advertising campaign hoping to boost the sales this year.
  • The data across the key regions has been collected and its our job as analysts to see if the ad-campaign has been really useful. The data is tabulated below.
Sales in Millions
Region  2015(Pre Ad) 2016(Post Ad)
Connecticut 67 66
Deetriot 58 60
Kansas City 60 69
Tulsa 61 57
Dallas 59 60
Austin 65 73
Santa Cruz 61 53
Las Vegas 60 56
Seattle 62 64
Manhattan 62 66
Orlando 62 63
Miami 60 53
Maui 63 54
Houston 55 60
Tampa 59 62
St. Louis 62 74
Memphis 69 56
Everglades 62 62
Mean Sales 61.50 61.56

One might say that the ad campaign has done nothing whatsoever, but the ANOVA result says otherwise. Let’s have a look.

Groups Count Sum Average Variance  
2015(Pre Ad) 18 1084 60.22222 16.18301
2016(Post Ad) 18 1090 60.55556 48.84967
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 262.3333 1 262.3333 8.067739 0.00755717 4.130018
Within Groups 1105.556 34 32.51634
Total 1367.889 35

We need to look that the following items from the table. The F value (8.067738) is greater than the critical value of 4.130018. This means that the equality of the means is rejected. What happened?

The mean of both the columns (seen as 61.50 and 61.56) apparently indicates that there has been no change to the sales after the Ad-campaign. But the means fail to say one thing. They do not say anything about the variability within the data. If we go back to the table, we see that the Between groups variation is very high as compared to the Within group variation. Also note that the variance in the sales for the 2016 has almost tripled to 48.84 from the previous year where it was only 16.18.

We can hereby summarize the following aspects about the ad-campaign.

  • The ad-campaign certainly did something even though the mean sales have not seemed to be affected at all.
  • The campaign affected the sales negatively in some regions. Example: Memphis with an decrease of 13  million in sales.
  • The campaign affected the sales positively in some regions. Example: St. Louis with a   increase of 12 million  in sales.
  • This means that the campaign has not been unsuccessful at all. It simply differed in its appeal to the customers across various demographics (here the demographic segments being various cities).
  • The next step would be to analyse details of the effect of the ad campaign and may be plan on devising a specific plan to a particular geographic region.

I hope this example has been helpful. This is only one of the uses of ANOVA. Extensions of this technique can be used in variety of other situations.


Source Code:

The mathematica source code used for the analysis can be see here.


End of the post

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.