Central limite theorem - Approximation

Probability and Statistical Inference - 09

Posted by Zekun on November 12, 2019

Import the packages first.

library(magrittr)
library(sn)

Situation Description

The central limit theorem is an important computational short-cut for generating and making inference from the sampling distribution of the mean. I will recall that the central limit theorem short-cut relies on a number of conditions, specifically:

  1. Independent observations
  2. Identically distributed observations
  3. Mean and variance exist
  4. Sample size large enough for convergence

In this simulation study, I.m going to compare the sampling distribution of the mean generated by simulation to the sampling distribution implied by the central limit theorem. I will compare the distributions graphically in QQ-plots.

This will be a 4 × 4 factorial experiment. The first factor will be the sample size, with N = 5, 10, 20, and 40. The second factor will be the degree of skewness in the underlying distribution. The underlying distribution will be the Skew-Normal distribution. The Skew-Normal distribution has three parameters: location \(\xi\), scale \(\omega\), and slant \(\alpha\). When the slant parameter is 0, the distribution reverts to the normal distribution. As the slant parameter increases, the distribution becomes increasingly skewed. In this simulation, the slant will be set to 0, 2, 10, 100. Set location and scale to 0 and 1, respectively, for all simulation settings.

Plot preparation

In the very beginning, we need to set up the parameters that do not change in the following steps. The slant of Skew-Normal distribution will change later, therefore, only the location \(\xi\) and scale \(\omega\) will be set in this part.

R <- 5000
location <- 0
scale <- 1

location: \(\xi\), scale: \(\omega\), slant: \(\alpha\)

Before creating the function, let’s clarify the functions for calculating the delta, mean and standard deviation for the central limit theorem (CLT).

  1. Delta: \(\delta =\frac{\alpha}{\sqrt{1+\alpha^2}}\)

  2. Mean: \(\xi+\omega \delta \sqrt{\frac{2}{\pi}}\)

  3. Standard deviation: \(\sqrt{\omega^2(1-\frac{2\delta^2}{\pi}) }\)

Then define the function for the CLT process and generating the QQplots by using the functions before.

qqplot_creator <- function(slant, N) {
  delta <- slant / (sqrt(1 + slant ^ 2))

  # Quantites to calculate/generate
  pop_mean <- location + scale * delta * (sqrt(2 / pi))
  pop_sd <- sqrt(scale ^ 2 * (1 - ((2 * delta ^ 2) / pi)))

  Z <- rnorm(R) # generate the normal distribution as the basement

  #CLT approximation
  sample_dist_clt <- Z * (pop_sd / sqrt(N)) + pop_mean

  #Simulation approximation
  random.skew <- array(rsn(R * N, xi = location, omega = scale, alpha = slant),
                      dim = c(R, N))

  sample_dist_sim <- apply(random.skew, 1, mean)

  qqplot(sample_dist_clt, sample_dist_sim, axes = FALSE, frame.plot=TRUE, ann = FALSE)
  abline(0,1)

  }

QQplot generation

Now we can set the slants and Ns we want to test in the following steps. As the requirement, the N = 5, 10, 20, and 40 and slant will be set to 0, 2, 10, 100. Then create a sequence to define the points where we want to test.

slant <- c(0,2,10,100)
N <- c(5,10,20,40)
x <- seq(-2,2,0.01)

Set a graph for put all of the QQplots together and use the qqplot_creator function to fill the QQplots inside.

par(mfrow=c(4,5),mai=c(0.1,0.1,0.1,0.1), oma = c(0, 4, 4, 0))

for(i in slant){
  plot(dsn(x,
           xi = location,
           omega = scale,
           alpha = i),
       axes = FALSE,
       frame.plot=TRUE,
       type = "l",
       xlab = NA, ylab = NA)
  for(j in N){
    qqplot_creator(i, j)
  }
}
mtext(text="Distribution              N=5                N=10                   N=20                    N=40",
      side = 3,
      outer = TRUE)
mtext(text="slant = 100   slant = 10       slant = 2      slant = 0",
      side = 2,
      outer = TRUE)

Conclusion

Definitely, when the N is bigger, the QQplot will fit the y=x line better, which means the CLT works better when it wants to simulate the distribution. And when the slant is bigger, in other words, the Skew-Normal distribution has higher skewness, it will be more difficult to simulate the distribution.