Central Limit Theorem: Worked Examples and Simulation
A focused cluster guide on the Central Limit Theorem with multiple worked sampling examples at n = 5, 30, and 100 from skewed and uniform populations, demonstrating convergence to normality of the sample mean. Includes the connection to bootstrap methods.
What You'll Learn
- ✓State the Central Limit Theorem precisely with parameters and conditions
- ✓Compute sampling distributions of x̄ for various n and underlying distributions
- ✓Recognize when n is sufficient for CLT to apply (skewed populations need more)
- ✓Connect CLT to standard error formulas used in t-tests and z-tests
- ✓Apply CLT to bootstrap inference for non-normal data
1. Direct Answer: What CLT Says
The Central Limit Theorem says that as sample size n grows, the sampling distribution of the sample mean x̄ approaches a normal distribution with mean μ (the population mean) and standard deviation σ/√n (the standard error), regardless of the shape of the underlying population. CLT is the bridge that lets us use normal-based inferential procedures (z-tests, t-tests, confidence intervals) even when the underlying data is not normal. The conventional rule of thumb is n ≥ 30 for most distributions, but severely skewed populations may need n closer to 100; only the underlying shape determines exactly how fast convergence happens.
Key Points
- •Sample mean x̄ → Normal(μ, σ²/n) as n grows
- •CLT works regardless of underlying distribution shape
- •Standard error = σ / √n
- •Rule of thumb: n ≥ 30 for moderate skew; larger for extreme skew
- •CLT enables z-tests, t-tests, confidence intervals on non-normal data
2. Worked Example 1: Right-Skewed Population (Exponential)
Consider an exponential population (heavily right-skewed) with rate parameter λ = 0.5, so μ = 2 and σ = 2. We draw samples of various sizes and compute the sample mean. At n = 5: Sampling distribution of x̄ has mean = 2 and SE = 2/√5 = 0.894. The shape is still noticeably right-skewed at this small sample size — CLT has not yet kicked in fully. At n = 30: Mean = 2, SE = 2/√30 = 0.365. Distribution is now much closer to normal but with slight residual right-skew. Most CLT-based procedures will work reasonably here. At n = 100: Mean = 2, SE = 2/√100 = 0.20. Distribution is essentially normal — exponential origin completely "forgotten" at the sample-mean level. Probability calculation. What is P(x̄ > 2.5) for n = 50? SE = 2/√50 = 0.283. Z = (2.5 − 2)/0.283 = 1.768. P(Z > 1.768) = 1 − 0.9615 = 0.039. About 3.9% chance of observing a sample mean above 2.5, even though individual values from the exponential distribution can easily exceed 2.5.
Key Points
- •Exponential is heavily right-skewed (mean = standard deviation = 1/λ)
- •At n = 5, sample-mean distribution still shows skew
- •At n = 30, distribution is approximately normal for most purposes
- •At n = 100, distribution is indistinguishable from normal
- •Standard error decreases as 1/√n — quadrupling n halves the SE
3. Worked Example 2: Uniform Population
A uniform distribution between 0 and 1 has μ = 0.5 and σ² = 1/12, so σ ≈ 0.289. Uniform is symmetric (no skew) so CLT convergence is fast. At n = 5: SE = 0.289/√5 = 0.129. Sampling distribution is already approximately normal — the uniform parent has no skew to overcome. At n = 30: SE = 0.289/√30 = 0.053. At n = 100: SE = 0.289/√100 = 0.029. For symmetric populations, n = 5 to 10 is often sufficient for the CLT approximation to work well in practice. The fact that "n = 30" became the textbook rule of thumb reflects worst-case skewed populations, not all populations. Probability calculation. What is P(x̄ < 0.45) for n = 100? SE = 0.029. Z = (0.45 − 0.5)/0.029 = −1.724. P(Z < −1.724) = 0.0424. About 4.2% chance of observing a sample mean below 0.45, even though about 45% of individual uniform draws are below 0.45.
Key Points
- •Uniform distribution: no skew → fast CLT convergence
- •For symmetric populations, n = 5–10 often sufficient
- •For skewed populations, n = 30+ needed; severely skewed n = 100+
- •"n ≥ 30" rule of thumb reflects worst-case, not typical
- •Sample-mean distribution narrows rapidly: SE = σ/√n
4. Connection to Standard Error Formulas
The CLT is what justifies the standard error formulas used in inferential statistics. For sample mean: SE(x̄) = σ/√n (CLT directly). For sample proportion p̂ from binary data: SE(p̂) = √(p(1−p)/n). This comes from CLT applied to the binomial: the sample proportion is a sample mean of 0/1 indicator variables. Approximation works when np > 10 and n(1−p) > 10 (the "success-failure condition"). For difference of two means: SE(x̄_1 − x̄_2) = √(σ²_1/n_1 + σ²_2/n_2). This is the formula behind two-sample t-tests; it follows from independence of the two samples and CLT applied to each. For regression coefficients: the CLT applied to linear combinations of residuals gives normal-approximate distributions for slope estimates, justifying t-tests on slopes. Without the CLT, none of these formulas would work for non-normal data. The reason a sample of n = 50 from a heavily skewed population still produces a valid t-test on the mean is precisely that CLT has converted the sample-mean distribution to approximately normal.
Key Points
- •SE(x̄) = σ/√n — direct CLT result
- •SE(p̂) = √(p(1−p)/n) — CLT applied to binary data; needs np > 10 and n(1−p) > 10
- •SE(x̄_1 − x̄_2) = √(σ²_1/n_1 + σ²_2/n_2) — two-sample t-test foundation
- •CLT justifies normal-approximate distributions for regression coefficients
- •Without CLT, inferential procedures would fail on non-normal data
5. Bootstrap Methods and CLT
The bootstrap is a resampling technique for estimating sampling distributions when analytical formulas are unavailable or assumptions are violated. The procedure: resample n observations WITH replacement from the original sample, compute the statistic of interest, repeat thousands of times, and use the empirical distribution of bootstrap statistics as the sampling distribution. Why the bootstrap works: CLT applies to the bootstrap distribution of x̄ (and many other statistics) under mild conditions. The bootstrap distribution converges to the sampling distribution of the statistic, which by CLT is approximately normal for large n. This is why bootstrap confidence intervals match analytical confidence intervals when the analytical ones are valid. The bootstrap is especially useful when: - The statistic has no closed-form sampling distribution (e.g., median, trimmed mean, ratio of medians). - Sample size is small but the statistic is asymptotically normal. - Data has unusual structure (clustered, censored) that standard formulas do not handle. Limitation: the bootstrap does NOT rescue you from severely skewed small samples where CLT has not yet kicked in. If n is too small for analytical CLT-based inference, n is too small for bootstrap inference too.
Key Points
- •Bootstrap: resample n observations WITH replacement, compute statistic, repeat thousands of times
- •Bootstrap distribution → sampling distribution under mild conditions
- •CLT justifies bootstrap normality for many statistics
- •Useful when analytical formulas are unavailable
- •Does NOT rescue under-powered small samples
6. How StatsIQ Helps With CLT Problems
CLT problems span every introductory and intermediate statistics course, the AP Statistics exam, and most applied research design discussions. The key skills are recognizing when CLT applies, computing the sampling-distribution parameters (mean μ, SE = σ/√n), and using them for probability or interval calculations. Snap a photo of any CLT problem and StatsIQ identifies the population parameters, computes SE, evaluates the n requirement, and produces the requested probability with the area under the sampling-distribution curve visualized. For multi-part problems involving sample-size planning or bootstrap inference, StatsIQ chains the steps. This content is for educational purposes only and does not constitute statistical advice.
Key Points
- •Identifies population parameters (μ, σ) from problem
- •Computes SE = σ/√n correctly
- •Evaluates whether n is sufficient given population shape
- •Produces final probability with sampling-distribution area visualized
- •Useful for AP Statistics, intro stats, and applied research design
Key Takeaways
- ★CLT: x̄ → Normal(μ, σ²/n) as n grows, regardless of population distribution
- ★Standard error = σ / √n
- ★Rule of thumb n ≥ 30; severely skewed populations need n closer to 100
- ★Symmetric populations (uniform): CLT convergence at n ≈ 5-10
- ★Sample proportion: SE = √(p(1−p)/n); needs np > 10 and n(1−p) > 10
- ★Two-sample SE: SE(x̄_1 − x̄_2) = √(σ²_1/n_1 + σ²_2/n_2)
- ★Bootstrap inference relies on CLT for many statistics
- ★CLT is why z-tests and t-tests work on non-normal data
- ★Quadrupling n halves SE (since SE ∝ 1/√n)
- ★Without CLT, inferential statistics would be unable to handle non-normal populations
Practice Questions
1. A right-skewed population has μ = 50, σ = 15. For a sample of n = 36, what is the sampling distribution of x̄?
2. For the population in question 1, P(x̄ > 53)?
3. A binary survey: p = 0.30. For n = 100, what is the sampling distribution of p̂?
4. Why does CLT need n ≥ 30 for skewed populations but only n ≈ 5-10 for symmetric ones?
FAQs
Common questions about this topic
CLT applies to any population with finite variance. The Cauchy distribution (and a few other heavy-tailed distributions) have undefined or infinite variance, so CLT does NOT apply — sample means do NOT converge to normal. For practical purposes, every distribution encountered in applied work has finite variance, so CLT applies. The differences are only in HOW FAST convergence happens (skewness controls the rate).
It depends on the population shape. Symmetric populations: n = 5-10 often suffices. Moderately skewed: n = 30 (the textbook rule). Severely skewed (exponential, lognormal): n = 100+ may be needed for the sampling distribution to be indistinguishable from normal. When in doubt, simulate: take many samples of various n from your data, compute means, and visualize.
Population standard deviation σ measures variability of individual observations. Standard error SE measures variability of a statistic (typically a sample mean) — it is σ/√n. SE shrinks as sample size grows; σ does not. Confidence intervals and t-tests use SE because they describe uncertainty about the parameter estimate, not the variability of individual observations.
The bootstrap resamples from the original sample many times and computes the statistic for each resample. By CLT applied to the bootstrap process, the resulting distribution converges to the sampling distribution of the statistic. Bootstrap confidence intervals are then constructed from the percentiles of the bootstrap distribution. The bootstrap is especially valuable when analytical sampling-distribution formulas are unavailable.
Yes. Snap a photo of any sampling-distribution or CLT problem and StatsIQ identifies the population parameters, computes SE, evaluates whether n is sufficient given the population shape, and produces the requested probability or confidence interval with the sampling-distribution curve visualized. This content is for educational purposes only and does not constitute statistical advice.