Lecture Notes: Chi-Square (χ²) Tests

Welcome to this comprehensive guide on Chi‑Square (χ²) tests, essential tools in statistics for analyzing categorical data. These notes cover the fundamental concepts, the core formula, manual calculation steps, different types of Chi-Square tests (goodness-of-fit, independence, homogeneity), practical examples with R code snippets, and an interactive visualization to help you understand the Chi-Square distribution itself. We'll explore how to compare observed frequencies with expected frequencies to perform hypothesis testing.

The Chi-Square (χ²) Test Statistic Formula

χ² =
(O - E)²E
  • Where:
  • O represents the Observed frequencies (the actual counts in your data).
  • E represents the Expected frequencies (the counts you would anticipate if your null hypothesis were true).
  • means to Sum the values calculated for each category or cell in your table.

This formula quantifies the discrepancy between your observed data and what you expected under the null hypothesis. A larger χ² value indicates a greater difference, suggesting the observed data may not fit the expected pattern.

Manual Calculation Example: Goodness-of-Fit (Fair Dice Roll)

Let's walk through calculating the χ² statistic by hand. This helps solidify understanding before relying on software.

Scenario

We roll a standard six-sided die 120 times to test if it's fair. A fair die means each face (1 to 6) should appear with equal probability (1/6).

Null Hypothesis (H₀): The die is fair (P(1)=P(2)=...=P(6)=1/6).

Alternative Hypothesis (H₁): The die is not fair (the probabilities are different).

Observed Data (O): Suppose we observed the following counts after 120 rolls:

  • Face 1: 18 times
  • Face 2: 25 times
  • Face 3: 15 times
  • Face 4: 22 times
  • Face 5: 28 times
  • Face 6: 12 times
  • (Total Rolls = 18+25+15+22+28+12 = 120)

Step-by-Step Calculation

  1. Calculate Expected Frequencies (E): If the die is fair (H₀ is true), we expect each face to appear:
    Total Rolls * P(Face) = 120 * (1/6) = 20 times. So, E = 20 for all faces.
  2. Set up the Calculation Table:
  3. FaceObserved (O)Expected (E)O - E(O - E)²(O - E)² / E
    11820-244 / 20 = 0.20
    22520 52525 / 20 = 1.25
    31520-52525 / 20 = 1.25
    42220 244 / 20 = 0.20
    52820 86464 / 20 = 3.20
    61220-86464 / 20 = 3.20
    Total (Sum of last column)χ² = 9.30
  4. Calculate the χ² Statistic: Sum the values in the last column:
    χ² = 0.20 + 1.25 + 1.25 + 0.20 + 3.20 + 3.20 = 9.30
  5. Determine Degrees of Freedom (df): For a Goodness-of-Fit test, df = k - 1, where k is the number of categories (faces).
    df = 6 - 1 = 5

Next Step (Interpretation)

We would compare our calculated χ² value (9.30) with a critical value from the Chi-Square distribution table using df=5 and a chosen significance level (e.g., α = 0.05). The critical value for χ²(df=5, α=0.05) is approximately 11.07. Since our calculated value (9.30) is less than the critical value (11.07), we fail to reject H₀. Alternatively, the p-value associated with χ²=9.30 and df=5 is approximately 0.097, which is greater than 0.05. Conclusion: Based on this data, there is not enough statistical evidence to conclude that the die is unfair.

Interactive Chi-Square Distribution Explorer

The tool below allows you to explore the Chi‑Square distribution interactively. Understanding the shape of this distribution and how it changes with degrees of freedom (df) is key to interpreting test results and p-values. Adjust the df and critical value to see their impact.

The Chi-Square (χ²) distribution is fundamental for goodness-of-fit tests and tests of independence. Its shape is defined by the degrees of freedom (df). The distribution is always non-negative and right-skewed, especially for low df. As df increases, the curve spreads out and becomes more symmetrical, resembling a normal distribution.

Use the slider to adjust the degrees of freedom (df) and see how the curve changes. Input a Critical Chi-Square Value (often found from tables or software based on your significance level α) to visualize the corresponding p-value (the area shaded to the right). Hover over the graph to see the probability density (PDF) for specific χ² values. The distribution's mean (df) and mode (df-2, for df>2) are also marked.

df = 5
Crit. Val = 11.07
Mean=5, Mode=3

Usage of Chi-Square Tests

Chi-Square tests are versatile statistical methods for analyzing categorical data. Their main applications include:

  • Goodness-of-Fit Test: Tests if the observed frequency distribution of a single categorical variable matches a specific, expected distribution. For example, are M&M colors distributed according to the company's claimed percentages, or is a die fair?
  • Test of Independence: Tests whether two categorical variables collected from a single sample are associated or independent. For example, is there a relationship between smoking status and lung cancer incidence in a population sample?
  • Test of Homogeneity: Compares the distribution of a categorical variable across two or more different populations or groups to see if the distributions are the same. For example, do different teaching methods lead to the same distribution of student grades (A, B, C, etc.)?

Types of Chi-Square Tests

The two primary types of Pearson’s chi-square (χ²) tests commonly encountered are:

  • χ² Goodness-of-Fit Test: Used when you have one categorical variable from a single population. It assesses whether the observed frequencies in each category significantly differ from the frequencies you would expect based on a specific hypothesis (e.g., equal proportions, proportions matching a known distribution). The degrees of freedom (df) are calculated as `k - 1`, where `k` is the number of categories.
  • χ² Test of Independence: Used when you have two categorical variables from a single population sample, usually presented in a contingency table. It evaluates whether there is a statistically significant association (dependency) between the two variables. The degrees of freedom (df) are calculated as `(rows - 1) * (columns - 1)`.

Note: The chi-square test of homogeneity, while conceptually distinct (comparing distributions across multiple populations), uses the same calculation method and degrees of freedom formula as the test of independence.

Example 1: Zodiac Signs (Goodness-of-Fit)

Scenario

A researcher wants to know if the birthdays of 256 executives are evenly distributed throughout the year (i.e., across the 12 zodiac signs). If evenly distributed, we'd expect an equal number of executives for each sign.

Null Hypothesis (H₀): The distribution of executive birthdays across zodiac signs is uniform (equal proportions).

Alternative Hypothesis (H₁): The distribution is not uniform.

Data:

SignObserved (O)Expected (E)(O-E)²/E
Aries2321.330.131
Taurus2021.330.082
Gemini1821.330.521
Cancer2321.330.131
Leo2121.330.005
Virgo1921.330.254
Libra1821.330.521
Scorpio2121.330.005
Sagittarius1921.330.254
Capricorn2221.330.021
Aquarius2421.330.328
Pisces2921.332.766
Totalχ² ≈ 5.019

Expected Count per sign: Total executives / Number of signs = 256 / 12 ≈ 21.33

Degrees of Freedom (df): Number of categories - 1 = 12 - 1 = 11

R Code (Example): Assuming `observed_counts` is a vector `c(23, 20, ..., 29)`:

observed_counts <- c(23, 20, 18, 23, 21, 19, 18, 21, 19, 22, 24, 29)
 # Expected probabilities are equal (1/12 for each)
 chisq_result_zodiac <- chisq.test(observed_counts)
 print(chisq_result_zodiac)
 

Result Interpretation: The calculated χ² ≈ 5.019. With df=11, the p-value reported by R is large (p ≈ 0.93). Since p > 0.05, we fail to reject H₀; there's no significant statistical evidence from this sample to suggest that executive birthdays are unevenly distributed across zodiac signs.

Test Function Reminder: chisq.test(observed_vector_count, p = expected_probability_vector) (p defaults to equal probabilities if omitted).

Example 2: Absenteeism (Goodness-of-Fit)

Scenario

Faculty hypothesized absenteeism rates among 100 students. They compared their expectations to actual survey data.

H₀: The observed student absenteeism matches the expected distribution (50% in 0-2, 30% in 3-5, 12% in 6-8, 8% in 9+).

H₁: The observed distribution differs from the expected one.

Data:

Number of AbsencesExpected Students (E)Observed Students (O)(O-E)²/E
0–250354.500
3–530403.333
6–812205.333
9+851.125
Totalχ² = 14.291

Degrees of Freedom (df): k - 1 = 4 - 1 = 3

R Code (Example):

observed_absences <- c(35, 40, 20, 5)
 expected_proportions_abs <- c(0.50, 0.30, 0.12, 0.08) # Use proportions
 chisq_result_abs <- chisq.test(observed_absences, p = expected_proportions_abs)
 print(chisq_result_abs)
 

Result Interpretation: The calculated χ² ≈ 14.29. With df=3, the p-value reported by R is small (p ≈ 0.0025). Since p < 0.05, we reject H₀. There is significant statistical evidence that the observed student absenteeism distribution differs from what the faculty expected.

Example 3: Televisions (Goodness-of-Fit)

Scenario

We test if the number of TVs per family in a sample of 600 families matches the claimed national distribution percentages.

H₀: The observed TV distribution matches the national percentages (10% 0, 16% 1, 55% 2, 11% 3, 8% 4+).

H₁: The observed distribution does not match.

Data:

Number of TVsNational %Expected (E) (out of 600)Observed (O)(O-E)²/E
010%60660.600
116%961195.510
255%3303400.303
311%66600.545
4+8%481522.688
Totalχ² ≈ 29.646

Degrees of Freedom (df): k - 1 = 5 - 1 = 4

R Code (Example):

observed_tv <- c(66, 119, 340, 60, 15)
 expected_proportions_tv <- c(0.10, 0.16, 0.55, 0.11, 0.08)
 chisq_result_tv <- chisq.test(observed_tv, p = expected_proportions_tv)
 print(chisq_result_tv)
 

Result Interpretation: The calculated χ² ≈ 29.65. With df=4, the p-value reported by R is very small (p ≈ 5.7e-6). Since p < 0.05, we reject H₀. There is strong statistical evidence that the distribution of televisions in this sample significantly differs from the claimed national distribution, with the largest discrepancy in the '4+' category.

Class Activity 1: Customer Count (Goodness-of-Fit)

Scenario

A shop owner claims customer traffic is the same every weekday. A researcher recorded counts over a week to test this claim.

H₀: Customer counts are uniformly distributed across the 5 weekdays (Mon-Fri).

H₁: Customer counts are not uniformly distributed.

Data: Mon: 50, Tue: 60, Wed: 40, Thu: 47, Fri: 53. Total Customers = 250.

Expected Count per day (if H₀ is true): Total Customers / Number of days = 250 / 5 = 50

R command to test the hypothesis:

customer_counts <- c(50, 60, 40, 47, 53)
 # Test against equal proportions (default for chisq.test when p is omitted)
 chisq_result_cust <- chisq.test(customer_counts)
 print(chisq_result_cust)
 # Explicitly with proportions: chisq.test(customer_counts, p = rep(1/5, 5))
                 

Example Output: X-squared = 4.36, df = 4, p-value = 0.3595

Conclusion: With a p-value of 0.36 (which is > 0.05), we fail to reject H₀. Based on this sample, there isn't enough statistical evidence to conclude that the customer flow significantly differs across weekdays from what would be expected if it were uniform.

Example: M&M Color Breakdown (Goodness-of-Fit)

Scenario

Test if a sample of 712 M&Ms collected in 2016-2017 matches the color distribution percentages published by Mars in 2008.

H₀: The 2016-2017 sample distribution matches the 2008 published percentages (Blue: 24%, Orange: 20%, Green: 16%, Yellow: 14%, Red: 13%, Brown: 13%).

H₁: The distributions differ.

Data:

  • Published (Expected %): Blue: 24%, Orange: 20%, Green: 16%, Yellow: 14%, Red: 13%, Brown: 13%
  • Observed Sample Counts (n=712): Blue: 133, Orange: 133, Green: 139, Yellow: 103, Red: 108, Brown: 96 (Total 712)

R code:

 # Observed counts
 observed_mm = c(133, 133, 139, 103, 108, 96)

 # Expected proportions
 expected_prop_mm = c(0.24, 0.20, 0.16, 0.14, 0.13, 0.13)

 # Perform the test
 chisq_result_mm <- chisq.test(observed_mm, p = expected_prop_mm)
 print(chisq_result_mm)
                 

Example Output: X-squared = 17.066, df = 5, p-value = 0.004377

Conclusion: Since the p-value (0.0044) is less than 0.05, we reject H₀. The color distribution in the 2016-2017 sample is significantly different from the proportions published by Mars in 2008.

Class Activity 2: Eye & Hair Color (Test of Independence)

Scenario

Analyze data on eye color and hair color from a sample of individuals to determine if these two traits are associated (dependent) or independent.

H₀: Eye color and hair color are independent.

H₁: Eye color and hair color are associated (dependent).

Data (Contingency Table - Example from `datasets::HairEyeColor` in R, summed over sex):

 # Example using built-in R dataset
 data(HairEyeColor)
 hair_eye_table <- apply(HairEyeColor, c("Hair", "Eye"), sum) # Sum over Male/Female
 print("Observed Contingency Table (Hair vs Eye Color):")
 print(hair_eye_table)
                  

R code for the chi-square test:

 # Perform the test directly on the table
 test_result_he <- chisq.test(hair_eye_table)
 print(test_result_he)

 print("Expected Counts (under independence assumption):")
 print(round(test_result_he$expected, 2)) # Show expected counts rounded
                 

Degrees of Freedom (df): (Number of Hair Colors - 1) * (Number of Eye Colors - 1) = (4 - 1) * (4 - 1) = 3 * 3 = 9

Example Output (using HairEyeColor data): X-squared = 138.29, df = 9, p-value < 2.2e-16

Conclusion: The p-value is extremely small (effectively zero). We strongly reject H₀. There is a highly significant statistical association between hair color and eye color in this dataset.

Example: Skittles Favorite Flavor (Test of Homogeneity)

Scenario

A survey asked people their favorite Skittle. One group was shown colors, another tasted blindfolded (flavors). We want to know if the distribution of favorite choices is the same for both polling methods (groups).

H₀: The distribution of favorite Skittle choices is the same for the Color Poll and Flavor Poll groups.

H₁: The distributions of favorite choices are different between the two groups.

Data (Contingency Table - Rows: Poll Type, Columns: Skittle Option):

Poll TypeOpt 1Opt 2Opt 3Opt 4Opt 5Row Totals
Color Poll18915131166
Flavor Poll13161934991
Col Totals3125344720157

R code:

 # Create the matrix
 skittles_data = matrix(c(18, 9, 15, 13, 11,  # Row 1: Color Poll
                          13, 16, 19, 34, 9), # Row 2: Flavor Poll
                          nrow = 2, ncol = 5, byrow = TRUE)
 colnames(skittles_data) <- c("Opt1", "Opt2", "Opt3", "Opt4", "Opt5")
 rownames(skittles_data) <- c("ColorPoll", "FlavorPoll")

 print("Observed Data:")
 print(skittles_data)

 # Perform the test (Test of Independence/Homogeneity)
 chisq_result_skittles <- chisq.test(skittles_data)
 print(chisq_result_skittles)
                 

Degrees of Freedom (df): (Rows - 1) * (Columns - 1) = (2 - 1) * (5 - 1) = 1 * 4 = 4

Example Output: X-squared = 9.0691, df = 4, p-value = 0.0594

Conclusion: The p-value (0.0594) is slightly above the common significance level of 0.05. Therefore, we fail to reject H₀. There isn't statistically significant evidence (at α=0.05) to conclude that the polling method (color vs. flavor) affects the distribution of favorite Skittle choices, although the result is borderline and might warrant further investigation with a larger sample.

Example: Rock-Paper-Scissors (1) - Goodness-of-Fit (Equal Proportions)

Scenario

Test if players choose Rock, Paper, or Scissors with equal frequency in a series of games.

H₀: The choices Rock, Paper, Scissors are made with equal probability (1/3 each).

H₁: The probabilities are not equal.

Data: Total plays = 88 + 74 + 66 = 228

  • Rock: 88
  • Paper: 74
  • Scissors: 66

Expected (if H₀ is true): 228 / 3 = 76 for each choice.

R code:

 rps_counts1 <- c(88, 74, 66)
 # Test against equal probabilities (default)
 chisq_result_rps1 <- chisq.test(rps_counts1)
 print(chisq_result_rps1)
 # Explicitly: chisq.test(rps_counts1, p = c(1/3, 1/3, 1/3))
                 

Degrees of Freedom (df): k - 1 = 3 - 1 = 2

Example Output: X-squared = 3.2632, df = 2, p-value = 0.1956

Conclusion: The p-value (0.196) is greater than 0.05. We fail to reject H₀. There's no significant evidence from this data that players deviated from choosing Rock, Paper, or Scissors with equal frequency.

Example: Rock-Paper-Scissors (2) - Goodness-of-Fit (Unequal Proportions)

Scenario

Test if the observed frequencies of choices match a specific, *unequal* hypothesized distribution (e.g., a player claims they throw Rock 50% of the time, Paper 30%, and Scissors 20%).

H₀: The observed choices follow the distribution P(Rock)=0.5, P(Paper)=0.3, P(Scissors)=0.2.

H₁: The observed choices do not follow this specific distribution.

Data: Total plays = 66 + 39 + 14 = 119

  • Rock: 66
  • Paper: 39
  • Scissors: 14

Expected (if H₀ is true): E(Rock)=119*0.5=59.5, E(Paper)=119*0.3=35.7, E(Scissors)=119*0.2=23.8

R code:

 rps_observed2 <- c(66, 39, 14)
 rps_expected_prop2 <- c(0.5, 0.3, 0.2)
 chisq_result_rps2 <- chisq.test(rps_observed2, p = rps_expected_prop2)
 print(chisq_result_rps2)
                 

Degrees of Freedom (df): k - 1 = 3 - 1 = 2

Example Output: X-squared = 5.0504, df = 2, p-value = 0.08004

Conclusion: The p-value (0.08) is greater than 0.05. We fail to reject H₀. While the observed counts aren't a perfect match to the expected ones (especially for Scissors), the difference is not statistically significant at the α=0.05 level. There isn't enough evidence to disprove the player's claimed strategy based on this data.

Finding Chi-Square with a Contingency Table (Manual Steps & R)

Scenario

Illustrate the calculation steps for a Test of Independence using a simple 2x2 contingency table and verify with R.

H₀: The row variable and column variable are independent.

H₁: The variables are dependent (associated).

Observed Data (Example 2x2 Table):

 # Create the contingency table in R
 observed_table_cont <- matrix(c(60, 40,  # Row 1
                            20, 10), # Row 2
                            nrow = 2, byrow = TRUE)
 colnames(observed_table_cont) <- c("Category A", "Category B")
 rownames(observed_table_cont) <- c("Group 1", "Group 2")

 print("Observed Table:")
 print(observed_table_cont)
                 

Manual Calculation Steps:

  1. Calculate Row and Column Totals: Row1=100, Row2=30, ColA=80, ColB=50, GrandTotal=130
  2. Calculate Expected Frequencies (E) for each cell: E = (Row Total * Column Total) / Grand Total
    • E(R1,C1)=(100*80)/130≈61.54
    • E(R1,C2)=(100*50)/130≈38.46
    • E(R2,C1)=(30*80)/130≈18.46
    • E(R2,C2)=(30*50)/130≈11.54
  3. Calculate the Chi-Square component for each cell: (O - E)² / E
    • Cell(1,1):(60-61.54)²/61.54≈0.0385
    • Cell(1,2):(40-38.46)²/38.46≈0.0615
    • Cell(2,1):(20-18.46)²/18.46≈0.1282
    • Cell(2,2):(10-11.54)²/11.54≈0.2051
  4. Sum the components to get the Chi-Square statistic (χ²): χ² ≈ 0.0385+0.0615+0.1282+0.2051 ≈ 0.4333
  5. Determine Degrees of Freedom (df): df = (rows-1)*(cols-1) = (2-1)*(2-1) = 1

R Code Verification:

 # Perform the chi-square test using R
 # correct = FALSE removes Yates' continuity correction for 2x2 tables, matching manual calculation more closely
 test_result_cont <- chisq.test(observed_table_cont, correct = FALSE)

 print(test_result_cont)

 print("Expected values from R:")
 print(round(test_result_cont$expected, 2)) # Rounded expected values
                 

Example R Output:

	Pearson's Chi-squared test

 data:  observed_table_cont
 X-squared = 0.43333, df = 1, p-value = 0.5104

 Expected values from R:
          Category A Category B
 Group 1      61.54      38.46
 Group 2      18.46      11.54
                 

Conclusion: The manually calculated χ² statistic (0.433) matches the R output (when `correct=FALSE`). With df=1, the p-value is 0.51. Since p > 0.05, we fail to reject H₀. There is no significant statistical evidence of an association between the row variable (Group) and the column variable (Category) in this data.

Chi-Square Tests: Formula Summary

Here's a quick reference table for the key formulas used in Chi-Square testing:

ConceptFormulaNotes / Applies To
Chi-Square Statistic (χ²)χ² = Σ [ (O - E)² / E ]Applies to all Chi-Square tests (Goodness-of-Fit, Independence, Homogeneity). Summation (Σ) is over all categories or cells.
Expected Frequency (E)E = n * pGoodness-of-Fit Test.
n = total sample size,
p = expected proportion for the category under H₀.
Expected Frequency (E)E = (Row Total * Column Total) / Grand TotalTest of Independence / Test of Homogeneity. Calculated for each cell in the contingency table.
Degrees of Freedom (df)df = k - 1Goodness-of-Fit Test.
k = number of categories.
Degrees of Freedom (df)df = (Rows - 1) * (Columns - 1)Test of Independence / Test of Homogeneity. Based on the dimensions of the contingency table.