Chi Square Test in R

Last updated on Dec 13 2021
Murugan Swamy

Table of Contents

Chi Square Test in R

Chi-Square test is a statistical method to determine if two categorical variables have a significant correlation between them. Both those variables should be from same population and they should be categorical like − Yes/No, Male/Female, Red/Green etc.

For example, we can build a data set with observations on people’s ice-cream buying pattern and try to correlate the gender of a person with the flavor of the ice-cream they prefer. If a correlation is found we can plan for appropriate stock of flavors by knowing the number of gender of people visiting.

Syntax

The function used for performing chi-Square test is chisq.test().

The basic syntax for creating a chi-square test in R is −

chisq.test(data)

Following is the description of the parameters used −

  • data is the data in form of a table containing the count value of the variables in the observation.

Example

We will take the Cars93 data in the “MASS” library which represents the sales of different models of car in the year 1993.

 

library("MASS")
print(str(Cars93))

When we execute the above code, it produces the following result −

'data.frame':   93 obs. of  27 variables:

 $ Manufacturer      : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...

 $ Model             : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...

 $ Type              : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...

 $ Min.Price         : num  12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...

 $ Price             : num  15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...

 $ Max.Price         : num  18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...

 $ MPG.city          : int  25 18 20 19 22 22 19 16 19 16 ...

 $ MPG.highway       : int  31 25 26 26 30 31 28 25 27 25 ...

 $ AirBags           : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...

 $ DriveTrain        : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ...

 $ Cylinders         : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ...

 $ EngineSize        : num  1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...

 $ Horsepower        : int  140 200 172 172 208 110 170 180 170 200 ...

 $ RPM               : int  6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...

 $ Rev.per.mile      : int  2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...

 $ Man.trans.avail   : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...

 $ Fuel.tank.capacity: num  13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...

 $ Passengers        : int  5 5 5 6 4 6 6 6 5 6 ...

 $ Length            : int  177 195 180 193 186 189 200 216 198 206 ...

 $ Wheelbase         : int  102 115 102 106 109 105 111 116 108 114 ...

 $ Width             : int  68 71 67 70 69 69 74 78 73 73 ...

 $ Turn.circle       : int  37 38 37 37 39 41 42 45 41 43 ...

 $ Rear.seat.room    : num  26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...

 $ Luggage.room      : int  11 15 14 17 13 16 17 21 14 18 ...

 $ Weight            : int  2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...

 $ Origin            : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ...

 $ Make              : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...

The above result shows the dataset has many Factor variables which can be considered as categorical variables. For our model we will consider the variables “AirBags” and “Type”. Here we aim to find out any significant correlation between the types of car sold and the type of Air bags it has. If correlation is observed we can estimate which types of cars can sell better with what types of air bags.

 

# Load the library.
library("MASS")
# Create a data frame from the main data set.
car.data <- data.frame(Cars93$AirBags, Cars93$Type)
# Create a table with the needed variables.
car.data = table(Cars93$AirBags, Cars93$Type)
print(car.data)
# Perform the Chi-Square test.
print(chisq.test(car.data))

When we execute the above code, it produces the following result −

Compact Large Midsize Small Sporty Van

  Driver & Passenger       2     4       7     0      3   0
  Driver only              9     7      11     5      8   3
  None                     5     0       4    16      3   6

Pearson’s Chi-squared test

 

data:  car.data
X-squared = 33.001, df = 10, p-value = 0.0002723

Warning message:

In chisq.test(car.data): Chi-squared approximation may be incorrect

Conclusion

The result shows the p-value of less than 0.05 which indicates a string correlation.

So, this brings us to the end of blog. This Tecklearn ‘Chi Square Test in R’ blog helps you with commonly asked questions if you are looking out for a job in Data Science. If you wish to learn R Language and build a career in Data Science domain, then check out our interactive, Data Science using R Language Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

https://www.tecklearn.com/course/data-science-training-using-r-language/

Data Science using R Language Training

About the Course

Tecklearn’s Data Science using R Language Training develops knowledge and skills to visualize, transform, and model data in R language. It helps you to master the Data Science with R concepts such as data visualization, data manipulation, machine learning algorithms, charts, hypothesis testing, etc. through industry use cases, and real-time examples. Data Science course certification training lets you master data analysis, R statistical computing, connecting R with Hadoop framework, Machine Learning algorithms, time-series analysis, K-Means Clustering, Naïve Bayes, business analytics and more. This course will help you gain hands-on experience in deploying Recommender using R, Evaluation, Data Transformation etc.

Why Should you take Data Science Using R Training?

  • The Average salary of a Data Scientist in R is $123k per annum – Glassdoor.com
  • A recent market study shows that the Data Analytics Market is expected to grow at a CAGR of 30.08% from 2020 to 2023, which would equate to $77.6 billion.
  • IBM, Amazon, Apple, Google, Facebook, Microsoft, Oracle & other MNCs worldwide are using data science for their Data analysis.

What you will Learn in this Course?

Introduction to Data Science

  • Need for Data Science
  • What is Data Science
  • Life Cycle of Data Science
  • Applications of Data Science
  • Introduction to Big Data
  • Introduction to Machine Learning
  • Introduction to Deep Learning
  • Introduction to R&R-Studio
  • Project Based Data Science

Introduction to R

  • Introduction to R
  • Data Exploration
  • Operators in R
  • Inbuilt Functions in R
  • Flow Control Statements & User Defined Functions
  • Data Structures in R

Data Manipulation

  • Need for Data Manipulation
  • Introduction to dplyr package
  • Select (), filter(), mutate(), sample_n(), sample_frac() & count() functions
  • Getting summarized results with the summarise() function,
  • Combining different functions with the pipe operator
  • Implementing sql like operations with sqldf()

Visualization of Data

  • Loading different types of datasets in R
  • Arranging the data
  • Plotting the graphs

Introduction to Statistics

  • Types of Data
  • Probability
  • Correlation and Co-variance
  • Hypothesis Testing
  • Standardization and Normalization

Introduction to Machine Learning

  • What is Machine Learning?
  • Machine Learning Use-Cases
  • Machine Learning Process Flow
  • Machine Learning Categories
  • Supervised Learning algorithm: Linear Regression and Logistic Regression

Logistic Regression

  • Intro to Logistic Regression
  • Simple Logistic Regression in R
  • Multiple Logistic Regression in R
  • Confusion Matrix
  • ROC Curve

Classification Techniques

  • What are classification and its use cases?
  • What is Decision Tree?
  • Algorithm for Decision Tree Induction
  • Creating a Perfect Decision Tree
  • Confusion Matrix
  • What is Random Forest?
  • What is Naive Bayes?
  • Support Vector Machine: Classification

Decision Tree

  • Decision Tree in R
  • Information Gain
  • Gini Index
  • Pruning

Recommender Engines

  • What is Association Rules & its use cases?
  • What is Recommendation Engine & it’s working?
  • Types of Recommendations
  • User-Based Recommendation
  • Item-Based Recommendation
  • Difference: User-Based and Item-Based Recommendation
  • Recommendation use cases

Time Series Analysis

  • What is Time Series data?
  • Time Series variables
  • Different components of Time Series data
  • Visualize the data to identify Time Series Components
  • Implement ARIMA model for forecasting
  • Exponential smoothing models
  • Identifying different time series scenario based on which different Exponential Smoothing model can be applied

Got a question for us? Please mention it in the comments section and we will get back to you.

 

 

0 responses on "Chi Square Test in R"

Leave a Message

Your email address will not be published. Required fields are marked *