Statistics Concepts in Python

Last updated on Jan 23 2023
Prabhas Ramanathan

Table of Contents

Python – Measuring Central Tendency

Mathematically central tendency means measuring the center or distribution of location of values of a data set. It gives an idea of the average value of the data in the data set and also an indication of how widely the values are spread in the data set. That in turn helps in evaluating the chances of a new input fitting into the existing data set and hence probability of success.
There are three main measures of central tendency which can be calculated using the methods in pandas python library.
• Mean – It is the Average value of the data which is a division of sum of the values with the number of values.
• Median – It is the middle value in distribution when the values are arranged in ascending or descending order.
• Mode – It is the most commonly occurring value in a distribution.

Calculating Mean and Median

The pandas functions can be directly used to calculate these values.
import pandas as pd

 

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','Chanchal','Gasper','Naviya','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print "Mean Values in the Distribution"
print df.mean()
print "*******************************"
print "Median Values in the Distribution"
print df.median()

 

Its output is as follows −
Mean Values in the Distribution
Age 31.833333
Rating 3.743333
dtype: float64
*******************************
Median Values in the Distribution
Age 29.50
Rating 3.79
dtype: float64

Calculating Mode

Mode may or may not be available in a distribution depending on whether the data is continous or whether there are values which has maximum frquency. We take a simple distribution below to find out the mode. Here we have a value which has maximum frequency in the distribution.
import pandas as pd

 

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','Chanchal','Gasper','Naviya','Andres']),
'Age':pd.Series([25,26,25,23,30,25,23,34,40,30,25,46])}
#Create a DataFrame
df = pd.DataFrame(d)

print df.mode()

p 21

Its output is as follows −

Age Name
0 25.0 Andres
1 NaN Chanchal
2 NaN Gasper
3 NaN Jack
4 NaN James
5 NaN Lee
6 NaN Naviya
7 NaN Ricky
8 NaN Smith
9 NaN Steve
10 NaN Tom
11 NaN Vin

 

Python – Measuring Variance

In statistics, variance is a measure of how far a value in a data set lies from the mean value. In other words, it indicates how dispersed the values are. It is measured by using standard deviation. The other method commonly used is skewness.
Both of these are calculated by using functions available in pandas library.

Measuring Standard Deviation

Standard deviation is square root of variance. variance is the average of squared difference of values in a data set from the mean value. In python we calculate this value by using the function std() from pandas library.
import pandas as pd

 

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','Chanchal','Gasper','Naviya','Andres']),
'Age':pd.Series([25,26,25,23,30,25,23,34,40,30,25,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)

# Calculate the standard deviation
print df.std()

Its output is as follows −

Age 7.265527
Rating 0.661628
dtype: float64

Measuring Skewness

It used to determine whether the data is symmetric or skewed. If the index is between -1 and 1, then the distribution is symmetric. If the index is no more than -1 then it is skewed to the left and if it is at least 1, then it is skewed to the right
import pandas as pd

 

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','Chanchal','Gasper','Naviya','Andres']),
'Age':pd.Series([25,26,25,23,30,25,23,34,40,30,25,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print df.skew()

Its output is as follows −
Age 1.443490
Rating -0.153629
dtype: float64
So the distribution of age rating is symmetric while the distribution of age is skewed to the right.

Python – Normal Distribution

The normal distribution is a form presenting data by arranging the probability distribution of each value in the data.Most values remain around the mean value making the arrangement symmetric.
We use various functions in numpy library to mathematically calculate the values for a normal distribution. Histograms are created over which we plot the probability distribution curve.

import matplotlib.pyplot as plt
import numpy as np

mu, sigma = 0.5, 0.1
s = np.random.normal(mu, sigma, 1000)

# Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)

# Plot the distribution curve
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *
np.exp( - (bins - mu)**2 / (2 * sigma**2) ), linewidth=3, color='y')
plt.show()

Its output is as follows −

p 22

 

Python – Binomial Distribution

The binomial distribution model deals with finding the probability of success of an event which has only two possible outcomes in a series of experiments. For example, tossing of a coin always gives a head or a tail. The probability of finding exactly 3 heads in tossing a coin repeatedly for 10 times is estimated during the binomial distribution.
We use the seaborn python library which has in-built functions to create such probability distribution graphs. Also, the scipy package helps is creating the binomial distribution.
from scipy.stats import binom
import seaborn as sb

binom.rvs(size=10,n=20,p=0.8)

data_binom = binom.rvs(n=20,p=0.8,loc=0,size=1000)
ax = sb.distplot(data_binom,
kde=True,
color=’blue’,
hist_kws={“linewidth”: 25,’alpha’:1})
ax.set(xlabel=’Binomial’, ylabel=’Frequency’)
Its output is as follows −

p 23

 

Python – Poisson Distribution

A Poisson distribution is a distribution which shows the likely number of times that an event will occur within a pre-determined period of time. It is used for independent events which occur at a constant rate within a given interval of time. The Poisson distribution is a discrete function, meaning that the event can only be measured as occurring or not as occurring, meaning the variable can only be measured in whole numbers.
We use the seaborn python library which has in-built functions to create such probability distribution graphs. Also the scipy package helps is creating the binomial distribution.
from scipy.stats import poisson
import seaborn as sb

data_binom = poisson.rvs(mu=4, size=10000)
ax = sb.distplot(data_binom,
kde=True,
color=’green’,
hist_kws={“linewidth”: 25,’alpha’:1})
ax.set(xlabel=’Poisson’, ylabel=’Frequency’)
Its output is as follows −

p 24

 

Python – Bernoulli Distribution

The Bernoulli distribution is a special case of the Binomial distribution where a single experiment is conducted so that the number of observation is 1. So, the Bernoulli distribution therefore describes events having exactly two outcomes.
We use various functions in numpy library to mathematically calculate the values for a bernoulli distribution. Histograms are created over which we plot the probability distribution curve.
from scipy.stats import bernoulli
import seaborn as sb

data_bern = bernoulli.rvs(size=1000,p=0.6)
ax = sb.distplot(data_bern,
kde=True,
color=’crimson’,
hist_kws={“linewidth”: 25,’alpha’:1})
ax.set(xlabel=’Bernouli’, ylabel=’Frequency’)
Its output is as follows −

p 25

 

Python – P-Value

The p-value is about the strength of a hypothesis. We build hypothesis based on some statistical model and compare the model’s validity using p-value. One way to get the p-value is by using T-test.
This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations ‘a’ is equal to the given population mean, popmean. Let us consider the following example.
from scipy import stats
rvs = stats.norm.rvs(loc = 5, scale = 10, size = (50,2))
print stats.ttest_1samp(rvs,5.0)
The above program will generate the following output.
Ttest_1sampResult(statistic = array([-1.40184894, 2.70158009]),
pvalue = array([ 0.16726344, 0.00945234]))

Comparing two samples

In the following examples, there are two samples, which can come either from the same or from different distribution, and we want to test whether these samples have the same statistical properties.
ttest_ind − Calculates the T-test for the means of two independent samples of scores. This is a two-sided test for the null hypothesis that two independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.
We can use this test, if we observe two independent samples from the same or different population. Let us consider the following example.
from scipy import stats
rvs1 = stats.norm.rvs(loc = 5,scale = 10,size = 500)
rvs2 = stats.norm.rvs(loc = 5,scale = 10,size = 500)
print stats.ttest_ind(rvs1,rvs2)
The above program will generate the following output.
Ttest_indResult(statistic = -0.67406312233650278, pvalue = 0.50042727502272966)
You can test the same with a new array of the same length, but with a varied mean. Use a different value in loc and test the same.

Python – Correlation

Correlation refers to some statistical relationships involving dependence between two data sets. Simple examples of dependent phenomena include the correlation between the physical appearance of parents and their offspring, and the correlation between the price for a product and its supplied quantity.
We take example of the iris data set available in seaborn python library. In it we try to establish the correlation between the length and the width of the sepals and petals of three species of iris flower. Based on the correlation found, a strong model could be created which easily distinguishes one species from another.
import matplotlib.pyplot as plt
import seaborn as sns
df = sns.load_dataset(‘iris’)

#without regression
sns.pairplot(df, kind=”scatter”)
plt.show()
Its output is as follows −

p 26 1

 

Python – Chi-Square Test

Chi-Square test is a statistical method to determine if two categorical variables have a significant correlation between them. Both those variables should be from same population and they should be categorical like − Yes/No, Male/Female, Red/Green etc. For example, we can build a data set with observations on people’s ice-cream buying pattern and try to correlate the gender of a person with the flavour of the ice-cream they prefer. If a correlation is found we can plan for appropriate stock of flavours by knowing the number of gender of people visiting.
We use various functions in numpy library to carry out the chi-square test.

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 10, 100)
fig,ax = plt.subplots(1,1)

linestyles = [‘:’, ‘–‘, ‘-.’, ‘-‘]
deg_of_freedom = [1, 4, 7, 6]
for df, ls in zip(deg_of_freedom, linestyles):
ax.plot(x, stats.chi2.pdf(x, df), linestyle=ls)

plt.xlim(0, 10)
plt.ylim(0, 0.4)

plt.xlabel(‘Value’)
plt.ylabel(‘Frequency’)
plt.title(‘Chi-Square Distribution’)

plt.legend()
plt.show()
Its output is as follows −

p 27

 

Python – Linear Regression

In Linear Regression these two variables are related through an equation, where exponent (power) of both these variables is 1. Mathematically a linear relationship represents a straight line when plotted as a graph. A non-linear relationship where the exponent of any variable is not equal to 1 creates a curve.
The functions in Seaborn to find the linear regression relationship is regplot. The below example shows its use.

import seaborn as sb
from matplotlib import pyplot as plt
df = sb.load_dataset(‘tips’)
sb.regplot(x = “total_bill”, y = “tip”, data = df)
plt.show()
Its output is as follows −

p 28 1

 

So, this brings us to the end of blog. This Tecklearn ‘Statistics Concepts in Python’ blog helps you with commonly asked questions if you are looking out for a job in Python Programming. If you wish to learn Python and build a career in Python Programming domain, then check out our interactive, Python with Data Science Training, that comes with 24*7 support to guide you throughout your learning period.

Python with Data Science Training

About the Course

Python with Data Science training lets you master the concepts of the widely used and powerful programming language, Python. This Python Course will also help you master important Python programming concepts such as data operations, file operations, object-oriented programming and various Python libraries such as Pandas, NumPy, Matplotlib which are essential for Data Science. You will work on real-world projects in the domain of Python and apply it for various domains of Big Data, Data Science and Machine Learning.

Why Should you take Python with Data Science Training?

• Python is the preferred language for new technologies such as Data Science and Machine Learning.
• Average salary of Python Certified Developer is $123,656 per annum – Indeed.com
• Python is by far the most popular language for data science. Python held 65.6% of the data science market.

What you will Learn in this Course?

Introduction to Python

• Define Python
• Understand the need for Programming
• Know why to choose Python over other languages
• Setup Python environment
• Understand Various Python concepts – Variables, Data Types Operators, Conditional Statements and Loops
• Illustrate String formatting
• Understand Command Line Parameters and Flow control

Python Environment Setup and Essentials

• Python installation
• Windows, Mac & Linux distribution for Anaconda Python
• Deploying Python IDE
• Basic Python commands, data types, variables, keywords and more

Python language Basic Constructs

• Looping in Python
• Data Structures: List, Tuple, Dictionary, Set
• First Python program
• Write a Python Function (with and without parameters)
• Create a member function and a variable
• Tuple
• Dictionary
• Set and Frozen Set
• Lambda function

OOP (Object Oriented Programming) in Python

• Object-Oriented Concepts

Working with Modules, Handling Exceptions and File Handling

• Standard Libraries
• Modules Used in Python (OS, Sys, Date and Time etc.)
• The Import statements
• Module search path
• Package installation ways
• Errors and Exception Handling
• Handling multiple exceptions

Introduction to NumPy

• Introduction to arrays and matrices
• Indexing of array, datatypes, broadcasting of array math
• Standard deviation, Conditional probability
• Correlation and covariance
• NumPy Exercise Solution

Introduction to Pandas

• Pandas for data analysis and machine learning
• Pandas for data analysis and machine learning Continued
• Time series analysis
• Linear regression
• Logistic Regression
• ROC Curve
• Neural Network Implementation
• K Means Clustering Method

Data Visualisation

• Matplotlib library
• Grids, axes, plots
• Markers, colours, fonts and styling
• Types of plots – bar graphs, pie charts, histograms
• Contour plots

Data Manipulation

• Perform function manipulations on Data objects
• Perform Concatenation, Merging and Joining on DataFrames
• Iterate through DataFrames
• Explore Datasets and extract insights from it

Scikit-Learn for Natural Language Processing

• What is natural language processing, working with NLP on text data
• Scikit-Learn for Natural Language Processing
• The Scikit-Learn machine learning algorithms
• Sentimental Analysis – Twitter

Introduction to Python for Hadoop

• Deploying Python coding for MapReduce jobs on Hadoop framework.
• Python for Apache Spark coding
• Deploying Spark code with Python
• Machine learning library of Spark MLlib
• Deploying Spark MLlib for Classification, Clustering and Regression

Got a question for us? Please mention it in the comments section and we will get back to you.

0 responses on "Statistics Concepts in Python"

Leave a Message

Your email address will not be published. Required fields are marked *