Python has several methods are available to perform aggregations on data. it’s done using the pandas and numpy libraries. the info must be available or converted to a dataframe to use the aggregation functions.

Applying Aggregations on DataFrame

Let us create a DataFrame and apply aggregations thereon.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
 index = pd.date_range('1/1/2000', periods=10),
 columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r

Its output is as follows −

 A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 0.790670 -0.387854 -0.668132 0.267283
2000-01-03 -0.575523 -0.965025 0.060427 -2.179780
2000-01-04 1.669653 1.211759 -0.254695 1.429166
2000-01-05 0.100568 -0.236184 0.491646 -0.466081
2000-01-06 0.155172 0.992975 -1.205134 0.320958
2000-01-07 0.309468 -0.724053 -1.412446 0.627919
2000-01-08 0.099489 -1.028040 0.163206 -1.274331
2000-01-09 1.639500 -0.068443 0.714008 -0.565969
2000-01-10 0.326761 1.479841 0.664282 -1.361169
Rolling [window=3,min_periods=1,center=False,axis=0]

We can aggregate by passing a function to the whole DataFrame, or select a column via the quality get item method.

Apply Aggregation on an entire Dataframe

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
 index = pd.date_range('1/1/2000', periods=10),
 columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r.aggregate(np.sum)
Its output is as follows −
 A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 1.879182 -1.038796 -3.215581 -0.299575
2000-01-03 1.303660 -2.003821 -3.155154 -2.479355
2000-01-04 1.884801 -0.141119 -0.862400 -0.483331
2000-01-05 1.194699 0.010551 0.297378 -1.216695
2000-01-06 1.925393 1.968551 -0.968183 1.284044
2000-01-07 0.565208 0.032738 -2.125934 0.482797
2000-01-08 0.564129 -0.759118 -2.454374 -0.325454
2000-01-09 2.048458 -1.820537 -0.535232 -1.212381
2000-01-10 2.065750 0.383357 1.541496 -3.201469

 A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 1.879182 -1.038796 -3.215581 -0.299575
2000-01-03 1.303660 -2.003821 -3.155154 -2.479355
2000-01-04 1.884801 -0.141119 -0.862400 -0.483331
2000-01-05 1.194699 0.010551 0.297378 -1.216695
2000-01-06 1.925393 1.968551 -0.968183 1.284044
2000-01-07 0.565208 0.032738 -2.125934 0.482797
2000-01-08 0.564129 -0.759118 -2.454374 -0.325454
2000-01-09 2.048458 -1.820537 -0.535232 -1.212381
2000-01-10 2.065750 0.383357 1.541496 -3.201469
Apply Aggregation on one Column of a Dataframe

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
 index = pd.date_range('1/1/2000', periods=10),
 columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r['A'].aggregate(np.sum)
Its output is as follows −

 A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 1.879182 -1.038796 -3.215581 -0.299575
2000-01-03 1.303660 -2.003821 -3.155154 -2.479355
2000-01-04 1.884801 -0.141119 -0.862400 -0.483331
2000-01-05 1.194699 0.010551 0.297378 -1.216695
2000-01-06 1.925393 1.968551 -0.968183 1.284044
2000-01-07 0.565208 0.032738 -2.125934 0.482797
2000-01-08 0.564129 -0.759118 -2.454374 -0.325454
2000-01-09 2.048458 -1.820537 -0.535232 -1.212381
2000-01-10 2.065750 0.383357 1.541496 -3.201469
2000-01-01 1.088512
2000-01-02 1.879182
2000-01-03 1.303660
2000-01-04 1.884801
2000-01-05 1.194699
2000-01-06 1.925393
2000-01-07 0.565208
2000-01-08 0.564129
2000-01-09 2.048458
2000-01-10 2.065750
Freq: D, Name: A, dtype: float64
Apply Aggregation on Multiple Columns of a DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
 index = pd.date_range('1/1/2000', periods=10),
 columns = ['A', 'B', 'C', 'D'])
print df
r = df.rolling(window=3,min_periods=1)
print r[['A','B']].aggregate(np.sum)
Its output is as follows −
 A B C D
2000-01-01 1.088512 -0.650942 -2.547450 -0.566858
2000-01-02 1.879182 -1.038796 -3.215581 -0.299575
2000-01-03 1.303660 -2.003821 -3.155154 -2.479355
2000-01-04 1.884801 -0.141119 -0.862400 -0.483331
2000-01-05 1.194699 0.010551 0.297378 -1.216695
2000-01-06 1.925393 1.968551 -0.968183 1.284044
2000-01-07 0.565208 0.032738 -2.125934 0.482797
2000-01-08 0.564129 -0.759118 -2.454374 -0.325454
2000-01-09 2.048458 -1.820537 -0.535232 -1.212381
2000-01-10 2.065750 0.383357 1.541496 -3.201469

 A B

2000-01-01 1.088512 -0.650942
2000-01-02 1.879182 -1.038796
2000-01-03 1.303660 -2.003821
2000-01-04 1.884801 -0.141119
2000-01-05 1.194699 0.010551
2000-01-06 1.925393 1.968551
2000-01-07 0.565208 0.032738
2000-01-08 0.564129 -0.759118
2000-01-09 2.048458 -1.820537
2000-01-10 2.065750 0.383357

Python - binomial distribution

The binomial distribution may be a special case of the Bernoulli distribution where one experiment is conducted in order that the amount of observation is 1. So, the binomial distribution therefore describes events having exactly two outcomes.

We use various functions in numpy library to mathematically calculate the values for a binomial distribution. Histograms are created over which we plot the probability distribution curve.

from scipy.stats import bernoulli
import seaborn as sb
data_bern = bernoulli.rvs(size=1000,p=0.6)
ax = sb.distplot(data_bern,
 kde=True,
 color='crimson',
 hist_kws={"linewidth": 25,'alpha':1})
ax.set(xlabel='Bernouli', ylabel='Frequency')

Its output is as follows −

4.1

So, this brings us to the end of blog. This Tecklearn ‘Data Aggregation and Bernoulli Distribution in Python’ blog helps you with commonly asked questions if you are looking out for a job in Python Programming. If you wish to learn Python and build a career in Data Science domain, then check out our interactive, Python with Data Science Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

https://www.tecklearn.com/course/python-with-data-science/

Python with Data Science Training

About the Course

Python with Data Science training lets you master the concepts of the widely used and powerful programming language, Python. This Python Course will also help you master important Python programming concepts such as data operations, file operations, object-oriented programming and various Python libraries such as Pandas, NumPy, Matplotlib which are essential for Data Science. You will work on real-world projects in the domain of Python and apply it for various domains of Big Data, Data Science and Machine Learning.

Why Should you take Python with Data Science Training?

Python is the preferred language for new technologies such as Data Science and Machine Learning.
Average salary of Python Certified Developer is $123,656 per annum – Indeed.com
Python is by far the most popular language for data science. Python held 65.6% of the data science market.

What you will Learn in this Course?

Introduction to Python

Define Python
Understand the need for Programming
Know why to choose Python over other languages
Setup Python environment
Understand Various Python concepts – Variables, Data Types Operators, Conditional Statements and Loops
Illustrate String formatting
Understand Command Line Parameters and Flow control

Python Environment Setup and Essentials

Python installation
Windows, Mac & Linux distribution for Anaconda Python
Deploying Python IDE
Basic Python commands, data types, variables, keywords and more

Python language Basic Constructs

Looping in Python
Data Structures: List, Tuple, Dictionary, Set
First Python program
Write a Python Function (with and without parameters)
Create a member function and a variable
Tuple
Dictionary
Set and Frozen Set
Lambda function

OOP (Object Oriented Programming) in Python

Object-Oriented Concepts

Working with Modules, Handling Exceptions and File Handling

Standard Libraries
Modules Used in Python (OS, Sys, Date and Time etc.)
The Import statements
Module search path
Package installation ways
Errors and Exception Handling
Handling multiple exceptions

Introduction to NumPy

Introduction to arrays and matrices
Indexing of array, datatypes, broadcasting of array math
Standard deviation, Conditional probability
Correlation and covariance
NumPy Exercise Solution

Introduction to Pandas

Pandas for data analysis and machine learning
Pandas for data analysis and machine learning Continued
Time series analysis
Linear regression
Logistic Regression
ROC Curve
Neural Network Implementation
K Means Clustering Method

Data Visualisation

Matplotlib library
Grids, axes, plots
Markers, colours, fonts and styling
Types of plots – bar graphs, pie charts, histograms
Contour plots

Data Manipulation

Perform function manipulations on Data objects
Perform Concatenation, Merging and Joining on DataFrames
Iterate through DataFrames
Explore Datasets and extract insights from it

Scikit-Learn for Natural Language Processing

What is natural language processing, working with NLP on text data
Scikit-Learn for Natural Language Processing
The Scikit-Learn machine learning algorithms
Sentimental Analysis – Twitter

Introduction to Python for Hadoop

Deploying Python coding for MapReduce jobs on Hadoop framework.
Python for Apache Spark coding
Deploying Spark code with Python
Machine learning library of Spark MLlib
Deploying Spark MLlib for Classification, Clustering and Regression

Got a question for us? Please mention it in the comments section and we will get back to you.

561