Top Data Science Interview Questions and Answers

Last updated on Feb 18 2022
Shankar Shankar Trivedi

Table of Contents

Top Data Science Interview Questions and Answers

What is logistic regression in Data Science?

Logistic Regression is also called as the logit model. It is a method to forecast the binary outcome from a linear combination of predictor variables.

Name three types of biases that can occur during sampling

In the sampling process, there are three types of biases, which are:

  • Selection bias
  • Under coverage bias
  • Survivorship bias

Discuss Decision Tree algorithm

A decision tree is a popular supervised machine learning algorithm. It is mainly used for Regression and Classification. It allows breaks down a dataset into smaller subsets. The decision tree can able to handle both categorical and numerical data.

How do you build a random forest model?

A random forest is built up of a number of decision trees. If you split the data into different packages and make a decision tree in each of the different groups of data, the random forest brings all those trees together.

Steps to build a random forest model:

  1. Randomly select ‘k’ features from a total of ‘m’ features where k << m
  2. Among the ‘k’ features, calculate the node D using the best split point
  3. Split the node into daughter nodes using the best split
  4. Repeat steps two and three until leaf nodes are finalized
  5. Build forest by repeating steps one to four for ‘n’ times to create ‘n’ number of trees

How can you avoid the overfitting your model?

Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger picture. There are three main methods to avoid overfitting:

  1. Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data
  2. Use cross-validation techniques, such as k folds cross-validation
  3. Use regularization techniques, such as LASSO, that penalize certain model parameters if they’re likely to cause overfitting

What are the differences between supervised and unsupervised learning?

dd1

How is logistic regression done?

Logistic regression measures the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by estimating probability using its underlying logistic function (sigmoid).

The image shown below depicts how logistic regression works:

dd2

Explain the steps in making a decision tree.

  1. Take the entire data set as input
  2. Calculate entropy of the target variable, as well as the predictor attributes
  3. Calculate your information gain of all attributes (we gain information on sorting different objects from each other)
  4. Choose the attribute with the highest information gain as the root node
  5. Repeat the same procedure on every branch until the decision node of each branch is finalized

For example, let’s say you want to build a decision tree to decide whether you should accept or decline a job offer. The decision tree for this case is as shown:

dd3

It is clear from the decision tree that an offer is accepted if:

  • Salary is greater than $,
  • The commute is less than an hour
  • Incentives are offered

Differentiate between univariate, bivariate, and multivariate analysis.

Univariate

Univariate data contains only one variable. The purpose of the univariate analysis is to describe the data and find patterns that exist within it.

Example: height of students

The patterns can be studied by drawing conclusions using mean, median, mode, dispersion or range, minimum, maximum, etc.

Bivariate

Bivariate data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to determine the relationship between the two variables.

Example: temperature and ice cream sales in the summer season

Here, the relationship is visible from the table that temperature and sales are directly proportional to each other. The hotter the temperature, the better the sales.

Multivariate

Multivariate data involves three or more variables, it is categorized under multivariate. It is similar to a bivariate but contains more than one dependent variable.

Example: data for house price prediction

he patterns can be studied by drawing conclusions using mean, median, and mode, dispersion or range, minimum, maximum, etc. You can start describing the data and using it to guess what the price of the house will be.

What are the feature selection methods used to select the right variables?

There are two main methods for feature selection, i.e, filter, and wrapper methods.

Filter Methods

This involves:

  • Linear discrimination analysis
  • ANOVA
  • Chi-Square

The best analogy for selecting features is “bad data in, bad answer out.” When we’re limiting or selecting the features, it’s all about cleaning up the data coming in.

Wrapper Methods

This involves:

  • Forward Selection: We test one feature at a time and keep adding them until we get a good fit
  • Backward Selection: We test all the features and start removing them to see what works better
  • Recursive Feature Elimination: Recursively looks through all the different features and how they pair together

Wrapper methods are very labor-intensive, and high-end computers are needed if a lot of data analysis is performed with the wrapper method.

You are given a data set consisting of variables with more than percent missing values. How will you deal with them?

The following are ways to handle missing data values:

If the data set is large, we can just simply remove the rows with missing data values. It is the quickest way; we use the rest of the data to predict the values.

For smaller data sets, we can substitute missing values with the mean or average of the rest of the data using the pandas’ data frame in python. There are different ways to do so, such as df.mean(), df.fillna(mean).

For the given points, how will you calculate the Euclidean distance in Python?

plot = [,]

plot = [,]

The Euclidean distance can be calculated as follows:

euclidean_distance = sqrt( (plot[]-plot[])** + (plot[]-plot[])** )

What are dimensionality reduction and its benefits?

Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely.

This reduction helps in compressing data and reducing storage space. It also reduces computation time as fewer dimensions lead to less computing. It removes redundant features; for example, there’s no point in storing a value in two different units (meters and inches).

How will you calculate eigenvalues and eigenvectors of the following x matrix?

The characteristic equation is as shown:

Expanding determinant:

(- – λ) [(-λ) (-λ)-x] + [(-) x (-λ) -x] + [(-) x -(-λ)] =

– λ + λ + λ – = ,

λ – λ – λ + =

Here we have an algebraic equation built from the eigenvectors.

By hit and trial:

– x – x + =

Hence, (λ – ) is a factor:

λ – λ – λ + = (λ – ) (λ – λ – )

Eigenvalues are ,-,:

(λ – ) (λ – λ – ) = (λ – ) (λ+) (λ-),

Calculate eigenvector for λ =

For X = ,

– – Y + Z =,

– – Y + Z =

Subtracting the two equations:

+ Y = ,

Subtracting back into second equation:

Y = -(/)

Z = -(/)

Similarly, we can calculate the eigenvectors for – and .

How should you maintain a deployed model?

The steps to maintain a deployed model are:

Monitor

Constant monitoring of all models is needed to determine their performance accuracy. When you change something, you want to figure out how your changes are going to affect things. This needs to be monitored to ensure it’s doing what it’s supposed to do.

Evaluate

Evaluation metrics of the current model are calculated to determine if a new algorithm is needed.

Compare

The new models are compared to each other to determine which model performs the best.

Rebuild

The best performing model is re-built on the current state of data.

What are recommender systems?

A recommender system predicts what a user would rate a specific product based on their preferences. It can be split into two different areas:

Collaborative Filtering

As an example, Last.fm recommends tracks that other users with similar interests play often. This is also commonly seen on Amazon after making a purchase; customers may notice the following message accompanied by product recommendations: “Users who bought this also bought…”

Content-based Filtering

As an example: Pandora uses the properties of a song to recommend music with similar properties. Here, we look at content, instead of looking at who else is listening to music.

MSE and MSE are two of the most common measures of accuracy for a linear regression model. 

RMSE indicates the Root Mean Square Error.

MSE indicates the Mean Square Error.

How can you select k for k-means? 

We use the elbow method to select k for k-means clustering. The idea of the elbow method is to run k-means clustering on the data set where ‘k’ is the number of clusters.

Within the sum of squares (WSS), it is defined as the sum of the squared distance between each member of the cluster and its centroid.

What is the significance of p-value?

p-value typically ≤ .

This indicates strong evidence against the null hypothesis; so you reject the null hypothesis.

p-value typically > .

This indicates weak evidence against the null hypothesis, so you accept the null hypothesis.

p-value at cutoff .

This is considered to be marginal, meaning it could go either way.

How can outlier values be treated?

You can drop outliers only if it is a garbage value.

Example: height of an adult = abc ft. This cannot be true, as the height cannot be a string value. In this case, outliers can be removed.

If the outliers have extreme values, they can be removed. For example, if all the data points are clustered between zero to , but one point lies at , then we can remove this point.

If you cannot drop outliers, you can try the following:

  • Try a different model. Data detected as outliers by linear models can be fit by nonlinear models. Therefore, be sure you are choosing the correct model.
  • Try normalizing the data. This way, the extreme data points are pulled to a similar range.
  • You can use algorithms that are less affected by outliers; an example would be random forests.

How can a time-series data be declared as stationery?

It is stationary when the variance and mean of the series are constant with time.

Here is a visual example:

dd4

In the first graph, the variance is constant with time. Here, X is the time factor and Y is the variable. The value of Y goes through the same points all the time; in other words, it is stationary.

In the second graph, the waves get bigger, which means it is non-stationary and the variance is changing with time.

How can you calculate accuracy using a confusion matrix?

Consider this confusion matrix:

dd5

You can see the values for total data, actual values, and predicted values.

The formula for accuracy is:

Accuracy = (True Positive + True Negative) / Total Observations

= ( + ) /

= /

= .

As a result, we get an accuracy of percent.

Write the equation and calculate the precision and recall rate.

Consider the same confusion matrix used in the previous question.

dd6

Precision = (True positive) / (True Positive + False Positive)

= /

= .

Recall Rate = (True Positive) / (Total Positive + False Negative)

= /

= .

‘People who bought this also bought…’ recommendations seen on Amazon are a result of which algorithm?

The recommendation engine is accomplished with collaborative filtering. Collaborative filtering explains the behavior of other users and their purchase history in terms of ratings, selection, etc.

The engine makes predictions on what might interest a person based on the preferences of other users. In this algorithm, item features are unknown.

For example, a sales page shows that a certain number of people buy a new phone and also buy tempered glass at the same time. Next time, when a person buys a phone, he or she may see a recommendation to buy tempered glass as well.

Write a basic SQL query that lists all orders with customer information.

Usually, we have order tables and customer tables that contain the following columns:

Order Table

Orderid

customerId

OrderNumber

TotalAmount

Customer Table

Id

FirstName

LastName

City

Country

The SQL query is:

SELECT OrderNumber, TotalAmount, FirstName, LastName, City, Country

FROM Order

JOIN Customer

ON Order.CustomerId = Customer.Id

You are given a dataset on cancer detection. You have built a classification model and achieved an accuracy of percent. Why shouldn’t you be happy with your model performance? What can you do about it?

Cancer detection results in imbalanced data. In an imbalanced dataset, accuracy should not be based as a measure of performance. It is important to focus on the remaining four percent, which represents the patients who were wrongly diagnosed. Early diagnosis is crucial when it comes to cancer detection, and can greatly improve a patient’s prognosis.

Hence, to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine the class wise performance of the classifier.

Which of the following machine learning algorithms can be used for inputting missing values of both categorical and continuous variables?

  • K-means clustering
  • Linear regression
  • K-NN (k-nearest neighbor)
  • Decision trees

The K nearest neighbor algorithm can be used because it can compute the nearest neighbor and if it doesn’t have a value, it just computes the nearest neighbor based on all the other features.

When you’re dealing with K-means clustering or linear regression, you need to do that in your pre-processing, otherwise, they’ll crash. Decision trees also have the same problem, although there is some variance.

Below are the eight actual values of the target variable in the train file. What is the entropy of the target variable?

[, , , , , , , ]

Choose the correct answer.

  1. -(/ log(/) + / log(/))
  2. / log(/) + / log(/)
  3. / log(/) + / log(/)
  4. / log(/) – / log(/)

The target variable, in this case, is .

The formula for calculating the entropy is:

Putting p= and n=, we get

Entropy = A = -(/ log(/) + / log(/))

We want to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level. What is the most appropriate algorithm for this case?

Choose the correct option:

  1. Logistic Regression
  2. Linear Regression
  3. K-means clustering
  4. Apriori algorithm

The most appropriate algorithm for this case is A, logistic regression.

After studying the behavior of a population, you have identified four specific individual types that are valuable to your study. You would like to find all users who are most similar to each individual type. Which algorithm is most appropriate for this study?

Choose the correct option:

  1. K-means clustering
  2. Linear regression
  3. Association rules
  4. Decision trees

As we are looking for grouping people together specifically by four different similarities, it indicates the value of k. Therefore, K-means clustering (answer A) is the most appropriate algorithm for this study.

You have run the association rules algorithm on your dataset, and the two rules {banana, apple} => {grape} and {apple, orange} => {grape} have been found to be relevant. What else must be true?

Choose the right answer:

  1. {banana, apple, grape, orange} must be a frequent itemset
  2. {banana, apple} => {orange} must be a relevant rule
  3. {grape} => {banana, apple} must be a relevant rule
  4. {grape, apple} must be a frequent itemset

The answer is A: {grape, apple} must be a frequent itemset

Your organization has a website where visitors randomly receive one of two coupons. It is also possible that visitors to the website will not receive a coupon. You have been asked to determine if offering a coupon to website visitors has any impact on their purchase decisions. Which analysis method should you use?

  1. One-way ANOVA
  2. K-means clustering
  3. Association rules
  4. Student’s t-test

The answer is A: One-way ANOVA

What are the feature vectors?

A feature vector is an n-dimensional vector of numerical features that represent an object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics (called features) of an object in a mathematical way that’s easy to analyze.

What are the steps in making a decision tree?

  1. Take the entire data set as input.
  2. Look for a split that maximizes the separation of the classes. A split is any test that divides the data into two sets.
  3. Apply the split to the input data (divide step).
  4. Re-apply steps one and two to the divided data.
  5. Stop when you meet any stopping criteria.
  6. This step is called pruning. Clean up the tree if you went too far doing splits.

What is root cause analysis?

Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from recurring.

What is logistic regression?

Logistic regression is also known as the logit model. It is a technique used to forecast the binary outcome from a linear combination of predictor variables.

What are recommender systems?

Recommender systems are a subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product.

Explain cross-validation.

Cross-validation is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. It is mainly used in backgrounds where the objective is to forecast and one wants to estimate how accurately a model will accomplish in practice.

The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) to limit problems like overfitting and gain insight into how the model will generalize to an independent data set.

What is collaborative filtering?

Most recommender systems use this filtering process to find patterns and information by collaborating perspectives, numerous data sources, and several agents.

Do gradient descent methods always converge to similar points?

They do not, because in some cases, they reach a local minimum or local optima point. You would not reach the global optima point. This is governed by the data and the starting conditions.

What is the goal of A/B Testing?

This is statistical hypothesis testing for randomized experiments with two variables, A and B. The objective of A/B testing is to detect any changes to a web page to maximize or increase the outcome of a strategy.

What are the drawbacks of the linear model?

  • The assumption of linearity of the errors
  • It can’t be used for count outcomes or binary outcomes
  • There are overfitting problems that it can’t solve

What is the law of large numbers?

It is a theorem that describes the result of performing the same experiment very frequently. This theorem forms the basis of frequency-style thinking. It states that the sample mean, sample variance, and sample standard deviation converge to what they are trying to estimate.

What are the confounding variables?

These are extraneous variables in a statistical model that correlates directly or inversely with both the dependent and the independent variable. The estimate fails to account for the confounding factor.

What is star schema?

It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes, star schemas involve several layers of summarization to recover information faster.

How regularly must an algorithm be updated?

You will want to update an algorithm when:

  • You want the model to evolve as data streams through infrastructure
  • The underlying data source is changing
  • There is a case of non-stationarity

What are eigenvalue and eigenvector?

Eigenvalues are the directions along which a particular linear transformation acts by flipping, compressing, or stretching.

Eigenvectors are for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix.

Why is resampling done?

Resampling is done in any of these cases:

  • Estimating the accuracy of sample statistics by using subsets of accessible data, or drawing randomly with replacement from a set of data points
  • Substituting labels on data points when performing significance tests
  • Validating models by using random subsets (bootstrapping, cross-validation)

What is selection bias?

Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample.

What are the types of biases that can occur during sampling?

  1. Selection bias
  2. Undercoverage bias
  3. Survivorship bias

What is survivorship bias?

Survivorship bias is the logical error of focusing on aspects that support surviving a process and casually overlooking those that did not because of their lack of prominence. This can lead to wrong conclusions in numerous ways.

How do you work towards a random forest?

The underlying principle of this technique is that several weak learners combine to provide a strong learner. The steps involved are:

  1. Build several decision trees on bootstrapped training samples of data
  2. On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates out of all pp predictors
  3. Rule of thumb: At each split m=p√m=p
  4. Predictions: At the majority rule

What is Selection Bias?

Selection bias is a kind of error that occurs when the researcher decides who is going to be studied. It is usually associated with research where the selection of participants isn’t random. It is sometimes referred to as the selection effect. It is the distortion of statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be accurate.

The types of selection bias include:

  1. Sampling bias: It is a systematic error due to a non-random sample of a population causing some members of the population to be less likely to be included than others resulting in a biased sample.
  2. Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.
  3. Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria.
  4. Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants) discounting trial subjects/tests that did not run to completion.

What is bias-variance trade-off?

Bias: Bias is an error introduced in your model due to oversimplification of the machine learning algorithm. It can lead to underfitting. When you train your model at that time model makes simplified assumptions to make the target function easier to understand.

Low bias machine learning algorithms — Decision Trees, k-NN and SVM High bias machine learning algorithms — Linear Regression, Logistic Regression

Variance: Variance is error introduced in your model due to complex machine learning algorithm, your model learns noise also from the training data set and performs badly on test data set. It can lead to high sensitivity and overfitting.

Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens until a particular point. As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance.

dd7

Bias-Variance trade-off: The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance.

  1. The k-nearest neighbour algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbours that contribute to the prediction and in turn increases the bias of the model.
  2. The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.

There is no escaping the relationship between bias and variance in machine learning. Increasing the bias will decrease the variance. Increasing the variance will decrease bias.

What is a confusion matrix?

The confusion matrix is a X table that contains outputs provided by the binary classifier. Various measures, such as error-rate, accuracy, specificity, sensitivity, precision and recall are derived from it. Confusion Matrix

A data set used for performance evaluation is called a test data set. It should contain the correct labels and predicted labels.

The predicted labels will exactly the same if the performance of a binary classifier is perfect.

The predicted labels usually match with part of the observed labels in real-world scenarios.

A binary classifier predicts all data instances of a test data set as either positive or negative. This produces four outcomes-

  1. True-positive(TP) — Correct positive prediction
  2. False-positive(FP) — Incorrect positive prediction
  3. True-negative(TN) — Correct negative prediction
  4. False-negative(FN) — Incorrect negative prediction

Basic measures derived from the confusion matrix

  1. Error Rate = (FP+FN)/(P+N)
  2. Accuracy = (TP+TN)/(P+N)
  3. Sensitivity(Recall or True positive rate) = TP/P
  4. Specificity(True negative rate) = TN/N
  5. Precision(Positive predicted value) = TP/(TP+FP)
  6. F-Score(Harmonic mean of precision and recall) = (+b)(PREC.REC)/(b²PREC+REC) where b is commonly ., , .

What is the difference between “long” and “wide” format data?

In the wide-format, a subject’s repeated responses will be in a single row, and each response is in a separate column. In the long-format, each row is a one-time point per subject. You can recognize data in wide format by the fact that columns generally represent groups.

dd8

What do you understand by the term Normal Distribution?

Data is usually distributed in different ways with a bias to the left or to the right or it can all be jumbled up.

However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell-shaped curve.

Figure: Normal distribution in a bell curve

The random variables are distributed in the form of a symmetrical, bell-shaped curve.

Properties of Normal Distribution are as follows;

  1. Unimodal -one mode
  2. Symmetrical -left and right halves are mirror images
  3. Bell-shaped -maximum height (mode) at the mean
  4. Mean, Mode, and Median are all located in the center
  5. Asymptotic

What is correlation and covariance in statistics?

Covariance and Correlation are two mathematical concepts; these two approaches are widely used in statistics. Both Correlation and Covariance establish the relationship and also measure the dependency between two random variables. Though the work is similar between these two in mathematical terms, they are different from each other.

dd9

Correlation: Correlation is considered or described as the best technique for measuring and also for estimating the quantitative relationship between two variables. Correlation measures how strongly two variables are related.

Covariance: In covariance two items vary together and it’s a measure that indicates the extent to which two random variables change in cycle. It is a statistical term; it explains the systematic relation between a pair of random variables, wherein changes in one variable reciprocal by a corresponding change in another variable.

What is the difference between Point Estimates and Confidence Interval?

Point Estimation gives us a particular value as an estimate of a population parameter. Method of Moments and Maximum Likelihood estimator methods are used to derive Point Estimators for population parameters.

A confidence interval gives us a range of values which is likely to contain the population parameter. The confidence interval is generally preferred, as it tells us how likely this interval is to contain the population parameter. This likeliness or probability is called Confidence Level or Confidence coefficient and represented by — alpha, where alpha is the level of significance.

What is the goal of A/B Testing?

It is a hypothesis testing for a randomized experiment with two variables A and B.

The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of interest. A/B testing is a fantastic method for figuring out the best online promotional and marketing strategies for your business. It can be used to test everything from website copy to sales emails to search ads

An example of this could be identifying the click-through rate for a banner ad.

What is p-value?

When you perform a hypothesis test in statistics, a p-value can help you determine the strength of your results. p-value is a number between and . Based on the value it will denote the strength of the results. The claim which is on trial is called the Null Hypothesis.

Low p-value (≤ .) indicates strength against the null hypothesis which means we can reject the null Hypothesis. High p-value (≥ .) indicates strength for the null hypothesis which means we can accept the null Hypothesis p-value of . indicates the Hypothesis could go either way. To put it in another way,

High P values: your data are likely with a true null. Low P values: your data are unlikely with a true null.

In any -minute interval, there is a % probability that you will see at least one shooting star. What is the proba­bility that you see at least one shooting star in the period of an hour?

Probability of not seeing any shooting star in minutes is

=   – P( Seeing one shooting star )
=   – .          =    .

Probability of not seeing any shooting star in the period of one hour

=   (.) ^         =    .

Probability of seeing at least one shooting star in the one hour

=   – P( Not seeing any star )
=   – .     =    .

How can you generate a random number between – with only a die?

  • Any die has six sides from -. There is no way to get seven equal outcomes from a single rolling of a die. If we roll the die twice and consider the event of two rolls, we now have different outcomes.
  • To get our equal outcomes we have to reduce this to a number divisible by . We can thus consider only outcomes and exclude the other one.
  • A simple scenario can be to exclude the combination (,), i.e., to roll the die again if appears twice.
  • All the remaining combinations from (,) till (,) can be divided into parts of each. This way all the seven sets of outcomes are equally likely.

A certain couple tells you that they have two children, at least one of which is a girl. What is the probability that they have two girls?

In the case of two children, there are equally likely possibilities

BB, BG, GB and GG;

where B = Boy and G = Girl and the first letter denotes the first child.

From the question, we can exclude the first case of BB. Thus from the remaining possibilities of BG, GB & BB, we have to find the probability of the case with two girls.

Thus, P(Having two girls given one girl)   =     /

A jar has coins, of which are fair and is double headed. Pick a coin at random, and toss it times. Given that you see heads, what is the probability that the next toss of that coin is also a head?

There are two ways of choosing the coin. One is to pick a fair coin and the other is to pick the one with two heads.

Probability of selecting fair coin = / = .
Probability of selecting unfair coin = / = .

Selecting heads in a row = Selecting fair coin * Getting heads  +  Selecting an unfair coin

P (A)  =  . * (/)^  =  . * (/)  =  .
P (B)  =  . *   =  .
P( A / A + B )  = . /  (. + .)  =  .
P( B / A + B )  = . / .  =  .

Probability of selecting another head = P(A/A+B) * . + P(B/A+B) * = . * . + .  =  .

What do you understand by statistical power of sensitivity and how do you calculate it?

Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, Random Forest etc.).

Sensitivity is nothing but “Predicted True events/ Total events”. True events here are the events which were true and model also predicted them as true.

Calculation of seasonality is pretty straightforward.

Seasonality = (True Positives) / (Positives in Actual Dependent Variable)

Why Is Re-sampling Done?

Resampling is done in any of these cases:

  • Estimating the accuracy of sample statistics by using subsets of accessible data or drawing randomly with replacement from a set of data points
  • Substituting labels on data points when performing significance tests
  • Validating models by using random subsets (bootstrapping, cross-validation)

What are the differences between over-fitting and under-fitting?

In statistics and machine learning, one of the most common tasks is to fit a model to a set of training data, so as to be able to make reliable predictions on general untrained data.

In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfitted, has poor predictive performance, as it overreacts to minor fluctuations in the training data.

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model too would have poor predictive performance.

How to combat Overfitting and Underfitting?

To combat overfitting and underfitting, you can resample the data to estimate the model accuracy (k-fold cross-validation) and by having a validation dataset to evaluate the model.

What is regularisation? Why is it useful?

Regularisation is the process of adding tuning parameter to a model to induce smoothness in order to prevent overfitting. This is most often done by adding a constant multiple to an existing weight vector. This constant is often the L(Lasso) or L(ridge). The model predictions should then minimize the loss function calculated on the regularized training set.

What Is the Law of Large Numbers?

It is a theorem that describes the result of performing the same experiment a large number of times. This theorem forms the basis of frequency-style thinking. It says that the sample means, the sample variance and the sample standard deviation converge to what they are trying to estimate.

What Are Confounding Variables?

In statistics, a confounder is a variable that influences both the dependent variable and independent variable.

For example, if you are researching whether a lack of exercise leads to weight gain,

lack of exercise = independent variable

weight gain = dependent variable.

A confounding variable here would be any other variable that affects both of these variables, such as the age of the subject.

What Are the Types of Biases That Can Occur During Sampling?

  • Selection bias
  • Under coverage bias
  • Survivorship bias

What is Survivorship Bias?

It is the logical error of focusing aspects that support surviving some process and casually overlooking those that did not work because of their lack of prominence. This can lead to wrong conclusions in numerous different means.

What is selection Bias?

Selection bias occurs when the sample obtained is not representative of the population intended to be analysed.

Explain how a ROC curve works?

The ROC curve is a graphical representation of the contrast between true positive rates and false-positive rates at various thresholds. It is often used as a proxy for the trade-off between the sensitivity(true positive rate) and false-positive rate.

dd10

What is TF/IDF vectorization?

TF–IDF is short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.

The TF–IDF value increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

Why we generally use Softmax non-linearity function as last operation in-network?

It is because it takes in a vector of real numbers and returns a probability distribution. Its definition is as follows. Let x be a vector of real numbers (positive, negative, whatever, there are no constraints).

Then the i’th component of Softmax(x) is —

It should be clear that the output is a probability distribution: each element is non-negative and the sum over all components is .

Python or R – Which one would you prefer for text analytics?

We will prefer Python because of the following reasons:

  • Python would be the best option because it has Pandas library that provides easy to use data structures and high-performance data analysis tools.
  • R is more suitable for machine learning than just text analysis.
  • Python performs faster for all types of text analytics.

How does data cleaning plays a vital role in the analysis?

Data cleaning can help in analysis because:

  • Cleaning data from multiple sources helps to transform it into a format that data analysts or data scientists can work with.
  • Data Cleaning helps to increase the accuracy of the model in machine learning.
  • It is a cumbersome process because as the number of data sources increases, the time taken to clean the data increases exponentially due to the number of sources and the volume of data generated by these sources.
  • It might take up to % of the time for just cleaning data making it a critical part of the analysis task.

Differentiate between univariate, bivariate and multivariate analysis.

Univariate analyses are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can the analysis can be referred to as univariate analysis.

The bivariate analysis attempts to understand the difference between two variables at a time as in a scatterplot. For example, analyzing the volume of sale and spending can be considered as an example of bivariate analysis.

Multivariate analysis deals with the study of more than two variables to understand the effect of variables on the responses.

Explain Star Schema.

It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.

Can you cite some examples where a false positive is important than a false negative?

Let us first understand what false positives and false negatives are.

  • False Positives are the cases where you wrongly classified a non-event as an event a.k.a Type I error.
  • False Negatives are the cases where you wrongly classify events as non-events, a.k.a Type II error.

Example : In the medical field, assume you have to give chemotherapy to patients. Assume a patient comes to that hospital and he is tested positive for cancer, based on the lab prediction but he actually doesn’t have cancer. This is a case of false positive. Here it is of utmost danger to start chemotherapy on this patient when he actually does not have cancer. In the absence of cancerous cell, chemotherapy will do certain damage to his normal healthy cells and might lead to severe diseases, even cancer.

Example : Let’s say an e-commerce company decided to give $ Gift voucher to the customers whom they assume to purchase at least $, worth of items. They send free voucher mail directly to customers without any minimum purchase condition because they assume to make at least % profit on sold items above $,. Now the issue is if we send the $ gift vouchers to customers who have not actually purchased anything but are marked as having made $, worth of purchase.

Can you cite some examples where a false negative important than a false positive?

Example : Assume there is an airport ‘A’ which has received high-security threats and based on certain characteristics they identify whether a particular passenger can be a threat or not. Due to a shortage of staff, they decide to scan passengers being predicted as risk positives by their predictive model. What will happen if a true threat customer is being flagged as non-threat by airport model?

Example : What if Jury or judge decides to make a criminal go free?

Example : What if you rejected to marry a very good person based on your predictive model and you happen to meet him/her after a few years and realize that you had a false negative?

Can you cite some examples where both false positive and false negatives are equally important?

In the Banking industry giving loans is the primary source of making money but at the same time if your repayment rate is not good you will not make any profit, rather you will risk huge losses.

Banks don’t want to lose good customers and at the same point in time, they don’t want to acquire bad customers. In this scenario, both the false positives and false negatives become very important to measure.

What is Cluster Sampling?

Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements.

For eg., A researcher wants to survey the academic performance of high school students in Japan. He can divide the entire population of Japan into different clusters (cities). Then the researcher selects a number of clusters depending on his research through simple or systematic random sampling.

Let’s continue our Data Science Interview Questions blog with some more statistics questions.

What is Systematic Sampling?

Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list, it is progressed from the top again. The best example of systematic sampling is equal probability method.

What are Eigenvectors and Eigenvalues?

Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching.

Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.

Can you explain the difference between a Validation Set and a Test Set?

AValidation set can be considered as a part of the training set as it is used for parameter selection and to avoid overfitting of the model being built.

On the other hand, a Test Set is used for testing or evaluating the performance of a trained machine learning model.

In simple terms, the differences can be summarized as; training set is to fit the parameters i.e. weights and test set is to assess the performance of the model i.e. evaluating the predictive power and generalization.

Explain cross-validation.

Cross-validation is a model validation technique for evaluating how the outcomes of statistical analysis will generalize to an independent dataset. Mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice.

The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting and get an insight on how the model will generalize to an independent data set.

What is Machine Learning?

Machine Learning explores the study and construction of algorithms that can learn from and make predictions on data. Closely related to computational statistics. Used to devise complex models and algorithms that lend themselves to a prediction which in commercial use is known as predictive analytics. Given below, is an image representing the various domains Machine Learning lends itself to.

What is Supervised Learning?

Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples.

Algorithms: Support Vector Machines, Regression, Naive Bayes, Decision Trees, K-nearest Neighbor Algorithm and Neural Networks

E.g. If you built a fruit classifier, the labels will be “this is an orange, this is an apple and this is a banana”, based on showing the classifier examples of apples, oranges and bananas.

What is Unsupervised learning?

Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labelled responses.

Algorithms: Clustering, Anomaly Detection, Neural Networks and Latent Variable Models

E.g. In the same example, a fruit clustering will categorize as “fruits with soft skin and lots of dimples”, “fruits with shiny hard skin” and “elongated yellow fruits”.

What are the various classification algorithms?

The diagram lists the most important classification algorithms.

What is ‘Naive’ in a Naive Bayes?

The Naive Bayes Algorithm is based on the Bayes Theorem. Bayes’ theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

The Algorithm is ‘naive’ because it makes assumptions that may or may not turn out to be correct.

Explain SVM algorithm in detail.

SVM stands for support vector machine, it is a supervised machine learning algorithm which can be used for both Regression and Classification. If you have n features in your training data set, SVM tries to plot it in n-dimensional space with the value of each feature being the value of a particular coordinate. SVM uses hyperplanes to separate out different classes based on the provided kernel function.

dd11

What are the different kernels in SVM?

There are four types of kernels in SVM.

  1. Linear Kernel
  2. Polynomial kernel
  3. Radial basis kernel
  4. Sigmoid kernel

Explain Decision Tree algorithm in detail.

decision tree is a supervised machine learning algorithm mainly used for Regression and Classification. It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision tree can handle both categorical and numerical data.

dd12

What are Entropy and Information gain in Decision tree algorithm?

The core algorithm for building a decision tree is called IDID uses Entropy and Information Gain to construct a decision tree.

Why do you need to perform resampling?

Resampling is done in below-given cases:

  • Estimating the accuracy of sample statistics by drawing randomly with replacement from a set of the data point or using as subsets of accessible data
  • Substituting labels on data points when performing necessary tests
  • Validating models by using random subsets

List out the libraries in Python used for Data Analysis and Scientific Computations.

  • SciPy
  • Pandas
  • Matplotlib
  • NumPy
  • SciKit
  • Seaborn

What is Power Analysis?

The power analysis is an integral part of the experimental design. It helps you to determine the sample size requires to find out the effect of a given size from a cause with a specific level of assurance. It also allows you to deploy a particular probability in a sample size constraint.

Explain Collaborative filtering

Collaborative filtering used to search for correct patterns by collaborating viewpoints, multiple data sources, and various agents.

What is bias?

Bias is an error introduced in your model because of the oversimplification of a machine learning algorithm.” It can lead to underfitting.

Discuss ‘Naive’ in a Naive Bayes algorithm?

The Naive Bayes Algorithm model is based on the Bayes Theorem. It describes the probability of an event. It is based on prior knowledge of conditions which might be related to that specific event.

What is a Linear Regression?

Linear regression is a statistical programming method where the score of a variable ‘A’ is predicted from the score of a second variable ‘B’. B is referred to as the predictor variable and A as the criterion variable.

State the difference between the expected value and mean value

They are not many differences, but both of these terms are used in different contexts. Mean value is generally referred to when you are discussing a probability distribution whereas expected value is referred to in the context of a random variable.

What the aim of conducting A/B Testing?

AB testing used to conduct random experiments with two variables, A and B. The goal of this testing method is to find out changes to a web page to maximize or increase the outcome of a strategy.

What is Ensemble Learning?

The ensemble is a method of combining a diverse set of learners together to improvise on the stability and predictive power of the model. Two types of Ensemble learning methods are:

Bagging

Bagging method helps you to implement similar learners on small sample populations. It helps you to make nearer predictions.

Boosting

Boosting is an iterative method which allows you to adjust the weight of an observation depends upon the last classification. Boosting decreases the bias error and helps you to build strong predictive models.

Discuss Artificial Neural Networks

Artificial Neural networks (ANN) are a special set of algorithms that have revolutionized machine learning. It helps you to adapt according to changing input. So the network generates the best possible result without redesigning the output criteria.

What is Back Propagation?

Back-propagation is the essence of neural net training. It is the method of tuning the weights of a neural net depend upon the error rate obtained in the previous epoch. Proper tuning of the helps you to reduce error rates and to make the model reliable by increasing its generalization.

What is the K-means clustering method?

K-means clustering is an important unsupervised learning method. It is the technique of classifying data using a certain set of clusters which is called K clusters. It is deployed for grouping to find out the similarity in the data.

Explain the difference between Data Science and Data Analytics

Data Scientists need to slice data to extract valuable insights that a data analyst can apply to real-world business scenarios. The main difference between the two is that the data scientists have more technical knowledge then business analyst. Moreover, they don’t need an understanding of the business required for data visualization.

Explain the method to collect and analyze data to use social media to predict the weather condition.

You can collect social media data using Facebook, twitter, Instagram’s API’s. For example, for the tweeter, we can construct a feature from each tweet like tweeted date, retweets, list of follower, etc. Then you can use a multivariate time series model to predict the weather condition.

When do you need to update the algorithm in Data science?

You need to update an algorithm in the following situation:

  • You want your data model to evolve as data streams using infrastructure
  • The underlying data source is changing

If it is non-stationarity

Explain the benefits of using statistics by Data Scientists

Statistics help Data scientist to get a better idea of customer’s expectation. Using the statistic method Data Scientists can get knowledge regarding consumer interest, behavior, engagement, retention, etc. It also helps you to build powerful data models to validate certain inferences and predictions.

Name various types of Deep Learning Frameworks

  • Pytorch
  • Microsoft Cognitive Toolkit
  • TensorFlow
  • Caffe
  • Chainer
  • Keras

Explain Auto-Encoder

Autoencoders are learning networks. It helps you to transform inputs into outputs with fewer numbers of errors. This means that you will get output to be as close to input as possible.

Define Boltzmann Machine

Boltzmann machines is a simple learning algorithm. It helps you to discover those features that represent complex regularities in the training data. This algorithm allows you to optimize the weights and the quantity for the given problem.

Explain why Data Cleansing is essential and which method you use to maintain clean data

Dirty data often leads to the incorrect inside, which can damage the prospect of any organization. For example, if you want to run a targeted marketing campaign. However, our data incorrectly tell you that a specific product will be in-demand with your target audience; the campaign will fail.

What is skewed Distribution & uniform distribution?

Skewed distribution occurs when if data is distributed on any one side of the plot whereas uniform distribution is identified when the data is spread is equal in the range.

When underfitting occurs in a static model?

Underfitting occurs when a statistical model or machine learning algorithm not able to capture the underlying trend of the data.

What is reinforcement learning?

Reinforcement Learning is a learning mechanism about how to map situations to actions. The end result should help you to increase the binary reward signal. In this method, a learner is not told which action to take but instead must discover which action offers a maximum reward. As this method based on the reward/penalty mechanism.

Name commonly used algorithms.

Four most commonly used algorithm by Data scientist are:

  • Linear regression
  • Logistic regression
  • Random Forest
  • KNN

What is precision?

Precision is the most commonly used error metric is n classification mechanism. Its range is from to , where represents %

What is a univariate analysis?

An analysis which is applied to none attribute at a time is known as univariate analysis. Boxplot is widely used, univariate model.

How do you overcome challenges to your findings?

In order, to overcome challenges of my finding one need to encourage discussion, Demonstrate leadership and respecting different options.

Explain cluster sampling technique in Data science

A cluster sampling method is used when it is challenging to study the target population spread across, and simple random sampling can’t be applied.

State the difference between a Validation Set and a Test Set

A Validation set mostly considered as a part of the training set as it is used for parameter selection which helps you to avoid overfitting of the model being built.

While a Test Set is used for testing or evaluating the performance of a trained machine learning model.

Explain the term Binomial Probability Formula?

“The binomial distribution contains the probabilities of every possible success on N trials for independent events that have a probability of π of occurring.”

What is a recall?

A recall is a ratio of the true positive rate against the actual positive rate. It ranges from to .

Discuss normal distribution

Normal distribution equally distributed as such the mean, median and mode are equal.

While working on a data set, how can you select important variables? Explain

Following methods of variable selection you can use:

  • Remove the correlated variables before selecting important variables
  • Use linear regression and select variables which depend on that p values.
  • Use Backward, Forward Selection, and Stepwise Selection
  • Use Xgboost, Random Forest, and plot variable importance chart.
  • Measure information gain for the given set of features and select top n features accordingly.

Is it possible to capture the correlation between continuous and categorical variable?

Yes, we can use analysis of covariance technique to capture the association between continuous and categorical variables.

Treating a categorical variable as a continuous variable would result in a better predictive model?

Yes, the categorical value should be considered as a continuous variable only when the variable is ordinal in nature. So it is a better predictive model.

What is exploding gradients ?

Gradient:

Gradient is the direction and magnitude calculated during training of a neural network that is used to update the network weights in the right direction and by the right amount.

“Exploding gradients are a problem where large error gradients accumulate and result in very large updates to neural network model weights during training.” At an extreme, the values of weights can become so large as to overflow and result in NaN values.

This has the effect of your model being unstable and unable to learn from your training data. Now let’s understand what is the gradient.

What is selection Bias ?

Selection bias occurs when sample obtained is not representative of the population intended to be analysed.

Explain SVM machine learning algorithm in detail.

SVM stands for support vector machine, it is a supervised machine learning algorithm which can be used for both Regression and Classification. If you have n features in your training data set, SVM tries to plot it in n-dimensional space with the value of each feature being the value of a particular coordinate. SVM uses hyper planes to separate out different classes based on the provided kernel function.

What are support vectors in SVM.

In the above diagram we see that the thinner lines mark the distance from the classifier to the closest data points called the support vectors (darkened data points). The distance between the two thin lines is called the margin.

What is Prior probability and likelihood?

Prior probability is the proportion of the dependent variable in the data set while the likelihood is the probability of classifying a given observant in the presence of some other variable.

Explain Recommender Systems?

It is a subclass of information filtering techniques. It helps you to predict the preferences or ratings which users likely to give to a product.

Name three disadvantages of using a linear model

Three disadvantages of the linear model are:

  • The assumption of linearity of the errors.
  • You can’t use this model for binary or count outcomes
  • There are plenty of overfitting problems that it can’t solve

What are the different kernels functions in SVM ?

There are four types of kernels in SVM.

  1. Linear Kernel
  2. Polynomial kernel
  3. Radial basis kernel
  4. Sigmoid kernel

What is pruning in Decision Tree ?

When we remove sub-nodes of a decision node, this process is called pruning or opposite process of splitting.

What is Ensemble Learning ?

Ensemble is the art of combining diverse set of learners(Individual models) together to improvise on the stability and predictive power of the model. Ensemble learning has many types but two more popular ensemble learning techniques are mentioned below.

Bagging

Bagging tries to implement similar learners on small sample populations and then takes a mean of all the predictions. In generalised bagging, you can use different learners on different population. As you expect this helps us to reduce the variance error.

Boosting

Boosting is an iterative technique which adjust the weight of an observation based on the last classification. If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa. Boosting in general decreases the bias error and builds strong predictive models. However, they may over fit on the training data.

What cross-validation technique would you use on a time series data set.

Instead of using k-fold cross-validation, you should be aware to the fact that a time series is not randomly distributed data — It is inherently ordered by chronological order.

In case of time series data, you should use techniques like forward chaining — Where you will be model on past data then look at forward-facing data.

fold : training[], test[]

fold : training[ ], test[]

fold : training[ ], test[]

fold : training[ ], test[]

What is logistic regression? Or State an example when you have used logistic regression recently.

Logistic Regression often referred as logit model is a technique to predict the binary outcome from a linear combination of predictor variables. For example, if you want to predict whether a particular political leader will win the election or not. In this case, the outcome of prediction is binary i.e. or (Win/Lose). The predictor variables here would be the amount of money spent for election campaigning of a particular candidate, the amount of time spent in campaigning, etc.

What is a Box Cox Transformation?

Dependent variable for a regression analysis might not satisfy one or more assumptions of an ordinary least squares regression. The residuals could either curve as the prediction increases or follow skewed distribution. In such scenarios, it is necessary to transform the response variable so that the data meets the required assumptions. A Box cox transformation is a statistical technique to transform non-normal dependent variables into a normal shape. If the given data is not normal then most of the statistical techniques assume normality. Applying a box cox transformation means that you can run a broader number of tests.

dd13

A Box Cox transformation is a way to transform non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques, if your data isn’t normal, applying a Box-Cox means that you are able to run a broader number of tests. The Box Cox transformation is named after statisticians George Box and Sir David Roxbee Cox who collaborated on a paper and developed the technique.

How will you define the number of clusters in a clustering algorithm?

Though the Clustering Algorithm is not specified, this question will mostly be asked in reference to K-Means clustering where “K” defines the number of clusters. For example, the following image shows three different groups.

Within Sum of squares is generally used to explain the homogeneity within a cluster. If you plot WSS for a range of number of clusters, you will get the plot shown below. The Graph is generally known as Elbow Curve.

Red circled point in above graph i.e. Number of Cluster = is the point after which you don’t see any decrement in WSS. This point is known as bending point and taken as K in K — Means.This is the widely used approach but few data scientists also use Hierarchical clustering first to create dendograms and identify the distinct groups from there.

What is deep learning?

Deep learning is sub field of machine learning inspired by structure and function of brain called artificial neural network. We have a lot numbers of algorithms under machine learning like Linear regression, SVM, Neural network etc and deep learning is just an extension of Neural networks. In neural nets we consider small number of hidden layers but when it comes to deep learning algorithms, we consider a huge number of hidden layers to better understand the input output relationship.

What are Recurrent Neural Networks (RNNs)?

Recurrent nets are type of artificial neural networks designed to recognise pattern from the sequence of data such as Time series, stock market and government agencies etc. To understand recurrent nets, first you have to understand the basics of feed forward nets. Both these networks RNN and feed forward named after the way they channel information through a series of mathematical orations performed at the nodes of the network. One feeds information through straight (never touching same node twice), while the other cycles it through loop, and the latter are called recurrent.

Recurrent networks on the other hand, take as their input not just the current input example they see, but also the what they have perceived previously in time. The BTSXPE at the bottom of the drawing represents the input example in the current moment, and CONTEXT UNIT represents the output of the previous moment. The decision a recurrent neural network reached at time t- affects the decision that it will reach one moment later at time t. So recurrent networks have two sources of input, the present and the recent past, which combine to determine how they respond to new data, much as we do in life.

The error they generate will return via back propagation and be used to adjust their weights until error can’t go any lower. Remember, the purpose of recurrent nets is to accurately classify sequential input. We rely on the back propagation of error and gradient descent to do so.

Back propagation in feed forward networks moves backward from the final error through the outputs, weights and inputs of each hidden layer, assigning those weights responsibility for a portion of the error by calculating their partial derivatives — ∂E/∂w, or the relationship between their rates of change. Those derivatives are then used by our learning rule, gradient descent, to adjust the weights up or down, whichever direction decreases error.

Recurrent networks rely on an extension of back propagation called back propagation through time, or BPTT. Time, in this case, is simply expressed by a well-defined, ordered series of calculations linking one-time step to the next, which is all back propagation needs to work.

What is the difference between machine learning and deep learning?

Machine learning:

Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed. Machine learning can be categorised in following three categories.

  1. Supervised machine learning,
  2. Unsupervised machine learning,
  3. Reinforcement learning

 

Deep learning:

Deep Learning is a sub field of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks.

What is reinforcement learning ?

Reinforcement learning

 

Reinforcement Learning is learning what to do and how to map situations to actions. The end result is to maximise the numerical reward signal. The learner is not told which action to take, but instead must discover which action will yield the maximum reward. Reinforcement learning is inspired by the learning of human beings, it is based on the reward/panelity mechanism.

Explain what regularization is and why it is useful.

Regularisation is the process of adding tunning parameter to a model to induce smoothness in order to prevent overfitting. This is most often done by adding a constant multiple to an existing weight vector. This constant is often the L(Lasso) or L(ridge). The model predictions should then minimize the loss function calculated on the regularized training set.

What is TF/IDF vectorization?

tf–idf is short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

What are Recommender Systems?

A subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc.

What is the difference between Regression and classification ML techniques.

Both Regression and classification machine learning techniques come under Supervised machine learning algorithms. In Supervised machine learning algorithm, we have to train the model using labelled data set, while training we have to explicitly provide the correct labels and algorithm tries to learn the pattern from input to output. If our labels are discrete values then it will a classification problem, e.g A, B etc. but if our labels are continuous values then it will be a regression problem, e.g ., . etc.

If you are having GB RAM in your machine and you want to train your model on GB data set. How would you go about this problem. Have you ever faced this kind of problem in your machine learning/data science experience so far?

First of all, you have to ask which ML model you want to train.

For Neural networks: Batch size with Numpy array will work.

Steps:

  1. Load the whole data in NumPy array. NumPy array has property to create mapping of complete data set, it doesn’t load complete data set in memory.
  2. You can pass index to NumPy array to get required data.
  3. Use this data to pass to Neural network.
  4. Have small batch size.

For SVM: Partial fit will work

Steps:

  1. Divide one big data set in small size data sets.
  2. Use partial fit method of SVM, it requires subset of complete data set.
  3. Repeat step for other subsets.

What is p-value?

When you perform a hypothesis test in statistics, a p-value can help you determine the strength of your results. p-value is a number between and . Based on the value it will denote the strength of the results. The claim which is on trial is called Null Hypothesis.

Low p-value (≤ .) indicates strength against the null hypothesis which means we can reject the null Hypothesis. High p-value (≥ .) indicates strength for the null hypothesis which means we can accept the null Hypothesis p-value of . indicates the Hypothesis could go either way. To put it in another way,

High P values: your data are likely with a true null. Low P values: your data are unlikely with a true null.

What are different ranking algorithms?

Traditional ML algorithms solve a prediction problem (classification or regression) on a single instance at a time. E.g. if you are doing spam detection on email, you will look at all the features associated with that email and classify it as spam or not. The aim of traditional ML is to come up with a class (spam or no-spam) or a single numerical score for that instance.

Ranking algorithms like LTR solves a ranking problem on a list of items. The aim of LTR is to come up with optimal ordering of those items. As such, LTR doesn’t care much about the exact score that each item gets, but cares more about the relative ordering among all the items. RankNetLambdaRank and LambdaMART are all LTR algorithms developed by Chris Burges and his colleagues at Microsoft Research.

  1. RankNet — The cost function for RankNet aims to minimize the number of inversions in ranking. RankNet optimizes the cost function using Stochastic Gradient Descent.
  2. LambdaRank — Burgess et. al. found that during RankNet training procedure, you don’t need the costs, only need the gradients (λ) of the cost with respect to the model score. You can think of these gradients as little arrows attached to each document in the ranked list, indicating the direction we’d like those documents to move. Further they found that scaling the gradients by the change in NDCG found by swapping each pair of documents gave good results. The core idea of LambdaRank is to use this new cost function for training a RankNet. On experimental datasets, this shows both speed and accuracy improvements over the original RankNet.
  3. LambdaMart — LambdaMART combines LambdaRank and MART (Multiple Additive Regression Trees). While MART uses gradient boosted decision trees for prediction tasks, LambdaMART uses gradient boosted decision trees using a cost function derived from LambdaRank for solving a ranking task. On experimental datasets, LambdaMART has shown better results than LambdaRank and the original RankNet.

Can you enumerate the various differences between Supervised and Unsupervised Learning?

Answer: Supervised learning is a type of machine learning where a function is inferred from labeled training data. The training data contains a set of training examples.

Unsupervised learning, on the other hand, is a type of machine learning where inferences are drawn from datasets containing input data without labeled responses. Following are the various other differences between the two types of machine learning:

  • Algorithms Used – Supervised learning makes use of Decision Trees, K-nearest Neighbor algorithm, Neural Networks, Regression, and Support Vector Machines. Unsupervised learning uses Anomaly Detection, Clustering, Latent Variable Models, and Neural Networks.
  • Enables – Supervised learning enables classification and regression, whereas unsupervised learning enables classification, dimension reduction, and density estimation
  • Use – While supervised learning is used for prediction, unsupervised learning finds use in analysis

What do you understand by the Selection Bias? What are its various types?

Answer: Selection bias is typically associated with research that doesn’t have a random selection of participants. It is a type of error that occurs when a researcher decides who is going to be studied. On some occasions, selection bias is also referred to as the selection effect.

In other words, selection bias is a distortion of statistical analysis that results from the sample collecting method. When selection bias is not taken into account, some conclusions made by a research study might not be accurate. Following are the various types of selection bias:

  • Sampling Bias – A systematic error resulting due to a non-random sample of a populace causing certain members of the same to be less likely included than others that results in a biased sample.
  • Time Interval – A trial might be ended at an extreme value, usually due to ethical reasons, but the extreme value is most likely to be reached by the variable with the most variance, even though all variables have a similar mean.
  • Data – Results when specific data subsets are selected for supporting a conclusion or rejection of bad data arbitrarily.
  • Attrition – Caused due to attrition, i.e. loss of participants, discounting trial subjects or tests that didn’t run to completion.

Please explain the goal of A/B Testing.

Answer: A/B Testing is a statistical hypothesis testing meant for a randomized experiment with two variables, A and B. The goal of A/B Testing is to maximize the likelihood of an outcome of some interest by identifying any changes to a webpage.

A highly reliable method for finding out the best online marketing and promotional strategies for a business, A/B Testing can be employed for testing everything, ranging from sales emails to search ads and website copy.

How will you calculate the Sensitivity of machine learning models?

Answer: In machine learning, Sensitivity is used for validating the accuracy of a classifier, such as Logistic, Random Forest, and SVM. It is also known as REC (recall) or TPR (true positive rate).

Sensitivity can be defined as the ratio of predicted true events and total events i.e.:

Sensitivity = True Positives / Positives in Actual Dependent Variable

Here, true events are those events that were true as predicted by a machine learning model. The best sensitivity is . and the worst sensitivity is ..

Could you draw a comparison between overfitting and underfitting?

Answer: In order to make reliable predictions on general untrained data in machine learning and statistics, it is required to fit a (machine learning) model to a set of training data. Overfitting and underfitting are two of the most common modeling errors that occur while doing so.

Following are the various differences between overfitting and underfitting:

  • Definition – A statistical model suffering from overfitting describes some random error or noise in place of the underlying relationship. When underfitting occurs, a statistical model or machine learning algorithm fails in capturing the underlying trend of the data.
  • Occurrence – When a statistical model or machine learning algorithm is excessively complex, it can result in overfitting. Example of a complex model is one having too many parameters when compared to the total number of observations. Underfitting occurs when trying to fit a linear model to non-linear data.
  • Poor Predictive Performance – Although both overfitting and underfitting yield poor predictive performance, the way in which each one of them does so is different. While the overfitted model overreacts to minor fluctuations in the training data, the underfit model under-reacts to even bigger fluctuations.

What do you mean by cluster sampling and systematic sampling?

Answer: When studying the target population spread throughout a wide area becomes difficult and applying simple random sampling becomes ineffective, the technique of cluster sampling is used. A cluster sample is a probability sample, in which each of the sampling units is a collection or cluster of elements.

Following the technique of systematic sampling, elements are chosen from an ordered sampling frame. The list is advanced in a circular fashion. This is done in such a way so that once the end of the list is reached, the same is progressed from the start, or top, again.

Can you compare the validation set with the test set?

Answer: A validation set is part of the training set used for parameter selection as well as for avoiding overfitting of the machine learning model being developed. On the contrary, a test set is meant for evaluating or testing the performance of a trained machine learning model.

What do you understand by linear regression and logistic regression?

Answer: Linear regression is a form of statistical technique in which the score of some variable Y is predicted on the basis of the score of a second variable X, referred to as the predictor variable. The Y variable is known as the criterion variable.

Also known as the logit model, logistic regression is a statistical technique for predicting the binary outcome from a linear combination of predictor variables.

Please explain Recommender Systems along with an application.

Answer: Recommender Systems is a subclass of information filtering systems, meant for predicting the preferences or ratings awarded by a user to some product.

An application of a recommender system is the product recommendations section in Amazon. This section contains items based on the user’s search history and past orders.

What are outlier values and how do you treat them?

Answer: Outlier values, or simply outliers, are data points in statistics that don’t belong to a certain population. An outlier value is an abnormal observation that is very much different from other values belonging to the set.

Identification of outlier values can be done by using univariate or some other graphical analysis method. Few outlier values can be assessed individually but assessing a large set of outlier values require the substitution of the same with either the th or the st percentile values.

There are two popular ways of treating outlier values:

  1. To change the value so that it can be brought within a range
  2. To simply remove the value

Note: – Not all extreme values are outlier values.

Please enumerate the various steps involved in an analytics project.

Answer: Following are the numerous steps involved in an analytics project:

  • Understanding the business problem
  • Exploring the data and familiarizing with the same
  • Preparing the data for modeling by means of detecting outlier values, transforming variables, treating missing values, et cetera
  • Running the model and analyzing the result for making appropriate changes or modifications to the model (an iterative step that repeats until the best possible outcome is gained)
  • Validating the model using a new dataset
  • Implementing the model and tracking the result for analyzing the performance of the same

Could you explain how to define the number of clusters in a clustering algorithm?

Answer: The primary objective of clustering is to group together similar identities in such a way that while entities within a group are similar to each other, the groups remain different from one another.

Generally, the Within Sum of Squares is used for explaining the homogeneity within a cluster. For defining the number of clusters in a clustering algorithm, WSS is plotted for a range pertaining to a number of clusters. The resultant graph is known as the Elbow Curve.

The Elbow Curve graph contains a point that represents the point post in which there aren’t any decrements in the WSS. This is known as the bending point and represents K in K–Means.

Although the aforementioned is the widely-used approach, another important approach is the Hierarchical clustering. In this approach, dendrograms are created first and then distinct groups are identified from there.

 What do you understand by Deep Learning?

Answer: Deep Learning is a paradigm of machine learning that displays a great degree of analogy with the functioning of the human brain. It is a neural network method based on convolutional neural networks (CNN).

Deep learning has a wide array of uses, ranging from social network filtering to medical image analysis and speech recognition. Although Deep Learning has been present for a long time, it’s only recently that it has gained worldwide acclaim. This is mainly due to:

  • An increase in the amount of data generation via various sources
  • The growth in hardware resources required for running Deep Learning models

Caffe, Chainer, Keras, Microsoft Cognitive Toolkit, Pytorch, and TensorFlow are some of the most popular Deep Learning frameworks as of today.

 Please explain Gradient Descent.

Answer: The degree of change in the output of a function relating to the changes made to the inputs is known as a gradient. It measures the change in all weights with respect to the change in error. A gradient can also be comprehended as the slope of a function.

Gradient Descent refers to escalating down to the bottom of a valley. Simply, consider this something as opposed to climbing up a hill. It is a minimization algorithm meant for minimizing a given activation function.

 How does Backpropagation work? Also, it states its various variants.

Answer: Backpropagation refers to a training algorithm used for multilayer neural networks. Following the backpropagation algorithm, the error is moved from an end of the network to all weights inside the network. Doing so allows for efficient computation of the gradient.

Backpropagation works in the following way:

  • Forward propagation of training data
  • Output and target is used for computing derivatives
  • Backpropagate for computing the derivative of the error with respect to the output activation
  • Using previously calculated derivatives for output generation
  • Updating the weights

Following are the various variants of Backpropagation:

  • Batch Gradient Descent – The gradient is calculated for the complete dataset and update is performed on each iteration
  • Mini-batch Gradient Descent – Mini-batch samples are used for calculating gradient and updating parameters (a variant of the Stochastic Gradient Descent approach)
  • Stochastic Gradient Descent – Only a single training example is used to calculate gradient and updating parameters

What do you know about Autoencoders?

Answer: Autoencoders are simplistic learning networks used for transforming inputs into outputs with minimum possible error. It means that the outputs resulted are very close to the inputs.

A couple of layers are added between the input and the output with the size of each layer smaller than the size pertaining to the input layer. An autoencoder receives unlabeled input that is encoded for reconstructing the output.

Please explain the concept of a Boltzmann Machine.

Answer: A Boltzmann Machine features a simple learning algorithm that enables the same to discover fascinating features representing complex regularities present in the training data. It is basically used for optimizing the quantity and weight for some given problem.

The simple learning algorithm involved in a Boltzmann Machine is very slow in networks that have many layers of feature detectors.

What are the skills required as a Data Scientist that could help in using Python for data analysis purposes?

Answer: The skills required as a Data Scientist that could help in using Python for data analysis purposes are stated under:

  1. Expertize in Pandas Dataframes, Scikit-learn, and N-dimensional NumPy Arrays.
  2. Skills to apply element-wise vector and matrix operations on NumPy arrays.
  3. Able to understand built-in data types, including tuples, sets, dictionaries, and various others.
  4. It is equipped with Anaconda distribution and the Conda package manager.
  5. Capability in writing efficient list comprehensions, small, clean functions, and avoid traditional for loops.
  6. Knowledge of Python script and optimizing bottlenecks.

What is the full form of GAN? Explain GAN?

Answer: The full form of GAN is Generative Adversarial Network. Its task is to take inputs from the noise vector and send it forward to the Generator and then to Discriminator to identify and differentiate the unique and fake inputs.

What are the vital components of GAN?

Answer: There are two vital components of GAN. These include the following:

  1. Generator: The Generator act as a Forger, which creates fake copies.
  2. Discriminator: The Discriminator act as a recognizer for fake and unique (real) copies.

What is the Computational Graph?

Answer: A computational graph is a graphical presentation that is based on TensorFlow. It has a wide network of different kinds of nodes wherein each node represents a particular mathematical operation. The edges in these nodes are called tensors. This is the reason the computational graph is called a TensorFlow of inputs. The computational graph is characterized by data flows in the form of a graph; therefore, it is also called the DataFlow Graph.

What are tensors?

Answer: Tensors are the mathematical objects that represent the collection of higher dimensions of data inputs in the form of alphabets, numerals, and rank fed as inputs to the neural network.

Why are Tensorflow considered a high priority in learning Data Science?

Answer: Tensorflow is considered a high priority in learning Data Science because it provides support to using computer languages such as C++ and Python. This way, it makes various processes under data science to achieve faster compilation and completion within the stipulated time frame and faster than the conventional Keras and Torch libraries. Tensorflow supports the computing devices, including the CPU and GPU for faster inputs, editing, and analysis of the data.

What is Dropout in Data Science?

Answer: Dropout is a toll in Data Science, which is used for dropping out the hidden and visible units of a network on a random basis. They prevent the overfitting of the data by dropping as much as % of the nodes so that the required space can be arranged for iterations needed to converge the network.

What is Batch normalization in Data Science?

Answer: Batch Normalization in Data Science is a technique through which attempts could be made to improve the performance and stability of the neural network. This can be done by normalizing the inputs in each layer so that the mean output activation remains with the standard deviation at .

What is the difference between Batch and Stochastic Gradient Descent?

Answer: The difference between Batch and Stochastic Gradient Descent can be displayed as follows:

dd14

What are Auto-Encoders?

Answer: Auto-Encoders are learning networks that are meant to change inputs into output with the lowest chance of getting an error. They intend to keep the output closer to the input. The process of Autoencoders is needed to be done through the development of layers between the input and output. However, efforts are made to keep the size of these layers smaller for faster processing.

What are the various Machine Learning Libraries and their benefits?

Answer: The various machine learning libraries and their benefits are as follows.

  1. Numpy: It is used for scientific computation.
  2. Statsmodels: It is used for time-series analysis.
  3. Pandas: It is used for tubular data analysis.
  4. Scikit learns: It is used for data modeling and pre-processing.
  5. Tensorflow: It is used for the deep learning process.
  6. Regular Expressions: It is used for text processing.
  7. Pytorch: It is used for the deep learning process.
  8. NLTK: It is used for text processing.

What is an Activation function?

Answer: An Activation function helps in introducing the non-linearity in the neural network. This is done to help the learning process for complex functions. Without the activation function, the neural network will be unable to perform only the linear function and apply linear combinations. Activation function, therefore, offers complex functions and combinations by applying artificial neurons, which helps in delivering output based on the inputs.

What are vanishing gradients?

Answer: The vanishing gradients is a condition when the slope is too small during the training process of RNN. The result of vanishing gradients is poor performance outcomes, low accuracy, and long term training processes.

What are exploding gradients?

Answer: The exploding gradients are a condition when the errors grow at an exponential rate or high rate during the training of RNN. This error gradient accumulates and results in applying large updates to the neural network, causes an overflow, and results in NaN values.

What is the full form of LSTM? What is its function?

Answer: LSTM stands for Long Short Term Memory. It is a recurrent neural network that is capable of learning long term dependencies and recalling information for the longer period as part of its default behavior.

What are the different steps in LSTM?

Answer: The different steps in LSTM include the following.

  • Step : The network helps in deciding the things that need to be remembered while others that need to be forgotten.
  • Step : The selection is made for cell state values that can be updated.
  • Step : The network decides as to what can be made as part of the current output.

What is Pooling on CNN?

Answer: Polling is a method that is used with the purpose to reduce the spatial dimensions of a CNN. It helps in performing downsampling operations for reducing dimensionality and creating pooled feature maps. Pooling in CNN helps in sliding the filter matrix over the input matrix.

What is RNN?

Answer: The RNN stands for Recurrent Neural Networks. They are an artificial neural network that is a sequence of data, including stock markets, sequence of data including stock markets, time series, and various others. The main idea behind the RNN application is to understand the basics of the feedforward nets.

What are the different layers on CNN?

Answer: There are four different layers on CNN. These include the following.

  1. Convolutional Layer: In this layer, several small picture windows are created to go over the data.
  2. ReLU Layer: This layer helps in bringing non-linearity to the network and converts the negative pixels to zero so that the output becomes a rectified feature map.
  3. Pooling Layer: This layer reduces the dimensionality of the feature map.
  4. Fully Connected Layer: This layer recognizes and classifies the objects in the image.

What is an Epoch in Data Science?

Answer: Epoch in Data Science represents one of the iterations over the entire dataset. It includes everything that is applied to the learning model.

What is a Batch in Data Science?

Answer: Batch is referred to as a different dataset that is divided into the form of different batches to help to pass the information into the system. It is developed in the situation when the developer cannot pass the entire dataset into the neural network at once.

What is the iteration in Data Science? Give an example?

Answer: Iteration in Data Science is applied by Epoch for analysis of data. The iteration is, therefore, classification of the data into different groups. For example, when there are , images, and the batch size is , then in such a case, the Epoch will run about iterations.

What is the cost function?

Answer: Cost functions are a tool to evaluate how good the model performance has been made. It takes into consideration the errors and losses that are made in the output layer during the backpropagation process. In such a case, the errors are moved backward in the neural network, and various other training functions are applied.

What are hyperparameters?

Answer: Hyperparameter is a kind of parameter whose value is set before the learning process so that the network training requirements can be identified and the structure of the network can be improved. This process includes recognizing the hidden units, learning rate, epochs, and various others associated.

Which skills are important to become a certified Data Scientist?

Answer: The important skills to become a certified Data Scientist include the following:

  1. Knowledge of built-in data types including lists, tuples, sets, and related.
  2. Expertize in N-dimensional NumPy Arrays.
  3. Ability to apply Pandas Dataframes.
  4. Strong holdover performance in element-wise vectors.
  5. Knowledge of matrix operations on NumPy arrays.

What is an Artificial Neural Network in Data Science?

Answer: Artificial Neural Network in Data Science is the specific set of algorithms that are inspired by the biological neural network meant to adapt the changes in the input so that the best output can be achieved. It helps in generating the best possible results without the need to redesign the output methods.

What is Deep Learning in Data Science?

Answer: Deep Learning in Data Science is a name given to machine learning, which requires a great level of analogy with the functioning of the human brain. This way, it is a paradigm of machine learning.

What is Ensemble learning?

Answer: Ensemble learning is a process of combining the diverse set of learners that is the individual models with each other. It helps in improving the stability and predictive power of the model.

What are the different kinds of Ensemble learning?

Answer: The different kinds of Ensemble learning includes the following.

  1. Bagging: It implements simple learners on one small population and takes mean for estimation purposes.
  2. Boosting: It adjusts the weight of the observation and thereby classifies the population in different sets before the outcome prediction is made.

What are some of the steps for data wrangling and data cleaning before applying machine learning algorithms?

There are many steps that can be taken when data wrangling and data cleaning. Some of the most common steps are listed below:

  • Data profiling: Almost everyone starts off by getting an understanding of their dataset. More specifically, you can look at the shape of the dataset with. shape and a description of your numerical variables with .describe().
  • Data visualizations: Sometimes, it’s useful to visualize your data with histograms, boxplots, and scatterplots to better understand the relationships between variables and also to identify potential outliers.
  • Syntax error: This includes making sure there’s no white space, making sure letter casing is consistent, and checking for typos. You can check for typos by using .unique() or by using bar graphs.
  • Standardization or normalization: Depending on the dataset your working with and the machine learning method you decide to use, it may be useful to standardize or normalize your data so that different scales of different variables don’t negatively impact the performance of your model.
  • Handling null values: There are a number of ways to handle null values including deleting rows with null values altogether, replacing null values with the mean/median/mode, replacing null values with a new category (eg. unknown), predicting the values, or using machine learning models that can deal with null values.
  • Other things include: removing irrelevant data, removing duplicates, and type conversion.

How to deal with unbalanced binary classification?

There are a number of ways to handle unbalanced binary classification (assuming that you want to identify the minority class):

  • First, you want to reconsider the metrics that you’d use to evaluate your model. The accuracy of your model might not be the best metric to look at because and I’ll use an example to explain why. Let’s say bank withdrawals were not fraudulent and withdrawal was. If your model simply classified every instance as “not fraudulent”, it would have an accuracy of %! Therefore, you may want to consider using metrics like precision and recall.
  • Another method to improve unbalanced binary classification is by increasing the cost of misclassifying the minority class. By increasing the penalty of such, the model should classify the minority class more accurately.
  • Lastly, you can improve the balance of classes by oversampling the minority class or by undersampling the majority class. You can read more about it here.

What is the difference between a box plot and a histogram?

Boxplot vs Histogram

While boxplots and histograms are visualizations used to show the distribution of the data, they communicate information differently.

Histograms are bar charts that show the frequency of a numerical variable’s values and are used to approximate the probability distribution of the given variable. It allows you to quickly understand the shape of the distribution, the variation, and potential outliers.

Boxplots communicate different aspects of the distribution of data. While you can’t see the shape of the distribution through a box plot, you can gather other information like the quartiles, the range, and outliers. Boxplots are especially useful when you want to compare multiple charts at the same time because they take up less space than histograms.

Describe different regularization methods, such as L and L regularization?

dd15

Both L and L regularization are methods used to reduce the overfitting of training data. Least Squares minimizes the sum of the squared residuals, which can result in low bias but high variance.

L Regularization, also called ridge regression, minimizes the sum of the squared residuals plus lambda times the slope squared. This additional term is called the Ridge Regression Penalty. This increases the bias of the model, making the fit worse on the training data, but also decreases the variance.

If you take the ridge regression penalty and replace it with the absolute value of the slope, then you get Lasso regression or L regularization.

L is less robust but has a stable solution and always one solution. L is more robust but has an unstable solution and can possibly have multiple solutions.

StatQuest has an amazing video on Lasso and Ridge regression here.

Neural Network Fundamentals

neural network is a multi-layered model inspired by the human brain. Like the neurons in our brain, the circles above represent a node. The blue circles represent the input layer, the black circles represent the hidden layers, and the green circles represent the output layer. Each node in the hidden layers represents a function that the inputs go through, ultimately leading to an output in the green circles. The formal term for these functions is called the sigmoid activation function.

If you want a step by step example of creating a neural network, check out Victor Zhou’s article here.

If you’re a visual/audio learner, BlueBrown has an amazing series on neural networks and deep learning on YouTube here.

How to define/select metrics?

There isn’t a one-size-fits-all metric. The metric(s) chosen to evaluate a machine learning model depends on various factors:

  • Is it a regression or classification task?
  • What is the business objective? Eg. precision vs recall
  • What is the distribution of the target variable?

There are a number of metrics that can be used, including adjusted r-squared, MAE, MSE, accuracy, recall, precision, f score, and the list goes on.

Explain what precision and recall are

Recall attempts to answer “What proportion of actual positives was identified correctly?”

dd16

Precision attempts to answer “What proportion of positive identifications was actually correct?”

 

dd17

Explain what a false positive and a false negative are. Why is it important these from each other? Provide examples when false positives are more important than false negatives, false negatives are more important than false positives and when these two types of errors are equally important

false positive is an incorrect identification of the presence of a condition when it’s absent.

false negative is an incorrect identification of the absence of a condition when it’s actually present.

An example of when false negatives are more important than false positives is when screening for cancer. It’s much worse to say that someone doesn’t have cancer when they do, instead of saying that someone does and later realizing that they don’t.

This is a subjective argument, but false positives can be worse than false negatives from a psychological point of view. For example, a false positive for winning the lottery could be a worse outcome than a false negative because people normally don’t expect to win the lottery anyways.

Assume you need to generate a predictive model using multiple regression. Explain how you intend to validate this model

There are two main ways that you can do this:

  1. A) Adjusted R-squared.

R Squared is a measurement that tells you to what extent the proportion of variance in the dependent variable is explained by the variance in the independent variables. In simpler terms, while the coefficients estimate trends, R-squared represents the scatter around the line of best fit.

However, every additional independent variable added to a model always increases the R-squared value — therefore, a model with several independent variables may seem to be a better fit even if it isn’t. This is where adjusted R² comes in. The adjusted R² compensates for each additional independent variable and only increases if each given variable improves the model above what is possible by probability. This is important since we are creating a multiple regression model.

  1. B) Cross-Validation

A method common to most people is cross-validation, splitting the data into two sets: training and testing data. See the answer to the first question for more on this.

What does NLP stand for?

NLP stands for Natural Language Processing. It is a branch of artificial intelligence that gives machines the ability to read and understand human languages.

When would you use random forests Vs SVM and why?

There are a couple of reasons why a random forest is a better choice of model than a support vector machine:

  • Random forests allow you to determine the feature importance. SVM’s can’t do this.
  • Random forests are much quicker and simpler to build than an SVM.
  • For multi-class classification problems, SVMs require a one-vs-rest method, which is less scalable and more memory intensive.

Why is dimension reduction important?

Dimensionality reduction is the process of reducing the number of features in a dataset. This is important mainly in the case when you want to reduce variance in your model (overfitting).

Wikipedia states four advantages of dimensionality reduction (see here):

  1. It reduces the time and storage space required
  2. Removal of multi-collinearity improves the interpretation of the parameters of the machine learning model
  3. It becomes easier to visualize the data when reduced to very low dimensions such as D or D
  4. It avoids the curse of dimensionality

What is principal component analysis? Explain the sort of problems you would use PCA for.

In its simplest sense, PCA involves project higher dimensional data (eg. dimensions) to a smaller space (eg. dimensions). This results in a lower dimension of data, ( dimensions instead of dimensions) while keeping all original variables in the model.

PCA is commonly used for compression purposes, to reduce required memory and to speed up the algorithm, as well as for visualization purposes, making it easier to summarize data.

Why is Naive Bayes so bad? How would you improve a spam detection algorithm that uses naive Bayes?

One major drawback of Naive Bayes is that it holds a strong assumption in that the features are assumed to be uncorrelated with one another, which typically is never the case.

One way to improve such an algorithm that uses Naive Bayes is by decorrelating the features so that the assumption holds true.

What are the drawbacks of a linear model?

There are a couple of drawbacks of a linear model:

  • A linear model holds some strong assumptions that may not be true in application. It assumes a linear relationship, multivariate normality, no or little multicollinearity, no auto-correlation, and homoscedasticity
  • A linear model can’t be used for discrete or binary outcomes.
  • You can’t vary the model flexibility of a linear model.

Do you think small decision trees are better than a large one? Why?

Another way of asking this question is “Is a random forest a better model than a decision tree?” And the answer is yes because a random forest is an ensemble method that takes many weak decision trees to make a strong learner. Random forests are more accurate, more robust, and less prone to overfitting.

Why is mean square error a bad measure of model performance? What would you suggest instead?

Mean Squared Error (MSE) gives a relatively high weight to large errors — therefore, MSE tends to put too much emphasis on large deviations. A more robust alternative is MAE (mean absolute deviation).

What are the assumptions required for linear regression? What if some of these assumptions are violated?

The assumptions are as follows:

  1. The sample data used to fit the model is representative of the population
  2. The relationship between X and the mean of Y is linear
  3. The variance of the residual is the same for any value of X (homoscedasticity)
  4. Observations are independent of each other
  5. For any value of X, Y is normally distributed.

Extreme violations of these assumptions will make the results redundant. Small violations of these assumptions will result in a greater bias or variance of the estimate.

What is collinearity and what to do with it? How to remove multicollinearity?

Multicollinearity exists when an independent variable is highly correlated with another independent variable in a multiple regression equation. This can be problematic because it undermines the statistical significance of an independent variable.

You could use the Variance Inflation Factors (VIF) to determine if there is any multicollinearity between independent variables — a standard benchmark is that if the VIF is greater than then multicollinearity exists.

How to check if the regression model fits the data well?

there are a couple of metrics that you can use:

R-squared/Adjusted R-squared: Relative measure of fit. This was explained in a previous answer

F Score: Evaluates the null hypothesis that all regression coefficients are equal to zero vs the alternative hypothesis that at least one doesn’t equal zero

RMSE: Absolute measure of fit.

What is a decision tree?

Decision trees are a popular model, used in operations research, strategic planning, and machine learning. Each square above is called a node, and the more nodes you have, the more accurate your decision tree will be (generally). The last nodes of the decision tree, where a decision is made, are called the leaves of the tree. Decision trees are intuitive and easy to build but fall short when it comes to accuracy.

What is a random forest? Why is it good?

Random forests are an ensemble learning technique that builds off of decision trees. Random forests involve creating multiple decision trees using bootstrapped datasets of the original data and randomly selecting a subset of variables at each step of the decision tree. The model then selects the mode of all of the predictions of each decision tree. By relying on a “majority wins” model, it reduces the risk of error from an individual tree.

For example, if we created one decision tree, the third one, it would predict . But if we relied on the mode of all decision trees, the predicted value would be . This is the power of random forests.

Random forests offer several other benefits including strong performance, can model non-linear boundaries, no cross-validation needed, and gives feature importance.

What is a kernel? Explain the kernel trick

A kernel is a way of computing the dot product of two vectors 𝐱x and 𝐲y in some (possibly very high dimensional) feature space, which is why kernel functions are sometimes called “generalized dot product” []

The kernel trick is a method of using a linear classifier to solve a non-linear problem by transforming linearly inseparable data to linearly separable ones in a higher dimension.

 

Is it beneficial to perform dimensionality reduction before fitting an SVM? Why or why not?

When the number of features is greater than the number of observations, then performing dimensionality reduction will generally improve the SVM.

What is overfitting?

 

Overfitting is an error where the model ‘fits’ the data too well, resulting in a model with high variance and low bias. As a consequence, an overfit model will inaccurately predict new data points even though it has a high accuracy on the training data.

What is boosting?

Boosting is an ensemble method to improve a model by reducing its bias and variance, ultimately converting weak learners to strong learners. The general idea is to train a weak learner and sequentially iterate and improve the model by learning from the previous learner.

The probability that item an item at location A is ., and . at location B. What is the probability that item would be found on Amazon website?

We need to make some assumptions about this question before we can answer it. Let’s assume that there are two possible places to purchase a particular item on Amazon and the probability of finding it at location A is . and B is .. The probability of finding the item on Amazon can be explained as so:

We can reword the above as P(A) = . and P(B) = .. Furthermore, let’s assume that these are independent events, meaning that the probability of one event is not impacted by the other. We can then use the formula…

P(A or B) = P(A) + P(B) — P(A and B)
P(A or B) = . + . — (.*.)
P(A or B) = .

You randomly draw a coin from coins — unfair coin (head-head), fair coins (head-tail) and roll it times. If the result is heads, what is the probability that the coin is unfair?

This can be answered using the Bayes Theorem. The extended equation for the Bayes Theorem is the following:

Assume that the probability of picking the unfair coin is denoted as P(A) and the probability of flipping heads in a row is denoted as P(B). Then P(B|A) is equal to , P(B∣¬A) is equal to .⁵¹⁰, and P(¬A) is equal to ..

If you fill in the equation, then P(A|B) = . or .%.Q: Difference between convex and non-convex cost function; what does it mean when a cost function is non-convex?

convex function is one where a line drawn between any two points on the graph lies on or above the graph. It has one minimum.

non-convex function is one where a line drawn between any two points on the graph may intersect other points on the graph. It characterized as “wavy”.

When a cost function is non-convex, it means that there’s a likelihood that the function may find local minima instead of the global minimum, which is typically undesired in machine learning models from an optimization perspective.

Walk through the probability fundamentals

Eight rules of probability

  • Rule #: For any event A, 0 ≤ P(A) ≤ 1; in other words, the probability of an event can range from to
  • Rule #: The sum of the probabilities of all possible outcomes always equals .
  • Rule #: P(not A) = — P(A)This rule explains the relationship between the probability of an event and its complement event. A complement event is one that includes all possible outcomes that aren’t in A.
  • Rule #: If A and B are disjoint events (mutually exclusive), then P(A or B) = P(A) + P(B); this is called the addition rule for disjoint events
  • Rule #: P(A or B) = P(A) + P(B) — P(A and B)this is called the general addition rule.
  • Rule #: If A and B are two independent events, then P(A and B) = P(A) * P(B)this is called the multiplication rule for independent events.
  • Rule #: The conditional probability of event B given event A is P(B|A) = P(A and B) / P(A)
  • Rule #: For any two events A and B, P(A and B) = P(A) * P(B|A)this is called the general multiplication rule

Counting Methods

Factorial Formula: n! = n x (n -) x (n — ) x … x x
Use when the number of items is equal to the number of places available.
Eg. Find the total number of ways people can sit in empty seats.
= x x x x =

Fundamental Counting Principle (multiplication)
This method should be used when repetitions are allowed and the number of ways to fill an open place is not affected by previous fills.
Eg. There are types of breakfasts, types of lunches, and types of desserts. The total number of combinations is = x x =

Permutations: P(n,r)= n! / (n−r)!
This method is used when replacements are not allowed and order of item ranking matters.
Eg. A code has digits in a particular order and the digits range from to . How many permutations are there if one digit can only be used once?
P(n,r) = !/(–)! = (xxxxxxxxx)/(xxxxx) =

Combinations Formula: C(n,r)=(n!)/[(n−r)!r!]
This is used when replacements are not allowed and the order in which items are ranked does not mater.
Eg. To win the lottery, you must select the correct numbers in any order from to . What is the number of possible combinations?
C(n,r) = ! / (–)!! = ,,

Describe Markov chains?

dd18

Brilliant provides a great definition of Markov chains (here):

A Markov chain is a mathematical system that experiences transitions from one state to another according to certain probabilistic rules. The defining characteristic of a Markov chain is that no matter how the process arrived at its present state, the possible future states are fixed. In other words, the probability of transitioning to any particular state is dependent solely on the current state and time elapsed.”

The actual math behind Markov chains requires knowledge on linear algebra and matrices, so I’ll leave some links below in case you want to explore this topic further on your own.

A box has red cards and black cards. Another box has red cards and black cards. You want to draw two cards at random from one of the two boxes, one card at a time. Which box has a higher probability of getting cards of the same color and why?

The box with red cards and black cards has a higher probability of getting two cards of the same color. Let’s walk through each step.

Let’s say the first card you draw from each deck is a red Ace.

This means that in the deck with reds and blacks, there’s now reds and blacks. Therefore your odds of drawing another red are equal to /(+) or /.

In the deck with reds and blacks, there would then be reds and blacks. Therefore your odds of drawing another red are equal to /(+) or /.

Since / > /, the second deck with more cards has a higher probability of getting the same two cards.

You are at a Casino and have two dices to play with. You win $ every time you roll a . If you play till you win and then stop, what is the expected payout?

  • Let’s assume that it costs $ every time you want to play.
  • There are possible combinations with two dice.
  • Of the combinations, there are combinations that result in rolling a five (see blue). This means that there is a / or / chance of rolling a .
  • A / chance of winning means you’ll lose eight times and win once (theoretically).
  • Therefore, your expected payout is equal to $. * — $. * = -$..

How can you tell if a given coin is biased?

This isn’t a trick question. The answer is simply to perform a hypothesis test:

  1. The null hypothesis is that the coin is not biased and the probability of flipping heads should equal % (p=.). The alternative hypothesis is that the coin is biased and p != ..
  2. Flip the coin times.
  3. Calculate Z-score (if the sample is less than , you would calculate the t-statistics).
  4. Compare against alpha (two-tailed test so ./ = .).
  5. If p-value > alpha, the null is not rejected and the coin is not biased.
    If p-value < alpha, the null is rejected and the coin is biased.

Make an unfair coin fair

Since a coin flip is a binary outcome, you can make an unfair coin fair by flipping it twice. If you flip it twice, there are two outcomes that you can bet on: heads followed by tails or tails followed by heads.

P(heads) * P(tails) = P(tails) * P(heads)

This makes sense since each coin toss is an independent event. This means that if you get heads → heads or tails → tails, you would need to reflip the coin.

You are about to get on a plane to London, you want to know whether you have to bring an umbrella or not. You call three of your random friends and ask each one of them if it’s raining. The probability that your friend is telling the truth is / and the probability that they are playing a prank on you by lying is /. If all of them tell that it is raining, then what is the probability that it is actually raining in London.

You can tell that this question is related to Bayesian theory because of the last statement which essentially follows the structure, “What is the probability A is true given B is true?” Therefore we need to know the probability of it raining in London on a given day. Let’s assume it’s %.

P(A) = probability of it raining = %
P(B) = probability of all friends say that it’s raining
P(A|B) probability that it’s raining given they’re telling that it is raining
P(B|A) probability that all friends say that it’s raining given it’s raining = (/)³ = /

Step : Solve for P(B)
P(A|B) = P(B|A) * P(A) / P(B), can be rewritten as
P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)
P(B) = (/)³ * . + (/)³ * . = .*/ + .*/

Step : Solve for P(A|B)
P(A|B) = . * (/) / ( .*/ + .*/)
P(A|B) = / ( + ) = /

Therefore, if all three friends say that it’s raining, then there’s an / chance that it’s actually raining.

You are given cards with four different colors- Green cards, Red Cards, Blue cards, and Yellow cards. The cards of each color are numbered from one to ten. Two cards are picked at random. Find out the probability that the cards picked are not of the same number and same color.

Since these events are not independent, we can use the rule:
P(A and B) = P(A) * P(B|A) ,which is also equal to
P(not A and not B) = P(not A) * P(not B | not A)

For example:

P(not and not yellow) = P(not ) * P(not yellow | not )
P(not and not yellow) = (/) * (/)
P(not and not yellow) = .

Therefore, the probability that the cards picked are not the same number and the same color is .%.

How do you assess the statistical significance of an insight?

You would perform hypothesis testing to determine statistical significance. First, you would state the null hypothesis and alternative hypothesis. Second, you would calculate the p-value, the probability of obtaining the observed results of a test assuming that the null hypothesis is true. Last, you would set the level of the significance (alpha) and if the p-value is less than the alpha, you would reject the null — in other words, the result is statistically significant.

Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?

Example of a long tail distribution

long-tailed distribution is a type of heavy-tailed distribution that has a tail (or tails) that drop off gradually and asymptotically.

practical examples include the power law, the Pareto principle (more commonly known as the – rule), and product sales (i.e. best-selling products vs others).

It’s important to be mindful of long-tailed distributions in classification and regression problems because the least frequently occurring values make up the majority of the population. This can ultimately change the way that you deal with outliers, and it also conflicts with some machine learning techniques with the assumption that the data is normally distributed.

What is the Central Limit Theorem? Explain it. Why is it important?

 

Statistics How To provides the best definition of CLT, which is:

The central limit theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size gets larger no matter what the shape of the population distribution.” []

The central limit theorem is important because it is used in hypothesis testing and also to calculate confidence intervals.

What is the statistical power?

‘Statistical power’ refers to the power of a binary hypothesis, which is the probability that the test rejects the null hypothesis given that the alternative hypothesis is true. []

Explain selection bias (with regard to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?

Selection bias is the phenomenon of selecting individuals, groups or data for analysis in such a way that proper randomization is not achieved, ultimately resulting in a sample that is not representative of the population.

Understanding and identifying selection bias is important because it can significantly skew results and provide false insights about a particular population group.

Types of selection bias include:

  • sampling bias: a biased sample caused by non-random sampling
  • time interval: selecting a specific time frame that supports the desired conclusion. e.g. conducting a sales analysis near Christmas.
  • exposure: includes clinical susceptibility bias, protopathic bias, indication bias. 
  • data: includes cherry-picking, suppressing evidence, and the fallacy of incomplete evidence.
  • attrition: attrition bias is similar to survivorship bias, where only those that ‘survived’ a long process are included in an analysis, or failure bias, where those that ‘failed’ are only included
  • observer selection: related to the Anthropic principle, which is a philosophical consideration that any data we collect about the universe is filtered by the fact that, in order for it to be observable, it must be compatible with the conscious and sapient life that observes it. []

Handling missing data can make selection bias worse because different methods impact the data in different ways. For example, if you replace null values with the mean of the data, you adding bias in the sense that you’re assuming that the data is not as spread out as it might actually be.

Provide a simple example of how an experimental design can help answer a question about behavior. How does experimental data contrast with observational data?

Observational data comes from observational studies which are when you observe certain variables and try to determine if there is any correlation.

Experimental data comes from experimental studies which are when you control certain variables and hold them constant to determine if there is any causality.

An example of experimental design is the following: split a group up into two. The control group lives their lives normally. The test group is told to drink a glass of wine every night for days. Then research can be conducted to see how wine affects sleep.

Is mean imputation of missing data acceptable practice? Why or why not?

Mean imputation is the practice of replacing null values in a data set with the mean of the data.

Mean imputation is generally bad practice because it doesn’t take into account feature correlation. For example, imagine we have a table showing age and fitness score and imagine that an eighty-year-old has a missing fitness score. If we took the average fitness score from an age range of to , then the eighty-year-old will appear to have a much higher fitness score that he actually should.

Second, mean imputation reduces the variance of the data and increases bias in our data. This leads to a less accurate model and a narrower confidence interval due to a smaller variance.

What is an outlier? Explain how you might screen for outliers and what would you do if you found them in your dataset. Also, explain what an inlier is and how you might screen for them and what would you do if you found them in your dataset.

An outlier is a data point that differs significantly from other observations.

Depending on the cause of the outlier, they can be bad from a machine learning perspective because they can worsen the accuracy of a model. If the outlier is caused by a measurement error, it’s important to remove them from the dataset. There are a couple of ways to identify outliers:

Z-score/standard deviations: if we know that .% of data in a data set lie within three standard deviations, then we can calculate the size of one standard deviation, multiply it by , and identify the data points that are outside of this range. Likewise, we can calculate the z-score of a given point, and if it’s equal to +/- , then it’s an outlier.
Note: that there are a few contingencies that need to be considered when using this method; the data must be normally distributed, this is not applicable for small data sets, and the presence of too many outliers can throw off z-score.

dd19

Interquartile Range (IQR): IQR, the concept used to build boxplots, can also be used to identify outliers. The IQR is equal to the difference between the rd quartile and the st quartile. You can then identify if a point is an outlier if it is less than Q–.*IRQ or greater than Q + .*IQR. This comes to approximately . standard deviations.

dd20

Other methods include DBScan clustering, Isolation Forests, and Robust Random Cut Forests.

An inlier is a data observation that lies within the rest of the dataset and is unusual or an error. Since it lies in the dataset, it is typically harder to identify than an outlier and requires external data to identify them. Should you identify any inliers, you can simply remove them from the dataset to address them.

How do you handle missing data? What imputation techniques do you recommend?

There are several ways to handle missing data:

  • Delete rows with missing data
  • Mean/Median/Mode imputation
  • Assigning a unique value
  • Predicting the missing values
  • Using an algorithm which supports missing values, like random forests

The best method is to delete rows with missing data as it ensures that no bias or variance is added or removed, and ultimately results in a robust and accurate model. However, this is only recommended if there’s a lot of data to start with and the percentage of missing values is low.

You have data on the duration of calls to a call center. Generate a plan for how you would code and analyze these data. Explain a plausible scenario for what the distribution of these durations might look like. How could you test, even graphically, whether your expectations are borne out?

First I would conduct EDA — Exploratory Data Analysis to clean, explore, and understand my data.  As part of my EDA, I could compose a histogram of the duration of calls to see the underlying distribution.

My guess is that the duration of calls would follow a lognormal distribution (see below). The reason that I believe it’s positively skewed is because the lower end is limited to since a call can’t be negative seconds. However, on the upper end, it’s likely for there to be a small proportion of calls that are extremely long relatively.

Lognormal Distribution Example

Explain likely differences between administrative datasets and datasets gathered from experimental studies. What are likely problems encountered with administrative data? How do experimental methods help alleviate these problems? What problem do they bring?

Administrative datasets are typically datasets used by governments or other organizations for non-statistical reasons.

Administrative datasets are usually larger and more cost-efficient than experimental studies. They are also regularly updated assuming that the organization associated with the administrative dataset is active and functioning. At the same time, administrative datasets may not capture all of the data that one may want and may not be in the desired format either. It is also prone to quality issues and missing entries.

You are compiling a report for user content uploaded every month and notice a spike in uploads in October. In particular, a spike in picture uploads. What might you think is the cause of this, and how would you test it?

There are a number of potential reasons for a spike in photo uploads:

  1. A new feature may have been implemented in October which involves uploading photos and gained a lot of traction by users. For example, a feature that gives the ability to create photo albums.
  2. Similarly, it’s possible that the process of uploading photos before was not intuitive and was improved in the month of October.
  3. There may have been a viral social media movement that involved uploading photos that lasted for all of October. Eg. Movember but something more scalable.
  4. It’s possible that the spike is due to people posting pictures of themselves in costumes for Halloween.

The method of testing depends on the cause of the spike, but you would conduct hypothesis testing to determine if the inferred cause is the actual cause.

Give examples of data that does not have a Gaussian distribution, nor log-normal.

  • Any type of categorical data won’t have a gaussian distribution or lognormal distribution.
  • Exponential distributions — eg. the amount of time that a car battery lasts or the amount of time until an earthquake occurs.

What is root cause analysis? How to identify a cause vs. a correlation? Give examples

Root cause analysis: a method of problem-solving used for identifying the root cause(s) of a problem []

Correlation measures the relationship between two variables, range from – to . Causation is when a first event appears to have caused a second event. Causation essentially looks at direct relationships while correlation can look at both direct and indirect relationships.

Example: a higher crime rate is associated with higher sales in ice cream in Canada, aka they are positively correlated. However, this doesn’t mean that one causes another. Instead, it’s because both occur more when it’s warmer outside.

You can test for causation using hypothesis testing or A/B testing.

Give an example where the median is a better measure than the mean

When there are a number of outliers that positively or negatively skew the data.

Given two fair dices, what is the probability of getting scores that sum to ? to ?

There are combinations of rolling a (+, +, +):
P(rolling a ) = / = /

There are combinations of rolling an (+, +, +, +, +):
P(rolling an ) = /

What is the Law of Large Numbers?

The Law of Large Numbers is a theory that states that as the number of trials increases, the average of the result will become closer to the expected value.

Eg. flipping heads from fair coin , times should be closer to . than times.

How do you calculate the needed sample size?

Formula for margin of error

You can use the margin of error (ME) formula to determine the desired sample size.

  • t/z = t/z score used to calculate the confidence interval
  • ME = the desired margin of error
  • S = sample standard deviation

When you sample, what bias are you inflicting?

Potential biases include the following:

  • Sampling bias: a biased sample caused by non-random sampling
  • Under coverage bias: sampling too few observations
  • Survivorship bias: error of overlooking observations that did not make it past a form of selection process.

How do you control for biases?

There are many things that you can do to control and minimize bias. Two common things include randomization, where participants are assigned by chance, and random sampling, sampling in which each member has an equal probability of being chosen.

What are confounding variables?

A confounding variable, or a confounder, is a variable that influences both the dependent variable and the independent variable, causing a spurious association, a mathematical relationship in which two or more variables are associated but not causally related.

What is A/B testing?

A/B testing is a form of hypothesis testing and two-sample hypothesis testing to compare two versions, the control and variant, of a single variable. It is commonly used to improve and optimize user experience and marketing.

How do you prove that males are on average taller than females by knowing just gender height?

You can use hypothesis testing to prove that males are taller on average than females.

The null hypothesis would state that males and females are the same height on average, while the alternative hypothesis would state that the average height of males is greater than the average height of females.

Then you would collect a random sample of heights of males and females and use a t-test to determine if you reject the null or not.

Infection rates at a hospital above a infection per person-days at risk are considered high. A hospital had infections over the last person-days at risk. Give the p-value of the correct one-sided test of whether the hospital is below the standard.

Since we looking at the number of events (# of infections) occurring within a given timeframe, this is a Poisson distribution question.

The probability of observing k events in an interval

Null (H): infection per person-days
Alternative (H): > infection per person-days

k (actual) = infections
lambda (theoretical) = (/)*
p = . or .% calculated using .poisson() in excel or ppois in R

Since p-value < alpha (assuming % level of significance), we reject the null and conclude that the hospital is below the standard.

You roll a biased coin (p(head)=.) five times. What’s the probability of getting three or more heads?

Use the General Binomial Probability formula to answer this question:

General Binomial Probability Formula

p = .
n =
k = ,,

P( or more heads) = P( heads) + P( heads) + P( heads) = . or %

A random variable X is normal with mean and a standard deviation . Calculate P(X>)

Using Excel…
p =-norm.dist(, , , true)
p= .

Consider the number of people that show up at a bus station is Poisson with mean ./h. What is the probability that at most three people show up in a four hour period?

x =
mean = .* =

using Excel…

p = poisson.dist(,,true)
p = .

An HIV test has a sensitivity of .% and a specificity of .%. A subject from a population of prevalence .% receives a positive test result. What is the precision of the test (i.e the probability he is HIV positive)?

Equation for Precision (PV)

Precision = Positive Predictive Value = PV
PV = (.*.)/[(.*.)+((–.)*(–.))]
PV = . or .%

You are running for office and your pollster polled hundred people. Sixty of them claimed they will vote for you. Can you relax?

  • Assume that there’s only you and one other opponent.
  • Also, assume that we want a % confidence interval. This gives us a z-score of ..

Confidence interval formula

p-hat = / = .
z* = .
n =
This gives us a confidence interval of [.,.]. Therefore, given a confidence interval of %, if you are okay with the worst scenario of tying then you can relax. Otherwise, you cannot relax until you got out of to claim yes.

Geiger counter records radioactive decays in minutes. Find an approximate % interval for the number of decays per hour.

  • Since this is a Poisson distribution question, mean = lambda = variance, which also means that standard deviation = square root of the mean
  • a % confidence interval implies a z score of .
  • one standard deviation =

Therefore the confidence interval = +/- . = [., .]

The homicide rate in Scotland fell last year to from the year before. Is this reported change really noteworthy?

  • Since this is a Poisson distribution question, mean = lambda = variance, which also means that standard deviation = square root of the mean
  • a % confidence interval implies a z score of .
  • one standard deviation = sqrt() = .

Therefore the confidence interval = +/- . = [., .]. Since is within this confidence interval, we can assume that this change is not very noteworthy.

Consider influenza epidemics for two-parent heterosexual families. Suppose that the probability is % that at least one of the parents has contracted the disease. The probability that the father has contracted influenza is % while the probability that both the mother and father have contracted the disease is %. What is the probability that the mother has contracted influenza?

Using the General Addition Rule in probability:
P(mother or father) = P(mother) + P(father) — P(mother and father)
P(mother) = P(mother or father) + P(mother and father) — P(father)
P(mother) = . + .–.
P(mother) = .

Suppose that diastolic blood pressures (DBPs) for men aged – are normally distributed with a mean of (mm Hg) and a standard deviation of . About what is the probability that a random – year old has a DBP less than ?

Since is one standard deviation below the mean, take the area of the Gaussian distribution to the left of one standard deviation.

= . + . = .%

In a population of interest, a sample of men yielded a sample average brain volume of ,cc and a standard deviation of cc. What is a % Student’s T confidence interval for the mean brain volume in this new population?

Confidence interval for sample

Given a confidence level of % and degrees of freedom equal to , the t-score = .

Confidence interval = +/- .*(/)
Confidence interval = [., .]

A diet pill is given to subjects over six weeks. The average difference in weight (follow up — baseline) is – pounds. What would the standard deviation of the difference in weight have to be for the upper endpoint of the % T confidence interval to touch ?

Upper bound = mean + t-score*(standard deviation/sqrt(sample size))
= – + .*(s/)
= . * s /
s = .
Therefore the standard deviation would have to be at least approximately . for the upper bound of the % T confidence interval to touch .

In a study of emergency room waiting times, investigators consider a new and the standard triage systems. To test the systems, administrators selected nights and randomly assigned the new triage system to be used on nights and the standard system on the remaining nights. They calculated the nightly median waiting time (MWT) to see a physician. The average MWT for the new system was hours with a variance of . while the average MWT for the old system was hours with a variance of .. Consider the % confidence interval estimate for the differences of the mean MWT associated with the new system. Assume a constant variance. What is the interval? Subtract in this order (New System — Old System).

Confidence Interval = mean +/- t-score * standard error (see above)

mean = new mean — old mean = – = –

t-score = . given df= (–) and confidence interval of %

standard error = sqrt((.⁶²*+.⁶⁸²*)/(+–)) * sqrt(/+/)
standard error = .

confidence interval = [-., -.]

To further test the hospital triage system, administrators selected nights and randomly assigned a new triage system to be used on nights and a standard system on the remaining nights. They calculated the nightly median waiting time (MWT) to see a physician. The average MWT for the new system was hours with a standard deviation of . hours while the average MWT for the old system was hours with a standard deviation of hours. Consider the hypothesis of a decrease in the mean MWT associated with the new treatment. What does the % independent group confidence interval with unequal variances suggest vis a vis this hypothesis? (Because there’s so many observations per group, just use the Z quantile instead of the T.)

Assuming we subtract in this order (New System — Old System):

 

confidence interval formula for two independent samples

mean = new mean — old mean = – = –

z-score = . confidence interval of %

 

  1. error = sqrt((.⁵²*+²²*)/(+–)) * sqrt(/+/)
    standard error = .
    lower bound = -–.*. = -.
    upper bound = -+.*. = -.

confidence interval = [-., -.]

Second Highest Salary

Write a SQL query to get the second highest salary from the Employee table. For example, given the Employee table below, the query should return  as the second highest salary. If there is no second highest salary, then the query should return null.

+—-+——–+
| Id | Salary |
+—-+——–+
| | |
| | |
| | |
+—-+——–+

SOLUTION A: Using IFNULL, OFFSET

  • IFNULL(expression, alt: ifnull() returns the specified value if null, otherwise returns the expected value. We’ll use this to return null if there’s no second-highest salary.
  • OFFSET : offset is used with the ORDER BY clause to disregard the top n rows that you specify. This will be useful as you’ll want to get the second row (nd highest salary)

SELECT
IFNULL(
(SELECT DISTINCT Salary
FROM Employee
ORDER BY Salary DESC
LIMIT OFFSET
), null) as SecondHighestSalary
FROM Employee
LIMIT

SOLUTION B: Using MAX()

This query says to choose the MAX salary that isn’t equal to the MAX salary, which is equivalent to saying to choose the second-highest salary!

SELECT MAX(salary) AS SecondHighestSalary
FROM Employee
WHERE salary != (SELECT MAX(salary) FROM Employee)

Here are three SQL concepts to review before your next interview!

Duplicate Emails

Write a SQL query to find all duplicate emails in a table named Person.

+—-+———+
| Id | Email |
+—-+———+
| | a@b.com |
| | c@d.com |
| | a@b.com |
+—-+———+

SOLUTION A: COUNT() in a Subquery

First, a subquery is created to show the count of the frequency of each email. Then the subquery is filtered WHERE the count is greater than .

SELECT Email
FROM (
SELECT Email, count(Email) AS count
FROM Person
GROUP BY Email
) as email_count
WHERE count >

SOLUTION B: HAVING Clause

  • HAVING is a clause that essentially allows you to use a WHERE statement in conjunction with aggregates (GROUP BY).

SELECT Email
FROM Person
GROUP BY Email
HAVING count(Email) >

Rising Temperature

Given a Weather table, write a SQL query to find all dates’ Ids with higher temperature compared to its previous (yesterday’s) dates.

+———+——————+——————+
| Id(INT) | RecordDate(DATE) | Temperature(INT) |
+———+——————+——————+
| | — | |
| | — | |
| | — | |
| | — | |
+———+——————+——————+

SOLUTION: DATEDIFF()

  • DATEDIFF calculates the difference between two dates and is used to make sure we’re comparing today’s temperature to yesterday’s temperature.

In plain English, the query is saying, Select the Ids where the temperature on a given day is greater than the temperature yesterday.

SELECT DISTINCT a.Id
FROM Weather a, Weather b
WHERE a.Temperature > b.Temperature
AND DATEDIFF(a.Recorddate, b.Recorddate) =

Department Highest Salary

The Employee table holds all employees. Every employee has an Id, a salary, and there is also a column for the department Id.

+—-+——-+——–+————–+
| Id | Name | Salary | DepartmentId |
+—-+——-+——–+————–+
| | Joe | | |
| | Jim | | |
| | Henry | | |
| | Sam | | |
| | Max | | |
+—-+——-+——–+————–+

The Department table holds all departments of the company.

+—-+———-+
| Id | Name |
+—-+———-+
| | IT |
| | Sales |
+—-+———-+

Write a SQL query to find employees who have the highest salary in each of the departments. For the above tables, your SQL query should return the following rows (order of rows does not matter).

+————+———-+——–+
| Department | Employee | Salary |
+————+———-+——–+
| IT | Max | |
| IT | Jim | |
| Sales | Henry | |
+————+———-+——–+

SOLUTION: IN Clause

  • The IN clause allows you to use multiple OR clauses in a WHERE statement. For example WHERE country = ‘Canada’ or country = ‘USA’ is the same as WHERE country IN (‘Canada’, ’USA’).
  • In this case, we want to filter the Department table to only show the highest Salary per Department (i.e. DepartmentId). Then we can join the two tables WHERE the DepartmentId and Salary is in the filtered Department table.

SELECT
Department.name AS ‘Department’,
Employee.name AS ‘Employee’,
Salary
FROM Employee
INNER JOIN Department ON Employee.DepartmentId = Department.Id
WHERE (DepartmentId , Salary)
IN
( SELECT
DepartmentId, MAX(Salary)
FROM
Employee
GROUP BY DepartmentId
)

Exchange Seats

Mary is a teacher in a middle school and she has a table seat storing students’ names and their corresponding seat ids. The column id is a continuous increment. Mary wants to change seats for the adjacent students.

Can you write a SQL query to output the result for Mary?

+———+———+
| id | student |
+———+———+
| | Abbot |
| | Doris |
| | Emerson |
| | Green |
| | Jeames |
+———+———+

For the sample input, the output is:

+———+———+
| id | student |
+———+———+
| | Doris |
| | Abbot |
| | Green |
| | Emerson |
| | Jeames |
+———+———+

Note:
If the number of students is odd, there is no need to change the last one’s seat.

SOLUTION: CASE WHEN

  • Think of a CASE WHEN THEN statement like an IF statement in coding.
  • The first WHEN statement checks to see if there’s an odd number of rows, and if there is, ensure that the id number does not change.
  • The second WHEN statement adds to each id (eg. ,, becomes ,,)
  • Similarly, the third WHEN statement subtracts to each id (,, becomes ,,)

SELECT
CASE
WHEN((SELECT MAX(id) FROM seat)% = ) AND id = (SELECT MAX(id) FROM seat) THEN id
WHEN id% = THEN id +
ELSE id –
END AS id, student
FROM seat
ORDER BY id

If there are marbles of equal weight and marble that weighs a little bit more (for a total of marbles), how many weighing are required to determine which marble is the heaviest?

Two weighing would be required (see part A and B above):

  1. You would split the nine marbles into three groups of three and weigh two of the groups. If the scale balances (alternative ), you know that the heavy marble is in the third group of marbles. Otherwise, you’ll take the group that is weighed more heavily (alternative ).
  2. Then you would exercise the same step, but you’d have three groups of one marble instead of three groups of three.

How would the change of prime membership fee affect the market?

I’m not 100% sure about the answer to this question but will give my best shot!

Let’s take the instance where there’s an increase in the prime membership fee — there are two parties involved, the buyers and the sellers.

For the buyers, the impact of an increase in a prime membership fee ultimately depends on the price elasticity of demand for the buyers. If the price elasticity is high, then a given increase in price will result in a large drop in demand and vice versa. Buyers that continue to purchase a membership fee are likely Amazon’s most loyal and active customers — they are also likely to place a higher emphasis on products with prime.

Sellers will take a hit, as there is now a higher cost of purchasing Amazon’s basket of products. That being said, some products will take a harder hit while others may not be impacted. It is likely that premium products that Amazon’s most loyal customers purchase would not be affected as much, like electronics.

If a PM says that they want to double the number of ads in Newsfeed, how would you figure out if this is a good idea or not?

You can perform an A/B test by splitting the users into two groups: a control group with the normal number of ads and a test group with double the number of ads. Then you would choose the metric to define what a “good idea” is. For example, we can say that the null hypothesis is that doubling the number of ads will reduce the time spent on Facebook and the alternative hypothesis is that doubling the number of ads won’t have any impact on the time spent on Facebook. However, you can choose a different metric like the number of active users or the churn rate. Then you would conduct the test and determine the statistical significance of the test to reject or not reject the null.

What is: lift, KPI, robustness, model fitting, design of experiments, / rule?

Lift: lift is a measure of the performance of a targeting model measured against a random choice targeting model; in other words, lift tells you how much better your model is at predicting things than if you had no model.

KPI: stands for Key Performance Indicator, which is a measurable metric used to determine how well a company is achieving its business objectives. Eg. error rate.

Robustness: generally, robustness refers to a system’s ability to handle variability and remain effective.

Model fitting: refers to how well a model fits a set of observations.

Design of experiments: also known as DOE, it is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variable. [] In essence, an experiment aims to predict an outcome based on a change in one or more inputs (independent variables).

/ rule: also known as the Pareto principle; states that % of the effects come from % of the causes. Eg. % of sales come from % of customers.

Define quality assurance, six sigma.

Quality assurance: an activity or set of activities focused on maintaining a desired level of quality by minimizing mistakes and defects.

Six sigma: a specific type of quality assurance methodology composed of a set of techniques and tools for process improvement. A six-sigma process is one in which .% of all outcomes are free of defects.

If % of Facebook users on iOS use Instagram, but only % of Facebook users on Android use Instagram, how would you investigate the discrepancy?

There are a number of possible variables that can cause such a discrepancy that I would check to see:

  • The demographics of iOS and Android users might differ significantly. For example, according to Hootsuite, % of females use Instagram as opposed to % of men. If the proportion of female users for iOS is significantly larger than for Android then this can explain the discrepancy (or at least a part of it). This can also be said for age, race, ethnicity, location, etc…
  • Behavioural factors can also have an impact on the discrepancy. If iOS users use their phones more heavily than Android users, it’s more likely that they’ll indulge in Instagram and other apps than someone who spent significantly less time on their phones.
  • Another possible factor to consider is how Google Play and the App Store differ. For example, if Android users have significantly more apps (and social media apps) to choose from, that may cause greater dilution of users.
  • Lastly, any differences in the user experience can deter Android users from using Instagram compared to iOS users. If the app is more buggy for Android users than iOS users, they’ll be less likely to be active on the app.

Likes/user and minutes spent on a platform are increasing but total number of users are decreasing. What could be the root cause of it?

Generally, you would want to probe the interviewer for more information but let’s assume that this is the only information that he/she is willing to give.

Focusing on likes per user, there are two reasons why this would have gone up. The first reason is that the engagement of users has generally increased on average over time — this makes sense because as time passes, active users are more likely to be loyal users as using the platform becomes a habitual practice. The other reason why likes per user would increase is that the denominator, the total number of users, is decreasing. Assuming that users that stop using the platform are inactive users, aka users with little engagement and fewer likes than average, this would increase the average number of likes per user.

The explanation above can also be applied to minutes spent on the platform. Active users are becoming more engaged over time, while users with little usage are becoming inactive. Overall, the increase in engagement outweighs the users with little engagement.

To take it a step further, it’s possible that the ‘users with little engagement’ are bots that Facebook has been able to detect. But over time, Facebook has been able to develop algorithms to spot and remove bots. If were a significant number of bots before, this can potentially be the root cause of this phenomenon.

Facebook sees that likes are up % year over year, why could this be?

The total number of likes in a given year is a function of the total number of users and the average number of likes per user (which I’ll refer to as engagement).

Some potential reasons for an increase in the total number of users are the following: users acquired due to international expansion and younger age groups signing up for Facebook as they get older.

Some potential reasons for an increase in engagement are an increase in usage of the app from users that are becoming more and more loyal, new features and functionality, and an improved user experience.

. If we were testing product X, what metrics would you look at to determine if it is a success?

The metrics that determine a product’s success are dependent on the business model and what the business is trying to achieve through the product. The book Lean analytics lays out a great framework that one can use to determine what metrics to use in a given scenario:

Framework from Lean Analytics

So, this brings us to the end of the Data Science Interview Questions blog.This Tecklearn ‘Top Data Science Interview Questions and Answers’ helps you with commonly asked questions if you are looking out for a job in Data Science Domain. If you wish to learn Data Science and build a career in Data Science domain, then check out our interactive Data Science Training using R Language, that comes with 24*7 support to guide you throughout your learning period.

https://www.tecklearn.com/course/data-science-training-using-r-language/

Data Science using R Language Training

About the Course

Tecklearn’s Data Science using R Language Training develops knowledge and skills to visualize, transform, and model data in R language. It helps you to master the Data Science with R concepts such as data visualization, data manipulation, machine learning algorithms, charts, hypothesis testing, etc. through industry use cases, and real-time examples. Data Science course certification training lets you master data analysis, R statistical computing, connecting R with Hadoop framework, Machine Learning algorithms, time-series analysis, K-Means Clustering, Naïve Bayes, business analytics and more. This course will help you gain hands-on experience in deploying Recommender using R, Evaluation, Data Transformation etc.

Why Should you take Data Science Using R Training?

  • The Average salary of a Data Scientist in R is $123k per annum – Glassdoor.com
  • A recent market study shows that the Data Analytics Market is expected to grow at a CAGR of 30.08% from 2020 to 2023, which would equate to $77.6 billion.
  • IBM, Amazon, Apple, Google, Facebook, Microsoft, Oracle & other MNCs worldwide are using data science for their Data analysis.

What you will Learn in this Course?

Introduction to Data Science

  • Need for Data Science
  • What is Data Science
  • Life Cycle of Data Science
  • Applications of Data Science
  • Introduction to Big Data
  • Introduction to Machine Learning
  • Introduction to Deep Learning
  • Introduction to R&R-Studio
  • Project Based Data Science

Introduction to R

  • Introduction to R
  • Data Exploration
  • Operators in R
  • Inbuilt Functions in R
  • Flow Control Statements & User Defined Functions
  • Data Structures in R

Data Manipulation

  • Need for Data Manipulation
  • Introduction to dplyr package
  • Select (), filter(), mutate(), sample_n(), sample_frac() & count() functions
  • Getting summarized results with the summarise() function,
  • Combining different functions with the pipe operator
  • Implementing sql like operations with sqldf()

Visualization of Data

  • Loading different types of dataset in R
  • Arranging the data
  • Plotting the graphs

Introduction to Statistics

  • Types of Data
  • Probability
  • Correlation and Co-variance
  • Hypothesis Testing
  • Standardization and Normalization

Introduction to Machine Learning

  • What is Machine Learning?
  • Machine Learning Use-Cases
  • Machine Learning Process Flow
  • Machine Learning Categories
  • Supervised Learning algorithm: Linear Regression and Logistic Regression

Logistic Regression

  • Intro to Logistic Regression
  • Simple Logistic Regression in R
  • Multiple Logistic Regression in R
  • Confusion Matrix
  • ROC Curve

Classification Techniques

  • What are classification and its use cases?
  • What is Decision Tree?
  • Algorithm for Decision Tree Induction
  • Creating a Perfect Decision Tree
  • Confusion Matrix
  • What is Random Forest?
  • What is Naive Bayes?
  • Support Vector Machine: Classification

Decision Tree

  • Decision Tree in R
  • Information Gain
  • Gini Index
  • Pruning

Recommender Engines

  • What is Association Rules & its use cases?
  • What is Recommendation Engine & it’s working?
  • Types of Recommendations
  • User-Based Recommendation
  • Item-Based Recommendation
  • Difference: User-Based and Item-Based Recommendation
  • Recommendation use cases

Time Series Analysis

  • What is Time Series data?
  • Time Series variables
  • Different components of Time Series data
  • Visualize the data to identify Time Series Components
  • Implement ARIMA model for forecasting
  • Exponential smoothing models
  • Identifying different time series scenario based on which different Exponential Smoothing model can be applied

Got a question for us? Please mention it in the comments section and we will get back to you.

 

 

 

 

0 responses on "Top Data Science Interview Questions and Answers"

Leave a Message

Your email address will not be published. Required fields are marked *