Probability and Statistics for Data Science
“Facts are stubborn, but statistics are more pliable” – Mark Twain
We frequently ask ourselves: Why do we need to learn statistics and probability for data science?
A data scientist’s primary responsibility is to identify different structures in the data set and predict patterns in order to make intelligent inferences. Prediction and estimates are crucial components of data science. Probability theory and statistics work together to provide the foundation of data science. While statistics helps in creating estimates for further analysis, probability theory aids in performing predictive analysis. They both rely on DATA.
What is Data ?
Data is a factual information (such as numbers, words, measurements, observations, etc ) collected through different resources and used to analyze something or make decisions.
Why does Data matter?
- Assists in gaining a deeper understanding of the data by pointing out any connections between two variables that might be present.
- Aids in forecasting or future predictions based on historical data trends.
- Aides in the detection of fraud by highlighting data irregularities.
- Aids in finding any possible patterns in the data.
Types of Data
Categorical Data
This data is qualitative in nature and can be stored into groups or categories with the aid of names or labels.
i. Nominal Data
Nominal data is labelled into mutually exclusive categories within a variable. These categories cannot be ordered in a meaningful way.
For example: Blood type, Hair color
ii. Ordinal Data
This type of data is the combination of numerical and categorical data i.e. it can be classified into categories that are ranked in a natural order.
For example: Measuring economic status using the hierarchy: ‘wealthy’, ‘middle income’ or ‘poor.
Numerical Data
Numerical data is information in the form of numbers, i.e. numeric data, that serves as a quantitative measurement of things.
i. Discrete Data
Discrete data is information that frequently counts of an event and, as a result, can only take certain values. These are frequently, but not always, integer-based.
For example: Shoe size of people, Number of children
ii. Continuous Data
Continuous data is information that can have infinite values, i.e. it can take any value within a range.
For example: Weight, Height
Note : Qualitative (Categorical) data can be displayed using a Bar Plot, a Pie Chart, or a Pareto Chart. Histograms, Line Plots, and Scatter Plots can be used to depict Quantitative (Numerical) data.
STATISTICS
The statistical approach of thinking proposes that we first formulate the problem and then collect the data to answer it. Whereas, in the machine learning method we already have the data with us, we just need to find what the data is saying to us.
Population or Sample Data

Population: In statistics, the population is the complete set of items from which data for a statistical study is derived. It can be a gathering of people, a collection of objects, etc. It constitutes the study’s data pool.
Sample: The sample is a random selection of the population that best represents the entire data collection. It is the subset of the population. A sample is used in statistical testing when the population size is too big to include all members or observations in the test. It shares same traits of population and is more manageable.

For Example: Assume you wish to discover the average salary of employee in a multi-national company (parameter). We choose a random sample of 500 employees and calculate their mean salary (x̄) to be $25,500 (statistic). We conclude that the population mean salary (μ) is also likely to be about $25,500.
Measure of Central Tendency
We quantify the “center” of our data using the mean, median, and mode. It is used in hypothesis testing, regression, and many more. Measures of central tendency describe the center position of a distribution for a data set.
Mean: It is the average of all the observations in the data. Mean is denoted by x̄.
Mean, x̄ = (∑xᵢ fᵢ)/(∑fᵢ)
Median: When there are an odd number of observations, the median is the middle observation; when there are an even number, the median is the mean of the two middle values. It is a better alternative to mean since it is less impacted by outliers and data skewness.
If n is odd, Median = ((n + 1)/ 2)ᵗʰ observation
If n is even, Median = [(n/2)ᵗʰ obs.+ ((n/2) + 1)ᵗʰ obs.]/2
Mode: The most frequently occurring observation. A dataset can have one mode, multiple mode or no mode at all.
For Example: In a dataset containing {12,14,14,15,15,16,17,18,90,95} values. Mean is 30.6. Median is 15.2. Mode is 14,15.
Measures of Variability
Measures of variability (or the measures of spread) helps in analyzing how dispersed the distribution is for a set of data. For instance, while a measure of central tendency can provide a person with the average value for a group of data, it cannot characterize the distribution of the data within the set.
Variance:
The variance is defined as average of the squared differences from the mean. This depicts how spread out the data is in a dataset.
When analyzing sample data, the variance formula is slightly different. If there are total n samples we divide by n-1 instead of n:

S² : Sample variance
xᵢ : Value of one observation
x̄ : Mean of observation
n : Number of observations
Standard Deviation:
The standard deviation measures the variance or dispersion of the data points in a dataset. It is calculated as the square root of the variance and shows how near a data point is to the mean.

σ : Population standard deviation
xᵢ : Value of one observation
μ : Mean of observation
N : Size of population
For Example: The heights of the 5 dogs are: 600mm, 470mm, 170mm, 430mm and 300mm. Lets find out the Mean, the Variance, and the Standard Deviation.

Mean = (600 + 470 + 170 + 430 + 3005 ) =19705 = 394
Average height is 394 mm.
Now we calculate each dog’s difference from the Mean and take square of it:

Variance σ² = (206² + 76² + (−224)² + 36² + (−94)²) / 5
=> (42436 + 5776 + 50176 + 1296 + 8836) / 5 => 108520 / 5 => 21704
So the Variance is 21,704.
Standard Deviation σ = √21704 = 147.32… = 147

We may now display the heights that are closest to the mean (within one standard deviation, or 147 mm). Therefore, we have a “standard” means of determining what is typical and what is unusually large or extra little utilizing the Standard Deviation.
Frequency Distribution
It is a representation, either in a graphical or tabular format, that displays the number of observations within a given interval.
- Positive Skewness: It occurs when the Mean > Median < Mode. The tail is skewed to the right in this case, i.e outliers are skewed to the right.
- Negative Skewness: It occurs when the Mean < Median < Mode. The tail is skewed to the left, i.e the outliers are skewed to left.
PROBABLITY
The term “probability” simply refers to how likely an event is to take place or the chance of the occurrence of an event. It is regarded as a key component of predictive analytics.
For Example: If you throw a die, then the probability of getting 1 is 1/6.
Conditional Probability
The probability of an event happening if another event has already happened is known as conditional probability.
For Example: In a group of 100 sports car buyers, 40 bought alarm systems, 30 purchased bucket seats, and 20 purchased an alarm system and bucket seats. If a car buyer chosen at random bought an alarm system, what is the probability they also bought bucket seats?

P(B|A) = P(A∩B) / P(A) = 0.2 / 0.4 = 0.5
Given that a consumer buys an alarm system, the likelihood that they purchased bucket seats is 50%.
Bayes’ Theorem
Bayes’ Theorem is a way of finding a probability when we know certain other probabilities. It is used to compute the likelihood of a hypothesis based on the probabilities of the hypothesis’s various data.
A, B : Events
P (A|B) : Probability of A given B is true
P (B|A) : Probability of B given A is true
P(A), P(B) : Independent probabilities of A and B
For Example: You have a picnic planned for today, but the weather is overcast.
- Oh no! 50% of all rainy days begin cloudy!
- However, cloudy mornings are common (approximately 40% of days begin gloomy)
- This is often a dry month (just 3 of 30 days are rainy, or 10%).
P(Rain) is Probability of Rain = 10%
P(Cloud | Rain) is Probability of Cloud, given that Rain happens = 50%
P(Cloud) is Probability of Cloud = 40%
Alternatively, there is a 12.5% chance of rain. Not to worry, let’s go on a picnic!
End Notes
I hope you found the article interesting. The heart of data science is probability and statistics. The topics presented in this article constitute the foundation of numerous algorithms, statistical approach towards the problem, and graphical understanding of things, and hence are critical and should not be overlooked.
Last Updated:
Views: 3