Distribution in Data Science
![](https://statusneo.com/wp-content/uploads/2021/12/Untitled.drawio.png)
![](https://statusneo.com/wp-content/uploads/2021/12/Screenshot-2021-12-13-at-5.39.10-PM-1024x487.png)
Discrete
1. Binomial
x successes in n events, each with p probability
with μ = np and σ2 = npq
= | binomial probability | |
= | number of times for a selected outcome within n trials | |
= | number of combinations | |
= | probability of success on one trial | |
= | probability of failure on a one-trial | |
= | number of trials |
![](https://statusneo.com/wp-content/uploads/2021/12/Screenshot-2021-12-13-at-5.38.51-PM-1024x486.png)
Note: If n = 1, this can be a Bernoulli distribution
2. Geometric
Geometric distribution may be a variety of opportunity distribution supported three key assumptions. These are arranged as follows.
![](https://statusneo.com/wp-content/uploads/2021/12/Screenshot-2021-12-13-at-5.39.23-PM-1024x472.png)
- The tests performed are independent.
- There are often only two results for every trial – success or failure.
- The probability of success, indicated by p, is that the same for every test.
first success with p probability on the nth trial
qn−1p, with µ = 1/p and σ2 =1−p/p2
![](https://statusneo.com/wp-content/uploads/2021/12/Screenshot-2021-12-13-at-7.26.47-PM.png)
3. Negative Binomial
- A negative binomial distribution (also called the Pascal Distribution) for random variables in a negative binomial experiment.
- number of failures before r successes
4. Hypergeometric
- The hypergeometric distribution is very the same as the statistical distribution. In fact, Bernoulli distribution is a superb measure of hypergeometric distribution as long as you create a sample of fifty or less of the population.
![](https://statusneo.com/wp-content/uploads/2021/12/Screenshot-2021-12-13-at-8.11.21-PM-1024x463.png)
![](https://statusneo.com/wp-content/uploads/2021/12/hypergeometric-distribution-formula-1.png)
- K is that the number of successes within the population
- k is that the number of observed successes
- N is that the population size
- n is that the number of draws
- X is items of that feature
5. Poisson
![](https://statusneo.com/wp-content/uploads/2021/12/Screenshot-2021-12-13-at-9.29.41-PM.png)
number of successes x in a hard and fast quantity, where success occurs at a median rate
![](https://statusneo.com/wp-content/uploads/2021/12/Screenshot-2021-12-13-at-9.36.53-PM-1024x413.png)
µ = σ2 = λ
Continuous
1. Uniform
![](https://statusneo.com/wp-content/uploads/2021/12/Screenshot-2021-12-13-at-5.38.10-PM-1-1024x480.png)
all values between a and b are equally likely
f(x)=1/(b−a)
for a ≤ x ≤ b
Theoretical definition formulas and standard deviations are present
μ=(a+b)/2 and σ=√(b−a)2/12
2. Normal/Gaussian
![](https://statusneo.com/wp-content/uploads/2021/12/Screenshot-2021-12-14-at-10.03.51-AM-1024x247.png)
![](https://statusneo.com/wp-content/uploads/2021/12/Screenshot-2021-12-13-at-5.39.47-PM-1024x464.png)
= | Probability density function | |
= | Standard deviation | |
= | Mean |
Central Limit Theorem – sample mean of i.i.d. data approaches Gaussian distribution.
Empirical Rule – 68%, 95%, and 99.7% of values lie within one, two, and three standard deviations of the mean.
Normal Approximation – discrete distributions like Binomial and Poisson may be approximated using z-scores when np, nq, and λ are greater than 10
3. Exponential
![](https://statusneo.com/wp-content/uploads/2021/12/Screenshot-2021-12-14-at-10.19.03-AM.png)
![](https://statusneo.com/wp-content/uploads/2021/12/Screenshot-2021-12-13-at-5.39.33-PM-1024x479.png)
= | probability density function | |
= | rate parameter | |
= | Random variable |
memoryless time between independent events occurring at a median rate λ → λe−λx, with µ = 1/λ
![](https://statusneo.com/wp-content/uploads/2021/12/Screenshot-2021-12-14-at-10.41.37-AM.png)
4. Gamma
time until n independent events occurring at a mean rate λ
![](https://statusneo.com/wp-content/uploads/2021/12/Screenshot-2021-12-14-at-10.44.00-AM.png)
where p and x are continuous chance variable.
Γ(α) = Gamma function
Concepts
Prediction Error = Bias2 + Variance + Irreducible Noise
1. Bias
![](https://statusneo.com/wp-content/uploads/2021/12/463627_1_En_8_Fig3_HTML.jpeg)
wrong assumptions when training → can’t capture underlying patterns → underfit
2. Variance
sensitive to fluctuations when training→ can’t generalize on unseen data → overfit
The bias-variance tradeoff attempts to attenuate these two sources of error, through methods such as:
– Cross-validation to generalize to unseen data
– Dimension reduction and have selection
In all cases, as variance decreases, bias increases.
ML models may be divided into two types:
– Parametric – uses a hard and fast number of parameters with regard to sample size
– Non-Parametric – uses a versatile number of parameters and doesn’t make particular assumptions on the data
3. Cross-Validation
validates test error with a subset of coaching data, and selects parameters to maximize average performance-
– k-fold – divide data into k groups, and use one to validate
– leave-pout – use p samples to validate and also the rest to train
Reference:- https://github.com/aaronwangy/Data-Science-Cheatsheet
Add Comment
You must be logged in to post a comment.
Ravi Kumar
Awesome
Mukul Gupta
Thankyou Ravi
Shreyas Baksi
Very informative
Shubham Gupta
Concepts are well explained and easy to understand…!!
Aman
Great👍👍
Mukul Gupta
thankyou Shreyas Baksi
Mukul Gupta
Thankyou
Vansh Gupta
Nice work👍
Jitendr
Good work 👍
Chirag kr vasav
nice and informative blog
Kishore kumar
Very well explained. Great work keep it up
Mritunjay Kumar Singh
Very well conceptualized. Great work sir👍
Mukul Gupta
Thankyou
Piyush Gupta
Very detailed and explained. Good job