Distribution in Data Science
data:image/s3,"s3://crabby-images/8147b/8147b4dfdc319aa2d13be9ebd12ff05feefefd83" alt=""
data:image/s3,"s3://crabby-images/e400b/e400ba50fda3797076f4232941da22fd4d22f0cc" alt=""
Discrete
1. Binomial
x successes in n events, each with p probability
with μ = np and σ2 = npq
= | binomial probability | |
= | number of times for a selected outcome within n trials | |
= | number of combinations | |
= | probability of success on one trial | |
= | probability of failure on a one-trial | |
= | number of trials |
data:image/s3,"s3://crabby-images/c2a98/c2a989344dc3c5baf57c7b3f5e5d86f9b761188e" alt=""
Note: If n = 1, this can be a Bernoulli distribution
2. Geometric
Geometric distribution may be a variety of opportunity distribution supported three key assumptions. These are arranged as follows.
data:image/s3,"s3://crabby-images/8ee34/8ee3493e9df66facc813d3183860e19b764ab6df" alt=""
- The tests performed are independent.
- There are often only two results for every trial – success or failure.
- The probability of success, indicated by p, is that the same for every test.
first success with p probability on the nth trial
qn−1p, with µ = 1/p and σ2 =1−p/p2
data:image/s3,"s3://crabby-images/46b21/46b21c586d8bb32812689cc7238c95e39e6ecdea" alt=""
3. Negative Binomial
- A negative binomial distribution (also called the Pascal Distribution) for random variables in a negative binomial experiment.
- number of failures before r successes
4. Hypergeometric
- The hypergeometric distribution is very the same as the statistical distribution. In fact, Bernoulli distribution is a superb measure of hypergeometric distribution as long as you create a sample of fifty or less of the population.
data:image/s3,"s3://crabby-images/2528b/2528bde8f5604fed08de60e2a9c5293fb187a81d" alt=""
data:image/s3,"s3://crabby-images/8f51f/8f51f0856ee356793cff7442b7489ebc5150915d" alt=""
- K is that the number of successes within the population
- k is that the number of observed successes
- N is that the population size
- n is that the number of draws
- X is items of that feature
5. Poisson
data:image/s3,"s3://crabby-images/e36e9/e36e9d1d37ffcc0b359db024a0503eb1d3259dd0" alt=""
number of successes x in a hard and fast quantity, where success occurs at a median rate
data:image/s3,"s3://crabby-images/58c51/58c51c0f505f4ec256b0423ab95dfa0089fd5636" alt=""
µ = σ2 = λ
Continuous
1. Uniform
data:image/s3,"s3://crabby-images/d72cd/d72cd3582c3a20c4a96d50d81ff5801167dbf630" alt=""
all values between a and b are equally likely
f(x)=1/(b−a)
for a ≤ x ≤ b
Theoretical definition formulas and standard deviations are present
μ=(a+b)/2 and σ=√(b−a)2/12
2. Normal/Gaussian
data:image/s3,"s3://crabby-images/4844e/4844ea1706114596ff530044d972099cc7c60e27" alt=""
data:image/s3,"s3://crabby-images/862d7/862d73440c9e69227e47350569bb8a40357f4a9c" alt=""
= | Probability density function | |
= | Standard deviation | |
= | Mean |
Central Limit Theorem – sample mean of i.i.d. data approaches Gaussian distribution.
Empirical Rule – 68%, 95%, and 99.7% of values lie within one, two, and three standard deviations of the mean.
Normal Approximation – discrete distributions like Binomial and Poisson may be approximated using z-scores when np, nq, and λ are greater than 10
3. Exponential
data:image/s3,"s3://crabby-images/1123a/1123a100d0d555dd9f2b78b7a080edb78f0c399c" alt=""
data:image/s3,"s3://crabby-images/9fcb9/9fcb9dc903b5c4f782a930b6a315732a1e975899" alt=""
= | probability density function | |
= | rate parameter | |
= | Random variable |
memoryless time between independent events occurring at a median rate λ → λe−λx, with µ = 1/λ
data:image/s3,"s3://crabby-images/4727c/4727cade9d1c8257bd4cffa19f2dc923bf1d9f20" alt=""
4. Gamma
time until n independent events occurring at a mean rate λ
data:image/s3,"s3://crabby-images/80199/80199ec0ab67cb4acca6a5a5de4fa400265f6c00" alt=""
where p and x are continuous chance variable.
Γ(α) = Gamma function
Concepts
Prediction Error = Bias2 + Variance + Irreducible Noise
1. Bias
data:image/s3,"s3://crabby-images/dadc3/dadc38274223faa047e914cb0b055d61cf6cdfc6" alt=""
wrong assumptions when training → can’t capture underlying patterns → underfit
2. Variance
sensitive to fluctuations when training→ can’t generalize on unseen data → overfit
The bias-variance tradeoff attempts to attenuate these two sources of error, through methods such as:
– Cross-validation to generalize to unseen data
– Dimension reduction and have selection
In all cases, as variance decreases, bias increases.
ML models may be divided into two types:
– Parametric – uses a hard and fast number of parameters with regard to sample size
– Non-Parametric – uses a versatile number of parameters and doesn’t make particular assumptions on the data
3. Cross-Validation
validates test error with a subset of coaching data, and selects parameters to maximize average performance-
– k-fold – divide data into k groups, and use one to validate
– leave-pout – use p samples to validate and also the rest to train
Reference:- https://github.com/aaronwangy/Data-Science-Cheatsheet
Add Comment
You must be logged in to post a comment.
Ravi Kumar
Awesome
Mukul Gupta
Thankyou Ravi
Shreyas Baksi
Very informative
Shubham Gupta
Concepts are well explained and easy to understand…!!
Aman
Great👍👍
Mukul Gupta
thankyou Shreyas Baksi
Mukul Gupta
Thankyou
Vansh Gupta
Nice work👍
Jitendr
Good work 👍
Chirag kr vasav
nice and informative blog
Kishore kumar
Very well explained. Great work keep it up
Mritunjay Kumar Singh
Very well conceptualized. Great work sir👍
Mukul Gupta
Thankyou
Piyush Gupta
Very detailed and explained. Good job