- Topics:
- Zomato Data EDA
- Black Friday Dataset EDA and Feature Engineering
- Flight Price Prediction EDA and Feature Engineering
-
Cost Function - Mean Squared Error
-
Gradient Descent Formula
-
Linear Regression Convergence Formula
-
Learning Rate:
Problem In Linear Regression:
| Overfitting | Underfitting | Generalized model |
|---|---|---|
| Train Acc = 90% | Train Acc = 60% | Train Acc = 92% |
| Test Acc = 80% | Test Acc = 58% | Test Acc = 91% |
| Low Bias | High Bias | Low Bias |
| High Variance | High Variance | Low Variance |
- Reduce number of features
- Manually select which features to be used
- Feature selection
- Regularization
- Keep all features but reduce magnitude of parameters theta
- Works well when we have a lot of features, each of which contributes a bit to predicting
- It adds "aboslute value of magnitude" of coefficents as penalty to loss function
- Lasso shrinks the less important feature's coefficient to near zero thus, removing some feature.
- Works as feature selection in case we have large number of features
- It adds "squared magnitute" of coefficents as penalty to loss function
- If lambda is very large then it will add too much weight and can lead to underfitting. Hence it's important how lambda is chosen.
- This technique works very well to avoid over-fitting.
- Features follows Normal (Gaussian Distribution): - As model will get trained well with normal distribution data - If any feature does not follows it we should try to use feature transformation.
- Standard Scale: - Scaling all features on same scale - Usually we use Zscore for Scaling mu=0 and sigma=1
- Linearity: - Data with Linear relation between Independent and Dependent variables will give good results. - If relation is not linear we can use non-linear model.
- Multi Collinearity: - No feature with correlation with other feature (Independent Variable). - Use Variance Inflation Factor (VIF) to select only important and non-multicolinear features.
- Homoscedasticity (Same Variance): - It means that the error is constant along the values of the dependent variable.
- Why not Linear regression for classification?
- For dynamic threshold we should use Regression models and for fixed thresholds we should use classification models.
- Regression values can go beyond (0,1) but in classification problems we focus on binary output y ∈ {0,1}.
-
From Linear Regression we knew that
$$h_\theta(x) = \theta_0+\theta_1x$$ Let,$z=\theta_0+\theta_1x$
$$h_\theta(x)=g(z)$$ where$g(z)$ can be given as$$h_\theta(x)={1\over1+e^{-z}}$$ $$h_\theta(x) = g(\theta^Tx) = {1\over1+e^{-(\theta_0+\theta_1x)}}$$ Here is plot showing
$g(z)$
-
Logistic Regression Cost Function:
$$J(\theta)={1\over m}\sum_{i=1}^m cost(h_\theta(x^{(i)}),y^{(i)})$$ $$cost(h_\theta(x^{(i)}),y^{(i)}) = -y log(h_\theta(x^{(i)})) - (1-y) log(1-h_\theta(x^{(i)}))$$
Repeat until Convergence:
{
-
When Focus is on both
$FP$ and$FN$ we use$\beta =1$ , $$F1 = {2Precision;;Recall\over{Precision;+;Recall}}$$ (Harmonic Mean) -
When we want more weight on Precision, less weight on Recall we use
$\mathbf{\beta <1}$ ,Let
$\beta=0.5$ $$F0.5 = {(1+0.5^2)Precision;;Recall\over{0.5^2 * Precision;+;Recall}} = {1.25Precision;;Recall\over{0.25 * Precision;+;Recall}}$$ -
When we want more weight on Recall, less weight on Precision we use
$\mathbf{\beta >1}$ ,Let
$\beta=2$ $$F2 = {(1+2^2)Precision;;Recall\over{2^2 * Precision;+;Recall}} = {5Precision;;Recall\over{4 * Precision;+;Recall}}$$
Note - Above measures are used in Binary as well as Multi-Class Classification problems
Click for more detailed measures for Imbalanced Classification.
- Naive Bayes is a classification algorithm for binary (two-class) and multiclass classification problems.
- It is called Naive Bayes or idiot Bayes because the calculations of the probabilities for each class.
- Before we dive into Bayes theorem, let’s review marginal, joint, and conditional probability.
- Marginal Probability : The probability of an event irrespective of the outcomes of other random variables.
e.g. P(A)= Fetching random card from deck of cards P(3 of diamond)=1/52 - Joint Probability: Probability of two (or more) simultaneous events.
e.g. P(A and B) or P(A, B)= probability of picking up a card that is both red and 6 is P(6 ⋂ red) = 2/52 = 1/26 - Conditional Probability: Probability of one (or more) event given the occurrence of another event.
e.g. P(A given B) or P(A | B)= probability that you get a 6, given that you drew a red card is P(6│red) = 2/26 = 1/13- The conditional probability can be calculated using the joint probability; for example: P(A | B) = P(A ⋂ B) / P(B)
- Marginal Probability : The probability of an event irrespective of the outcomes of other random variables.
- Bayes Theorem: It is a way of calculating a conditional probability without the joint probability.
- P(A ∣ B) = P(A ⋂ B) / P(B) = P(A)⋅P(B ∣ A) / P(B)
where:
P(A)= The probability of A occurring
P(B)= The probability of B occurring
P(A∣B)=The probability of A given B
P(B∣A)= The probability of B given A
P(A⋂B))= The probability of both A and B occurring
The terms in the Bayes Theorem equation are given names depending on the context where the equation is used.
-P(A) : Prior probability.
-P(B) : Evidence.
-P(A|B) : Posterior probability.
-P(B|A) : Likelihood.
-Bayes Theorom can be rewritten as:
Posterior = Likelihood * Prior / Evidence
Eg. What is the probability that there is fire given that there is smoke?
Where P(Fire) is the Prior, P(Smoke|Fire) is the Likelihood, and P(Smoke) is the evidence:
P(Fire|Smoke) = P(Smoke|Fire) * P(Fire) / P(Smoke)
- P(A ∣ B) = P(A ⋂ B) / P(B) = P(A)⋅P(B ∣ A) / P(B)









