Let’s start with basics and define What regression is? Regression can be defined as a method used to determine the strength and character of relationship between one dependent variable (y) and some other variable known as independent variable (x).
When there’s a single independent variable (x), the method is referred to as simple linear regression. when there are multiple independent variables this method is known as multi linear regression.
The general form of Linear Regression model is:
y = m₁x₁ + m₂x₂ + m₃x₃ + . . . . . + mnxn + c + e
- y = Regressed/ dependent variable
- x =Independent / explanatory variable
- c = Intercept
- e = Random / stochastic error
1. The expression y = mx+c states that an individual value of yᵢ is equal to the mean of the population of which it is a member plus the random error term ‘e’
2. ‘c’ is known as intercept i.e. the default value of known variables when all independent parameters are 0.
3. The m₁…mn are known as the slope coefficients reflecting the change in y for one-unit change in independent variable
Assumptions behind Linear Regression
y is linearly related to x when the rate of change y with respect to x(ie slope or derivative of y with respect to xi dy/dx) is independent of the value of x
- If y = 10x then dy/dx = 10 which is independent of x (ie for all values of x dy/dx is constant )
2. If y — 10x² then dy/dx = 20x ie the rate of change of y depends on current value of x
3. ‘c’ is the collective representation of impact of all the other attributes not consideration in this relationship between y and x.
Value (x) is independent of the error term. ‘c’ the error should not increase or decrease with an increase or decrease in the values of x or vice versa, error in predictions in each trial is independent of x .
Mean value of error in prediction for any given value of x that mean is centered at zero
Preparing Data For Linear Regression
1 Linear Assumption: This method assumes that the relationship between independent variables and dependent variables is linear. It does not support anything else.
2 Remove Noise: Linear Regression assumes that your independent variables and dependent variables are not noisy, consider using data cleaning operations that lets better expose and clarify the signal in data, also remove outliers in the dependent variable (y).
3 Gaussian Distributions: Linear Regression will make more reliable predictions of independent variables and dependent variables have Gaussian distribution.
4 Rescale Independent Variables: Linear Regression will make more reliable predictions if the data is rescaled (Independent Variables) using standardization or normalization.
How to find best fit line in Linear Regression?
- Use least square method to fit line.
- Find R²
- p-value for R²
Sum of squared residual are the difference between the real data and the line we are summing the square if these values
y= ax + b : slope(ax), intercept(b)
We want to find out optimal value for ‘m’ and ‘c’ so that we minimize the sum of squared residuals.
(( ax₁ + b ) — y )² + (( ax₂ + b ) — y₂ )² + . . . .
- value of line at x₁ observed (calculated the distance between line and actual fit)
- Derivative is used to find out the most optimal slope of line by trial and error method where Sum of Squared error is least.
- Distance from line to the data point is known as Residual.
Variance : variance is the expectation of the squared deviation of a independent variable from its mean. Informally, it measures how far numbers are spread out from their average value.
- X = the value of the one observation
- X| = the mean value of all observation
- n = the number of observations
R -Squared (R²)
R² represents the proportion of the variance for a dependent variable that explained by an independent variable in the regression also known as coefficient of determination.
R -squared is always between 0% and 100%:
- R² of 0% represents model does not explain any of the variation in the dependent variable around its mean. this mean the we want to determine whether changes in the independent variables are associated with changes in the dependent variables
- R² of 100% represents a model that explains all of the variation in the response variable around its mean.
R-squared has Limitations
- We cannot use R² to determine whether the coefficient and predictions are biased.
- R² does not indicate if a regression model provides an adequate fit to our data. A good model can have a low R² value. On the other hand, a biased model can have a high R² value.
The p-value is the evidence against a null hypothesis. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis. P values are expressed as decimals although it may be easier to understand what they are if you convert them to a percentage.