Frequently asked questions on linear regression in a Data Science Interview!
Introduction
Linear Regression is a supervised machine learning technique and is a commonly used algorithm used for predictive analysis. In this article, we will discuss the most important questions on Linear Regression.
Let’s get started!
1. What is Linear Regression Algorithm?
It is a supervised machine learning algorithm that finds the linear relationship between independent and dependent variables by minimizing the sum of squared residuals known as the ordinary least square method(OLS) or Gradient descent.
2. How to interpret the slope coefficients and intercept in a Linear Regression?
Slope Coefficient(β1):- It represents the estimated increase in y for every one unit increase in x1 keeping other variables constant.
Intercept (β0):- It represents the estimated value of y when all the independent variables are set to zero.
3. What are the assumptions of a OLS linear regression model?
Assumption 1: The regression model is linear in the coefficient and intercept term
According to Statisticians, a linear regression equation should be linear in parameters. However, it is possible to have an equation linear in parameters while containing square of the independent variables.
Linear models can also contain log and inverse terms and yet continue to be linear in parameters.
Assumption 2 : The error term has a population mean of zero
The error term accounts for the variation in the dependent variable unexplained by the independent variables .The value of the error term should be determined by random chance. The zero value of average error term ensures that the model is unbiased as it forces the mean of the residuals to be zero.
However, if the constant term is excluded from the regression model then it requires all the independent and dependent variables to be equal to zero simultaneously and if it is not possible in the study then it might lead to bias.
Assumption 3 : All independent variables are uncorrelated with the error term
When an independent variable correlates with the error term then the coefficient estimate might overestimate or underestimate the strength of an effect, and the estimated coefficients might have incorrect signs.
Assumption 4: The error term are uncorrelated with each other
In time series model, incase the error term for one observation increases the chance that the following residual term is also positive or negative then this problem is known as serial correlation or autocorrelation. Serial correlation reduces the precision of OLS estimates.
Assumption 5 : The error term has a constant variance (no heteroscedasticity)
It is assumed that the residual errors have the same variance. This assumption is known as homoscedasticity. The easiest way to check for this assumption is to create a residual versus fitted value plot.
Heteroscedasticity does not cause bias in the coefficient estimates. However, it reduces the precision of the estimates in OLS linear regression as it increases the variance of the coefficient estimates.
Assumption 6: No independent variable is a perfect linear function of other explanatory variables
Perfect correlation occurs when two variables have a Pearson’s correlation coefficient of +1 or -1. For example, games won and games lost have a perfect negative correlation (-1). If a model contains independent variables with perfect correlation then the statistical software cannot fit the model.
Assumption 7: The error term is normally distributed (optional)
It is not a requirement in the OLS linear regression that the error term follows normal distribution however this assumption is required for performing statistical hypothesis testing and generating reliable confidence intervals.
4. What are the evaluation metrics in Linear Regression?
The evaluation metrics of the linear regression are mean absolute error, mean squared error, root mean squared error, r-squared and adjusted r squared.
- The Mean absolute error represents the average of the absolute difference between the actual and predicted values in the dataset. It measures the average of the residuals in the dataset.
- Mean Squared Error represents the average of the squared difference between the original and predicted values in the data set. It measures the variance of the residuals.
- Root Mean Squared Error is the square root of Mean Squared error. It measures the standard deviation of residuals.
- The coefficient of determination or R-squared represents the proportion of the variance in the dependent variable which is explained by the linear regression model. It is a scale-free score i.e. irrespective of the values being small or large, the value of R square will be less than one.
- Adjusted R squared is a modified version of R square, and it is adjusted for the number of independent variables in the model, and it will always be less than or equal to R².In the formula below n is the number of observations in the data and k is the number of the independent variables in the data.
Mean Squared Error(MSE) and Root Mean Square Error penalizes the large prediction errors vi-a-vis Mean Absolute Error (MAE). However, RMSE is widely used than MSE to evaluate the performance of the regression model with other random models as it has the same units as the dependent variable (Y-axis).MSE is a differentiable function that makes it easy to perform mathematical operations in comparison to a non-differentiable function like MAE. Therefore, in many models, RMSE is used as a default metric for calculating Loss Function despite being harder to interpret than MAE.
R Squared & Adjusted R Squared are used for explaining how well the independent variables in the linear regression model explains the variability in the dependent variable. R Squared value always increases with the addition of the independent variables which might lead to the addition of the redundant variables in our model. However, the adjusted R-squared solves this problem. Adjusted R squared takes into account the number of predictor variables, and it is used to determine the number of independent variables in our model. The value of Adjusted R squared decreases if the increase in the R square by the additional variable isn’t significant enough.
The RMSE tells how well a regression model can predict the value of a response variable in absolute terms while R- Squared tells how well the predictor variables can explain the variation in the response variable.
5. What are the disadvantages of the linear regression Algorithm?
Linear regression is sensitive to outliers ,and the model also assumes that linear relationship exists between the independent variable and dependent variables therefore complex problems can’t fit the dataset.
6. What is the difference between Correlation and Regression?
Correlation and regression are useful for describing the type and degree of association or relationship between two continuous quantitative variables. The former is used to investigate the strength of relationship between two variables while the latter establishes a functional relationship between two variables in order to make future predictions. Regression tries to find the best-fit line (or curve) to predict the value of the dependent variable from independent variables .
In addition, the correlation coefficient remains the same incase X and Y are swapped while in regression X and Y cannot be swapped as the function that best predicts Y from X may not be the best predictor of X from Y.
7. If correlation and regression implies causation?
Correlation and regression doesn’t imply causation as there are instances of non-related data that qualifies all the statistical tests when spurious relationship exists between the variables hence we cannot conclude that correlation and regression implies causation.
8. What is a Gradient Descent ?
Gradient Descent is an iterative first-order optimization algorithm use to find a local minimum/maximum of a given function. This method is commonly used in machine learning algorithms like linear regression to minimize the loss function. An arbitrary point is made the starting point and from that point we find the derivative to measure the steepness of slope. In the beginning, there will be steeper slope, but as new parameters are generated, the steepness should gradually reduce until it reaches the point of convergence.
The goal of gradient descent is to minimize the error/loss between predicted and actual y. It requires direction and learning rate(referred as step size or alpha)to arrive at the local or global minimum. A high learning rate results in larger steps but there is a risk of overshooting the minimum while low learning rate or small steps have the advantage of more precision as this takes more time and computations to reach the minimum.
9. What are the types of Gradient Descent?
Batch gradient descent, stochastic gradient descent and mini-batch gradient descent are three types of gradient descent learning algorithms.
Batch gradient descent- It sums up the error for each point in a training set and the model is updated after all the training instances have been evaluated. It provides computation efficiency, but it takes a long time for large training datasets as data is stored into the memory. Batch gradient descent sometimes finds the local minimum rather than the global minima.
Stochastic gradient descent -It updates each training point’s parameters one at a time. These constant updates in the parameter can offer more speed, but it can result in loss of computational efficiency when compared to batch gradient descent. These frequent updates can result in noisy gradients, but it can be helpful in escaping the local minimum and finding the global one.
Mini-batch gradient descent- It uses both batch and stochastic gradient descent as it splits the training datasets into small batch sizes and performs updates on each of those batches. This approach strikes a balance between the computational efficiency of batch gradient descent and the speed of the stochastic gradient descent.
10. What are the challenges in Gradient Descent?
Gradient Descent can find global minimum with ease however non-convex problems struggles to find a global minimum. Noisy gradients can help escape local minimums.
11. How to determine that linear regression algorithm is suitable for a given dataset?
A scatter plot can be made in case of simple or univariate linear regression. However, incase of multivariate linear regression, two-dimensional pair-wise scatter plots or rotating plots could be plotted.
12. How to interpret a Q-Q plot in linear regression?
Q-Q plot is the plot of two quantiles against each other. The theoretical quantiles or the standard normal variate( a normal distribution with mean zero and standard deviation equal to one) is plotted on the x axis and the ordered values for the random variable which we want to find whether it is a normal distribution or not on the y- axis. A 45 degree angle is plotted on the Q -Q plot and incase the two data sets come from the same distribution then the points will fall on the reference line. Q -Q plot helps find the normality of the residual/error term.
13. When is Gradient Descent method preferred instead of OLS in Linear Regression Algorithm?
Gradient descent needs hyper-parameter tuning ,and it is an iterative process. It is preferred incase of extremely large dataset while normal equation is a non-iterative process that is ideal for small training datasets.
14. Is Feature Scaling required in Machine Learning ?
The most critical step in data pre-processing of machine learning(ML) model is feature scaling as ML models that calculate distances between data in the absence of scaling gives higher weightage to features with a higher value range. Also, the algorithms like gradient descent converge faster with scaling as θ descends quickly on small ranges and slowly on large ranges.
End Notes
Thanks for reading the article, and I hope it helps you clear the data science interview questions on Linear regression which is the mostly commonly used algorithm in machine learning. Please feel free to contact me on akshita.chugh024@gmail.com and comment below incase you have any questions.