Regression analysis is one of the most commonly used tools to find a relationship (linear or non-linear) between a response and one or more predictors and use this relationship to predict the expected value of the response for certain values of the predictor(s) with the maximum accuracy possible.

In today’s world dominated by data, due to technology and data tracking systems, we are capable of capturing and storing massive amounts of data. For example, a car sales company like Honda can collect an enormous amount of data on different aspects of its business, such as sales, stock market prices, crude oil prices, etc. The next step after data collection is data analysis and interpretation. Consider the car sales example of Honda. The question of how the economy of a country affects car sales, and vice versa, is a rather important question. To do this, the industry needs to understand the relationship between car sales and various macroeconomic factors.

Also, whether there is at all any direct relation among these features, or the relationship is controlled by one or more unobserved factors, needs to be objectively determined. In both the cases specified, it is required to supply a definitive relationship between the variable of response (or interest) and the other predictors (or variables) which are utilized to get it. This makes a difference in clarifying the reliance between the variables concerned.

Another very important aspect of data analysis, particularly in the above example, is prediction. Honda would like to predict car sales for the future, given the expected measure of economic growth or recession that year, so that it can make informed decisions regarding its business. For this objective, one needs to form a model which explicitly relates the response to the predictors.

Regression is one of the simplest statistical tools used to analyze the dependence of response on two or more predictors. Two important terms related to regression are given below:

·       Dependent variable or Response: It is the variable of interest that one wants to model or forecast using one or more variables whose values are known as the dependent variable or response.

·       Independent variable(s) or Predictor(s): It is presumed that the response depends on one or more independent variables or predictors. Because of the independence of these variables, a model is developed to show the explicit connection between the response and the predictor(s).

Types of Regression:

Two types of regression analysis that can be performed:

1.       Simple Linear Regression: When the response is assumed to have a linear dependence on one single predictor.

2.       Multiple Linear Regression: When the response is assumed to have a linear dependence on multiple predictors.Example

A wine manufacturer wants to invest in new technologies to improve its wine quality. Wine quality is directly dependent on the amount of alcohol in wines and the smoothness which, in turn, are controlled by different chemicals either specifically included amid the manufacturing process or produced through different chemical reactions. Wine certification and quality assessment are key elements for wine gradation and its pricing. Wine certification is based on a variety of physiochemical components. Therefore, the company wants to estimate the percentage (%) of alcohol in a bottle of wine as a function of various chemical components of the wine. Hence you can define the requirement as:

Regress alcohol percentage on the chemical components present in the wines.

Data Description (also known as Data Dictionary)

 Variables Description Fixed Acidity (FA) Number of grams of Tartaric acid per cubic decimeter, (𝑔(𝑡𝑎𝑟𝑡𝑎𝑟𝑖𝑐 𝑎𝑐𝑖𝑑)𝑑𝑚3⁄) Volatile Acidity (VA) Number of grams of Acetic acid per cubic decimeter, (𝑔(𝑎𝑐𝑒𝑡𝑖𝑐 𝑎𝑐𝑖𝑑)𝑑𝑚3⁄) Citric Acid (CA) Number of grams of Citric acid per cubic decimeter, (𝑔𝑑𝑚3⁄) Residual Sugar (RS) Number of grams of Residual sugar per cubic decimeter, (𝑔𝑑𝑚3⁄) Chlorides Number of grams of Sodium chloride per cubic decimeter, (𝑔(𝑠𝑜𝑑𝑖𝑢𝑚 𝑐ℎ𝑙𝑜𝑟𝑖𝑑𝑒)𝑑𝑚3⁄) Free Sulphur Dioxide (FSD) Number of milligrams of Free Sulphur dioxide per cubic decimeter, (𝑚𝑔𝑑𝑚3⁄) Total Sulphur Dioxide (TSD) Number of milligrams of total Sulphur dioxide per cubic decimeter, (𝑚𝑔𝑑𝑚3⁄) Density Number of grams per cubic centimeter (𝑔𝑐𝑚3⁄) Ph pH is a scale of acidity from 0 to 14. Sulphates Number of grams of Potassium sulphate per cubic decimeter, (𝑔(𝑝𝑜𝑡𝑎𝑠𝑠𝑖𝑢𝑚 𝑠𝑢𝑙𝑝ℎ𝑎𝑡𝑒)𝑑𝑚3⁄) Brand (categorical) Three different brands of wine are considered where 1 represents “Grover Zampa”, 2 represents “Fratelli” and 3 represents “Sula”. Alcohol (response) The percentage volume of alcohol in wine (% 𝑣𝑜𝑙.)

Sometimes a regression model of alcohol on the indicators can be built, but it is vital to explore whether there exists any reliance among the observed factors. Collect the data for each of the brands.

 ID Brand FA VA CA RS chloride FSD TSD density pH sulphate alcohol 1 Grover 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 2 Fratelli 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 3 Sula 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8

Then analyze and summarize the variable data.

 count mean Std min 25% 50% 75% max FA 1599.0 8.32 1.74 4.60 7.10 7.90 9.20 15.90 VA 1599.0 0.53 0.18 0.12 0.39 0.52 0.64 1.58 CA 1599.0 0.27 0.19 0.00 0.09 0.26 0.42 1.00 RS 1599.0 2.54 1.41 0.90 1.90 2.20 2.60 15.50 chloride 1599.0 0.09 0.05 0.01 0.07 0.08 0.09 0.61 FSD 1599.0 15.87 10.46 1.00 7.00 14.00 21.00 72.00 TSD 1599.0 46.47 32.90 6.00 22.00 38.00 62.00 289.00 density 1599.0 1.00 0.00 0.99 1.00 1.00 1.00 1.00 ph 1599.0 3.31 0.15 2.74 3.21 3.31 3.40 4.01 alcohol 1599.0 10.42 1.07 8.40 9.50 10.20 11.10 14.90 sulphate 1599.0 0.66 0.17 0.33 0.55 0.62 0.73 2.00

Then, you need to find the correlation between different variables.  Correlation between alcohol and FA Correlation between alcohol and pH Correlation between alcohol and FA

Using all this data, the regression for alcohol with a variable is then calculated. In this example, we have shown below the regression of alcohol (in percentage) on density that is calculated and plotted on a chart. 