Multiple Linear Regression using Python

5 min readJan 15, 2021

In Linear regression, the most common way to train your model is by using Multiple Linear Regression(MLR). Before start coding our model, let’s understand more about MLR.

Linear Regression:

As noun sounds, Linear Regression is a linear approach to predict the relationship between independent and dependent variables.

Multiple Linear Regression:

MLR is also a linear approach, but instead of using a single independent variable like simple linear regression, we will use multiple independent variables with one dependent variable.

I guess now you know a little bit more about multiple linear regression. I will divide this article into seven different coding stages,

Importing the libraries
Importing dataset
Encoding the categorical data
Splitting the data into training and testing sets
Training MLR model
Predicting the result
MLR equation

I will provide the link for the code at the end…

Happy Coding!!!

1. Importing the libraries:

We will use Pandas, Nurmpy, and Scikit-learn.

Pandas library is used for data analysis, Numpy provides an array that is faster than python lists, and Scikit-learn contains machine learning libraries for python.

Now that you know why do we use these libraries, it’s time to import them,

# importing the libraries
import numpy as np
import pandas as pd

2. Importing the dataset:

Now, before import your data into the code cell, you need to study it, you need to understand why you are using this particular dataset.

So, now before importing it let’s understand what this data is about,

You can download the dataset from here: https://drive.google.com/file/d/1HRhBI6IDGTX1Tepaw8OIOQc6Ta8Uu0pI/view?usp=sharing

Dataset which contains information about bestseller books with categories. — Bestseller book with categories

The above dataset contains seven columns and more than 500 rows. The first column is ‘Name’, the second one is ‘Author’, and so on till the seventh is ‘Price’. Now the first thing we will do is separate our dataset into independent and dependent variables. You need to understand which column will be independent and which one will be dependent. Sometimes, you have to drop a few columns because they contain unnecessary information, like in this dataset Name of the book, and the Autor’s name.
So, our first two columns have been dropped because they contain unnecessary info. Independent columns are from column C to column F, and the dependent column is G, which is the price of a book.

# importing the dataset
dataset = pd.read_csv('name_of_dataset.csv')
# droping the unnecessary columns
dataset.drop(['Name','Author'], axis=1, inplace=True)
# independent variables
X = dataset.iloc[:, :-1].values
# dependent variables
y = dataset.iloc[:, -1:].values

3. Encoding the categorical data:

Now that we have separated our dataset into independent and dependent variables, next we will check if our independent data has any missing values, or if it contains any string data.

The above dataset doesn't have any missing values but it does have string data. So we will convert string data to binary values so the computer will understand.

We will use the scikit learn library and its modules to encode the categorical data.

# importing scikit learn library
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

After importing libraries, we will create an object of the class ColumnTransformer. This object will take two-parameter, transformers, and the remainder. Again transformers will take a list with a one-string parameter as ‘encoder’, OneHotEncoder’s object, and one more list with the number of the column which we want to encode.

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [2])], remainder='passthrough')

After the ColumnTransformer object, we will use the fit_transform() method to transform string data into the binary value.

X = np.array(ct.fit_transform(X))

We used the array() method from the NumPy library because the fit_transform() method converted our binary values into a vector and we want it into the array.

One more thing to notice is that when you will print X, then you will notice that binary values are at the starting of the column.

4. Splitting the data into training and testing sets:

Now that our data doesn’t contain any string values or they aren’t any missing values, it’s time to split our data into the training set and the testing set.

# imporiting necessary library and it's module
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In the above code cell, we inilizied four variables as X_train, X_test, y_train, and y_test contains training and testing data. Using the test_size parameter in the train_test_split() object, we split most of the data to the training set and the remaining to the testing set.

5. Training MLR model:

Now, it’s time to do the one thing which you have been waiting for, we will use the scikit learn library’s modules to train our training data.

# importing the library
from sklearn.linear_model import LinearRegression
# creating an object of class LinearRegression
LR = LinearRegression()
# Fitting our training data to train the model
LR.LR.fit(X_train, y_train)

And that’s it, within three lines of code you have trained your model to predict the price of a bestseller book.

6. Predicting the result:

Let’s see if our model predict the correct values,

# predicting the test result
y_pred = LR.predict(X_test)

Now to see both our real value and predicted value we will create a new data frame,

# creating new dataframe
df = pd.DataFrame({'Real Value':y_test, 'predictated value':y_pred})
df

df will show you real and predicted values so, you can compare them.

7. MLR equation:

In the above equation, y is a dependent variable, and X1, X2,…, Xi are the independent variables, to find the value of y we will need Interceptor(Bi) and Slope for Xi(Bi) also known as the coefficient.

# coefficient of X
coefficient = LR.coef_
# Interceptor of y
interceptor = LR.intercept_

Both coefficient and interceptor variables will give us the coefficient and interceptor.

And now you have your own machine learning model, ready to deploy in the real world.

You can find all the code in the following link:

https://github.com/RAHUL-KAD/Machine-Learning/blob/main/Regrassion/Code/multiple_linear_regression.py

I hope, now you can train your own model on different datasets.

Best of luck!!!