How to Build a Random Forest Model with Python, Scikit-Learn, and Machine Learning

2020-12-27 09:14:54 | #programming #python #ml

Tested On

  • Linux Ubuntu 20.04
  • Windows 10
  • macOS Catalina

Scikit-Learn (also known as Sklearn; formerly scikits) is an open source, machine learning library built on Numpy, Scipy and Matplotlib. Its name is short for "Scipy Toolkit", for originally being developed as a third-party Scipy extension.

Sklearn is a simple and efficient tool for predictive analysis, data modeling, statistical modeling, clustering, regression and data training. It is quite robust and also provides support for supervised and unsupervised algorithms.

Sklearn is designed to work with Scipy, Numpy, Pandas and Matplotlib. While Sklearn is perfectly capable of mathematical computation, it is not intended for it. Nor is it intended for loading, manipulating, or visualizing data. Sklearn is focused on machine learning, and provides a diverse set of features and algorithms, including:

  • Data modeling
  • Clustering
  • Cross-validation
  • Ensemble models
  • Manifold learning
  • Feature selection
  • Feature extraction
  • Parameter tuning
  • Dimensionality reduction
  • Neural networks
  • Support Vector Machiness
  • Naive Bayes

This tutorial is the second part of a beginner series of machine learning and its associated frameworks. For a more introductory explanation of machine learning, in general, please check out Introduction to Machine Learning with Python. And for a crash course on Python fundamentals, start with our intro to Python series.

Setting Up a Scikit-Learn Project

How to Create Python Project Files with Windows 10 PowerShell 2.0+

cd ~
New-Item -ItemType "directory" -Path ".\sklearn-project"
cd sklearn-project
virtualenv venv
.\venv\Scripts\activate

To verify that the virtual environment is active, make sure (venv) is in the PowerShell command prompt. For example, (venv) PS C:\Users\username\sklearn-project>

How to Create Python Project Files with Linux Ubuntu 14.04+ or macOS

cd ~
mkdir sklearn-project
cd sklearn-project
virtualenv -p python3 venv
source venv/bin/activate

To verify that the virtual environment is active, make sure (venv) is in the terminal command prompt.

This will create the following files and folders, and activate the virtual environment.

▾ sklearn-project/
  ▸ venv/

Installing Scikit-Learn with Pip

Scikit-learn requires Numpy and Scipy (and dependencies). You also need Python 2.7 and above.

pip install -U scikit-learn==0.24.0

You can also install scikit-learn with conda

conda install scikit-learn==0.24.0

Installing Pandas with Pip

Pandas is a necessary dependency for the program we'll write, later on.

pip install pandas==1.2.0

For our first scikit-learn project, we will be training and tuning a random forest for house price based on various properties, such as number of bedrooms, the year it was renovated, floors, square feet, etc. This tutorial will introduce you to some of the functions and libraries used within scikit-learn, and will be a general guide to how scikit-learn can be used for machine learning.

Before starting out with scikit-learn, you should have basic python knowledge and some experience with Numpy, Scipy, Pandas and Matplotlib,

The dataset we will be using is the House Sales in King County, USA dataset, which can be downloaded from Kaggle.

Full Example Code

Here's the full code for our Scikit-Learn-based machine learning program. Add this code to a file called main.py and run it with python main.py

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

import joblib
import datetime as dt


dataset = pd.read_csv('kc_house_data.csv', index_col=0)
dataset.date = pd.to_datetime(dataset.date)
dataset.date = dataset.date.map(dt.datetime.toordinal)
print(dataset.head(5))
print(dataset.columns)

y = dataset.price
x = dataset.drop('price', axis=1)
x_train, x_test, y_train, y_test = train_test_split(x, y,
                                                    test_size=0.2,
                                                    random_state=123)

print(dataset.describe())

pipeline = make_pipeline(preprocessing.StandardScaler(),
                         RandomForestRegressor(n_estimators=100))

hyppara = {'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'], 'randomforestregressor__max_depth': [None, 5, 3, 1]}

clf = GridSearchCV(pipeline, hyppara, cv=10)
clf.fit(x_train, y_train)

# Evaluate model pipeline on test data
pred = clf.predict(x_test)
print(r2_score(y_test, pred))
print(mean_squared_error(y_test, pred))

# Saving the model for future use
joblib.dump(clf, 'rf_regressor.pkl')

# Load model from .pkl file
clf2 = joblib.load('rf_regressor.pkl')

# Predict the data using loaded model
clf2.predict(x_test)

Explanation of the Machine Learning Program

Importing Dependencies

First, we import all of the required pandas and sklearn modules and functions.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

import joblib
import datetime as dt

Reading a CSV Dataset with Pandas

Next, we load the dataset using the read_csv() Pandas function.

dataset= pd.read_csv('kc_house_data.csv', index_col = 0)

We can print the first 5 rows of our dataset with dataset.head(5)

date     price  bedrooms  bathrooms  sqft_living  sqft_lot  floors  ...  yr_built  yr_renovated  zipcode      lat     long  sqft_living15  sqft_lot15
id                                                                                ...                                                                              
7129300520  735519  221900.0         3       1.00         1180      5650     1.0  ...      1955             0    98178  47.5112 -122.257           1340        5650
6414100192  735576  538000.0         3       2.25         2570      7242     2.0  ...      1951          1991    98125  47.7210 -122.319           1690        7639
5631500400  735654  180000.0         2       1.00          770     10000     1.0  ...      1933             0    98028  47.7379 -122.233           2720        8062
2487200875  735576  604000.0         4       3.00         1960      5000     1.0  ...      1965             0    98136  47.5208 -122.393           1360        5000
1954400510  735647  510000.0         3       2.00         1680      8080     1.0  ...      1987             0    98074  47.6168 -122.045           1800        7503

[5 rows x 20 columns]

The data values are separated by commas, so there's no need to indicate a delimiter. But if your dataset, for example, was semicolon delimited, you would have to update line 14 with:

dataset = pd.read_csv('kc_house_data.csv', index_col = 0, sep=';')

Converting Timestamp Strings to Numerical Data with Pandas

Here, we have to convert timestamp strings to numerical data or our program will complain.

dataset.date = pd.to_datetime(dataset.date)
dataset.date = dataset.date.map(dt.datetime.toordinal)

Understanding the Predictive Accuracy of a Machine Learning Program

A crucial step in setting up a supervised machine learning program is to determine its precision. We do this by running the model against our training data, and seeing how accurately it predicts a target value or classification.

For this example, we want to see if our program can predict a house's price based on its analysis of the other clumns in the dataset.

But keep in mind that a variety of factors can negatively impact a machine learning program's accuracy. Having too many variables and a program with an inability to generalize well leads to overfitting, where accurary reported during training is much higher than the program's real-world accurary with unseen data.

Perhaps the biggest factor, here, is that we're working with just one dataset for testing and evaluation.

A good machine learning program should be able to separate noise from the "signal" (the true pattern you are aiming to predict).

Separating the Data into Training and Test Sets

To compare our machine learning program's accuracy within a training setting vs. a real-world setting, we have to split our data into two sets—a training set and a test set.

y = dataset.price
x = dataset.drop('price', axis=1)
x_train, x_test, y_train, y_test = train_test_split(x, y,
                                                    test_size=0.2,
                                                    random_state=123)

Here, we set aside 20% of the data as a test set for evaluating the model. We will also be using random seed, in order to reproduce our results.

We use dataset.describe() to better understand at our data:

date         price      bedrooms     bathrooms   sqft_living  ...       zipcode           lat          long  sqft_living15     sqft_lot15
count   21613.000000  2.161300e+04  21613.000000  21613.000000  21613.000000  ...  21613.000000  21613.000000  21613.000000   21613.000000   21613.000000
mean   735535.193078  5.400881e+05      3.370842      2.114757   2079.899736  ...  98077.939805     47.560053   -122.213896    1986.552492   12768.455652
std       113.048011  3.671272e+05      0.930062      0.770163    918.440897  ...     53.505026      0.138564      0.140828     685.391304   27304.179631
min    735355.000000  7.500000e+04      0.000000      0.000000    290.000000  ...  98001.000000     47.155900   -122.519000     399.000000     651.000000
25%    735436.000000  3.219500e+05      3.000000      1.750000   1427.000000  ...  98033.000000     47.471000   -122.328000    1490.000000    5100.000000
50%    735522.000000  4.500000e+05      3.000000      2.250000   1910.000000  ...  98065.000000     47.571800   -122.230000    1840.000000    7620.000000
75%    735646.000000  6.450000e+05      4.000000      2.500000   2550.000000  ...  98118.000000     47.678000   -122.125000    2360.000000   10083.000000
max    735745.000000  7.700000e+06     33.000000      8.000000  13540.000000  ...  98199.000000     47.777600   -121.315000    6210.000000  871200.000000

[8 rows x 20 columns]

Here, we can see that we're working with a numeric dataset—ideal for training. However, it is represented with various scales, so we need to standardize it using a pipeline and the StandardScaler() function.

pipeline = make_pipeline(preprocessing.StandardScaler(),
                         RandomForestRegressor(n_estimators=100))

Setting Hyperparameters for Machine Learning

Model Parameters vs. Model Hyperparameters

A model parameter is internal to the model, configurable, and has a value that can be estimated from the data, automatically. Model hyperparameters, on the other hand, must be set manually and tuned, and help in estimating model parameters. In any machine learning algorithm, hyperparameters need to be initialized before training a model.

The format for hyperparameters should be a Python dictionary (data structure for key-value pairs) where keys are the hyperparameter names and values are lists of settings to try.

hyppara = {'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'], 'randomforestregressor__max_depth': [None, 5, 3, 1]}

Setting a Cross-Validation Pipeline

Next we tune the models using a cross-validation pipeline.

clf = GridSearchCV(pipeline, hyppara, cv=10)
clf.fit(x_train, y_train)

Cross-validation pipelines reliably estimate the performance of a method for building the model. It repeatedly evaluates the model using the same method every time.

Cross validation is important because it tests the effectiveness of a model. The best practice is to include your data preprocessing steps inside the cross validation loop.

# Evaluate model pipeline on test data
pred = clf.predict(x_test)
print(r2_score(y_test, pred))
print(mean_squared_error(y_test, pred))
0.8958047936671787
14714527296.347326

Evaluating the model pipeline is quite straightforward. We use the clf object to fine tune the hyperparameters.

The metrics imported from before can be used to evaluate our model's performance.

Saving and Loading the Machine Learning Model

# Saving the model for future use
joblib.dump(clf, 'rf_regressor.pkl')

Now that we have saved the model for future use, we can load it whenever we want with the following code from the .pkl file.

# Load model from .pkl file
clf2 = joblib.load('rf_regressor.pkl')

In the wake of training a random forest, it is normal to ask which variables have the most prescient force. Variables with high significance are drivers of the result and their qualities significantly affect the result outcome. Conversely, variables with low significance may be excluded from a model, making it easier and quicker to fit and predict.

Conclusion

Scikit-learn has many advantages over other machine learning libraries. It provides a consistent interface to machine learning models, it has a rich collection of modules and packages making it easy to train models, it is flexible and it provides tuning parameters with sensible defaults. It remains a top choice for machine learning.

Comments

You must log in to comment. Don't have an account? Sign up for free.

Subscribe to comments for this post

Want To Receive More Free Content?

Would you like to receive free resources, tailored to help you reach your IT goals? Get started now, by leaving your email address below. We promise not to spam. You can also sign up for a free account and follow us on and engage with the community. You may opt out at any time.



Hire Us for IT and Consulting Services









Contact Us

Do you have a specific IT problem that needs solving or just have a general IT question? Use the contact form to get in touch with us and an IT professional will be with you, momentarily.

Services

We offer web development, enterprise software development, QA & testing, google analytics, domains and hosting, databases, security, IT consulting, and other IT-related services.

Free IT Tutorials

Head over to our tutorials section to learn all about working with various IT solutions.

We Noticed Adblock Running

Because we offer a variety of free programming tools and resources to our visitors, we rely on ad revenue to keep our servers up. Would you consider disabling Adblock for our site and clicking the "Refresh Page" button?

Contact