Introduction to Statistical Graphs with Python and Seaborn

2020-12-12 19:12:26 | #programming #python #dataviz

Tested On

  • Linux Ubuntu 20.04
  • Windows 10
  • macOS Catalina

Seaborn is a powerful python library used for the exploration, distribution and visualization of statistical graphs. Seaborn is built on Matplotlib and thus extends the functionality of Matplotlib.

By integrating directly with numpy and pandas data structures, it takes care of some of the frustrations users have with Matplotlib like parameters and function operations with DataFrames.

Seaborn Offers Statistical Graphing Functionality

  • Flexible customization of graphics
  • API abstraction across visualization
  • Visualizing univariate and bivariate distributions
  • To visualize a variety of plots like distribution plots, heat maps, matrix plots, grids and regression plots
  • Plotting statistical time series data

What You Need to Get Started with Seaborn

Having knowledge of Matplotlib, Pandas and Numpy makes it significantly easier to work with a library like Seaborn, but is not necessary to get started and complete this tutorial.

How to Set Up a Project Skeleton

How to Create Python Project Files with Windows 10 PowerShell 2.0+

cd ~
New-Item -ItemType "directory" -Path ".\seaborn-project"
cd seaborn-project
virtualenv venv
.\venv\Scripts\activate

To verify that the virtual environment is active, make sure (venv) is in the PowerShell command prompt. For example, (venv) PS C:\Users\username\seaborn-project>

How to Create Python Project Files with Linux Ubuntu 14.04+ or macOS

cd ~
mkdir seaborn-project
cd seaborn-project
virtualenv -p python3 venv
source venv/bin/activate

To verify that the virtual environment is active, make sure (venv) is in the terminal command prompt.

This will create the following files and folders, and activate the virtual environment.

▾ seaborn-project/
  ▸ venv/

Installing Seaborn with Pip

The Seaborn library can be downloaded using pip with pip3 install seaborn==0.11.0 or conda - if you are using the anaconda distribution - with conda install seaborn==0.11.0.

In our code, we'll make referencing Seaborn easier by importing Seaborn with the shorthand sns.

import seaborn as sns

Installing Pandas and Matplotlib with Pip

This tutorial also requires you to install a specific version of Pandas with the pip3 install pandas==1.1.5 command, as well as Matplotlib with pip3 install matplotlib==3.3.3. To get the plot to display in a window, you can install PyQt5 with pip3 install PyQt5==5.15.2.

Loading the Datasets

There are two major ways to load datasets. We can use the built-in data sets that come installed with the Seaborn library or we can use Pandas.

To use the default dataset, all we need to do is call the load_dataset() function. A few datasets like Titanic and Tips are loaded with Seaborn during installation and this function loads the dataset depending on the name passed in the parameters.

titanic_data = sns.load_data('Titanic')

Using load_dataset() is limited to using in-built datasets. We need to work with our own data and that is where Pandas comes in.

To use Pandas, we will import the Pandas library and then load the data we want to work with. For this tutorial, we will be working with Pandas and a simple Amazon Bestseller dataset from Kaggle. This will be downloaded and imported into the program.

NOTE: The dataset will be downloaded as CSV and must be placed in the same folder as the seaborn file.

import pandas as pd
amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0)
amzbooks.head(5)

Different Plots with Seaborn

Linear Regression Curves

Seaborn allows us to model data that contain multiple quantitative or explanatory variables otherwise known as dependent and independent variables. The function to visualize linear relationships through regression curves is lmplot().

So let's say you have two variables—an independent variable x, and the dependent variable y. Linear regression models aim to:

  1. Determine how closely related the two variables are. Applying linear regression gives us a number between -1 and 1, indicating the intensity of the correlation between x and y. 0 means they aren't related. 1 means there's a positive correlation (an increase in x also means an increase in y). -1 means there's a negative correlation (an increase in x means a decrease in y and vice versa).
  2. Predict the values of y for any given value of x, based on our understanding of the relationship between x and y.

Linear regressions curves are very suitable for predictive models, forecasting, predictive analysis and error reduction.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0)
sns.lmplot(x="Reviews", y="User Rating", data=amzbooks)
plt.show()

A linear regression curve plotting user ratings and reviews of bestselling Amazon books and how there is little to no correlation between them

The output of this shows a straight line that runs vertically through the plot showing that there is no correlation between user ratings and reviews from our initial analysis.

Swarm Plots

Swarm plots make for one of the most aesthetically pleasing seaborn visualizations. They have points on a categorical axis that adjusts automatically and cannot overlap. This style of plotting is similar to the strip plot and is sometimes called beeswarm, for obvious reasons.

Swarm plots are usually used for small observations and the input parameters are not limited to strings or integers.

The swarmplot() function is used to plot the swarmplot and it takes a categorical parameter, numeric parameter and dataset.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0)
sns.swarmplot(x="Price", y="Genre", data=amzbooks)
plt.show()

A swarm plot depicting how non-fiction and fiction books are both priced mostly in the 0 to 25 dollar range, with non-fiction having some outliers price towards and above 100 dollars

This plots a categorical plot of the prices of books against genre. We can further customize the plots by changing its orientation, fonts, sizes, colors and order.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0)
ax = sns.swarmplot(data=amzbooks, x="Genre", y="User Rating", size=5, orient="h")

ax.set_title("A Swarm Plot")
plt.show()

A swarm plot depicting that fiction books have more outliers that are rated higher than non fictions books

The above shows a swarm plot that is horizontally oriented, has a title and a plot marker size of 5.

We can change the order of the y-axis by adding an order attribute.

ax = sns.swarmplot(data = amzbooks, x="Genre", y="User Rating", size = 5, order=["Non Fiction", "Fiction"])

Box Plots

Box plots are so named because of their box-like appearance. They visualize data through their quartiles. The boxplot is quite a basic, simple plot and it is created using the boxplot() function.

The boxplot() function is used to plot bivariate and univariate distribution of data. They can take one variable or two in their parameters.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0)
sns.boxplot(x='Genre', y='Reviews', data=amzbooks)
plt.show()

A box plot depicting that fiction books have more outliers that are rated higher than non fictions books

The vertical lines extending from the boxes are called whiskers.

We can pass single variables into the parameters from our dataframe using the drop attribute.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0)
pandas_series = amzbooks.drop(['Price'], axis=1)

# New boxplot
sns.boxplot(data=pandas_series)
plt.show()

A box plot depicting the relationship between the price of books against user rating, reviews, and year

This demo produces an output of the price of books plotted against the other numeric columns in the dataframe - User Rating, Reviews and Year.

We can further customize the boxplot by adding colors and controlling the size using the palette and width attributes.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0)
sns.boxplot(x='Genre', y='Reviews', data=amzbooks, width=0.3, palette="Blues")
plt.show()

A box plot depicting that fiction books are generally higher rated than non-fiction

Conclusion

Using the Seaborn library for data visualization is a doorway to a whole new opportunity of visualization modeling.

We touched a few plots in this article, but Seaborn still has many plots with a wide range of customization for each of them, helping us create visualization that fits professional, brand and personal preferences. With this tutorial, we believe you can get started on your Seaborn journey.

Comments

You must log in to comment. Don't have an account? Sign up for free.

Subscribe to comments for this post

Want To Receive More Free Content?

Would you like to receive free resources, tailored to help you reach your IT goals? Get started now, by leaving your email address below. We promise not to spam. You can also sign up for a free account and follow us on and engage with the community. You may opt out at any time.



Hire Us for IT and Consulting Services









Contact Us

Do you have a specific IT problem that needs solving or just have a general IT question? Use the contact form to get in touch with us and an IT professional will be with you, momentarily.

Services

We offer web development, enterprise software development, QA & testing, google analytics, domains and hosting, databases, security, IT consulting, and other IT-related services.

Free IT Tutorials

Head over to our tutorials section to learn all about working with various IT solutions.

We Noticed Adblock Running

Because we offer a variety of free programming tools and resources to our visitors, we rely on ad revenue to keep our servers up. Would you consider disabling Adblock for our site and clicking the "Refresh Page" button?

Contact