Introduction to Statistical Graphs with Python and Seaborn
2020-12-12 19:12:26 |
- Linux Ubuntu 20.04
- Windows 10
- macOS Catalina
Seaborn is a powerful python library used for the exploration, distribution and visualization of statistical graphs. Seaborn is built on Matplotlib and thus extends the functionality of Matplotlib.
By integrating directly with numpy and pandas data structures, it takes care of some of the frustrations users have with Matplotlib like parameters and function operations with DataFrames.
Seaborn Offers Statistical Graphing Functionality
- Flexible customization of graphics
- API abstraction across visualization
- Visualizing univariate and bivariate distributions
- To visualize a variety of plots like distribution plots, heat maps, matrix plots, grids and regression plots
- Plotting statistical time series data
What You Need to Get Started with Seaborn
Having knowledge of Matplotlib, Pandas and Numpy makes it significantly easier to work with a library like Seaborn, but is not necessary to get started and complete this tutorial. If you'd like to learn about these frameworks, in depth, feel free to check out our Data Visualization with Python course.
How to Set Up a Project Skeleton
How to Create Python Project Files with Windows 10 PowerShell 2.0+
cd ~ New-Item -ItemType "directory" -Path ".\seaborn-project" cd seaborn-project virtualenv venv .\venv\Scripts\activate
To verify that the virtual environment is active, make sure (venv) is in the PowerShell command prompt. For example, (venv) PS C:\Users\username\seaborn-project>
How to Create Python Project Files with Linux Ubuntu 14.04+ or macOS
cd ~ mkdir seaborn-project cd seaborn-project virtualenv -p python3 venv source venv/bin/activate
To verify that the virtual environment is active, make sure (venv) is in the terminal command prompt.
This will create the following files and folders, and activate the virtual environment.
▾ seaborn-project/ ▸ venv/
Installing Seaborn with Pip
The Seaborn library can be downloaded using pip with pip3 install seaborn==0.11.0 or conda - if you are using the anaconda distribution - with conda install seaborn==0.11.0.
In our code, we'll make referencing Seaborn easier by importing Seaborn with the shorthand sns.
import seaborn as sns
Installing Pandas and Matplotlib with Pip
This tutorial also requires you to install a specific version of Pandas with the pip3 install pandas==1.1.5 command, as well as Matplotlib with pip3 install matplotlib==3.3.3. To get the plot to display in a window, you can install PyQt5 with pip3 install PyQt5==5.15.2.
Loading the Datasets
There are two major ways to load datasets. We can use the built-in data sets that come installed with the Seaborn library or we can use Pandas.
To use the default dataset, all we need to do is call the load_dataset() function. A few datasets like Titanic and Tips are loaded with Seaborn during installation and this function loads the dataset depending on the name passed in the parameters.
titanic_data = sns.load_data('Titanic')
Using load_dataset() is limited to using in-built datasets. We need to work with our own data and that is where Pandas comes in.
To use Pandas, we will import the Pandas library and then load the data we want to work with. For this tutorial, we will be working with Pandas and a simple Amazon Bestseller dataset from Kaggle. This will be downloaded and imported into the program.
NOTE: The dataset will be downloaded as CSV and must be placed in the same folder as the seaborn file.
import pandas as pd amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0) amzbooks.head(5)
Different Plots with Seaborn
Linear Regression Curves
Seaborn allows us to model data that contain multiple quantitative or explanatory variables otherwise known as dependent and independent variables. The function to visualize linear relationships through regression curves is lmplot().
So let's say you have two variables—an independent variable x, and the dependent variable y. Linear regression models aim to:
- Determine how closely related the two variables are. Applying linear regression gives us a number between -1 and 1, indicating the intensity of the correlation between x and y. 0 means they aren't related. 1 means there's a positive correlation (an increase in x also means an increase in y). -1 means there's a negative correlation (an increase in x means a decrease in y and vice versa).
- Predict the values of y for any given value of x, based on our understanding of the relationship between x and y.
Linear regressions curves are very suitable for predictive models, forecasting, predictive analysis and error reduction.
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0) sns.lmplot(x="Reviews", y="User Rating", data=amzbooks) plt.show()
The output of this shows a straight line that runs vertically through the plot showing that there is no correlation between user ratings and reviews from our initial analysis.
Swarm plots make for one of the most aesthetically pleasing seaborn visualizations. They have points on a categorical axis that adjusts automatically and cannot overlap. This style of plotting is similar to the strip plot and is sometimes called beeswarm, for obvious reasons.
Swarm plots are usually used for small observations and the input parameters are not limited to strings or integers.
The swarmplot() function is used to plot the swarmplot and it takes a categorical parameter, numeric parameter and dataset.
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0) sns.swarmplot(x="Price", y="Genre", data=amzbooks) plt.show()
This plots a categorical plot of the prices of books against genre. We can further customize the plots by changing its orientation, fonts, sizes, colors and order.
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0) ax = sns.swarmplot(data=amzbooks, x="Genre", y="User Rating", size=5, orient="h") ax.set_title("A Swarm Plot") plt.show()
The above shows a swarm plot that is horizontally oriented, has a title and a plot marker size of 5.
We can change the order of the y-axis by adding an order attribute.
ax = sns.swarmplot(data = amzbooks, x="Genre", y="User Rating", size = 5, order=["Non Fiction", "Fiction"])
Box plots are so named because of their box-like appearance. They visualize data through their quartiles. The boxplot is quite a basic, simple plot and it is created using the boxplot() function.
The boxplot() function is used to plot bivariate and univariate distribution of data. They can take one variable or two in their parameters.
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0) sns.boxplot(x='Genre', y='Reviews', data=amzbooks) plt.show()
The vertical lines extending from the boxes are called whiskers.
We can pass single variables into the parameters from our dataframe using the drop attribute.
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0) pandas_series = amzbooks.drop(['Price'], axis=1) # New boxplot sns.boxplot(data=pandas_series) plt.show()
This demo produces an output of the price of books plotted against the other numeric columns in the dataframe - User Rating, Reviews and Year.
We can further customize the boxplot by adding colors and controlling the size using the palette and width attributes.
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0) sns.boxplot(x='Genre', y='Reviews', data=amzbooks, width=0.3, palette="Blues") plt.show()
Using the Seaborn library for data visualization is a doorway to a whole new opportunity of visualization modeling.
We touched a few plots in this article, but Seaborn still has many plots with a wide range of customization for each of them, helping us create visualization that fits professional, brand and personal preferences. With this tutorial, we believe you can get started on your Seaborn journey.
If you're interested in learning about data visualization, in depth, take our Real World Data Science with Python course. This course teaches you how to programmatically generate graphs and charts from existing datasets. You'll also learn Python fundamentals, and how to utilize frameworks like Matplotlib, Pandas, Numpy, and Seaborn.