# Introduction to Statistical Graphs with Python and Seaborn

##### 2020-12-12 19:12:26 |

## Tested On

- Linux Ubuntu 20.04
- Windows 10
- macOS Catalina

Seaborn is a powerful python library used for the exploration, distribution and visualization of statistical graphs. Seaborn is built on Matplotlib and thus extends the functionality of Matplotlib.

By integrating directly with numpy and pandas data structures, it takes care of some of the frustrations users have with Matplotlib like parameters and function operations with DataFrames.

## Seaborn Offers Statistical Graphing Functionality

- Flexible customization of graphics
- API abstraction across visualization
- Visualizing univariate and bivariate distributions
- To visualize a variety of plots like distribution plots, heat maps, matrix plots, grids and regression plots
- Plotting statistical time series data

## What You Need to Get Started with Seaborn

Having knowledge of Matplotlib, Pandas and Numpy makes it significantly easier to work with a library like Seaborn, but is not necessary to get started and complete this tutorial. If you'd like to learn about these frameworks, in depth, feel free to check out our Data Visualization with Python course.

## How to Set Up a Project Skeleton

### How to Create Python Project Files with Windows 10 PowerShell 2.0+

```
cd ~
New-Item -ItemType "directory" -Path ".\seaborn-project"
cd seaborn-project
virtualenv venv
.\venv\Scripts\activate
```

To verify that the virtual environment is active, make sure (venv) is in the PowerShell command prompt. For example, (venv) PS C:\Users\username\seaborn-project>

### How to Create Python Project Files with Linux Ubuntu 14.04+ or macOS

```
cd ~
mkdir seaborn-project
cd seaborn-project
virtualenv -p python3 venv
source venv/bin/activate
```

To verify that the virtual environment is active, make sure (venv) is in the terminal command prompt.

This will create the following files and folders, and activate the virtual environment.

```
▾ seaborn-project/
▸ venv/
```

## Installing Seaborn with Pip

The Seaborn library can be downloaded using pip with pip3 install seaborn==0.11.0 or conda - if you are using the anaconda distribution - with conda install seaborn==0.11.0.

In our code, we'll make referencing Seaborn easier by importing Seaborn with the shorthand sns.

`import seaborn as sns`

## Installing Pandas and Matplotlib with Pip

This tutorial also requires you to install a specific version of Pandas with the pip3 install pandas==1.1.5 command, as well as Matplotlib with pip3 install matplotlib==3.3.3. To get the plot to display in a window, you can install PyQt5 with pip3 install PyQt5==5.15.2.

## Loading the Datasets

There are two major ways to load datasets. We can use the built-in data sets that come installed with the Seaborn library or we can use Pandas.

To use the default dataset, all we need to do is call the load_dataset() function. A few datasets like Titanic and Tips are loaded with Seaborn during installation and this function loads the dataset depending on the name passed in the parameters.

`titanic_data = sns.load_data('Titanic')`

Using load_dataset() is limited to using in-built datasets. We need to work with our own data and that is where Pandas comes in.

To use Pandas, we will import the Pandas library and then load the data we want to work with. For this tutorial, we will be working with Pandas and a simple Amazon Bestseller dataset from Kaggle. This will be downloaded and imported into the program.

NOTE: The dataset will be downloaded as CSV and must be placed in the same folder as the seaborn file.

```
import pandas as pd
amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0)
amzbooks.head(5)
```

## Different Plots with Seaborn

### Linear Regression Curves

Seaborn allows us to model data that contain multiple quantitative or explanatory variables otherwise known as dependent and independent variables. The function to visualize linear relationships through regression curves is lmplot().

So let's say you have two variables—an independent variable x, and the dependent variable y. Linear regression models aim to:

- Determine how closely related the two variables are. Applying linear regression gives us a number between -1 and 1, indicating the intensity of the correlation between x and y. 0 means they aren't related. 1 means there's a positive correlation (an increase in x also means an increase in y). -1 means there's a negative correlation (an increase in x means a decrease in y and vice versa).
- Predict the values of y for any given value of x, based on our understanding of the relationship between x and y.

Linear regressions curves are very suitable for predictive models, forecasting, predictive analysis and error reduction.

```
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0)
sns.lmplot(x="Reviews", y="User Rating", data=amzbooks)
plt.show()
```

The output of this shows a straight line that runs vertically through the plot showing that there is no correlation between user ratings and reviews from our initial analysis.

### Swarm Plots

Swarm plots make for one of the most aesthetically pleasing seaborn visualizations. They have points on a categorical axis that adjusts automatically and cannot overlap. This style of plotting is similar to the strip plot and is sometimes called beeswarm, for obvious reasons.

Swarm plots are usually used for small observations and the input parameters are not limited to strings or integers.

The swarmplot() function is used to plot the swarmplot and it takes a categorical parameter, numeric parameter and dataset.

```
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0)
sns.swarmplot(x="Price", y="Genre", data=amzbooks)
plt.show()
```

This plots a categorical plot of the prices of books against genre. We can further customize the plots by changing its orientation, fonts, sizes, colors and order.

```
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0)
ax = sns.swarmplot(data=amzbooks, x="Genre", y="User Rating", size=5, orient="h")
ax.set_title("A Swarm Plot")
plt.show()
```

The above shows a swarm plot that is horizontally oriented, has a title and a plot marker size of 5.

We can change the order of the y-axis by adding an order attribute.

`ax = sns.swarmplot(data = amzbooks, x="Genre", y="User Rating", size = 5, order=["Non Fiction", "Fiction"])`

### Box Plots

Box plots are so named because of their box-like appearance. They visualize data through their quartiles. The boxplot is quite a basic, simple plot and it is created using the boxplot() function.

The boxplot() function is used to plot bivariate and univariate distribution of data. They can take one variable or two in their parameters.

```
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0)
sns.boxplot(x='Genre', y='Reviews', data=amzbooks)
plt.show()
```

The vertical lines extending from the boxes are called whiskers.

We can pass single variables into the parameters from our dataframe using the drop attribute.

```
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0)
pandas_series = amzbooks.drop(['Price'], axis=1)
# New boxplot
sns.boxplot(data=pandas_series)
plt.show()
```

This demo produces an output of the price of books plotted against the other numeric columns in the dataframe - User Rating, Reviews and Year.

We can further customize the boxplot by adding colors and controlling the size using the palette and width attributes.

```
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
amzbooks = pd.read_csv('bestsellers_with_categories.csv', index_col=0)
sns.boxplot(x='Genre', y='Reviews', data=amzbooks, width=0.3, palette="Blues")
plt.show()
```

## Conclusion

Using the Seaborn library for data visualization is a doorway to a whole new opportunity of visualization modeling.

We touched a few plots in this article, but Seaborn still has many plots with a wide range of customization for each of them, helping us create visualization that fits professional, brand and personal preferences. With this tutorial, we believe you can get started on your Seaborn journey.

If you're interested in learning about data visualization, in depth, take our Real World Data Science with Python course. This course teaches you how to programmatically generate graphs and charts from existing datasets. You'll also learn Python fundamentals, and how to utilize frameworks like Matplotlib, Pandas, Numpy, and Seaborn.

## Comments

You must log in to comment. Don't have an account? Sign up for free.

Subscribe to comments for this post

## Info