The basics for reading in shapefile data, plotting it, and adding a basemap.
Python is a powerful programming language for spatial analyses. In this tutorial, we learn the basics of plotting shapefiles overlaid on top of a basemap, which gives us spatial context and opens doors for deeper analysis. Understanding where points, lines, and polygons exist in the world empowers data scientists to answer questions about correlated variables, consider new questions that arise throughout the analysis process, and communicate conclusions to audiences in an easily-digestible way.
We start with the basics; installing the integrated development
environment (IDE) Jupyter Notebook and installing the necessary
packages. If you prefer to use another IDE such as Visual Studio Code
and have already installed the packages geopandas
, feel
free to jump ahead to the Importing Packages
section.
As a conservation biologist and an aspiring data scientist, I’m interested in plotting endangered species’ critical habitat ranges throughout the United States. I sourced my data from the U.S. Geological Survey here. I’m a big fan of USGS datasets because they are usually relatively clean and provide comprehensive metadata.
Let’s map the critical habitat ranges of the Florida Bonneted bat and many other species of concern on a basemap of the United States.
Transitioning from R
and Rstudio
to
Python
and Jupyter
can be difficult, as is
learning any new language or IDE. We have to say goodbye to the
exceptionally user-friendly RStudio
interface with its
convenient graphical user interface and easily accessible console and
terminal windows. Let’s dive into the objected-oriented language of
Python and the mysterious terminal. Personally, I struggled every step
of this journey with lots of syntax errors, merge conflicts, and endless
Googling I hope this post will help hold your hand as we make this
transition together.
First things first, if you’re using Jupyter Notebooks to code in
Python, you’re gonna have to install Python and Anaconda Navigator.
Install Anaconda
into your home directory. (Note: there are
plenty of other IDE’s for Python, such as visual studio code, which I
adopted later in my data science journey).
As I had a PC when I first started coding in Python, installing
Anaconda
really threw me and my laptop for a loop. But what
better way to learn to use the terminal besides struggling to install
stuff you desperately need for graduate school? Errors and
trouble-shooting is what makes us stronger data scientists, even if we
don’t want to recognize that as we fight the urge to throw our computer
out the window. Click here
for a great resource to walk you through installing
Anaconda
. Either Anaconda
or
Miniconda
will work fine whether you have a PC or Mac.
After installing Anaconda, start your journey plotting shapefiles in
Python
by opening up Jupyter Notebooks. On a Mac, I can
activate conda
in the regular terminal
(conda activate
). On a PC, I like to do this from the
Anaconda Prompt (the Aanconda
terminal) because it’s tricky
to get my normal command line or Bash to recognize that
conda
is indeed on my computer. My favorite workflow is as
follows:
geopandas
with
conda install geopandas
in your base environment (which is
the default). If that doesn’t work (which would not surprise me if
you’re on a PC), create a new environment to do so. I found the steps on
this website to be helpful: https://medium.com/analytics-vidhya/fastest-way-to-install-geopandas-in-jupyter-notebook-on-windows-8f734e11fa2b.
I tried installing geopandas
in my base environment, but it
was difficult to install all the correct versions of all the
dependencies, so I decided to take the easy route and just make a new
environment for geopandas
and any other spatial analysis
packages I’ll need. - Activate the environment in which you have
installed the geopandas
package. I named that environment
geo_env
, so I type activate geo_env
.cd file/path/to/folder
. You know
this worked if your terminal working directory now has that file path
tacked onto the end. This file path step is not required if you
want to include relative file paths to import data and save
your notebook.jupyter notebook
. This will tell your terminal to pop open
Jupyter Notebook in your browser with your folder of choice already open
and ready to go.geo_env
environment to do so.modules
)For plotting shapefiles, you’ll want to import the following packages:
import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt # plot our data and manipulate the plot
import contextily as ctx # add a default basemap under our polygons of interest
If you do not already have contextily
installed, use a
conda forge
command to install it in the terminal in your
geospatial environment.
As a proponent of reproducibility and crediting those who provided the data, I like to include a markdown chunk following my package imports that includes a URL link to where I found my data, along with a citation if necessary and any notes about how I downloaded it for myself or anyone else working through my code:
Data source: US
Fish and Wildlife - contains .shp, .dbf, and .shx polygon files
that relate to critical species habiata in the United States
- I chose the first zip file you see on this site
- USFW provided great metadata
Let’s import your data! Now is the time you’re gonna thank yourself
for placing your data in the same folder as your Jupyter notebook. We
will use geopandas
to read in the
shapefile with your polygons or lines or points of
choice (you will not find a combination of these shapes in the same
shapefile, because that’s just how the world works). You might take a
look at all the data files and feel a little overwhelmed at the choices
due to the way that shapefiles and their metadata are stored separately
(.shp, .dbf, .shx, .xml, and so on). In this example we are trying to
import a shapefile of polygons, so that
.shp file is the only one you need to read in:
= gpd.read_file('CRITHAB_POLY.shp')
gdf # Take a look at the first rows:
print(gdf.head())
# Ask Python how many rows and columns are in this dataframe
print(gdf.shape)
My only complaint with the head()
function is that it
returns the first rows in a plain text format:
If you want to see the first and last few rows of the dataframe in a
format that looks more familiar (like how Rstudio presents dataframes),
try just typing the name of the data frame, gdf
:
This shapefile I read in contains polygons that designate the
critical habitat ranges for many endangered and threatened species in
the United States. I chose to name it gdf
for
geodataframe. Expect that you will be modifying this dataframe
as you go through this mapping process (subsetting columns, filtering
for certain factor levels, etc.) so you will likely be tacking on more
words to gdf
to tell these modified versions apart. Start
naming things simply and clearly, and get more specific as you process
your data.
Practice cimple commans like calling all the column names in the dataframe:
print(gdf.columns)
Check the factor levels for the columns listing_st
:
= gdf.listing_st.unique()
status_levels status_levels
Using U.S. Fish and Wildlife as an example, now that you know the factor levels of a categorical variable, you can subset for only “endangered” species, only “threatened” species, etc.
Play around with your dataframe a bit; Google some of the species,
subset the columns, search for NA
values, or take the
average of a column. After you make a structural change, its a good
habit to check the status or dimensions of your dataframe.
Check the number of rows and columns:
Print the latitude and longitude pairs that make up a particular polygon:
As a last step before you plot, you have to make sure you set the data to the desired coordinate reference system (CRS). This is pretty standard for all kinds of spatial data, since your data might come with an unfamiliar CRS or have no CRS at all if you are making a mask, a raster, or executing similar geospatial processes. For information about coordinate reference systems, check out this guide.
Three common CRS’s are as follows:
contextily
.Set the CRS to ESPG 3857. This code may take a minute to run. In Jupyter Notebook, you know that code chunk is still chuggin’ away if you see an asterisk in brackets to the left of the code chunk:
Use the plot() function from matplotlib and make the polygon color represent each species:
='comname',
gdf_3857.plot(column=(20,20)) figsize
Note that you did not have to call the package to use the function
plot()
. Instead, you can name the dataframe which you want
to plot, which is gdf_3857
in this case, then
plot()
and add arguments and supplemental plot structure
changes as you go.
The fig size can be whatever you want. 10-20 is usually good enough.
You have finer control over the degree of zoom of the map with the
arguments xlim()
and ylim()
, anyway. These
polygons are just floating in space, so lets add a basemap to give us
geographical context:
= gdf_3857.plot(column='comname',
crit_hab_polys =(20,20)) figsize
Notice that I used an argument in the plot function, setting the
column = 'comname'
, which is a column within the
gdf_3857
geodataframe that specifies the common name for
the species in that row. This argument sets a unique color to each
common name, which will help us tell the difference between each
species’ habitat on the map, even if 1 species’ habitat is composed of
multiple polygons.
ctx.add_basemap(crit_hab_polys)
# Set the axis limits to zoom in on just the lower 48 states, rather than viewing the entire world:
2350000,6350000]) plt.ylim([
Since the basemap within the contextily
package is of
the entire world, we need to specify the x-limitations and y-limitations
for our map so we can zoom in on the United States to best understand
our data. The default x and y units were in the millions, so I specified
my units in millions, too. When considering if I should plug in positive
or negative values, I considered the way that coordinate reference
systems are designed with positive values for North and East, and
negative values for South and West. I considered that the United States
is north of the equator, so I should have positive values for the min
and max y. As for the magnitude of my values, I simply looked at the map
for a starting point and played around with different numbers until I
got the view I wanted.
-14000000,-6000000]) plt.xlim([
Notice that these values are negative. Along similar thinking to how
we decided on our y limitation, these negative values are the result of
how coordinate systems are designed. Consider the prime meridian (which
lies at 0 degrees in the East and West directions) with West being
negative. Since the United States are to the West of the prime meridian,
we know that the x-range for our zoom should be negative. As for the
magnitude, I just played around with the numbers until I got the
East-West orientation that encompassed the United States. Use the
show()
function in matplotlib
to tell Python
to show the most recently created graph:
plt.show()
You did it! Welcome to the wonderful world of geospatial data in Python.
With this basic skill mastered, you can now dive deeper into this data to determine if variables are correlated across space. Considering state borders, you might ask which endangered species occupy ciritical habitat in your home state? and investigate the different wildlife protection policies across the United States. Alternatively, you could approach this data from an environmental perspective and ask which natural biomes contain the most critical habitat for these endangered species? Are these habitats degrading at a faster rate than those that contain less critical habitat?
Living in Santa Barbara, I’m interested in the critical habitat of the Southern California Steelhead Trout. Steelhead trout is a beautiful fish species that interbreeds with resident rainbow trout in many coastal regions throughout California, which are split into distinct population segments:
I was lucky enough to conduct field work with these migratory fish in 2020-2021 through the California Department of Fish and Wildlife and the Pacific States Marine Fisheries Committee. I grew to appreciate their vital role in ecosystem processes and the culture of indigenous people who have interacted with them for centuries. This fish is currently in the process of being listed as a California endangered species starting in December of 2021, which will hopefully expand critical habitat range, increase monitoring of the populations, and help enforce illegal fishing and pollution regulation in their habitat. We can check out the steelhead’s critical habitat range before and after 2021 to see how it expands over time and space.