Vector data

Authors: David Shean & Scott Henderson

SnowEx Hackweek July 13, 2021

Learning Objectives

A 30 minute guide to vector data for SnowEX Hackweek

  • find, visualize, interpret vector data formats

  • use Python raster libraries geopandas

GeoPandas

pandas is great, but what if we want to do some geospatial operations - like reproject our vector data or compute the intersection between Point and Polygon features?

Enter Geopandas - all the great things about pandas, plus geo! (http://geopandas.org/).

“GeoPandas is an open source project to make working with geospatial data in python easier. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types. Geometric operations are performed by shapely. Geopandas further depends on fiona for file access and descartes and matplotlib for plotting.”

“GeoPandas enables you to easily do operations in python that would otherwise require a spatial database such as PostGIS.”

Under the hood, GeoPandas is pandas plus some other core geospatial packages:

Under those hoods are lower-level geospatial libraries (GEOS, GDAL/OGR, PROJ), written in C/C++/Java and compiled (fast!), that provide a foundation for most GIS software (open-source and commercial). I encourage you to explore these - I guarantee you will learn something valuable.

For now, let’s explore some basic geopandas functionality.

import geopandas as gpd
import hvplot.pandas
import holoviews as hv

GeoDataFrame

gpd.GeoDataFrame?

Coordinate Reference Systems (CRS)

These are essential for geospatial analysis, but can be confusing (even for seasoned GIS professionals). Fortunately, most applications use common CRS with a shared database of EPSG codes. https://en.wikipedia.org/wiki/EPSG_Geodetic_Parameter_Dataset

Here are some common CRS you will encounter:

  • EPSG:4326 - WGS84, geodetic latitude and longitude

  • EPSG:3857 - “web mercator”, used by most tiled basemaps and online/mobile mapping platforms, https://en.wikipedia.org/wiki/Web_Mercator_projection

    • Usually OK for display purposes, maybe some analysis at lower latitudes, but considerable distortion at high latitudes

  • UTM zones - convenient compromise projections for smaller areas (<500 km)

    • EPSG:326XX - where XX is zone number (e.g., 12), WGS84 ellipsoid

    • EPSG:269XX - where XX is zone number, NAD83(2011) ellipsoid

Key takeaways for now:

  • Make sure your dataset has a CRS defined (consult metadata if unspecified)

  • While ArcGIS/QGIS can reproject datasets to a common CRS “on the fly”

  • GeoPandas/Shapely/GEOS performs calculations using 2D Cartesian coordinates (e.g., computing distance with Pythagorean theorem, not true 3D geodetic distance on surface of the Earth). Choice of CRS for these calculations is important (e.g., use equal-area projection for area calculations)

  • Most of the time, you will use one of the above CRS definitions. For larger regional analysis, consider a custom projection defined by your domain and planned analysis: https://projectionwizard.org/

Let’s use some Polygons for US States

Hmmm, let’s see. Two choices:

  1. We could go to ESRI or the U.S. Census website, identify and download a shapefile, unzip 4+ files, copy/paste the appropriate *.shp filename into the notebook. Wait, how can I download on a remote server? OK, maybe run something like wget http://..., unzip, provide absolute path
    - OR -

  2. Give geopandas a url string that points to a GeoJSON file somewhere on the web, and read dynamically

Yeah, let’s go with #2

Let’s use the US States 5M GeoJSON here: http://eric.clst.org/tech/usgeojson/

  • JSON (JavaScript Object Notation) is a structured text-based data format. A Jupyter notebook is a json text file.

  • GeoJSON extends this format to include geospatial information. It’s pretty great. If you are unfamiliar, take a moment to read about GeoJSON: https://en.wikipedia.org/wiki/GeoJSON

Take a look at the 5M file contents in your browser or download and open with a text editor. Note organization structure. How does this compare to, say, a Python dictionary object?

This is a GeoJSON file!

Read the file using GeoPandas by passing the url to gpd.read_file().

states_url = 'http://eric.clst.org/assets/wiki/uploads/Stuff/gz_2010_us_040_00_5m.json'
#states_url = 'http://eric.clst.org/assets/wiki/uploads/Stuff/gz_2010_us_040_00_500k.json'
states_gdf = gpd.read_file(states_url)
states_gdf.head()
GEO_ID STATE NAME LSAD CENSUSAREA geometry
0 0400000US01 01 Alabama 50645.326 MULTIPOLYGON (((-88.12466 30.28364, -88.08681 ...
1 0400000US02 02 Alaska 570640.950 MULTIPOLYGON (((-166.10574 53.98861, -166.0752...
2 0400000US04 04 Arizona 113594.084 POLYGON ((-112.53859 37.00067, -112.53454 37.0...
3 0400000US05 05 Arkansas 52035.477 POLYGON ((-94.04296 33.01922, -94.04304 33.079...
4 0400000US06 06 California 155779.220 MULTIPOLYGON (((-122.42144 37.86997, -122.4213...
states_gdf.plot()
<AxesSubplot:>
../../_images/vector_10_1.png

Check the CRS

  • Note that this was defined when we opened the file with GeoPandas - by default, a GeoJSON is assumed to use EPSG:4326 coordinate system for geodetic latitude and longitude

states_gdf.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
states_gdf.hvplot()
states_gdf.hvplot(geo=True)