PHC6194 SPATIAL EPIDEMIOLOGY

Spatial Data Engineering and Linkage

Hui Hu Ph.D.

Department of Epidemiology

College of Public Health and Health Professions & College of Medicine

January 30, 2019

Spatial Data Engineering and Linkage
 

Lab: PostGIS Part 2

Spatial Data Engineering and Linkage

Spatial Data Engineering

  • Geometry and geography functions
     
  • Geometry relationships
     
  • Proximity analysis
     
  • Geometry and geography processing

Geometry and Geography Functions

  • Output functions
    - output data in various standard formats
     
  • Constructor functions
    - create PostGIS objects from well-known formats
     
  • Accessor and setter functions
    - work against a single spatial object and return or set attributes of the object
     
  • Measurement functions
    - return scalar measurements of a spatial object
     
  • Decomposition functions
    - extract other spatial objects from an input object

Common Formats and Output Functions

  • Well-known binary (WKB) and well-known text (WKT)
    -  the most common formats for spatial objects
    -  WKT: ST_AsText and ST_AsEWKT
    -  WKB:  ST_AsBinary and ST_AsEWKB
     
  • Keyhole Markup Language (KML)
    -  an XML-based format, used by Google
    -  SRS is always SRID 4326
    -  ST_AsKML
     
  • Geography Markup Language (GML)
    -  an XML-based format used in Web Feature Service
    -  ST_AsGML
     
  • Geometry JaveScript Object Notation (GeoJSON)
    -  a format based on JSON
    -  ST_AsGeoJSON
     
  • Scalable Vector Graphics (SVG)
    -  popular among high-end redering or drawing tools
    -  ST_AsSVG
     
  • Extensible 3D Graphics (X3D)
    -  ST_AsX3D

Shapefiles

.shp - the file that stores the geometry of the feature

.shx - the file that stores the index of the feature geometry

.dbf - the dBASE file that stores the attribute information

.prj  - the file that defines the shapefile's projection

.html, .htm, .xml - the files that usually contains metadata

.sbn and .sbx - store additional indices 

Constructor Functions

  • Two common ways:
    -  Build new spatial objects from scratch using raw data in various formats
    -  Utilize existing spatial objects and decompose, splice, slice, dice, or morph them to form new ones
     
  • Create geometries from text and binary formats
    -  ST_GeomFromText, ST_GeomFromWKB, ST_GeomFromEWKB, ST_GeomFromGML, ST_GeomFromGeoJSON, ST_GeomFromKML
     
  • Create geographies from text and binary formats
    -  ST_GeogFromText, ST_GeogFromWKB, ST_GeogFromKML, ST_GeogFromGML, ST_GeogFromGeoJSON

Accessor and Setter Functions

  • Any function that accesses or sets the intrinsic properties of an object
     
  • A few defining characteristics of spatial objects:
    -  spatial reference identifiers: SRID
    -  subtype: the finer categorization of geometry and geography types, such as points, polygons, etc.
    -  coordinate dimension: the dimension of the vector space in which your geometry lives, which can be 2, 3, or 4
    -  geometric dimension: minimal dimension of the vector space necessary to fully contain the geometry, which can be 0 (points), 1(linestrings), or 2(polygons)

Accessor and Setter Functions
SRID and Transformation for Geometry

  • ST_SRID and ST_SetSRID
    -  retrieves and sets the SRID
     
  • ST_Transform
    -  transform geometry to different spatial references
    -  e.g. take a geometry in lon/lat and transform it to a planar SRS so that you can take meaningful measurements
     
  • Differences between ST_SetSRID and ST_Transform
    -  ST_SetSRID doesn't change the coordinates of a geometry; it only sets an attribute called SRID, which comes in useful when you realize that you made a mistake during data import

Accessor and Setter Functions
Using transformation with the geography type

  • The geography type does not have ST_Transform, ST_SetSRID, or ST_SRID functions
    -  because it always uses WGS84 lon/lat
     
  • However, the ST_Transform function is crucial when working with geography type
    -  e.g. if you want to use geometry functions that are not available for geography, then you can cast objects to geometry, use the geometry function, and then cast back to geography

Accessor and Setter Functions
Geometry Type Function

  • When importing data with heterogeneous geometry columns, you may not be aware of the geometry types.
     
  • GeometryType and ST_GeometryType

Accessor and Setter Functions
Geometry and Coordinate Dimensions

  • ST_CoordDim
    -  coordinate dimension
    -  the dimension of the space that the geometry lives in
     
  • ST_Dimension
    -  geometry dimension
    -  the smallest dimensional space that will fully contain the geometry

Measurement Functions

  • Planar measurements
    -  treats the earth as essentially flat
    -  generally in units of meters or feet
    -  better supported by GIS tools and are faster to process
     
  • Geodetic measurements
    -  once measures start to cross continents and oceans
    -  consider the spherical nature of the earth

Measurement Functions
Geometry Planar Measurements

  • All the planar measurement functions are in the same units as the SRS that's defined for the geometry
     
  • Common functions:
    -  ST_Length and ST_3DLength
    -  ST_Area and ST_3DArea
    -  ST_Perimeter and ST_3DPerimeter: calculate the length of all the rings for multi-ringed polygons

Measurement Functions
Geodetic Measurements

  • If you use functions on geography type objects, it will generate geodetic measurements which consider the spherical nature of the earth
     
  • If you have geometry type objects, you can use the spherical family of functions in geometry to take advantage of spheroidal computation
    -  e.g. ST_LengthSpheroid

Decomposition Functions
Bounding Box of Geometries

  • Often when comparing the relative spatial relationships of two or more geometries, the question can be sufficiently answered much more quickly by comparing the bounding boxes of the geometries
    -  you only need to work with rectangles and can ignore the details of the geometries within
     
  • The bounding box of a 2D geometry is a box2D object (we also have box3D object for 3D geometry)
     
  • All geometries have boxes, even points
    -  boxes are not geometries, but you can cast boxes into geometries

Decomposition Functions
Boundaries and Converting Polygons to Linestrings

  • ST_Boundary
    -  returns the geometry that determines the separation between the points in the geometry and the rest of the coordinate space
    -  a common use is to break apart polygons and multipolygons into their constituents rings
     
  • ST_ExteriorRing and ST_InteriorRingN

Decomposition Functions
Centroid and Point on Surface

  • ST_Centroid
    -  you can think of the cenroid of a geometry as the center of gravity, as if every point in the geometry had equal mass
    -  the centroid may not lie within the geometry itself
     
  • ST_PointOnSurface
    -  always returns an arbitrary point on the boundary geometry

Decomposition Functions
Returning points defining a geometry

  • ST_PointN
    -  only works with linestrings and circularstrings
    -  returns the nth point on the linestring, with indexing starting at 1
     
  • ST_DumpPoints
    -  if you want to extract all or many points of a geometry
    -  returns a set of geometry_dump objects which have two components: a one-dimensional path array (lists the sequence in which the points were dumped) and a geometry (always a point in this case)

Decomposition Functions
Decomposing Multi-geometries and Geometry Collections

  • ST_Dump
    -  recursively dumps all contained geometries
    -  returns a set of geometry_dump objects
     
  • ST_GeometryN
    -  drills down only a single level
    -  extracts the nth geometry from a multi-geometry or collection geometry
    -  returns a single extracted geometry, doesn't recurse, and therefore doesn't report depth

Lab: PostGIS Part 2

Moving beyond Single Geometries
Geometry Relationships

  • Bounding boxes
     
  • Intersections
     
  • Relationships
     
  • The meaning of equality

Bounding Box

  • Bounding boxes
    -  the smallest rectangular box with edges parallel to the axes of the coordinate plane that completely encloses the object
    -  box-based comparisons embedded in PostGIS makes relationship queries really fast
     
  • Example:
    -  check whether the state of Washington is northwest of Florida

Geometry Comparators

Intersections

  • Interior, exterior, and boundary of a geometry
    -  Interior: the space inside a geometry and not on the boundary
    -  Exterior: the space outside a geometry and not on the boundary
    -  Boundary: the space that's neither interior nor exterior
     
  • Intersections
    -  two geometries intersect when they have interior or boundary points in common
    -  the set of all shared points is called intersection
    -  ST_Intersects: returns true or false
    -  ST_Intersection: returns the geometry of the intersected region

Relating Two Geometries

  • Contains
    -  when geometry A contains geometry B, no points of B lie in the exterior of A, and at least one point of B must lie in the interior of A
    -  if B lies only on the boundary of A, A does NOT contain B
    -  ST_Contains
     
  • Within
    -  contains and within are inverse relationships
    -  if A is within geometry B, then B contains A
    -  ST_Within

Relating Two Geometries (cont'd)

  • Covers
    - contain with boundary
    - when geometry A covers geometry B, no points of B lie in the exterior of A, and at least one point of B must lie in the interior or boundary of A
    -  if B lies only on the boundary of A, A DOES cover B
    -  ST_Covers
     
  • Covered by
    -  ST_Coveredby

Relating Two Geometries (cont'd)

  • Overlapping geometries
    - two geometries overlap when they have the same geometry dimension, they intersect, and one is not completely contained in the other
    - ST_Overlaps
     
  • Touching geometries
    -  two geometries touch if they have at least one point in common and none of the common points lie in the interior of both geometries
    -  ST_Touches
     
  • Crossing geometries
    -  two geometries cross each other if they have some interior points in common but not all
    -  ST_Crosses
     
  • Disjoint geometries
    -  the antithesis of the intersects relationship
    -  two geometries disjointed if they have no shared interiors or boundaries
    -  ST_Disjoint
    -  ST_Disjoint cannot use an index, therefore usually slower than ST_Intersects

Equality

  • Bounding-box equality
    -  the bounding boxes of the two geometries share the same space
    -  this is what is tested when you use the = operator
    -  this also applies to deduping operations such as UNION, DISTINCT, and GROUP BY
     
  • Spatial equality
    -  two geometries occupy the same space
    -  e.g. a linestring that starts at point A and runs to point B spatially equals to a linestring that starts at point B and runs to point A
    -  ST_Equals
     
  • Geometric equality
    -  stronger than spatial equality and means that two geometries occupy the same space and have the same underlying representation
    -  important for routing
    -  ST_OrderingEquals

Proximity Analysis

  • How far something is located from something else:
    -  how far is my house from the nearest expressway?
    -  how many burger joints are within a mile drive?
    -  what's the average distance that people have to commute to work?
     
  • Nearest neighbor searches
     
  • KNN distance operators
     
  • Using KNN with geography
     
  • Geotagging

Nearest Neighbor Searches

  • Which places are within X distance?
    -  ST_DWithin
    -  can be used on both geometry and geography types
     
  • What are the N closest places?
    -  use ST_DWithin with ST_Distance (through ORDER BY)
     
  • Find the closest locations
    -  use ST_DWithin and DISTINCT ON to find closest locations
    -  DISTINCT ON performs an implicit GROUP BY, but it's not limited to returning just the fields that you grouped on
    -  DISTINCT ON (expression) only keeps the first row of each set of rows where the given expression evaluate to equal

Nearest Neighbor Searches (cont'd)

  • Intersects with tolerance
    -  use ST_DWithin to check for intersections when you have two geometries that fail to intersect because of differences caused by the number of significant digits
    -  e.g. LINESTRING(1 2, 3 4) and Point(3.00001, 4.00001)
    -  this is used very often when working with real data where not everything lines up perfectly

KNN Distance Operators

  • Finding N closest places using KNN distance bounding-box operators
    -  good enough for geometries that tend to fill up their bounding boxes or that are very small
     
  • <#>
    - this is the KNN bounding-box distance operator
    -  A <#> B returns the minimum distance between the bounding boxes of A and B
     
  • <->
    -  the KNN bounding-box centroid distance operator
    -  A<->B returns the distance between the centroids of the bounding boxes of A and B
     
  • These can only be used with geometry type
     
  • Much faster than ST_Distance

Use KNN with Geography Types

  • KNN distance operators cannot be used with geography types directly
     
  • Steps:
    -  create a functional geometry index to the table
    -  temporarily convert the geography to geometry
    -  use the KNN operators
    -  finally convert the results back to geography

Geotagging

  • Situate points located within the context of another geometry
     
  • Region tagging:
    -  tag a geometry, such as a point of interest, with the name of a region it's in, such as a state
     
  • Linear referencing
    -  refer to a point of interest by its closest point along a linestring (the tag can be the closest point on the linestring, or a measure such as a mile marker or fractional percent measured from the start of the linestring to the point on the linestring closest to your point of interest)
    -  steps: 1) use ST_DWithin to narrow choices, 2) for every pairing of point and linestring, use ST_ClosestPoint to pinpoint the closest point on the linestring, and 3) use DISTINCT ON and ST_Distance to keep only the paired point and linestring that are closest

Geometry and Geography Processing

  • Aggregation
    -  rolling up several rows of data into one
    -  COUNT, SUM, MIN, MAX, AVG
     
  • Spatial aggregation
    -  ST_MakeLine
    -  ST_Union: the most commonly used one
    -  ST_Collect
    -  ST_Polygonize
     
  • No spatial aggregates for the geography type, therefore, need to cast geography to geometry

Geometry and Geography Processing (cont'd)

  • Clipping
    -  remove unwanted sections of a geometry
    -  ST_Difference(A,B): returns the portion of A that's not shared with B
    -  ST_SymDifference(A,B): returns the portion of A and B that's not shared

Geometry and Geography Processing (cont'd)

  • Splitting
    -  use a linestring to slice a polygon
    -  ST_Split: can only be used with single geometries, not collections, and the blade you use to cut has to be one dimension lower than what you are cutting up

Lab: PostGIS Part 2

(continued)