Default Presentation

class: center, middle, inverse

## Data Science for the Flint Water Crisis

### Jacob Abernethy and Eric Schwartz

#### July 21, 2016

---
class: top

## What is the Flint Water Crisis?

History in Headlines

---
class: top

### February 2011

---
class: top
### November 2011

---
class: top

### March 2012

---
class: top

### April 2014

---
class: top

### April 2014

---
class: top

### September 2015

---
class: top

### September 2015

---
class: top

### October 2015

---
class: top

### October 2015

---
class: top

### May 2016

---
class:top

# What kind of crisis is the Flint Water Crisis?

* Public Health?
* Social and Environmental Justice?
* Civil Engineering and Infrastructure?
* Political?
--

* Informational?
  * What do we know? What can we learn?

# Enter Data Science ...

* Statistical Machine Learning
* Prediction and Causal Inference

---
class:top

## Quick Quiz and Preview: Flint by the Numbers

(Audience participation!)
--

* What portion of Flint homes have water lead levels above federal EPA 'action level' (15 parts per billion)?
--

.red[ 7 - 10%] <br>
--

* Where are those at-risk homes located in the city?
--

.red[ highly scattered ] <br>
--

* What are the most important risk factor predicting high water lead levels?
--

.red[ there are many: age of property, service line material, home value, city precinct ] <br>
--

* What percentage of homes have lead 'service lines'? 
--

.red[ 4 - 20% ] <br>
--

* Do we know where the lead is located throughout the infrastructure?
--

.red[ not very well ] <br>
--

* What percentage of tested children's blood lead levels are above federal CDC monitoring level 5 ug/dL?
--

.red[ 3 - 6% ]

---
class:top

# Letting the data speak

The nature of problem may be different than headlines may suggest.

There is danger and concern in Flint.

--
But not every home is in danger.

The risk levels vary widely across homes.

The problem is: there are many serious cases, and they are scattered around the city.

Finding which homes are in greatest danger is .red[critical],

and that is where statistics, machine learning, and data science help most.

---
class:top

# Today

1. How did we get involved?

1. Tell a story of data and people.

1. Ask questions and use data to find answers.
  * Where are the homes with highest water lead risk?
  * Which homes have lead pipes underground?

1. Guide decision makers and inform residents.

---
class:top

# How did we get involved?

---
class:top

# Talent Pool: The Michigan Data Science Team

.left-column[

A student group founded by J.A. and PhD student Jonathan Stroud. Now co-advised by E.S. and J.A.

]

.right-column[

We took a field-trip to Flint in April, and devoted most summer to Flint

<img src="../../assets/flint_field_trip_water_plan.jpg" width="70%">
</center>
]

---
class:top

# Finding and combining diverse data sources

## Investigative data science

* Water testing
  1. Residential tests (Department of Environmental Quality, DEQ)
  1. Sentinel site tests (DEQ and Governor's office)
  1. Sequential testing (EPA)

* Parcel information (City of Flint)

* Water infrastructure
  1. City records (via U-M Flint)
  1. In-home inspection (DEQ)
  1. Service lines replaced (Flint Mayor's office)

---
class:top

# Water testing results are public data

---
class:top

# Water testing kit

---
class:top

# Water testing data

* Residential testing data
  * Over 15,000 homes submitting over 20,000 water tests.
  * Cross-sectional, with some repeated measures (self-selection)

* Sentinel site testing data
  * About 700 homes selected to be tested 5 times.  
  * Longitudinal repeated measures overtime (random selection?)

* Parcel information

* Detailed property information about over 40,000 parcels in Flint (address, year built, vacancy, condition, unique key: Parcel ID)

---
class:top

# Map of water lead readings

Fewer than 40% of Homes have sampled tested their water.

What about the rest?

---
class:top

### Predicting the most at-risk homes in Flint

We can predict lead levels with 0.677 AUROC score

---
class:top

# How are we predicting water lead levels?

## Classification task

* General Model
  * Y := Is the water lead level greater than 15 ppb?
  * X := Observable property attributes

* Methods: Ensemble of several algorithms.
  * Logistic regression
  * Classification trees and forests
  * Bagging, Bootstrapping, Boosting

---
class:top

# What have we done with those predictions?

We shared these maps and list of most at-risk properties with:

* Flint Mayor's office
* DEQ
* EPA
* NGOs
* U-M Flint

and we continue to update the results with new data.

Goal: To help decision makers allocate resources for public health communication efforts, targeted testing (water and blood lead levels), and prioritizing infrastructure investment.

---
class:top

# Risk factors: What predicts lead in the water?

According to the model...

--
.left-column[
### Age of house

<img src="../../assets/loglead_by_year.png" width=70%>
]

.right-column[
### Service Line Material

<img src="../../assets/lead_levels_by_sltype.png" width=90%>
]

---
class:top

### What is a water Service Line?

### And where can you find Lead in the Service Line?

---
class:top

## State of MI contributed $27M towards SL replacement

.left-column[
General Michael McDaniel heading up task force to replace ~5,000 service lines

<img src="../../assets/mcdaniel_appointed_service_lines.png" width=80%>
]

.right-column[
Problem: Where should we dig?

<img src="../../assets/headline_sl_replacement_no_lead.png" width=80%>
]

---
class:top

## Expensive to determine Service Line material?

.left-column[
### Private portion service line

Check basement pipes. (~ free)
]

.right-column[
### Public portion service line

Hire a company with a hydrovac and start digging (~ $1000)
]

---
class:top

# City should have records, right?

* The City of Flint did not maintain accurate service line info
* Officials began to search for records as soon as crisis erupted
--

* Records found in basement of City Hall!<br>
<img src="../../assets/service_line_record_screenshot.png" width="60%"><br>
* These handwritten maps were digitized by Marty Kaufman's group at UM Flint

---
class:top

### City Records are *Highly* Noisy

How do we know?

* A recent pilot SL replacement project replaced 33 lines
--

* Michigan DEQ inspected about 3700 *private* service lines, and provided the results.
--

* Confusion matrix of RECORD vs. INSPECTION

---
class:top

## Machine Learning to the Rescue

* At the request of Gen. McDaniel, we have applied learning algorithms to predict service line materials
--

* Model:
  *  Y = *Service line material type* (via DEQ inspection)
  *  X = *Property attributes*, *City Records*, *Water Readings*

* We are able to estimate lead with an AUROC score of 0.86 using ensemble of XGBoost and Random Forest classifiers

---
class:top

# What's next?

## Aiding policy makers

* Continue to guide policy and decision makers to help Flint residents.

* Adaptive sampling for lead detection and service line replacement: Which homes and neighborhoods should be examined next?

* Other cities: Flint has attracted national attention to infrastructure and lead issues. How can the methods and findings be applied elsewhere?

---
class:top

# What's next?

## Colllaboration with experts

* U-M Public health and medicine
  * Lead exposure in 2014-2016 can be used for longitudinal studies in Flint

* U-M Civil engineering
  * Connecting the data analysis to physical tests of pipes, and other chemicals and pathogens in water

* And others...

---
class: middle

.footnote[A huge thanks to Jared Webb, Jonathan Stroud, Guangsha Shi, Chengyu Dai, and many many more of the amazing Michigan Data Science Team]