class: center, middle, inverse ## Data Science for the Flint Water Crisis ### Jacob Abernethy and Eric Schwartz #### July 21, 2016 --- class: top ## What is the Flint Water Crisis? History in Headlines --- class: top ### February 2011
--- class: top ### November 2011
--- class: top ### March 2012
--- class: top ### April 2014
--- class: top ### April 2014
--- class: top ### September 2015
--- class: top ### September 2015
--- class: top ### October 2015
--- class: top ### October 2015
--- class: top ### May 2016
--- class:top # What kind of crisis is the Flint Water Crisis? * Public Health? * Social and Environmental Justice? * Civil Engineering and Infrastructure? * Political? -- * Informational? * What do we know? What can we learn? -- # Enter Data Science ... * Statistical Machine Learning * Prediction and Causal Inference --- class:top ## Quick Quiz and Preview: Flint by the Numbers (Audience participation!) -- * What portion of Flint homes have water lead levels above federal EPA 'action level' (15 parts per billion)? -- .red[ 7 - 10%]
-- * Where are those at-risk homes located in the city? -- .red[ highly scattered ]
-- * What are the most important risk factor predicting high water lead levels? -- .red[ there are many: age of property, service line material, home value, city precinct ]
-- * What percentage of homes have lead 'service lines'? -- .red[ 4 - 20% ]
-- * Do we know where the lead is located throughout the infrastructure? -- .red[ not very well ]
-- * What percentage of tested children's blood lead levels are above federal CDC monitoring level 5 ug/dL? -- .red[ 3 - 6% ] --- class:top # Letting the data speak The nature of problem may be different than headlines may suggest. -- There is danger and concern in Flint. -- But not every home is in danger. The risk levels vary widely across homes. -- The problem is: there are many serious cases, and they are scattered around the city. Finding which homes are in greatest danger is .red[critical], and that is where statistics, machine learning, and data science help most. --- class:top # Today 1. How did we get involved? 1. Tell a story of data and people. 1. Ask questions and use data to find answers. * Where are the homes with highest water lead risk? * Which homes have lead pipes underground? 1. Guide decision makers and inform residents. --- class:top # How did we get involved? --
--- class:top # Talent Pool: The Michigan Data Science Team .left-column[
A student group founded by J.A. and PhD student Jonathan Stroud. Now co-advised by E.S. and J.A. ] -- .right-column[ We took a field-trip to Flint in April, and devoted most summer to Flint
] --- class:top # Finding and combining diverse data sources ## Investigative data science * Water testing 1. Residential tests (Department of Environmental Quality, DEQ) 1. Sentinel site tests (DEQ and Governor's office) 1. Sequential testing (EPA) * Parcel information (City of Flint) * Water infrastructure 1. City records (via U-M Flint) 1. In-home inspection (DEQ) 1. Service lines replaced (Flint Mayor's office) --- class:top # Water testing results are public data
--- class:top # Water testing kit
--- class:top # Water testing data * Residential testing data * Over 15,000 homes submitting over 20,000 water tests. * Cross-sectional, with some repeated measures (self-selection) * Sentinel site testing data * About 700 homes selected to be tested 5 times. * Longitudinal repeated measures overtime (random selection?) * Parcel information * Detailed property information about over 40,000 parcels in Flint (address, year built, vacancy, condition, unique key: Parcel ID) --- class:top # Map of water lead readings
-- Fewer than 40% of Homes have sampled tested their water. What about the rest? --- class:top ### Predicting the most at-risk homes in Flint
We can predict lead levels with 0.677 AUROC score --- class:top # How are we predicting water lead levels? ## Classification task * General Model * Y := Is the water lead level greater than 15 ppb? * X := Observable property attributes * Methods: Ensemble of several algorithms. * Logistic regression * Classification trees and forests * Bagging, Bootstrapping, Boosting --- class:top # What have we done with those predictions? We shared these maps and list of most at-risk properties with: * Flint Mayor's office * DEQ * EPA * NGOs * U-M Flint and we continue to update the results with new data. Goal: To help decision makers allocate resources for public health communication efforts, targeted testing (water and blood lead levels), and prioritizing infrastructure investment. --- class:top # Risk factors: What predicts lead in the water? According to the model... -- .left-column[ ### Age of house
] -- .right-column[ ### Service Line Material
] --- class:top ### What is a water Service Line?
-- ### And where can you find Lead in the Service Line?
--- class:top ## State of MI contributed $27M towards SL replacement -- .left-column[ General Michael McDaniel heading up task force to replace ~5,000 service lines
] -- .right-column[ Problem: Where should we dig?
] --- class:top ## Expensive to determine Service Line material? -- .left-column[ ### Private portion service line
Check basement pipes. (~ free) ] -- .right-column[ ### Public portion service line
Hire a company with a hydrovac and start digging (~ $1000) ] --- class:top # City should have records, right? * The City of Flint did not maintain accurate service line info * Officials began to search for records as soon as crisis erupted -- * Records found in basement of City Hall!
* These handwritten maps were digitized by Marty Kaufman's group at UM Flint --- class:top ### City Records are *Highly* Noisy How do we know? -- * A recent pilot SL replacement project replaced 33 lines -- * Michigan DEQ inspected about 3700 *private* service lines, and provided the results. -- * Confusion matrix of RECORD vs. INSPECTION
--- class:top ## Machine Learning to the Rescue * At the request of Gen. McDaniel, we have applied learning algorithms to predict service line materials -- * Model: * Y = *Service line material type* (via DEQ inspection) * X = *Property attributes*, *City Records*, *Water Readings* --
* We are able to estimate lead with an AUROC score of 0.86 using ensemble of XGBoost and Random Forest classifiers --- class:top # What's next? ## Aiding policy makers * Continue to guide policy and decision makers to help Flint residents. * Adaptive sampling for lead detection and service line replacement: Which homes and neighborhoods should be examined next? * Other cities: Flint has attracted national attention to infrastructure and lead issues. How can the methods and findings be applied elsewhere? --- class:top # What's next? ## Colllaboration with experts * U-M Public health and medicine * Lead exposure in 2014-2016 can be used for longitudinal studies in Flint * U-M Civil engineering * Connecting the data analysis to physical tests of pipes, and other chemicals and pathogens in water * And others... --- class: middle
$\sim \sim$ FIN $\sim \sim$
.footnote[A huge thanks to Jared Webb, Jonathan Stroud, Guangsha Shi, Chengyu Dai, and many many more of the amazing Michigan Data Science Team]