Working with Wisconsin Justice System Data

Branden Dupont

Data Analyst

Medical College of Wisconsin

 Datashare @ MCW IHE = Local IDS






Documented Lack of Quality in Race/Ethnicity Justice Data

Scope of the Problem - U.S.

  • 2016 Urban Institute report found in a review of 40 states that only 15 state's arrest records reported ethnicity.
  • Often is not self-identified, but assigned by official
  • This often feeds into the front end of the system (jail, pretrial, prosecutor)
  • Even when reported the quality is suspect. e.g. SPP found only 37% of Hispanic affiliated names were correctly classified in Texas

Locally, Milwaukee Shares This Problem

  • Officers assign race and sometimes ethnicity
  • Data from law enforcement feeds into the system, in particular PROTECT
  • Oftentimes, lack of confidence in the quality of race/ethnicity data
  • Currently no systems in place to assess the size, direction of potential misclassification
  • Mapping home addresses reveal questions about data quality (officer assigned vs census reported)

Implications for Measuring Racial Ethnic Disparities


  • "[F]ailing to account for Hispanics in white and black estimates tends to inflate white proportions and deflate black proportions of arrests, admissions, and prison population estimates, masking the “true” black and white racial disproportionality." - Harris CT et al


Technical Implications

  • Stakeholders want to be confident a measure is accurate for defining the size of the problem and measuring impact of a program
  • Need to get a sense of the magnitude of the misclassification
  • Even an immediate fix - still need to benchmark performance to the past and assess quality in an ongoing way
  • 15.1% of Milwaukee County is hispanic/latino and represents ~40% of all hispanic/latinos that live in Wisconsin.
  • Community trust in analysis results

Program and Policy Implications

Methods to Impute Hispanic from White in Criminal Justice Data

Existing Standards

  • The Standford Policing Project follows an established benchmark of reclassifying an individual as hispanic if 75% or more people with that same last name are Hispanic affiliated. (Melendres v. Arpaio, 2009; Word and Perkins, 1996).[8]
  • In their analysis of racial disparities in policing stops, individuals labled as white that meet the 75% threshold are changed to hispanic. They note that 90% of people with Hispanic-affiliated names identify as Hispanic.


  • Ethnicolr is a python package used to predict race and ethnicity using first and last name.


  • Ethnicolor's model was validated using Florida Voter Registration data precision of .83 and recall of .84 when both last and first name


  • Higher quality probabilistic output compared to another classifier often used in the criminal justice system: offender risk assement. These models at their best reach an AUC of .72. (COMPAS and LSI-R .66)


  •  Easy, automated way to error correct improper classification, fix missing entries, and gauge the overall quality in race/ethnicity data.

Approach to PROTECT Race/Ethnicity Error Correction


  • MacArthur project evaluating disparities in prosecutorial decision-making


  • A race/ethnicity of white is reclassified as hispanic if the first name and last name in PROTECT has a Hispanic affiliated predicted score of 75% using Ethnicolr


  • Hispanics who are reclassified from white are more likely to be correctly assigned, but will miss more instances where white should be reclassified as Hispanic (precision vs recall trade-off).




  • Imputation using Ethnicolr classifies 5848 more defendants as Hispanic instead of nh white for a total of 13,215 on a 5 year cohort of DA Referrals (2014-2018) using a predicted score of 75% or greater.


  • This represents a percentage increase of ~80% Hispanics reclassified from nh white.

Sentencing Features: Type and Length

Sentencing Type

  • Goal: calculating sentence length from circuit court data is complex
  • Start with meaningful variables that can be built with confidence
  • Generate custodial sentence for Milwaukee DA and other researchers

Custodial Sentence

  1. Goal: Identify jail and prison sentence
    • combination = custodial sentence
  2. Drop sentenced counts that are:
    • Imposed and stayed
    • Not custodial related (e.g. firearm related)
    • Prison or Jail/HOC with zero days
  3. Exceptions:
    • Check court record of events - time served disposition
    • Probation sentence with jail condition time

Sentence Length:

A Work in Progress

Sentence Length Can Get Complex

  • Image: one individual case with two convicted counts
    • (1) one for felony burglary
      • Prison (3 yrs) and E.S. (2 yrs)
      • Concurrent to count 2
    • (2)  felon in possession of a firearm
      • Prison (2 yrs) and E.S. (2 yrs)
      • Concurrent to count 1 and any other sentence

"Any Other Sentence"

  • That same individual  has another case 20 days earlier in the same county with one convicted count for
    • (1) another for felon in possession of a firearm
      • Prison (3 yrs) and E.S. (2 yrs)

Whether a Count is Concurrent or Consecutive Lives in Free Form Text

  • "concurrent to count 1, two, and 3 in this case and consecutive to case 13cf04"
  • "concurr with 15cf3316"
  • "Concurrent with: Concurrent to count two and consecutive to count three. *Credit of 180 days as to count one and two."

Task Isn't Surmountable

  • Only need to classify which counts are consecutive and sentences that are consecutive to another case
  • Presumption is concurrent
  • Cases in same time horizon can be linked by SID and general record linkage
  • Counts and case numbers can be extracted using NLP/Regex

Consecutive Regex Pattern

Dependency Parse

Dependency Parse

Any Questions?

DOJ Presentation

By Branden DuPont

DOJ Presentation

  • 120
Loading comments...

More from Branden DuPont