[SADIS - ECSA 2025] Making a Pipeline Production-Ready: Challenges and Lessons Learned in the Healthcare Domain

Making a Pipeline Production-Ready

https://renatocf.xyz/sadis25-slides

2025

Renato Cordeiro Ferreira

Institute of Mathematics and Statistics (IME)
University of São Paulo (USP) – Brazil

Jheronimus Academy of Data Science (JADS)
Technical University of Eindhoven (TUe) / Tilburg University (TiU) – The Netherlands

Paper

Slides

Challenges and Lessons Learned in the Healthcare Domain

Former Principal ML Engineer at Elo7 (BR)

4 years of industry experience designing, building, and operating ML products with multidisciplinary teams

B.Sc. and M.Sc. at University of São Paulo (BR)

Theoretical and practical experience with Machine Learning and Software Engineering

Scientific Programmer at JADS (NL)

Currently participating in the MARIT-D European project, using ML techniques for more secure seas

Ph.D. candidate at USP + JADS (BR + NL)

Research about SE4AI, in particular about MLOps and the software architecture of ML-Enabled Systems

Renato Cordeiro Ferreira

https://renatocf.xyz/contacts

Renato Cordeiro Ferreira

Lucas Quaresma Lam

Daniel Lawand

renatocf@ime.usp.br

lucasqml08@alumni.usp.br

daniel.lawand@alumni.usp.br

Alfredo Goldman

gold@ime.usp.br

Marcelo Finger

mfinger@ime.usp.br

Roberto Oliveira Bolgheroni

robertobolgheroni@alumni.usp.br

Our paper describes
challenges and lessons learned
on evolving the training pipeline of SPIRA:
from a BIG BALL OF MUD (v1)
to a MODULAR MONOLITH (v2)
to a set of MICROSERVICES (v3).

System
Architecture

Research Track - ECSA 2025

MLOps in Practice: Requirements and a Reference Architecture from Industry

https://renatocf.xyz/ecsa25-paper

Doctoral Symposium - CAIN 2025

A Metrics-Oriented Architectural Model
to Characterize Complexity on
Machine Learning-Enabled Systems

https://renatocf.xyz/cain25-paper

Doctoral Symposium - CAIN 2025

A Metrics-Oriented Architectural Model
to Characterize Complexity on
Machine Learning-Enabled Systems

https://renatocf.xyz/cain25-paper

Data Collection App
Scientific Initiation 2021
Francisco Wernke

Streaming
Prediction Server
+ Client API / App
Capstone Project 2022
Vitor Tamae

Highly Availability
with Kubernetes
Capstone Project 2023
Vitor Guidi

Redesign Continuous Training Subsystem
Capstone Project 2023
Daniel Lawand

CI/CD/CD4ML on
Training Pipeline
Capstone Project 2024
Lucas Quaresma
+ Roberto Bolgheroni

Pipeline
Reimplementation

random.seed(c.train_config["seed"])
torch.manual_seed(c.train_config["seed"])
torch.cuda.manual_seed(c.train_config["seed"])
np.random.seed(c.train_config["seed"])
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

self.c = c
self.ap = ap
self.train = train
self.test = test
self.test_insert_noise = test_insert_noise
self.num_test_additive_noise = num_test_additive_noise
self.num_test_specaug = num_test_specaug
self.dataset_csv = \
  c.dataset["train_csv"] if train else c.dataset["eval_csv"]

assert os.path.isfile(self.dataset_csv), \
  "Test or Train CSV file don't exists! Fix it in config.json")

accepted_tc = [ 'overlapping', 'padding', 'one_window' ]
assert self.c.dataset['temporal_control'] in accepted_tc), \
  "You cannot use the padding_with_max_length option with the \
   split_wav_using_overlapping option, disable one of them !!")

self.control_class = c.dataset['control_class']
self.patient_class = c.dataset['patient_class']

self.dataset_list = \
  pd.read_csv(self.dataset_csv, sep=',') \
  .replace({'?': -1}) \
  .replace({'negative': self.control_class}, regex=True) \
  .replace({'positive': self.patient_class}, regex=True) \
  .values

Misplaced

Responsabilities

Lines 1-7 handle random number generation

Lines 8-16 assign values to
attributes

Lines 18-19 and 22-24
handle assertions

Line 30 handles data loading

# Setup

config_path = ValidPath.from_str("/app/spira/spira.json")
config = load_config(config_path)

operation_mode = OperationMode.TRAIN
randomizer = initialize_random(config, operation_mode)

# Data Loading

patients_paths = read_valid_paths_from_csv(config.patients_csv)
controls_paths = read_valid_paths_from_csv(config.controls_csv)
noises_paths = read_valid_paths_from_csv(config.noises_csv)

patients_inputs = Audios.load(
  patients_paths, config.audio, config.dataset
)
controls_inputs = Audios.load(
  controls_paths, config.audio, config.dataset
)
noises = Audios.load(noises_paths, config)
  noises_path, config.audio, config.dataset
)

# Feature Engineering

audio_processor = create_audio_processor(config.audio)

patient_feature_transformer = create_audio_feature_transformer(
    randomizer, audio_processor, config, noises,
)
control_feature_transformer = create_audio_feature_transformer(
    randomizer, audio_processor, config, noises,
)

Design-Pattern

Modularization

Lines 15-23 handles data loading by using the
Audio ADAPTER

Line 27 builds an audio_processor via a CHAIN OF RESPONSIBILITY

Lines 29-34 build two
feature_transformer's
via a STRATEGY

By bringing ML Engineers to
work with Data Scientists and
employing automated testing
since the beginning,
projects may reach production sooner