The universe of Data Science

Nicolas Rochet  - 2025

A brief tour

Warm up activity

What words comes to your mind ?

Go here to participate:

www.wooclap.com/HEETHC

DATA

A computer science definition

DATA

INFORMATION

Sequence of symbols coded as numbers

interpreted data

A variety of data types

IMAGES

LANGUAGE

SERIES DE NOMBRES

SONS

Professional

photos

voices

recordings

reviews

commentaries

chats

weather

sensors

META DATA

web

applications

softwares

tweets

ventes

stock

business data

videos

Social networks

friends

sharing

likes

...

music

IoT

messages

electric consumtion

... from various sources

Open Data

Internal data

From the web

Data markets

scraping

API(s)

social network

plateforms

data scientist communities

public organisations

brokers

statistical reports

from a work domain

from softwares

from organisations

...

the rise of (BIG) DATA

the ability to produce, collect, store, structure, acces, and present digital data

BIG DATA

DATA SETS

SMALL DATA

Volume

Velocity

Variety

3V of Big Data

the quest for structuration

unstructured

unstructured

structured

low quantity

high quantity

big data

data lake

data warehouse

one data base

several types of data bases

single type data sets

data streams

different types of data sets

data mesh

 A huge consumption of data

8

40

200

world data consumtion (Zb)

2015

2025

2020

1~Zb = 10^{21}b

 with an big ecological footprint

Big data

pollution

Big data benefits

Big Data market size

The advantages of data for organizations

Exploiting data with data science

Communication

Data story telling

Dash boarding

Data Visualisation

Data Analysis

Computer science

Statistics

AI

Coding langage

Data

structures

Differents domains

Data Science

Reporting

Data mining

Business

Intelligence

Décision

APIs

Automation

tools

Decision Science

Game theory

Data mining & KDD

Le data mining is a step of a more general process:  Knowledge Discovery in Databases (KDD)

Data Sources

Data mining

Knowledge extrcation

Exploitation & deployement

Structuration

Models

Patterns

Information

Enrichiment

Data bases

Entrepôt de donnée

Files

Documents

The rise of data science

and Machine Learning

For who ?
By whom ?

Some actors and users

Big companies

Research

Labs

Citizens

Institutions

Communities

CNIL

Europe

Governements

Companies

Academics

Open AI

UN

Kaggle

Start ups

Non profits

Github

INRIA

Element AI

GAFAM

BATX

Small companies

Data science employers

Data science employers

Data Science & organizations in 2021

Data Science & AI in 2021

Conclusions

45 % of organizations haven't adopted Data Science & IA

Organizations need to think right strategies to incorporate data culture

In general, Northern countries are more advanced

There is a discrepency in the advance of adoption among organizations

How to build and use data science ?

A data science project lifecycle

 Need to ensure ethics-by-design !

idea

project's

definition

PoC

develop & deploy

sharing

&

feedbacks

ethical watch

mesure of project's adoption

Data science professions

Project's steps

Researcher

Data visualiser

Communicator

Data Analyst

Data Architecte

Data Manager

Data Scientist

Decision makers

Ethician

DPO

Manager

Data engineer

Designer

Characteristic steps

data collection & management

Data prepration

Data processing

Deployement

Data exploration

A need

A Problem to solve

Communication

Vizualisations

Report

Product & service

Decision making

cleaned

 

data

Use case

Data mining

Data collection

Data preparation

Data Analysis

Deployment

Besoin /

Problème à résoudre

Understanding of data

Modeling

Pattern

identification

Evaluation

Understanding of the domain

Inspired by CRISP's method

Exploratory Analysis

How to organize a data science project ?

Describe your project

Define what kind of data product you want

Build a detailled user story

 Resume & pitch your project

How to organize a data science project ?

State of the art

Sourcing concerning your topic : web sites, articles, ...

Make an inventory of data sources

availables ones (open data, domain data, ...)

needed ones

Are there similar projects ?

How to organize a data science project ?

Design a mock-up

Imagine the user experience (UI & UX)

Organize visually what information & how to display ?

How to organize a data science project ?

Domain knowledge

Exploratory data analysis

Select your data's & analysis

what variables to keep, exclude ?

what kind of pre-processings to apply ?

what final processing analytics chain to apply ?

are there specific methods/algorithms to use ?

what type of plots to produce ?

How to organize a data science project ?

Organize your data & code

build a modularized template for your preprocessing chain

build a modularised template for your main processing chain

code the graphical interface

organize the deployement of your data product

structure your data with information systems tools

... and popular tools

RapidMiner

KNIME

R

Python

Jupyter notebooks

Tableau Software

Softwares & Platforms

Data Preparation

SAP

Microsoft Power BI

Qlik View

Hadoop

Google Cloud Platform

Amazon Web Services

Microsoft Azure

Scikit learn

Tensorflow

Pandas

ERP

Data management

Automation &

Deployement

SQL

No-SQL

Data warehouse

Data lake

Databases

Data structures

ETL

Exploration & communication

Processing

IBM Cloud

CRM

Sales force

SAS

API

...

LLM's based tools

Le boom des

Large Language Models

Le boom des IAs génératives

Un fort engouement de la recherche, du public et des entreprises

 IAs génératives ?

Depuis quelques années les progrès de la recherche ont abouti à des IAs capables de générer des données réalistes

Des images

vidéos

dessins

illustrations

photo realistes

...

Du texte

Code

paragraphes

questions / réponses

Listes

résumés

...

Les Large Language Models (LLM)

Des sons

voix

musique

...

De très gros réseaux de neurones

Entrainés très longuement sur des jeux de données gigantesques ...

... à prédire chaque prochain token d'un texte

Le réseau apprend des représentations complexes (embeding)

Exemple simplifié de la génération de texte

Conception & Entraînement

"L'apprentissage automatique est une branche de l'IA

 Bonnes capacités de généralisation

image reseau neuroneCreated by Mohamed Mbfrom the Noun Project
icone apprentissageArtboard 5 Created by Gregor Cresnarfrom the Noun Project

Le réseau entrainé a appris des représentations généralisables

Exemple simplifié de la génération de texte

Inférence

Génération de texte token par token

Ré-entrainement sur des données spécifiques

Sélectionner un ensemble de documents à donner en contexte

Noun_Project_50Icon_10px_grid Created by Yana Sapeginafrom the Noun Project
icone apprentissageArtboard 5 Created by Gregor Cresnarfrom the Noun Project

fine tuning

 En tant qu'expert de la data science programmant en python ...

contexte : prompt

Retrieval Augmented Information

+

IAs génératives d'images

DALL-E

Stable diffusion

MidJourney

Les plus connues

IAs génératives de vidéos

Sora

Veo

Meta Movie Gen

Meta AI

Les plus connues

IAs génératives de texte

Les modèles de fondation fermés (les plus connus)

GPT-4

Generative Pre-trained Transformer

Palm 2

Pathway Langage Model

Llama 3

Large Langage Model Meta AI

IAs génératives de texte

Les modèles de fondation ouverts (les plus connus)

Falcon

Claude

Mixtral

Mixture of Models

Llama 3

Large Langage Model Meta AI

IA génératives de sons

Exemple : Stable audio

Générer du son à partir d'une instruction texte (prompt)

Des IAs aux capacités multi-tâches

...

Applications & uses cases

Succes stories & fails

Map of some applications

INDUSTRY

BANK

RETAIL
MARKETING

MEDECINE

ARTS

MEDIA

TRANSPORTS

...

Predictive maintenance

Robots

Flow management

Credit scoring

Fraud detecttion

Automatic trading

Sentiment analysis

Discovery of treatments

Prediction of treatment succes

Traffic analysis

Generative design

Sound generation

Image generation

Ressources planning

Assisted diagnostic

Autonmous vehicules

Automatic summary

Text generation

Product recommendation

Content recommendation

Automatic captioning

Churn prediction

customer behavior prediction

Uses cases

Well known succes stories

Recommendations

Online retail

Streaming plaforms

Social networks

Product / service

Content

People

Market places

Advertising

Uses cases

Marketing

chat bots

Churn prediction

sentiment analysis

Predictive marketing

Customer segmentation

A/B testing

optimization

Still growing

Uses cases

Widely used in the future ?

Drug discovery

Diagostic assistance

Health care

Stock optimisation

Patients allocation

Health medical record analysis

Smart city

Pollution prediction

Traffic prediction

Optmisation of

buildings consumption

Facial recognition

Uses cases

Should be avoided ?

"Smart" city

Predictive Police

Facial recognition

Prediction of recidivism

Justice

Prediction of criminality

Citizens surveillance

Data ethics principles

Privacy

Transparency

Interprétability

Ecological footprint

Impact on users