World Food Facts

Andreas Kammerloher, Muhammad Triwindu Prasetya, Vivek Sethia

 

Motivation

  • Nutritional values
  • Exploring different categories
  • Eating habits across the world
  • Distribution of specific ingredients ( sugar, fats etc.)
  • Compare product brands then change to healthier food

Overview

  • Introduction
  • Problems
  • Cleaning & Preprocessing
  • Interpretations
  • Predictions
  • Conclusion

Introduction

  • Dataset for food products created by everyone for everyone
  • Total                  : 119415 products
  • Size                    : 267 MB
  • Format              : .csv
  • Attributes         : 159
  • Missing values : Almost 13 Million
  • Almost 80% missing values for most attributes
  • The Open Food Facts Project was started by Stéphane Gigandet
  • Like on Wikipedia, anyone can create an account and add / edit food items

Introduction

Overview

  • Introduction
  • Problems
  • Cleaning & Preprocessing
  • Interpretations
  • Predictions
  • Conclusion

Problems

  • Data biased - most entries from France (90%)
  • Lot of missing values
  • Translation issues
  • Stop words
  • Same product from the same brand and quantity have different values of nutrition fact

Overview

  • Introduction
  • Problems
  • Cleaning & Preprocessing
  • Interpretations
  • Predictions
  • Conclusion

Cleaning

  • Missing values
  • Stopwords and punctuation
  • Translation issues
  • Case sensitive ("Lait" and "lAit")
  • Inappropriate value in cells

Missing values

  • Many attributes have missing values
product_name        : 14787
generic_name        : 60005
quantity            : 21450
packaging           : 39153
packaging_tags      : 39153
brands              : 18298
brands_tags         : 18304
categories          : 35132
categories_tags     : 35153
ingredients_text    : 44506
allergens           : 83246
allergens_en        : 106340
additives_n         : 44542
additives           : 45032
additives_tags      : 73536
additives_en        : 73536
fat_100g            : 53180
saturated.fat_100g  : 57684
X.butyric.acid_100g : 106340
X.sucrose_100g      : 106283

Stopwords and Punctuation

  • Library for cleaning:
    • Library (NLP)
    • Library (tm)

Stopwords and Punctuation

  • Wordcloud before removing stopwords and punctuantion

Figure 1

Stopwords and Punctuation

  • Wordcloud after removing:
    • Stopwords
    • Punctuation
  • Also after translating words

Figure 2

Translation issues

  • Couldn't find library for translation
  • Manually substitute Non-English words to English words

Case - Senstive

  • Because anyone can insert data into the OpenFoodFact database, the database has inconsistent input.
  • For example: blé and Blé

Inappropriate Value

  • Many cells have inappropriate value, for example in countries_en attribute.

Overview

  • Introduction
  • Problems
  • Cleaning & Preprocessing
  • Interpretations
  • Predictions
  • Conclusion

Inspecting Sweets

  • Searched database for sweets-related tags
  • Found 661 items (in a database of over 100000 foods)
  • Obviously there must be missing (untagged) sweets!

Inspecting Sweets

All nutritional values per 100g

Figure 1

Figure 1

Figure 1

Figure 1

Figure 3

Inspecting Sweets

All nutritional values per 100g

Figure 4

Inspect beverage

Searched the dataset for beverage tag

  • Found 71187 entries
  • Beverage categories based on its tag:
    • Morning drink
      • Milk, Coffee, Chocolate Milk
    • Soda drink
      • Sprite, Coca Cola, Pepsi, Diet Coke
    • Healthy drink
      • Tea, Juice, Smoothie
    • Protein and energy drink
      • Protein shake, Power drink

Inspect beverage

  • Formula:
    • Were sent to OpenFoodFact contributor by the team of Prof. Hercberg
    • The formula has been subject of studies and adaptations for the French market
    • There are 2 formulas:
      • For calculating "solid food"
      • For calculation "beverage"
        • Note: based on the website, we need to use formula for calculation solid food to calculate Milk nutrition score

Clustering beverage products

Figure 5

Beverage Nutrition Grade

Threshold                                            

Grade

Beverage Nutrition Grade

Points 	Energy (kJ) 	    Sugars (g) 	            Fruits, vegetables (%)
0 	0 	            0 	                            < = 40
1 	≤30 	            ≤1,5 ​​or sweeteners 	
2 	>30 	            >1.5 	                    >40
3 	>60 	            >3 	
4 	>90 	            >4.5 	                    >60
5 	>120 	            >6 	
6 	>150 	            >7.5 	
7 	>180 	            >9 	
8 	>210 	            >10.5 	
9 	>240 	            >12 	
10 	>270 	            >13.5 	                    >80

Figure 6

Meat product - Types

Figure 7

Meat product - Additives used

Figure 8

Overview

  • Introduction
  • Problems
  • Cleaning & Preprocessing
  • Interpretations
  • Predictions
  • Conclusion

Predicting Chocolate

The first attempt

  • Trained a random forest with items tagged as chocolate
  • Used sugar, fat, salt and energy content
  • High accuracy due to low percentage of chocolate items -> need better quality measurement
  • F1-score: 0.66

Predicting Chocolate

Ideas to improve the predictions

  • Predict subtypes of chocolate
  • Use more / different attributes
  • play around with the number of trees in our forrest

Predicting Chocolate

Dark Chocolate

Figure 9

Predicting Chocolate

Dark Chocolate

  • Predicting Dark Chocolate only
  • Using Proteins and Carbohydrates in addition to the previous attributes

 

  • F1-score: 0.77                                                

Predicting Chocolate

Milk Chocolate

Figure 10

Predicting Chocolate

Milk Chocolate

  • Better F1-score when leaving out carbohydrates
  • Only chocolate-prediction that had improved results for a higher amount of trees (500 instead of 100)

 

  • Still only F1-score of 0.56                        

Predicting Chocolate

White Chocolate

  • Tried different combinations of attributes
  • Still only f1-score of 0.23

 

  • why?                                                               

Predicting Chocolate

White Chocolate

  • Multiplied every white chocolate in the dataset x20
  • Removed half of them before training and added them back in for testing
  • The other half was split randomly between training and test set

 

  • Result: F1-score of 0.88                               

Predicting Juice in Whole dataset

 

  • Data consist (after removing empty categories_en cells):
    • 1963 categorized as Juice
    • 69234 categorized as non-Juice
  • Split the data into:
    • Train set (60%)
    • Test set (40%)
  • Using randomForest:
    • Number of trees : 500
    • F1-score               : 0.695

Predicting Juice in Beverage dataset

 

  • Data consist (after removing non-Beverage items):
    • 1925 categorized as Juice
    • 25852 categorized as non-Juice
  • Split the data into:
    • Train set (60%), Test set (20%), Validation set (20%)
  • Attributes: fat, sugar, energy, sodium, fruits, fibre, protein
  • Use 2 different methods to see which is better:
    • randomForest 
    • naiveBayes

Predicting Juice in Beverage dataset 

 

Result

  • Naive Bayes: 0.22
  • Random Forest: 0.73

Predicting missing Plant-based products

 

  • Data :
    • Missing  category entries: 23443
    •  Attributes used: protein, fat, sugar, carbohydrate, energy
  • Split the data into:
    • Train set (60%), Test set (20%),Validation set (20%)
  • Method : Random Forest Regressor
  • Result:
    • Recall: 0.79
    • Precision: 0.85
    • F1-score: 0.82

Predicting missing Plant-based products

 

  • Total missing values : 2315 ( w.r.t to non NAN values for protein, sugar, carbohydrate, fat and energy)

 

  • Labelled : 434
  • We manually checked 30 values and 25 out of them were plant based.

Overview

  • Introduction
  • Problems
  • Cleaning & Preprocessing
  • Interpretations
  • Predictions
  • Conclusion

Conclusion & Lessons-learnt

  • We learnt about different data mining techniques.
  • How to apply machine learning techniques
  • Learning R and Python for data exploration
  • Using different tools like Python Notebook

Tools 

THANK YOU

World Food Facts

By Vivek Sethia

World Food Facts

  • 531