How to work with the data:

Kaggle approach

Andrey Lukyanenko

MLE @ Meta

About me

  • ~4 years as ERP-system consultant
  • DS/MLE since 2017
  • Lead a medical chatbot project, an R&D CV team
  • Senior DS at Careem: anti-fraud, recommendation system, LLM-based products
  • MLE at Meta: Monetization team
  • Google Developer Expert: Kaggle category

Home Credit Default Risk

  • The goal is to predict default risk
  • Default risk is correlated to the interest rate
  • Interest rate wasn't present in the data
  • Based on AMT_CREDIT, AMT_ANNUITY, and CNT_PAYMENT we can derive interest rate
  • Annuity x CNT payments / Amount of Credit = (1+ir) ^ (CNT payment /12)

Sberbank Russian Housing Market

  • Predict prices of Russian apartments
  • The economy drives prices -> oil/gas prices are very important, information on sanctions is influential too
  • Cleaning data was crucial; otherwise, it was too noisy: fixing outliers
  • Proxy of "market temperature": number of sales per period

Rossmann Store Sales

  • Predict daily sales of stores
  • Test data included the end of summer break
  • Weather, holidays, macro and search trends are important -> need to get external data
  • Store ID -> State -> Weather Station

Predicting Red Hat Business Value

  • Binary classification: predict customer potential value
  • Data: train and test activity files, profile info
  • Some people were in both the train and test and shared the label
  • One categorical column almost always had the same label for all customers in it on the same date
  • Train model for the remaining customers

Intel & MobileODT Cervical Cancer Screening

  • Predict women's cervical type based on images
  • The same patient could have images in both the train and the test set
  • Patients identified by image matching or file name prefix
  • It was crucial to use patient-based validation
  • Manually removing bad quality images from the train data vs data augmentation

IEEE-CIS Fraud Detection

  • No user_id in the data
  • User ID Proxy: card info, billing zip, email, device info
  • Create user-level features
  • User-based validation
  • All user-level transactions should have the same prediction

Quora Question Pairs

  • Determine if two questions are duplicates
  • It turned out that there is an overlap between train and test
  • Many questions appeared multiple times
  • The key: build a graph on train + test, then traverse and do Union-Find or BFS
  • Neighbour statistics

Two Sigma Connect: Rental Listing Inquiries

  • Predict interest (low, medium, high) for the listing
  • Folder with images had timestamps
  • Timestamps were correlated with the classes: newer listings had low interest, older listings had high interest

Instant Gratification

  • Synthetic data, 528 columns
  • Participants discovered that the data was generated using Gaussian Mixture Models
  • Per group, only 33‑47 features had noticeably larger variance; the rest looked like pure N(0, 1) noise
  • wheezy-copper-turtle-magic has 512 unique values
  • Train 512 GMM models with 3 clusters per class

Conclusions

  • Create domain-based features
  • Think what additional information could be useful
  • Find patterns in the behavior
  • Build a robust validation approach
  • Try to reformulate the problem
  • Think on how was the data collected

Contacts

kaggle_data

By Andrey Lukyanenko

kaggle_data

  • 41