How to work with the data:
Kaggle approach
Andrey Lukyanenko
MLE @ Meta

About me
- ~4 years as ERP-system consultant
- DS/MLE since 2017
- Lead a medical chatbot project, an R&D CV team
- Senior DS at Careem: anti-fraud, recommendation system, LLM-based products
- MLE at Meta: Monetization team
- Google Developer Expert: Kaggle category

Home Credit Default Risk
- The goal is to predict default risk
- Default risk is correlated to the interest rate
- Interest rate wasn't present in the data
- Based on AMT_CREDIT, AMT_ANNUITY, and CNT_PAYMENT we can derive interest rate
- Annuity x CNT payments / Amount of Credit = (1+ir) ^ (CNT payment /12)

Sberbank Russian Housing Market
- Predict prices of Russian apartments
- The economy drives prices -> oil/gas prices are very important, information on sanctions is influential too
- Cleaning data was crucial; otherwise, it was too noisy: fixing outliers
- Proxy of "market temperature": number of sales per period
Rossmann Store Sales
- Predict daily sales of stores
- Test data included the end of summer break
- Weather, holidays, macro and search trends are important -> need to get external data
- Store ID -> State -> Weather Station

Predicting Red Hat Business Value
- Binary classification: predict customer potential value
- Data: train and test activity files, profile info
- Some people were in both the train and test and shared the label
- One categorical column almost always had the same label for all customers in it on the same date
- Train model for the remaining customers
Intel & MobileODT Cervical Cancer Screening
- Predict women's cervical type based on images
- The same patient could have images in both the train and the test set
- Patients identified by image matching or file name prefix
- It was crucial to use patient-based validation
- Manually removing bad quality images from the train data vs data augmentation
IEEE-CIS Fraud Detection
- No user_id in the data
- User ID Proxy: card info, billing zip, email, device info
- Create user-level features
- User-based validation
- All user-level transactions should have the same prediction
Quora Question Pairs
- Determine if two questions are duplicates
- It turned out that there is an overlap between train and test
- Many questions appeared multiple times
- The key: build a graph on train + test, then traverse and do Union-Find or BFS
- Neighbour statistics
Two Sigma Connect: Rental Listing Inquiries
- Predict interest (low, medium, high) for the listing
- Folder with images had timestamps
- Timestamps were correlated with the classes: newer listings had low interest, older listings had high interest

Instant Gratification
- Synthetic data, 528 columns
- Participants discovered that the data was generated using Gaussian Mixture Models
- Per group, only 33‑47 features had noticeably larger variance; the rest looked like pure N(0, 1) noise
- wheezy-copper-turtle-magic has 512 unique values
- Train 512 GMM models with 3 clusters per class


Conclusions
- Create domain-based features
- Think what additional information could be useful
- Find patterns in the behavior
- Build a robust validation approach
- Try to reformulate the problem
- Think on how was the data collected
Contacts
kaggle_data
By Andrey Lukyanenko
kaggle_data
- 41