Technical Challenge

Data Scientist Vacancy

André Ferreira

Solar forecast requirements

A hindcast showing the expected performance of your model on a historical dataset

A script for retraining your model

An inference script for making the production forecast

Final results

2x

better model than the linear baseline

3

scripts for training, inference and evaluation

├── notebooks
│   ├── 01_eda.ipynb
│   ├── 02_modelling.ipynb
│   └── 03_evaluation.ipynb
├── scripts
│   ├── train.py
│   ├── predict.py
│   └── evaluate.py
├── src
│   ├── models.py
│   └── utils.py
├── .python-version
├── README.md
└── requirements.txt

code setup in a GitHub repository

Planning

Write scripts
& classes

Write README

Engineering

Test pipeline

Planning

Read assignment

Define
goals & tasks

Experiment
with models

Explore data

Experimentation

Planning

Write scripts
& classes

Write README

Engineering

Test pipeline

Planning

Read assignment

Define
goals & tasks

Experiment
with models

Explore data

Experimentation

Exploratory data analysis

*

* without considering Sun-related features like hour & season

Modelling experiments

XGBoost

Linear Regression

Model architecture options

Original cloud cover

Time aggregated

Feature options

Sun position

Modelling experiments

Original cloud cover

Time aggregated

Feature options

Sun position

Original cloud cover

Sun position

Best model:

XGBoost

+

+

XGBoost

Linear Regression

Model architecture options

Model class

for feature in self.features_to_agg:
  for time_window in self.time_windows:
    for agg_func in self.agg_funcs:
      X_processed[
        f"{feature}_agg_{time_window}_{agg_func}"
      ] = (
        X_processed[feature].rolling(time_window).agg(agg_func)
      )
df["squared_error"] = (df.preds - df.y) ** 2
df["quantile_category"] = pd.cut(
  df.y,
  bins=self.quantile_values,
  include_lowest=True,
  labels=self.quantile_labels,
)
metrics_df = (
  df
  .groupby("quantile_category")
  .squared_error
  .mean()
  .apply(np.sqrt)
)
metrics_df["all"] = np.sqrt(df.squared_error.mean())
metrics_df.rename("rmse", inplace=True)

SolarForecastModel

Attributes

n_estimators

early_stopping_rounds

eval_metric

features_to_agg

time_windows

agg_funcs

sun_position_features

model_type

quantiles

quantile_labels

quantile_values

explainer

model

kwargs

Methods

preprocess_data

calculate_metrics

fit

predict

feature_importances

save

load

Relevant code snippets

Scripts

train.py

df = load_data(input_path)
df = df.loc[start_date:end_date]
model = SolarForecastModel(**kwargs)
df = model.preprocess_data(df)
train_df, val_df = split_data(df)
X_train, y_train = train_df.drop(columns=["power"]), train_df["power"]
X_val, y_val = val_df.drop(columns=["power"]), val_df["power"]
model.fit(X_train, y_train, X_val, y_val)
model.save(output_path)

Scripts

train.py

predict.py

df = load_data(input_path)
df = df.loc[start_date:end_date]
model = SolarForecastModel(**kwargs)
model.load(model_weights_path)
df = model.preprocess_data(df)
df["prediction"] = model.predict(df.drop(columns=["power"]))
df.to_parquet(output_path)

Scripts

train.py

predict.py

evaluate.py

model = train(
  input_path=input_path,
  start_date=train_start_date,
  end_date=train_end_date,
  output_path=model_output_path,
  kwargs=kwargs,
)
preds = predict(
  model_weights_path=model_output_path,
  input_path=input_path,
  start_date=test_start_date,
  end_date=test_end_date,
  output_path=preds_output_path,
  kwargs=kwargs,
)
metrics = model.calculate_metrics(preds=preds.prediction, y=preds.power)
return metrics

added script to evaluate models on test set more easily and with extra metrics

Final results

2x

better model than the linear baseline

3

scripts for training, inference and evaluation

├── notebooks
│   ├── 01_eda.ipynb
│   ├── 02_modelling.ipynb
│   └── 03_evaluation.ipynb
├── scripts
│   ├── train.py
│   ├── predict.py
│   └── evaluate.py
├── src
│   ├── models.py
│   └── utils.py
├── .python-version
├── README.md
└── requirements.txt

code setup in a GitHub repository

Future work

Unit tests

Data validation

Discussion with domain experts

UI via Streamlit

Past power generation features

Time series models

Ensembles

Feature selection

Code

Data

Model

Versioning

Experiment tracking

Made with Slides.com