Technical Challenge
Data Scientist Vacancy
André Ferreira

Solar forecast requirements
A hindcast showing the expected performance of your model on a historical dataset
A script for retraining your model
An inference script for making the production forecast
Final results
2x
better model than the linear baseline
3
scripts for training, inference and evaluation
├── notebooks
│ ├── 01_eda.ipynb
│ ├── 02_modelling.ipynb
│ └── 03_evaluation.ipynb
├── scripts
│ ├── train.py
│ ├── predict.py
│ └── evaluate.py
├── src
│ ├── models.py
│ └── utils.py
├── .python-version
├── README.md
└── requirements.txtcode setup in a GitHub repository
Planning
Write scripts
& classes
Write README
Engineering
Test pipeline
Planning
Read assignment
Define
goals & tasks
Experiment
with models
Explore data
Experimentation
Planning
Write scripts
& classes
Write README
Engineering
Test pipeline
Planning
Read assignment
Define
goals & tasks
Experiment
with models
Explore data
Experimentation
Exploratory data analysis
*
* without considering Sun-related features like hour & season
Modelling experiments
XGBoost
Linear Regression
Model architecture options
Original cloud cover
Time aggregated
Feature options
Sun position
Modelling experiments
Original cloud cover
Time aggregated
Feature options
Sun position
Original cloud cover
Sun position
Best model:
XGBoost
+
+
XGBoost
Linear Regression
Model architecture options
Model class
for feature in self.features_to_agg:
for time_window in self.time_windows:
for agg_func in self.agg_funcs:
X_processed[
f"{feature}_agg_{time_window}_{agg_func}"
] = (
X_processed[feature].rolling(time_window).agg(agg_func)
)df["squared_error"] = (df.preds - df.y) ** 2
df["quantile_category"] = pd.cut(
df.y,
bins=self.quantile_values,
include_lowest=True,
labels=self.quantile_labels,
)
metrics_df = (
df
.groupby("quantile_category")
.squared_error
.mean()
.apply(np.sqrt)
)
metrics_df["all"] = np.sqrt(df.squared_error.mean())
metrics_df.rename("rmse", inplace=True)SolarForecastModel
Attributes
n_estimators
early_stopping_rounds
eval_metric
features_to_agg
time_windows
agg_funcs
sun_position_features
model_type
quantiles
quantile_labels
quantile_values
explainer
model
kwargs
Methods
preprocess_data
calculate_metrics
fit
predict
feature_importances
save
load
Relevant code snippets
Scripts
train.py
df = load_data(input_path)
df = df.loc[start_date:end_date]
model = SolarForecastModel(**kwargs)
df = model.preprocess_data(df)
train_df, val_df = split_data(df)
X_train, y_train = train_df.drop(columns=["power"]), train_df["power"]
X_val, y_val = val_df.drop(columns=["power"]), val_df["power"]
model.fit(X_train, y_train, X_val, y_val)
model.save(output_path)Scripts
train.py
predict.py
df = load_data(input_path)
df = df.loc[start_date:end_date]
model = SolarForecastModel(**kwargs)
model.load(model_weights_path)
df = model.preprocess_data(df)
df["prediction"] = model.predict(df.drop(columns=["power"]))
df.to_parquet(output_path)Scripts
train.py
predict.py
evaluate.py
model = train(
input_path=input_path,
start_date=train_start_date,
end_date=train_end_date,
output_path=model_output_path,
kwargs=kwargs,
)
preds = predict(
model_weights_path=model_output_path,
input_path=input_path,
start_date=test_start_date,
end_date=test_end_date,
output_path=preds_output_path,
kwargs=kwargs,
)
metrics = model.calculate_metrics(preds=preds.prediction, y=preds.power)
return metricsadded script to evaluate models on test set more easily and with extra metrics
Final results
2x
better model than the linear baseline
3
scripts for training, inference and evaluation
├── notebooks
│ ├── 01_eda.ipynb
│ ├── 02_modelling.ipynb
│ └── 03_evaluation.ipynb
├── scripts
│ ├── train.py
│ ├── predict.py
│ └── evaluate.py
├── src
│ ├── models.py
│ └── utils.py
├── .python-version
├── README.md
└── requirements.txtcode setup in a GitHub repository
Future work
Unit tests
Data validation
Discussion with domain experts
UI via Streamlit
Past power generation features
Time series models
Ensembles
Feature selection
Code
Data
Model
Versioning
Experiment tracking
Dexter Energy Technical Interview
By André Cristóvão Neves Ferreira
Dexter Energy Technical Interview
- 35