A Performance Study of LLM-Generated Code on Leetcode

Coignion Tristan, Quinton Clément, Rouvoy Romain

Green Days 2024 - Toulouse

New Shiny Things

New Shiny Things

GitHub Copilot

Some definitions

Large Language Model (LLM) :

An artifical intelligence capable of generating text

Code LLM : LLMs specialized in writing code

Code Assistant : Code LLMs integrated in the IDE

LLM + Green = 💔

LLMs need a lot of computing resources

Is it really worth the cost?

Training StarCoder2-7B

=> 100,000kWh

=>  30,000kgCO2eq

How fast is the code generated by LLMs ?

Is it worth it?

  • Measure the impact of the LLM
  • Measure the time gained for the developer

  • Measure the energy saved on the software

On the model's temperature

The temperature of a model is a parameter regulating the "creativity" and the randomness of the model's generations.

Our research questions

  1. Can Leetcode (a public repository of
    algorithmic problems) be used as a dataset and a benchmark platform for evaluating LLMs ?
     
  2. Are there notable differences between the performance of the code generated by different LLMs ?
     
  3. Is there an effect of the temperature parameter of the LLM on the code's performance ?
     
  4. How efficient are the solutions generated by the LLMs compared to humans ?

The task

A competitive programming platform hosting algorithmic problems

+ Practical for performance testing

+ Practical for evaluating LLMs

The dataset

 2 Datasets of problems :

  • New dataset : 204 problems published after January 1st 2023
     
  • Old dataset (RQ1) : 300 problems from the most liked problems of Leetcode

LLMs under study

Generating the solutions

10 solutions generated by problem, temperature and model

 

210,120 generated solutions on the new dataset

Results

RQ1: Can Leetcode be used as a dataset and a benchmark platform for evaluating LLMs?

LLMs success rate on :

  • old problems : 37% of valid solutions
  • new problems (published after training) : 3% of valid solutions

Why are the LLMs 10x worse on newer questions?

RQ1: Can Leetcode be used as a dataset and a benchmark platform for evaluating LLMs?

Data contamination

=> Harder to reproduce and generalize research

=> Questions the previous research
done using Leetcode

LLMs success rate on :

  • old problems : 37% of valid solutions
  • new problems (after January 2023) : 3% of valid solutions

Why are they 10x worse on newer questions ?

RQ1: Can Leetcode be used as a dataset and a benchmark platform for evaluating LLMs?

Leetcode provides useful measures :

run time

memory usage

ranking (based on run time)

BUT

RQ1: Can Leetcode be used as a dataset and a benchmark platform for evaluating LLMs?

Leetcode provides useful measures like :

run time

memory usage

ranking (based on run time)

Very high variance (inability to differentiate solutions of different time complexities)

Ranking evolves over time, thus is unreliable

BUT

RQ2: Are there notable differences in performances between LLMs?

Very small differences

(Cohen's d < 0.05),

thus negligible.

LLMs seem to converge towards the same kinds of solutions

(not necessarily the best ones)

Almost (<5%) no problems where one LLM is consistently better than another.

RQ2: Are there notable differences in performances between LLMs?

Better LLMs

 

 

Faster code

RQ3: Is there an effect of the temperature on the code’s performance?

Higher temperatures => higher variance of the performance of the code

 

=> Higher temperatures can help in searching for faster solutions.

Temperature : Parameter controlling the "creativity" of the model

RQ4: How fast is code generated by LLMs compared to humans?

On average, the generated solutions are faster than 73% of the other submissions on Leetcode

RQ4: How fast is code generated by LLMs compared to humans*?

On average, the generated solutions are faster than 73% of the other submissions on Leetcode

* assuming the other submissions on Leetcode were made by humans

Conclusions

Leetcode should be used cautiously when evaluating LLMs because of issues of measure stability and data contamination

Performance of generated code is largely similar across different models regardless of their size, training data or architecture

Increasing the temperature parameter leads to a greater variance in performance

Perspectives

  • Extend the study on other kinds of problems
     
  • How to make LLMs produce greener code ?
     
  • What is the energy consumption of a code assistant ?

Thanks for listening !

Any questions?

A Performance Study of LLM-generated code on Leetcode

By Tristan Coignion

A Performance Study of LLM-generated code on Leetcode

  • 17