Coignion Tristan, Quinton Clément, Rouvoy Romain
EASE 2024 - Salerno
Find the slides online :
Large Language Model (LLM) :
An artifical intelligence capable of generating text
Code LLM : LLMs specialized in writing code
Code Assistant : Code LLMs integrated in the IDE
LLMs need a lot of computing resources
Is it really worth the cost?
Training StarCoder2-7B
=> 100,000kWh
=> 30,000kgCO2eq
Measure the energy saved on the software
The temperature of a model is a parameter regulating the "creativity" and the randomness of the model's generations.
A competitive programming platform hosting algorithmic problems
+ Practical for performance testing
+ Practical for evaluating LLMs
2 Datasets of problems :
210,120 generated solutions
LLMs success rate on :
Why are the LLMs 10x worse on newer questions?
=> Harder to reproduce and generalize research
=> Questions the previous research
done using Leetcode
LLMs success rate on :
Why are they 10x worse on newer questions ?
Leetcode provides useful measures :
run time
memory usage
ranking (based on run time)
BUT
Leetcode provides useful measures like :
run time
memory usage
ranking (based on run time)
Very high variance (inability to differentiate solutions of different time complexities)
Ranking evolves over time, thus is unreliable
BUT
Very small differences
(Cohen's d < 0.05),
thus negligible.
LLMs seem to converge towards the same kinds of solutions
(not necessarily the best ones)
Almost (<5%) no problems where one LLM is consistently better than another.
Better LLMs
Faster code
Higher temperatures => higher variance of the performance of the code
=> Higher temperatures can help in searching for faster solutions.
Temperature : Parameter controlling the "creativity" of the model
On average, the generated solutions are faster than 73% of the other submissions on Leetcode
On average, the generated solutions are faster than 73% of the other submissions on Leetcode
* assuming the other submissions on Leetcode were made by humans
Leetcode should be used cautiously when evaluating LLMs because of issues of measure stability and data contamination
Performance of generated code is largely similar across different models regardless of their size, training data or architecture
Increasing the temperature parameter leads to a greater variance in performance
Any questions?