Building a Real-World Benchmark for Code Optimization
Bachelor & Master Thesis
Code optimization has become an increasingly important research topic, especially with the rise of large language models that aim to improve code performance automatically. However, most existing datasets in this area are built on LeetCode-style problems—short, self-contained code snippets designed for algorithmic challenges. These datasets lack the complexity, structure, and constraints of real-world software systems, making them a poor proxy for evaluating optimization techniques in practical scenarios. This project aims to construct a benchmark dataset based on real-world open-source projects, focusing on performance-related code changes that occur in the evolution of mature software. The dataset will consist of code revision pairs mined from GitHub repositories, where one version is less efficient and a subsequent revision improves performance. Alongside the code, we will collect metadata such as commit messages, runtime profiles, and contextual information from surrounding files. This benchmark will not only reflect the diversity and messiness of real-world code but also capture a broader spectrum of optimization patterns—from algorithmic improvements to low-level refactorings like loop unrolling, memory layout changes, and system call batching. By building this benchmark, we can enable more realistic training and evaluation of optimization models, and open up new research directions in learning from real-world software evolution.
Required knowledge:
- Strong programming background, especially proficient in python.
- Familiar with static analysis techniques.