Modern software projects often incorporate some form of performance testing into their development cycle, intending to detect changes in performance between commits or releases. Performance testing generally relies on experimental evaluation using various benchmark workloads.
To detect performance changes reliably, benchmarks must be executed many times to account for variability in the measurement results. While considered best practice, this approach can become prohibitively expensive when the number of versions and benchmark workloads increases.
To alleviate the cost of performance testing, we propose an approach for the early stopping of non-productive experiments that are unlikely to detect a performance bug in a particular benchmark. The stopping conditions are based on benchmark-specific thresholds determined from historical data modified to emulate the potential effects of software changes on benchmark performance.
We evaluate the approach on the GraalVM benchmarking project and show that it can eliminate about 50% of the experiments if we can afford to ignore about 15% of the least significant performance changes.