Benchmark data for reverse complement benchmark introduces biases
CHANGE
Description
I propose that rather than using fasta1000000000.txt
as benchmark data fasta1000000001.txt
is used instead for the reverse_complement benchmark challenge.
The primary benefit is that fasta1000000001.txt
does not contain any sequences that have a number of characters exactly divisible by 60. fasta1000000000.txt
does have one sequence that has that. As a consequence some programs have an optimization where they check the length of the sequence and if it has a number of characters divisible by 60, the reformatting step will not be done. The sequence will be simply reversed.
For example the Reverse_Complement.py from #517 (closed) uses on my system:
- 3.2 cpu seconds on fasta1000000000.txt
- 3,9 cpu seconds on fasta1000000001.txt
By comparison Python program number 2
- 5.9 cpu seconds on fasta1000000000.txt
- 5.9 cpu seconds on fasta1000000001.txt
What makes the benchmarks game website interesting in my opinion:
- different languages
- different programming techniques (and how these techniques are possible in some languages and not in other)
Shortcuts that are specifically tuned to the test data are simply not interesting from a "learning about programming language" perspective. As such I propose that the test data simply does not allow for these shortcuts.
Benefits & Costs
The cost is trivial, as it requires changing a 0 to 1 in the test data generation script. The benefit is big as it allows a more fair comparison between programming languages and programming techniques without the shortcut as a confounding factor.
P.S. I want to make it clear that I in no way intend to throw shade on @JZerfas-guest's python solution. I think it is great that he tries to squeeze every bit of performance out of the program. Recognizing that there is a way to prevent doing work with an extremely low-cost check is very good. That is why I argue to change the benchmark data instead. The shortcut can still remain in the code, but it allows for a more fair comparison with other languages and other techniques which where written by people who did not realize this shortcut was possible.