Benchmark data for reverse complement benchmark introduces biases

CHANGE

Description

I propose that rather than using fasta1000000000.txt as benchmark data fasta1000000001.txt is used instead for the reverse_complement benchmark challenge.

The primary benefit is that fasta1000000001.txt does not contain any sequences that have a number of characters exactly divisible by 60. fasta1000000000.txt does have one sequence that has that. As a consequence some programs have an optimization where they check the length of the sequence and if it has a number of characters divisible by 60, the reformatting step will not be done. The sequence will be simply reversed.

For example the Reverse_Complement.py from #517 (closed) uses on my system:

3.2 cpu seconds on fasta1000000000.txt
3,9 cpu seconds on fasta1000000001.txt

By comparison Python program number 2

5.9 cpu seconds on fasta1000000000.txt
5.9 cpu seconds on fasta1000000001.txt

What makes the benchmarks game website interesting in my opinion:

different languages
different programming techniques (and how these techniques are possible in some languages and not in other)

Shortcuts that are specifically tuned to the test data are simply not interesting from a "learning about programming language" perspective. As such I propose that the test data simply does not allow for these shortcuts.

Benefits & Costs

The cost is trivial, as it requires changing a 0 to 1 in the test data generation script. The benefit is big as it allows a more fair comparison between programming languages and programming techniques without the shortcut as a confounding factor.

P.S. I want to make it clear that I in no way intend to throw shade on @JZerfas-guest's python solution. I think it is great that he tries to squeeze every bit of performance out of the program. Recognizing that there is a way to prevent doing work with an extremely low-cost check is very good. That is why I argue to change the benchmark data instead. The shortcut can still remain in the code, but it allows for a more fair comparison with other languages and other techniques which where written by people who did not realize this shortcut was possible.

Edited Aug 17, 2022 by Isaac Gouy