Skip to content

Benchmark data for reverse complement benchmark introduces biases

CHANGE

Description

I propose that rather than using fasta1000000000.txt as benchmark data fasta1000000001.txt is used instead for the reverse_complement benchmark challenge.

The primary benefit is that fasta1000000001.txt does not contain any sequences that have a number of characters exactly divisible by 60. fasta1000000000.txt does have one sequence that has that. As a consequence some programs have an optimization where they check the length of the sequence and if it has a number of characters divisible by 60, the reformatting step will not be done. The sequence will be simply reversed.

For example the Reverse_Complement.py from #517 (closed) uses on my system:

  • 3.2 cpu seconds on fasta1000000000.txt
  • 3,9 cpu seconds on fasta1000000001.txt

By comparison Python program number 2

  • 5.9 cpu seconds on fasta1000000000.txt
  • 5.9 cpu seconds on fasta1000000001.txt

What makes the benchmarks game website interesting in my opinion:

  • different languages
  • different programming techniques (and how these techniques are possible in some languages and not in other)

Shortcuts that are specifically tuned to the test data are simply not interesting from a "learning about programming language" perspective. As such I propose that the test data simply does not allow for these shortcuts.

Benefits & Costs

The cost is trivial, as it requires changing a 0 to 1 in the test data generation script. The benefit is big as it allows a more fair comparison between programming languages and programming techniques without the shortcut as a confounding factor.

P.S. I want to make it clear that I in no way intend to throw shade on @JZerfas-guest's python solution. I think it is great that he tries to squeeze every bit of performance out of the program. Recognizing that there is a way to prevent doing work with an extremely low-cost check is very good. That is why I argue to change the benchmark data instead. The shortcut can still remain in the code, but it allows for a more fair comparison with other languages and other techniques which where written by people who did not realize this shortcut was possible.

Edited by Isaac Gouy