Relates to #99 (closed).
Spam detection
This MR implement spam detection in two ways.
Time between when the registration form is requested and submitted
The following command show the time spent on the registration page:
declare -A hash; (grep '/accounts/register' /var/log/apache2/access-debexpo.log /var/log/apache2/access-debexpo.log.1; zgrep '/accounts/register' /var/log/apache2/access-debexpo.log.*.gz) | sort -k 4 |cut -d ' ' -f 1,4,6 | sed -e 's/^[^:]*://g' |tr -d '["' | while read ip time method; do if [[ $method == "GET" ]]; then hash[${ip}]="${time}"; else echo $ip $hash[${ip}] $(( $(date +%s -d "$(echo "$time" | sed -e 's,/,-,g' -e 's,:, ,')") - $(date +%s -d "$(echo "$hash[${ip}]" | sed -e 's,/,-,g' -e 's,:, ,')") )); fi; done | cut -d ' ' -f 3 | sort | uniq -c | sort -n
Output:
COUNT SECONDS
1 10
1 11
1 12
1 2662
1 27
1 290
1 34
1 4
1 48
1 83
1 95
2 23
2 3
2 6
2 7
10 2
154 1
2256 0
We can see that most of spammer spend less than 2 seconds before submitting the form. I manually reviewed the times compared to activated account in db and it appears that a human spend at least 6-7 seconds (for the fastest) while usually being closer to 10 to 20 seconds.
In this MR, on form request, a timestamp is stored in the session (server-side, stored in db). On submission, the timestamp is tested and must be greater than REGISTRATION_MIN_ELAPSED. I set it by default to 5 seconds.
Max number of registration per IP
The following command shows the number of request per IP and per day:
declare -A hash; (grep '/accounts/register' /var/log/apache2/access-debexpo.log /var/log/apache2/access-debexpo.log.1; zgrep '/accounts/register' /var/log/apache2/access-debexpo.log.*.gz) | grep POST | sort -k 4 |cut -d ' ' -f 1,4 | sed -e 's/^[^:]*://g' -e 's/:.*//g' |tr -d '["' | while read ip day; do echo $(echo $ip | sha256sum | cut -c 1-7) $day; done | sort | uniq -c | sort -n
Output (IP is hashed for privacy):
COUNT IP DAY
2 e194673 12/Sep/2020
2 e194673 14/Sep/2020
2 e194673 15/Sep/2020
2 e194673 18/Sep/2020
2 e85186a 06/Sep/2020
2 ea292cf 06/Sep/2020
2 ea292cf 09/Sep/2020
39 a2f40e8 19/Sep/2020
40 082b225 15/Sep/2020
49 082b225 07/Sep/2020
59 bea6192 19/Sep/2020
62 a2f40e8 12/Sep/2020
105 a2f40e8 18/Sep/2020
112 bea6192 12/Sep/2020
143 bea6192 18/Sep/2020
163 a2f40e8 11/Sep/2020
226 a2f40e8 07/Sep/2020
247 a2f40e8 15/Sep/2020
298 bea6192 11/Sep/2020
363 bea6192 15/Sep/2020
409 bea6192 07/Sep/2020
In this MR, each IP is stored (hashed for privacy) for REGISTRATION_CACHE_TIMEOUT (default to 1 day). Each time a registration is processed, the cached key is incremented and when it reaches the limit defined by REGISTRATION_PER_IP, it will be rejected.
Both of those measures should prevent most of the spammer to utilize mentors while having no effect on human users. If a single IP should legitimately register more than 5 accounts a day, a contact with the support would be enough to tweak the settings.
Spam detection can be disable with REGISTRATION_SPAM_DETECTION = False.