split data/CVE/list considerations
in !1 (merged) i drifted into the question of whether data/CVE/list
can be usefully split up in little bits.
In SVN, that file wasn't such problem: SVN would keep only the latest copy and changes were relatively fast. It's a bit more of a problem in git, as each change on the file creates a new blob that git can't easily deduplicate. (It does deduplicate at the packing stage, but it still means lots of operations like clone, push or branching are needlessly slow.) Even editing the file can be a problem: Emacs warns that the file is too big, for example, even though it can usually edit it correctly if you have any decent amount of RAM.
Still, I've looked at what would be involved in splitting that file up in one file per CVE. At the code level, !1 (merged) implements merging multiple files so that shouldn't be a problem. I was curious to see what the impact would be in terms of performance of the git repository, however.
A first dumb implementation means we end up with ~100k files in a directory, which takes a whopping 408MB of disk space right off the bat, whereas the original CVE/list file is currently 16MB, because each file takes its 4k block. This can obviously be tweaked at the filesystem level, of course. git can deal with those files fairly well, which creates a ~400MB .git
directory for a total of 800MB on disk for the checkout. after git gc
, the .git
directory gets compressed down to 26MB however, for a total of 430MB of disk space for the compressed git checkout. Compare this to the 353MB that the security-tracker repository currently takes, and keeps in mind that's with the whole history of the file!
One of the problems with that layout is that any file operation with wildcards (e.g. *
) will fail because there are too many files. So it might actually be better to split the CVEs in dated directory (e.g. CVE-YYYY/CVE-YYYY-XXXXX). In this layout, we save a tiny bit of disk space (4MB, for a total 404MB before the git commit). The resulting .git
directory is 420MB and gets compressed to 27MB after git gc
. So not much gains there in terms of disk usage, but wildcards become usable again.
I'm not sure this is worth it at all after all. The .git
directory is smaller, but this could be just because the history is much, much shorter. We would need to do a filter-branch with the script to get better figures, and that would be quite an undertaking.
Another interesting conclusion from this work is that we are obviously getting more and more data in the file each year, with for example 2017 taking as much as double the space (so probably double the number of CVEs) as 2016. Here's the stats per year:
6,1M CVE-1999
4,9M CVE-2000
6,1M CVE-2001
9,3M CVE-2002
6,0M CVE-2003
11M CVE-2004
19M CVE-2005
28M CVE-2006
26M CVE-2007
28M CVE-2008
20M CVE-2009
20M CVE-2010
18M CVE-2011
21M CVE-2012
23M CVE-2013
32M CVE-2014
30M CVE-2015
34M CVE-2016
51M CVE-2017
18M CVE-2018
And here's the program to generate those files:
import os
import os.path
import re
def gen_entries(list_file):
"""tokenize the CVE/list file, grouped by CVE entry"""
buf = ''
with open(list_file) as f:
for line in f:
if line.startswith('CVE-'):
yield buf
buf = ''
buf += line
if buf:
yield buf
dates = set()
for buf in gen_entries('data/CVE/list'):
if not buf:
continue
m = re.search(r'((CVE-\d+)-\d+) ', buf)
if not m:
continue
cve_id = m.group(1)
cve_date = m.group(2)
if cve_date not in dates:
os.mkdir('data/CVEs/{}'.format(cve_date))
dates.add(cve_date)
with open('data/CVEs/{}/{}'.format(cve_date, cve_id), 'w') as f:
f.write(buf)
Keep in mind that the script might drop some data and shouldn't be used as is in a migration. For example, it doesn't take into account CVE-XXX issues. But it's sufficient to make those observations.