split data/CVE/list considerations

in !1 (merged) i drifted into the question of whether data/CVE/list can be usefully split up in little bits.

In SVN, that file wasn't such problem: SVN would keep only the latest copy and changes were relatively fast. It's a bit more of a problem in git, as each change on the file creates a new blob that git can't easily deduplicate. (It does deduplicate at the packing stage, but it still means lots of operations like clone, push or branching are needlessly slow.) Even editing the file can be a problem: Emacs warns that the file is too big, for example, even though it can usually edit it correctly if you have any decent amount of RAM.

Still, I've looked at what would be involved in splitting that file up in one file per CVE. At the code level, !1 (merged) implements merging multiple files so that shouldn't be a problem. I was curious to see what the impact would be in terms of performance of the git repository, however.

A first dumb implementation means we end up with ~100k files in a directory, which takes a whopping 408MB of disk space right off the bat, whereas the original CVE/list file is currently 16MB, because each file takes its 4k block. This can obviously be tweaked at the filesystem level, of course. git can deal with those files fairly well, which creates a ~400MB .git directory for a total of 800MB on disk for the checkout. after git gc, the .git directory gets compressed down to 26MB however, for a total of 430MB of disk space for the compressed git checkout. Compare this to the 353MB that the security-tracker repository currently takes, and keeps in mind that's with the whole history of the file!

One of the problems with that layout is that any file operation with wildcards (e.g. *) will fail because there are too many files. So it might actually be better to split the CVEs in dated directory (e.g. CVE-YYYY/CVE-YYYY-XXXXX). In this layout, we save a tiny bit of disk space (4MB, for a total 404MB before the git commit). The resulting .git directory is 420MB and gets compressed to 27MB after git gc. So not much gains there in terms of disk usage, but wildcards become usable again.

I'm not sure this is worth it at all after all. The .git directory is smaller, but this could be just because the history is much, much shorter. We would need to do a filter-branch with the script to get better figures, and that would be quite an undertaking.

Another interesting conclusion from this work is that we are obviously getting more and more data in the file each year, with for example 2017 taking as much as double the space (so probably double the number of CVEs) as 2016. Here's the stats per year:

6,1M	CVE-1999
4,9M	CVE-2000
6,1M	CVE-2001
9,3M	CVE-2002
6,0M	CVE-2003
11M	CVE-2004
19M	CVE-2005
28M	CVE-2006
26M	CVE-2007
28M	CVE-2008
20M	CVE-2009
20M	CVE-2010
18M	CVE-2011
21M	CVE-2012
23M	CVE-2013
32M	CVE-2014
30M	CVE-2015
34M	CVE-2016
51M	CVE-2017
18M	CVE-2018

And here's the program to generate those files:

import os
import os.path
import re


def gen_entries(list_file):
    """tokenize the CVE/list file, grouped by CVE entry"""
    buf = ''
    with open(list_file) as f:
        for line in f:
            if line.startswith('CVE-'):
                yield buf
                buf = ''
            buf += line
        if buf:
            yield buf


dates = set()
for buf in gen_entries('data/CVE/list'):
    if not buf:
        continue
    m = re.search(r'((CVE-\d+)-\d+) ', buf)
    if not m:
        continue
    cve_id = m.group(1)
    cve_date = m.group(2)
    if cve_date not in dates:
        os.mkdir('data/CVEs/{}'.format(cve_date))
        dates.add(cve_date)
    with open('data/CVEs/{}/{}'.format(cve_date, cve_id), 'w') as f:
        f.write(buf)

Keep in mind that the script might drop some data and shouldn't be used as is in a migration. For example, it doesn't take into account CVE-XXX issues. But it's sufficient to make those observations.

Edited Jun 08, 2018 by Antoine Beaupré