• Jérémy Bobbio's avatar
    Massive rearchitecturing: make each file type have their own class · 5c02e000
    Jérémy Bobbio authored
    A good amount of the code for comparators is now based on classes
    instead of methods. Each file type gets its own classs.
    
    The base class, File, is an abstract class that can represent files
    on the filesystem but also files that can be extracted from an archive.
    This design makes room for future implementation of fuzzy-matching.
    
    Each file type class implements a class method recognizes() that will
    receives an unspecialized File instance. This is way more flexible than
    the old constrained regex table approach. The new identification method
    used for Haskell interfaces is a good illustration. Appropriate caching
    for calls to libmagic methods is there as they are still frequently used
    and tend to be rather slow.
    
    An unspecialized File object will then be typecasted into the class that
    recognized it. If that does not happen, binary comparison is implemented
    by the File class.
    
    Instead of redefining the compare() method which returns a single
    Difference or None, file type classes can implement compare_details()
    which returns an array of “inside” differences. An empty array means no
    differences were found.
    
    This new approach makes room to handle special file types better. As an
    example, device files can now be compared directly as their extraction
    from archives is problematic without root access.
    
    To reduce a good amount of boilerplate code, the Container and its
    subclass Archive has been introduced to represent anything that
    “contains” more file to be compared. While the API might still be
    improved, this already helped a good amount of code become more
    consistent. This will also make it pretty straightforward to implement
    parallel processing in a near future.
    
    Some archive formats (at least cpio and iso9660) were pretty annoying
    to work with. To get rid of some painful code, we now use
    libarchive—through the ctypes based wrapper libarchive-c—to handle these
    archives in a generic manner. One downside is that libarchive is very
    stream-oriented which is not really suited to our random-access model.
    We'll see how this impacts performance in the future.
    
    Other less crucial changes:
    
     - `find` is now used to compare directory listings.
     - The fallback code in case the `rpm` module cannot be found has been
       isolated to a `comparators.rpm_fallback` module.
     - Symlinks and devices are now compared in a consistent manner.
     - `md5sums` files in Debian packages are now only recognized when
       they are part of a Debian package.
     - Files in squashfs are now extracted one by one.
     - Text files with different encodings can be compared and this difference
       is recorded as well.
     - Test coverage is now at 92% for comparators.
    
    Sincere apologies for this unreviewable commit.
    5c02e000
text_iso8859_expected_diff 744 Bytes