Commit 5c02e000 authored by Jérémy Bobbio's avatar Jérémy Bobbio

Massive rearchitecturing: make each file type have their own class

A good amount of the code for comparators is now based on classes
instead of methods. Each file type gets its own classs.

The base class, File, is an abstract class that can represent files
on the filesystem but also files that can be extracted from an archive.
This design makes room for future implementation of fuzzy-matching.

Each file type class implements a class method recognizes() that will
receives an unspecialized File instance. This is way more flexible than
the old constrained regex table approach. The new identification method
used for Haskell interfaces is a good illustration. Appropriate caching
for calls to libmagic methods is there as they are still frequently used
and tend to be rather slow.

An unspecialized File object will then be typecasted into the class that
recognized it. If that does not happen, binary comparison is implemented
by the File class.

Instead of redefining the compare() method which returns a single
Difference or None, file type classes can implement compare_details()
which returns an array of “inside” differences. An empty array means no
differences were found.

This new approach makes room to handle special file types better. As an
example, device files can now be compared directly as their extraction
from archives is problematic without root access.

To reduce a good amount of boilerplate code, the Container and its
subclass Archive has been introduced to represent anything that
“contains” more file to be compared. While the API might still be
improved, this already helped a good amount of code become more
consistent. This will also make it pretty straightforward to implement
parallel processing in a near future.

Some archive formats (at least cpio and iso9660) were pretty annoying
to work with. To get rid of some painful code, we now use
libarchive—through the ctypes based wrapper libarchive-c—to handle these
archives in a generic manner. One downside is that libarchive is very
stream-oriented which is not really suited to our random-access model.
We'll see how this impacts performance in the future.

Other less crucial changes:

 - `find` is now used to compare directory listings.
 - The fallback code in case the `rpm` module cannot be found has been
   isolated to a `comparators.rpm_fallback` module.
 - Symlinks and devices are now compared in a consistent manner.
 - `md5sums` files in Debian packages are now only recognized when
   they are part of a Debian package.
 - Files in squashfs are now extracted one by one.
 - Text files with different encodings can be compared and this difference
   is recorded as well.
 - Test coverage is now at 92% for comparators.

Sincere apologies for this unreviewable commit.
parent 71da3ffa
......@@ -104,7 +104,7 @@ def main():
if parsed_args.debug:
logger.setLevel(logging.DEBUG)
set_locale()
difference = debbindiff.comparators.compare_files(
difference = debbindiff.comparators.compare_root_paths(
parsed_args.file1, parsed_args.file2)
if difference:
if parsed_args.html_output:
......
......@@ -2,7 +2,7 @@
#
# debbindiff: highlight differences between two builds of Debian packages
#
# Copyright © 2014 Jérémy Bobbio <lunar@debian.org>
# Copyright © 2014-2015 Jérémy Bobbio <lunar@debian.org>
#
# debbindiff is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
......@@ -38,6 +38,7 @@ class RequiredToolNotFound(Exception):
, 'cpio': { 'debian': 'cpio' }
, 'diff': { 'debian': 'diffutils' }
, 'file': { 'debian': 'file' }
, 'find': { 'debian': 'findutils' }
, 'getfacl': { 'debian': 'acl' }
, 'ghc': { 'debian': 'ghc' }
, 'gpg': { 'debian': 'gnupg' }
......
This diff is collapsed.
......@@ -2,7 +2,7 @@
#
# debbindiff: highlight differences between two builds of Debian packages
#
# Copyright © 2014 Jérémy Bobbio <lunar@debian.org>
# Copyright © 2014-2015 Jérémy Bobbio <lunar@debian.org>
#
# debbindiff is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
......@@ -17,11 +17,17 @@
# You should have received a copy of the GNU General Public License
# along with debbindiff. If not, see <http://www.gnu.org/licenses/>.
from abc import ABCMeta, abstractproperty, abstractmethod
from binascii import hexlify
from contextlib import contextmanager
import os
import os.path
import re
from stat import S_ISCHR, S_ISBLK
import subprocess
import magic
from debbindiff.difference import Difference
from debbindiff import tool_required, RequiredToolNotFound
from debbindiff import tool_required, RequiredToolNotFound, logger
@contextmanager
......@@ -56,8 +62,147 @@ def compare_binary_files(path1, path2, source=None):
comment = 'xxd not available in path. Falling back to Python hexlify.\n'
return Difference.from_unicode(hexdump1, hexdump2, path1, path2, source, comment)
SMALL_FILE_THRESHOLD = 65536 # 64 kiB
@tool_required('cmp')
def are_same_binaries(path1, path2):
return 0 == subprocess.call(['cmp', '--silent', path1, path2],
shell=False, close_fds=True)
# decorator for functions which needs to access the file content
# (and so requires a path to be set)
def needs_content(original_method):
def wrapper(self, other, *args, **kwargs):
with self.get_content(), other.get_content():
return original_method(self, other, *args, **kwargs)
return wrapper
class File(object):
__metaclass__ = ABCMeta
@classmethod
def guess_file_type(self, path):
if not hasattr(self, '_mimedb'):
self._mimedb = magic.open(magic.NONE)
self._mimedb.load()
return self._mimedb.file(path)
@classmethod
def guess_encoding(self, path):
if not hasattr(self, '_mimedb_encoding'):
self._mimedb_encoding = magic.open(magic.MAGIC_MIME_ENCODING)
self._mimedb_encoding.load()
return self._mimedb_encoding.file(path)
def __repr__(self):
return '<%s %s %s>' % (self.__class__, self.name, self.path)
# Path should only be used when accessing the file content (through get_content())
@property
def path(self):
return self._path
# This might be different from path and is used to do file extension matching
@property
def name(self):
return self._name
@property
def magic_file_type(self):
if not hasattr(self, '_magic_file_type'):
with self.get_content():
self._magic_file_type = File.guess_file_type(self.path)
return self._magic_file_type
@abstractmethod
@contextmanager
def get_content(self):
raise NotImplemented
@abstractmethod
def is_directory():
raise NotImplemented
@abstractmethod
def is_symlink():
raise NotImplemented
@abstractmethod
def is_device():
raise NotImplemented
@needs_content
def compare_bytes(self, other, source=None):
return compare_binary_files(self.path, other.path, source)
def _compare_using_details(self, other, source):
details = [d for d in self.compare_details(other, source) if d is not None]
if len(details) == 0:
return None
difference = Difference(None, self.name, other.name, source=source)
difference.add_details(details)
return difference
@tool_required('cmp')
@needs_content
def has_same_content_as(self, other):
logger.debug('%s has_same_content %s', self, other)
# try comparing small files directly first
my_size = os.path.getsize(self.path)
other_size = os.path.getsize(other.path)
if my_size == other_size and my_size <= SMALL_FILE_THRESHOLD:
if file(self.path).read() == file(other.path).read():
return True
return 0 == subprocess.call(['cmp', '--silent', self.path, other.path],
shell=False, close_fds=True)
# To be specialized directly, or by implementing compare_details
@needs_content
def compare(self, other, source=None):
if hasattr(self, 'compare_details'):
try:
difference = self._compare_using_details(other, source)
# no differences detected inside? let's at least do a binary diff
if difference is None:
difference = self.compare_bytes(other, source=source)
if difference is None:
return None
difference.comment = (difference.comment or '') + \
"No differences found inside, yet data differs"
except subprocess.CalledProcessError as e:
difference = self.compare_bytes(other, source=source)
output = re.sub(r'^', ' ', e.output, flags=re.MULTILINE)
cmd = ' '.join(e.cmd)
difference.comment = (difference.comment or '') + \
"Command `%s` exited with %d. Output:\n%s" \
% (cmd, e.returncode, output)
except RequiredToolNotFound as e:
difference = self.compare_bytes(other, source=source)
difference.comment = (difference.comment or '') + \
"'%s' not available in path. Falling back to binary comparison." % e.command
package = e.get_package()
if package:
difference.comment += "\nInstall '%s' to get a better output." % package
return difference
return self.compare_bytes(other, source)
class FilesystemFile(File):
def __init__(self, path):
self._path = None
self._name = path
@contextmanager
def get_content(self):
if self._path is not None:
yield
else:
self._path = self._name
yield
self._path = None
def is_directory(self):
return not os.path.islink(self._name) and os.path.isdir(self._name)
def is_symlink(self):
return os.path.islink(self._name)
def is_device(self):
mode = os.lstat(self._name).st_mode
return S_ISCHR(mode) or S_ISBLK(mode)
......@@ -2,7 +2,7 @@
#
# debbindiff: highlight differences between two builds of Debian packages
#
# Copyright © 2014 Jérémy Bobbio <lunar@debian.org>
# Copyright © 2014-2015 Jérémy Bobbio <lunar@debian.org>
#
# debbindiff is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
......@@ -19,33 +19,49 @@
from contextlib import contextmanager
import os.path
import re
import subprocess
import debbindiff.comparators
from debbindiff.comparators.utils import binary_fallback, returns_details, make_temp_directory
from debbindiff.difference import get_source
from debbindiff import tool_required
@contextmanager
@tool_required('bzip2')
def decompress_bzip2(path):
with make_temp_directory() as temp_dir:
if path.endswith('.bz2'):
temp_path = os.path.join(temp_dir, os.path.basename(path[:-4]))
else:
temp_path = os.path.join(temp_dir, "%s-content" % path)
with open(temp_path, 'wb') as temp_file:
from debbindiff.comparators.binary import File, needs_content
from debbindiff.comparators.utils import Archive, get_compressed_content_name
from debbindiff import logger, tool_required
class Bzip2Container(Archive):
@property
def path(self):
return self._path
def open_archive(self, path):
self._path = path
return self
def close_archive(self):
self._path = None
def get_member_names(self):
return [get_compressed_content_name(self.path, '.bz2')]
@tool_required('bzip2')
def extract(self, member_name, dest_dir):
dest_path = os.path.join(dest_dir, member_name)
logger.debug('bzip2 extracting to %s' % dest_path)
with open(dest_path, 'wb') as fp:
subprocess.check_call(
["bzip2", "--decompress", "--stdout", path],
shell=False, stdout=temp_file, stderr=None)
yield temp_path
@binary_fallback
@returns_details
def compare_bzip2_files(path1, path2, source=None):
with decompress_bzip2(path1) as new_path1:
with decompress_bzip2(path2) as new_path2:
return [debbindiff.comparators.compare_files(
new_path1, new_path2,
source=[os.path.basename(new_path1), os.path.basename(new_path2)])]
["bzip2", "--decompress", "--stdout", self.path],
shell=False, stdout=fp, stderr=None)
return dest_path
class Bzip2File(File):
RE_FILE_TYPE = re.compile(r'^bzip2 compressed data\b')
@staticmethod
def recognizes(file):
return Bzip2File.RE_FILE_TYPE.match(file.magic_file_type)
@needs_content
def compare_details(self, other, source=None):
with Bzip2Container(self).open() as my_container, \
Bzip2Container(other).open() as other_container:
return my_container.compare(other_container, source)
......@@ -3,6 +3,7 @@
# debbindiff: highlight differences between two builds of Debian packages
#
# Copyright © 2015 Reiner Herrmann <reiner@reiner-h.de>
# 2015 Jérémy Bobbio <lunar@debian.org>
#
# debbindiff is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
......@@ -17,58 +18,36 @@
# You should have received a copy of the GNU General Public License
# along with debbindiff. If not, see <http://www.gnu.org/licenses/>.
import re
import subprocess
import os.path
import debbindiff.comparators
from debbindiff import logger, tool_required
from debbindiff.comparators.utils import binary_fallback, returns_details, make_temp_directory, Command
from debbindiff.comparators.binary import File, needs_content
from debbindiff.comparators.libarchive import LibarchiveContainer
from debbindiff.comparators.utils import Command
from debbindiff.difference import Difference
class CpioContent(Command):
@tool_required('cpio')
def cmdline(self):
return ['cpio', '-tvF', self.path]
@tool_required('cpio')
def get_cpio_names(path):
cmd = ['cpio', '--quiet', '-tF', path]
return subprocess.check_output(cmd, stderr=subprocess.PIPE, shell=False).splitlines(False)
@tool_required('cpio')
def extract_cpio_archive(path, destdir):
cmd = ['cpio', '--no-absolute-filenames', '--quiet', '-idF',
os.path.abspath(path.encode('utf-8'))]
logger.debug("extracting %s into %s", path.encode('utf-8'), destdir)
p = subprocess.Popen(cmd, shell=False, cwd=destdir)
p.communicate()
p.wait()
if p.returncode != 0:
logger.error('cpio exited with error code %d', p.returncode)
return ['cpio', '--quiet', '-tvF', self.path]
@binary_fallback
@returns_details
def compare_cpio_files(path1, path2, source=None):
differences = []
differences.append(Difference.from_command(
CpioContent, path1, path2, source="file list"))
class CpioFile(File):
RE_FILE_TYPE = re.compile(r'\bcpio archive\b')
# compare files contained in archive
content1 = get_cpio_names(path1)
content2 = get_cpio_names(path2)
with make_temp_directory() as temp_dir1:
with make_temp_directory() as temp_dir2:
extract_cpio_archive(path1, temp_dir1)
extract_cpio_archive(path2, temp_dir2)
for member in sorted(set(content1).intersection(set(content2))):
in_path1 = os.path.join(temp_dir1, member)
in_path2 = os.path.join(temp_dir2, member)
if not os.path.isfile(in_path1) or not os.path.isfile(in_path2):
continue
differences.append(debbindiff.comparators.compare_files(
in_path1, in_path2, source=member))
@staticmethod
def recognizes(file):
return CpioFile.RE_FILE_TYPE.search(file.magic_file_type)
return differences
@needs_content
def compare_details(self, other, source=None):
differences = []
differences.append(Difference.from_command(
CpioContent, self.path, other.path, source="file list"))
with LibarchiveContainer(self).open() as my_container, \
LibarchiveContainer(other).open() as other_container:
differences.extend(my_container.compare(other_container, source))
return differences
......@@ -2,7 +2,7 @@
#
# debbindiff: highlight differences between two builds of Debian packages
#
# Copyright © 2014 Jérémy Bobbio <lunar@debian.org>
# Copyright © 2014-2015 Jérémy Bobbio <lunar@debian.org>
#
# debbindiff is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
......@@ -19,54 +19,75 @@
from __future__ import absolute_import
import re
import os.path
from debian.arfile import ArFile
from debbindiff import logger
from debbindiff.difference import Difference, get_source
from debbindiff.difference import Difference
import debbindiff.comparators
from debbindiff.comparators.binary import are_same_binaries
from debbindiff.comparators.binary import File, needs_content
from debbindiff.comparators.utils import \
binary_fallback, returns_details, make_temp_directory, get_ar_content
@binary_fallback
@returns_details
def compare_deb_files(path1, path2, source=None):
differences = []
# look up differences in content
ar1 = ArFile(filename=path1)
ar2 = ArFile(filename=path2)
with make_temp_directory() as temp_dir1:
with make_temp_directory() as temp_dir2:
logger.debug('content1 %s', ar1.getnames())
logger.debug('content2 %s', ar2.getnames())
for name in sorted(set(ar1.getnames())
.intersection(ar2.getnames())):
logger.debug('extract member %s', name)
member1 = ar1.getmember(name)
member2 = ar2.getmember(name)
in_path1 = os.path.join(temp_dir1, name)
in_path2 = os.path.join(temp_dir2, name)
with open(in_path1, 'w') as f1:
f1.write(member1.read())
with open(in_path2, 'w') as f2:
f2.write(member2.read())
differences.append(
debbindiff.comparators.compare_files(
in_path1, in_path2, source=name))
os.unlink(in_path1)
os.unlink(in_path2)
# look up differences in file list and file metadata
content1 = get_ar_content(path1)
content2 = get_ar_content(path2)
differences.append(Difference.from_unicode(
content1, content2, path1, path2, source="metadata"))
return differences
def compare_md5sums_files(path1, path2, source=None):
if are_same_binaries(path1, path2):
return None
return Difference(None, path1, path2,
source=get_source(path1, path2),
comment="Files in package differs")
Archive, ArchiveMember, get_ar_content
AR_EXTRACTION_BUFFER_SIZE = 32768
class ArContainer(Archive):
def open_archive(self, path):
return ArFile(filename=path)
def close_archive(self):
# ArFile don't have to be closed
pass
def get_member_names(self):
return self.archive.getnames()
def extract(self, member_name, dest_dir):
logger.debug('ar extracting %s to %s', member_name, dest_dir)
member = self.archive.getmember(member_name)
dest_path = os.path.join(dest_dir, os.path.basename(member_name))
member.seek(0)
with open(dest_path, 'w') as fp:
for buf in iter(lambda: member.read(AR_EXTRACTION_BUFFER_SIZE), b''):
fp.write(buf)
return dest_path
class DebContainer(ArContainer):
pass
class DebFile(File):
RE_FILE_TYPE = re.compile(r'^Debian binary package')
@staticmethod
def recognizes(file):
return DebFile.RE_FILE_TYPE.match(file.magic_file_type)
@needs_content
def compare_details(self, other, source=None):
differences = []
my_content = get_ar_content(self.path)
other_content = get_ar_content(other.path)
differences.append(Difference.from_unicode(
my_content, other_content, self.path, other.path, source="metadata"))
with DebContainer(self).open() as my_container, \
DebContainer(other).open() as other_container:
differences.extend(my_container.compare(other_container, source))
return differences
class Md5sumsFile(File):
@staticmethod
def recognizes(file):
return isinstance(file, ArchiveMember) and \
file.name == './md5sums' and \
isinstance(file.container.source, ArchiveMember) and \
isinstance(file.container.source.container.source, ArchiveMember) and \
file.container.source.container.source.name.startswith('control.tar.')
def compare(self, other, source=None):
if self.has_same_content_as(other):
return None
return Difference(None, self.path, other.path, source='md5sums',
comment="Files in package differs")
......@@ -2,7 +2,7 @@
#
# debbindiff: highlight differences between two builds of Debian packages
#
# Copyright © 2014 Jérémy Bobbio <lunar@debian.org>
# Copyright © 2014-2015 Jérémy Bobbio <lunar@debian.org>
#
# debbindiff is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
......@@ -17,11 +17,15 @@
# You should have received a copy of the GNU General Public License
# along with debbindiff. If not, see <http://www.gnu.org/licenses/>.
from contextlib import contextmanager
import os.path
import re
import sys
from debbindiff import logger
from debbindiff.changes import Changes
import debbindiff.comparators
from debbindiff.comparators.utils import binary_fallback, returns_details
from debbindiff.comparators.binary import File, needs_content
from debbindiff.comparators.utils import Container
from debbindiff.difference import Difference, get_source
......@@ -33,51 +37,84 @@ DOT_CHANGES_FIELDS = [
]
@binary_fallback
@returns_details
def compare_dot_changes_files(path1, path2, source=None):
try:
dot_changes1 = Changes(filename=path1)
dot_changes1.validate(check_signature=False)
dot_changes2 = Changes(filename=path2)
dot_changes2.validate(check_signature=False)
except IOError, e:
logger.critical(e)
sys.exit(2)
differences = []
for field in DOT_CHANGES_FIELDS:
differences.append(Difference.from_unicode(
dot_changes1[field].lstrip(),
dot_changes2[field].lstrip(),
path1, path2, source=field))
files_difference = Difference.from_unicode(
dot_changes1.get_as_string('Files'),
dot_changes2.get_as_string('Files'),
path1, path2,
source='Files')
if not files_difference:
return differences
class DotChangesMember(File):
def __init__(self, container, member_name):
self._container = container
self._name = member_name
self._path = None
@property
def container(self):
return self._container
@property
def name(self):
return self._name
@contextmanager
def get_content(self):
if self._path is not None:
yield
else:
with self.container.source.get_content():
self._path = os.path.join(os.path.dirname(self.container.source.path), self.name)
yield
self._path = None
def is_directory(self):
return False
def is_symlink(self):
return False
def is_device(self):
return False
class DotChangesContainer(Container):
@contextmanager
def open(self):
yield self
differences.append(files_difference)
# we are only interested in file names
files1 = dict([(d['name'], d) for d in dot_changes1.get('Files')])
files2 = dict([(d['name'], d) for d in dot_changes2.get('Files')])
for filename in sorted(set(files1.keys()).intersection(files2.keys())):
d1 = files1[filename]
d2 = files2[filename]
if d1['md5sum'] != d2['md5sum']:
logger.debug("%s mentioned in .changes have "
"differences", filename)
differences.append(
debbindiff.comparators.compare_files(
dot_changes1.get_path(filename),
dot_changes2.get_path(filename),
source=get_source(dot_changes1.get_path(filename),
dot_changes2.get_path(filename))))
return differences
def get_member_names(self):
return [d['name'] for d in self.source.changes.get('Files')]
def get_member(self, member_name):
return DotChangesMember(self, member_name)
class DotChangesFile(File):
RE_FILE_EXTENSION = re.compile(r'\.changes$')
@staticmethod
def recognizes(file):
if not DotChangesFile.RE_FILE_EXTENSION.search(file.name):
return False
with file.get_content():
changes = Changes(filename=file.path)
changes.validate(check_signature=False)
file._changes = changes
return True
@property
def changes(self):
return self._changes
@needs_content
def compare_details(self, other, source=None):
differences = []
for field in DOT_CHANGES_FIELDS:
differences.append(Difference.from_unicode(
self.changes[field].lstrip(),
other.changes[field].lstrip(),
self.path, other.path, source=field))
# compare Files as string
differences.append(Difference.from_unicode(self.changes.get_as_string('Files'),
other.changes.get_as_string('Files'),
self.path, other.path, source=field))
with DotChangesContainer(self).open() as my_container, \
DotChangesContainer(other).open() as other_container:
differences.extend(my_container.compare(other_container, source))
return differences
# -*- coding: utf-8 -*-
#
# debbindiff: highlight differences between two builds of Debian packages
#
# Copyright © 2015 Jérémy Bobbio <lunar@debian.org>
#
# debbindiff is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# debbindiff is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with debbindiff. If not, see <http://www.gnu.org/licenses/>.
from contextlib import contextmanager
import os