Georg Faerber · Georg Faerber · Georg Faerber · 5a6382c7 · 5a6382c7 · 5a6382c7
--- a/.gitlab-ci.yml
+++ b/.gitlab-ci.yml
@@ -49,6 +49,8 @@ tests:debian:
 tests:fedora:
  image: fedora
  stage: test
+  tags:
+    - whitewhale
  script:
  - dnf install -y python3 python3-mutagen python3-gobject gdk-pixbuf2 poppler-glib gdk-pixbuf2 gdk-pixbuf2-modules cairo-gobject cairo python3-cairo perl-Image-ExifTool mailcap
  - gdk-pixbuf-query-loaders-64 > /usr/lib64/gdk-pixbuf-2.0/2.10.0/loaders.cache
@@ -57,6 +59,8 @@ tests:fedora:
 tests:archlinux:
  image: archlinux/base
  stage: test
+  tags:
+    - whitewhale
  script:
  - pacman -Sy --noconfirm python-mutagen python-gobject gdk-pixbuf2 poppler-glib gdk-pixbuf2 python-cairo perl-image-exiftool python-setuptools mailcap
  - python3 setup.py test
--- a/.mailmap
+++ b/.mailmap
+Julien (jvoisin) Voisin <julien.voisin+mat2@dustri.org> totallylegit <totallylegit@dustri.org>
+Julien (jvoisin) Voisin <julien.voisin+mat2@dustri.org> jvoisin <julien.voisin@dustri.org>
+Julien (jvoisin) Voisin <julien.voisin+mat2@dustri.org> jvoisin <jvoisin@riseup.net>
+
+Daniel Kahn Gillmor <dkg@fifthhorseman.net> dkg <dkg@fifthhorseman.net>
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
+# 0.4.0 - 2018-10-03
+
+- There is now a policy, for advanced users, to deal with unknown embedded fileformats
+- Improve the documentation
+- Various minor refactoring
+- Improve how corrupted PNG are handled
+- Dangerous/advanced cli's options no longer have short versions
+- Significant improvements to office files anonymisation
+	- Archive members are sorted lexicographically
+	- XML attributes are sorted lexicographically too
+	- RSID are now stripped
+	- Dangling references in [Content_types].xml are now removed
+- Significant improvements to office files support
+- Anonimysed office files can now be opened by MS Office without warnings
+- The CLI isn't threaded anymore, for it was causing issues
+- Various misc typo fix
+
 # 0.3.1 - 2018-09-01

 - Document how to install MAT2 for various distributions

--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -24,10 +24,13 @@ Since MAT2 is written in Python3, please conform as much as possible to the
 1. Update the [changelog](https://0xacab.org/jvoisin/mat2/blob/master/CHANGELOG.md)
 2. Update the version in the [mat2](https://0xacab.org/jvoisin/mat2/blob/master/mat2) file
 3. Update the version in the [setup.py](https://0xacab.org/jvoisin/mat2/blob/master/setup.py) file
-4. Update the version and date in the [man page](https://0xacab.org/jvoisin/mat2/blob/master/doc/mat.1)
+4. Update the version and date in the [man page](https://0xacab.org/jvoisin/mat2/blob/master/doc/mat2.1)
 5. Commit the changelog, man page, mat2 and setup.py files
 6. Create a tag with `git tag -s $VERSION`
 7. Push the commit with `git push origin master`
 8. Push the tag with `git push --tags`
-9. Tell the [downstreams](https://0xacab.org/jvoisin/mat2/blob/master/INSTALL.md) about it
-10. Do the secret release dance
+9. Create the signed tarball with `git archive --format=tar.xz --prefix=mat-$VERSION/ $VERSION > mat-$VERSION.tar.xz`
+10. Sign the tarball with `gpg --armor --detach-sign mat-$VERSION.tar.xz`
+11. Upload the result on Gitlab's [tag page](https://0xacab.org/jvoisin/mat2/tags) and add the changelog there
+12. Tell the [downstreams](https://0xacab.org/jvoisin/mat2/blob/master/INSTALL.md) about it
+13. Do the secret release dance
--- a/INSTALL.md
+++ b/INSTALL.md
@@ -38,13 +38,14 @@ $ ./mat2
 and if you want to install the über-fancy Nautilus extension:

 ```
-# apt install python-gi-dev
+# apt install gnome-common gtk-doc-tools libnautilus-extension-dev python-gi-dev
 $ git clone https://github.com/GNOME/nautilus-python
 $ cd nautilus-python
 $ PYTHON=/usr/bin/python3 ./autogen.sh
 $ make
 # make install
-$ cp ./nautilus/mat2.py ~/.local/share/nautilus-python/extensions/
+$ mkdir -p ~/.local/share/nautilus-python/extensions/
+$ cp ../nautilus/mat2.py ~/.local/share/nautilus-python/extensions/
 $ PYTHONPATH=/home/$USER/mat2 PYTHON=/usr/bin/python3 nautilus
 ```

@@ -52,3 +53,7 @@ $ PYTHONPATH=/home/$USER/mat2 PYTHON=/usr/bin/python3 nautilus

 Thanks to [Francois_B](https://www.sciunto.org/), there is an package available on
 [Arch linux's AUR](https://aur.archlinux.org/packages/mat2/).
+
+## Gentoo
+
+MAT2 is available in the [torbrowser overlay](https://github.com/MeisterP/torbrowser-overlay).
--- a/README.md
+++ b/README.md
@@ -44,22 +44,33 @@ $ python3 -m unittest discover -v
 # How to use MAT2

 ```bash
-usage: mat2 [-h] [-v] [-l] [-s | -L] [files [files ...]]
+usage: mat2 [-h] [-v] [-l] [--check-dependencies] [-V]
+            [--unknown-members policy] [-s | -L]
+            [files [files ...]]

 Metadata anonymisation toolkit 2

 positional arguments:
-  files
+  files                 the files to process

 optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -l, --list            list all supported fileformats
-  -s, --show         list all the harmful metadata of a file without removing
-                     them
+  --check-dependencies  check if MAT2 has all the dependencies it needs
+  -V, --verbose         show more verbose status information
+  --unknown-members policy
+                        how to handle unknown members of archive-style files
+                        (policy should be one of: abort, omit, keep)
+  -s, --show            list harmful metadata detectable by MAT2 without
+                        removing them
  -L, --lightweight     remove SOME metadata
 ```

+Note that MAT2 **will not** clean files in-place, but will produce, for
+example, with a file named "myfile.png" a cleaned version named
+"myfile.cleaned.png".
+
 # Notes about detecting metadata

 While MAT2 is doing its very best to display metadata when the `--show` flag is
@@ -78,12 +89,15 @@ be cleaned or not.
 	tries to deal with *printer dots* too.
 - [pdfparanoia](https://github.com/kanzure/pdfparanoia), that removes
 	watermarks from PDF.
+- [Scrambled Exif](https://f-droid.org/packages/com.jarsilio.android.scrambledeggsif/),
+	an open-source Android application to remove metadata from pictures.

 # Contact

-If possible, use the [issues system](https://0xacab.org/jvoisin/mat2/issues).
-If you think that a more private contact is needed (eg. for reporting security issues),
-you can email Julien (jvoisin) Voisin at `julien.voisin+mat@dustri.org`,
+If possible, use the [issues system](https://0xacab.org/jvoisin/mat2/issues)
+or the [mailing list](https://mailman.boum.org/listinfo/mat-dev)
+Should a more private contact be needed (eg. for reporting security issues),
+you can email Julien (jvoisin) Voisin at `julien.voisin+mat2@dustri.org`,
 using the gpg key `9FCDEE9E1A381F311EA62A7404D041E8171901CC`.

 # License

--- a/debian/watch
+++ b/debian/watch
 version=4

-opts="pgpmode=next" https://0xacab.org/jvoisin/mat2/tags (?:.*/)mat-@ANY_VERSION@\.tar\.xz
+opts="pgpmode=next" https://0xacab.org/jvoisin/mat2/tags (?:.*/)mat2-@ANY_VERSION@\.tar\.xz

-opts="pgpmode=previous" https://0xacab.org/jvoisin/mat2/tags (?:.*/)mat-@ANY_VERSION@@SIGNATURE_EXT@
+opts="pgpmode=previous" https://0xacab.org/jvoisin/mat2/tags (?:.*/)mat2-@ANY_VERSION@@SIGNATURE_EXT@
--- a/doc/implementation_notes.md
+++ b/doc/implementation_notes.md
@@ -61,3 +61,11 @@ Images handling
 When possible, images are handled like PDF: rendered on a surface, then saved
 to the filesystem. This ensures that every metadata is removed.

+XML attacks
+-----------
+
+Since our threat model conveniently excludes files crafted to specifically
+bypass MAT2, fileformats containing harmful XML are out of our scope.
+But since MAT2 is using [etree](https://docs.python.org/3/library/xml.html#xml-vulnerabilities)
+to process XML, it's "only" vulnerable to DoS, and not memory corruption:
+odds are that the user will notice that the cleaning didn't succeed.
--- a/doc/mat.1
+++ b/doc/mat.1
-.TH MAT2 "1" "September 2018" "MAT2 0.3.1" "User Commands"
+.TH MAT2 "1" "October 2018" "MAT2 0.4.0" "User Commands"

 .SH NAME
 mat2 \- the metadata anonymisation toolkit 2

 .SH SYNOPSIS
-mat2 [\-h] [\-v] [\-l] [\-c] [\-s | \-L]\fR [files [files ...]]
+\fBmat2\fR [\-h] [\-v] [\-l] [\-V] [-s | -L] [\fIfiles\fR [\fIfiles ...\fR]]

 .SH DESCRIPTION
 .B mat2
 removes metadata from various fileformats. It supports a wide variety of file 
 formats, audio, office, images, …

+Careful, mat2 does not clean files in-place, instead, it will produce a file with the word
+"cleaned" between the filename and its extension, for example "filename.cleaned.png"
+for a file named "filename.png".
+
 .SH OPTIONS
 .SS "positional arguments:"
 .TP
@@ -27,9 +31,15 @@ show program's version number and exit
 \fB\-l\fR, \fB\-\-list\fR
 list all supported fileformats
 .TP
-\fB\-c\fR, \fB\-\-check\-dependencies\fR
+\fB\-\-check\-dependencies\fR
 check if MAT2 has all the dependencies it needs
 .TP
+\fB\-V\fR, \fB\-\-verbose\fR
+show more verbose status information
+.TP
+\fB\-\-unknown-members\fR \fIpolicy\fR
+how to handle unknown members of archive-style files (policy should be one of: abort, omit, keep)
+.TP
 \fB\-s\fR, \fB\-\-show\fR
 list harmful metadata detectable by MAT2 without
 removing them

--- a/libmat2/__init__.py
+++ b/libmat2/__init__.py
@@ -2,6 +2,7 @@

 import os
 import collections
+import enum
 import importlib
 from typing import Dict, Optional

@@ -35,16 +36,16 @@ DEPENDENCIES = {
    'mutagen': 'Mutagen',
    }

-def _get_exiftool_path() -> Optional[str]:
+def _get_exiftool_path() -> Optional[str]:  # pragma: no cover
    exiftool_path = '/usr/bin/exiftool'
    if os.path.isfile(exiftool_path):
-        if os.access(exiftool_path, os.X_OK):  # pragma: no cover
+        if os.access(exiftool_path, os.X_OK):
            return exiftool_path

    # ArchLinux
    exiftool_path = '/usr/bin/vendor_perl/exiftool'
    if os.path.isfile(exiftool_path):
-        if os.access(exiftool_path, os.X_OK):  # pragma: no cover
+        if os.access(exiftool_path, os.X_OK):
            return exiftool_path

    return None
@@ -62,3 +63,9 @@ def check_dependencies() -> dict:
            ret[value] = False  # pragma: no cover

    return ret
+
+@enum.unique
+class UnknownMemberPolicy(enum.Enum):
+    ABORT = 'abort'
+    OMIT = 'omit'
+    KEEP = 'keep'
--- a/libmat2/archive.py
+++ b/libmat2/archive.py
+import zipfile
+import datetime
+import tempfile
+import os
+import logging
+import shutil
+from typing import Dict, Set, Pattern
+
+from . import abstract, UnknownMemberPolicy, parser_factory
+
+# Make pyflakes happy
+assert Set
+assert Pattern
+
+
+class ArchiveBasedAbstractParser(abstract.AbstractParser):
+    """ Office files (.docx, .odt, …) are zipped files. """
+    def __init__(self, filename):
+        super().__init__(filename)
+
+        # Those are the files that have a format that _isn't_
+        # supported by MAT2, but that we want to keep anyway.
+        self.files_to_keep = set()  # type: Set[Pattern]
+
+        # Those are the files that we _do not_ want to keep,
+        # no matter if they are supported or not.
+        self.files_to_omit = set()  # type: Set[Pattern]
+
+        # what should the parser do if it encounters an unknown file in
+        # the archive?
+        self.unknown_member_policy = UnknownMemberPolicy.ABORT  # type: UnknownMemberPolicy
+
+        try:  # better fail here than later
+            zipfile.ZipFile(self.filename)
+        except zipfile.BadZipFile:
+            raise ValueError
+
+    def _specific_cleanup(self, full_path: str) -> bool:
+        """ This method can be used to apply specific treatment
+        to files present in the archive."""
+        # pylint: disable=unused-argument,no-self-use
+        return True  # pragma: no cover
+
+    @staticmethod
+    def _clean_zipinfo(zipinfo: zipfile.ZipInfo) -> zipfile.ZipInfo:
+        zipinfo.create_system = 3  # Linux
+        zipinfo.comment = b''
+        zipinfo.date_time = (1980, 1, 1, 0, 0, 0)  # this is as early as a zipfile can be
+        return zipinfo
+
+    @staticmethod
+    def _get_zipinfo_meta(zipinfo: zipfile.ZipInfo) -> Dict[str, str]:
+        metadata = {}
+        if zipinfo.create_system == 3:  # this is Linux
+            pass
+        elif zipinfo.create_system == 2:
+            metadata['create_system'] = 'Windows'
+        else:
+            metadata['create_system'] = 'Weird'
+
+        if zipinfo.comment:
+            metadata['comment'] = zipinfo.comment  # type: ignore
+
+        if zipinfo.date_time != (1980, 1, 1, 0, 0, 0):
+            metadata['date_time'] = str(datetime.datetime(*zipinfo.date_time))
+
+        return metadata
+
+    def remove_all(self) -> bool:
+        # pylint: disable=too-many-branches
+
+        with zipfile.ZipFile(self.filename) as zin,\
+             zipfile.ZipFile(self.output_filename, 'w') as zout:
+
+            temp_folder = tempfile.mkdtemp()
+            abort = False
+
+            # Since files order is a fingerprint factor,
+            # we're iterating (and thus inserting) them in lexicographic order.
+            for item in sorted(zin.infolist(), key=lambda z: z.filename):
+                if item.filename[-1] == '/':  # `is_dir` is added in Python3.6
+                    continue  # don't keep empty folders
+
+                zin.extract(member=item, path=temp_folder)
+                full_path = os.path.join(temp_folder, item.filename)
+
+                if self._specific_cleanup(full_path) is False:
+                    logging.warning("Something went wrong during deep cleaning of %s",
+                                    item.filename)
+                    abort = True
+                    continue
+
+                if any(map(lambda r: r.search(item.filename), self.files_to_keep)):
+                    # those files aren't supported, but we want to add them anyway
+                    pass
+                elif any(map(lambda r: r.search(item.filename), self.files_to_omit)):
+                    continue
+                else:  # supported files that we want to first clean, then add
+                    tmp_parser, mtype = parser_factory.get_parser(full_path)  # type: ignore
+                    if not tmp_parser:
+                        if self.unknown_member_policy == UnknownMemberPolicy.OMIT:
+                            logging.warning("In file %s, omitting unknown element %s (format: %s)",
+                                            self.filename, item.filename, mtype)
+                            continue
+                        elif self.unknown_member_policy == UnknownMemberPolicy.KEEP:
+                            logging.warning("In file %s, keeping unknown element %s (format: %s)",
+                                            self.filename, item.filename, mtype)
+                        else:
+                            logging.error("In file %s, element %s's format (%s) " +
+                                          "isn't supported",
+                                          self.filename, item.filename, mtype)
+                            abort = True
+                            continue
+                    if tmp_parser:
+                        tmp_parser.remove_all()
+                        os.rename(tmp_parser.output_filename, full_path)
+
+                zinfo = zipfile.ZipInfo(item.filename)  # type: ignore
+                clean_zinfo = self._clean_zipinfo(zinfo)
+                with open(full_path, 'rb') as f:
+                    zout.writestr(clean_zinfo, f.read())
+
+        shutil.rmtree(temp_folder)
+        if abort:
+            os.remove(self.output_filename)
+            return False
+        return True
--- a/libmat2/images.py
+++ b/libmat2/images.py
@@ -62,9 +62,13 @@ class PNGParser(_ImageParser):

    def __init__(self, filename):
        super().__init__(filename)
+
+        if imghdr.what(filename) != 'png':
+            raise ValueError
+
        try:  # better fail here than later
            cairo.ImageSurface.create_from_png(self.filename)
-        except MemoryError:
+        except MemoryError:  # pragma: no cover
            raise ValueError

    def remove_all(self):

--- a/libmat2/office.py
+++ b/libmat2/office.py
+import logging
 import os
 import re
-import shutil
-import tempfile
-import datetime
 import zipfile
-import logging
 from typing import Dict, Set, Pattern

-try:  # protect against DoS
-    from defusedxml import ElementTree as ET  # type: ignore
-except ImportError:
 import xml.etree.ElementTree as ET  # type: ignore

+from .archive import ArchiveBasedAbstractParser

-from . import abstract, parser_factory
+# pylint: disable=line-too-long

 # Make pyflakes happy
 assert Set
 assert Pattern

 def _parse_xml(full_path: str):
-    """ This function parse XML, with namespace support. """
+    """ This function parses XML, with namespace support. """

    namespace_map = dict()
    for _, (key, value) in ET.iterparse(full_path, ("start-ns", )):
+        # The ns[0-9]+ namespaces are reserved for internal usage, so
+        # we have to use an other nomenclature.
+        if re.match('^ns[0-9]+$', key, re.I):  # pragma: no cover
+            key = 'mat' + key[2:]
+
        namespace_map[key] = value
        ET.register_namespace(key, value)

    return ET.parse(full_path), namespace_map


-class ArchiveBasedAbstractParser(abstract.AbstractParser):
-    """ Office files (.docx, .odt, …) are zipped files. """
-    # Those are the files that have a format that _isn't_
-    # supported by MAT2, but that we want to keep anyway.
-    files_to_keep = set()  # type: Set[str]
+def _sort_xml_attributes(full_path: str) -> bool:
+    """ Sort xml attributes lexicographically,
+    because it's possible to fingerprint producers (MS Office, Libreoffice, …)
+    since they are all using different orders.
+    """
+    tree = ET.parse(full_path)

-    # Those are the files that we _do not_ want to keep,
-    # no matter if they are supported or not.
-    files_to_omit = set() # type: Set[Pattern]
+    for c in tree.getroot():
+        c[:] = sorted(c, key=lambda child: (child.tag, child.get('desc')))

-    def __init__(self, filename):
-        super().__init__(filename)
-        try:  # better fail here than later
-            zipfile.ZipFile(self.filename)
-        except zipfile.BadZipFile:
-            raise ValueError
+    tree.write(full_path, xml_declaration=True)
+    return True

-    def _specific_cleanup(self, full_path: str) -> bool:
-        """ This method can be used to apply specific treatment
-        to files present in the archive."""
-        # pylint: disable=unused-argument,no-self-use
-        return True  # pragma: no cover

-    @staticmethod
-    def _clean_zipinfo(zipinfo: zipfile.ZipInfo) -> zipfile.ZipInfo:
-        zipinfo.create_system = 3  # Linux
-        zipinfo.comment = b''
-        zipinfo.date_time = (1980, 1, 1, 0, 0, 0)  # this is as early as a zipfile can be
-        return zipinfo
+class MSOfficeParser(ArchiveBasedAbstractParser):
+    mimetypes = {
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
+        'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
+        'application/vnd.openxmlformats-officedocument.presentationml.presentation'
+    }
+    content_types_to_keep = {
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml',  # /word/endnotes.xml
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.footnotes+xml',  # /word/footnotes.xml
+        'application/vnd.openxmlformats-officedocument.extended-properties+xml',  # /docProps/app.xml
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml',  # /word/document.xml
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml',  # /word/fontTable.xml
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml',  # /word/footer.xml
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.header+xml',  # /word/header.xml
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml',  # /word/styles.xml
+        'application/vnd.openxmlformats-package.core-properties+xml',  # /docProps/core.xml
+
+        # Do we want to keep the following ones?
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml',
+
+        # See https://0xacab.org/jvoisin/mat2/issues/71
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.numbering+xml',  # /word/numbering.xml
+    }

-    @staticmethod
-    def _get_zipinfo_meta(zipinfo: zipfile.ZipInfo) -> Dict[str, str]:
-        metadata = {}
-        if zipinfo.create_system == 3:  # this is Linux
-            pass
-        elif zipinfo.create_system == 2:
-            metadata['create_system'] = 'Windows'
-        else:
-            metadata['create_system'] = 'Weird'

-        if zipinfo.comment:
-            metadata['comment'] = zipinfo.comment  # type: ignore
+    def __init__(self, filename):
+        super().__init__(filename)

-        if zipinfo.date_time != (1980, 1, 1, 0, 0, 0):
-            metadata['date_time'] = str(datetime.datetime(*zipinfo.date_time))
+        self.files_to_keep = set(map(re.compile, {  # type: ignore
+            r'^\[Content_Types\]\.xml$',
+            r'^_rels/\.rels$',
+            r'^word/_rels/document\.xml\.rels$',
+            r'^word/_rels/footer[0-9]*\.xml\.rels$',
+            r'^word/_rels/header[0-9]*\.xml\.rels$',

-        return metadata
+            # https://msdn.microsoft.com/en-us/library/dd908153(v=office.12).aspx
+            r'^word/stylesWithEffects\.xml$',
+        }))
+        self.files_to_omit = set(map(re.compile, {  # type: ignore
+            r'^customXml/',
+            r'webSettings\.xml$',
+            r'^docProps/custom\.xml$',
+            r'^word/printerSettings/',
+            r'^word/theme',
+
+            # we have a whitelist in self.files_to_keep,
+            # so we can trash everything else
+            r'^word/_rels/',
+        }))

-    def remove_all(self) -> bool:
-        with zipfile.ZipFile(self.filename) as zin,\
-             zipfile.ZipFile(self.output_filename, 'w') as zout:
+        if self.__fill_files_to_keep_via_content_types() is False:
+            raise ValueError

-            temp_folder = tempfile.mkdtemp()
+    def __fill_files_to_keep_via_content_types(self) -> bool:
+        """ There is a suer-handy `[Content_Types].xml` file
+        in MS Office archives, describing what each other file contains.
+        The self.content_types_to_keep member contains a type whitelist,
+        so we're using it to fill the self.files_to_keep one.
+        """
+        with zipfile.ZipFile(self.filename) as zin:
+            if '[Content_Types].xml' not in zin.namelist():
+                return False
+            xml_data = zin.read('[Content_Types].xml')

-            for item in zin.infolist():
-                if item.filename[-1] == '/':  # `is_dir` is added in Python3.6
-                    continue  # don't keep empty folders
+        self.content_types = dict()  # type: Dict[str, str]
+        try:
+            tree = ET.fromstring(xml_data)
+        except ET.ParseError:
+            return False
+        for c in tree:
+            if 'PartName' not in c.attrib or 'ContentType' not in c.attrib:
+                continue
+            elif c.attrib['ContentType'] in self.content_types_to_keep:
+                fname = c.attrib['PartName'][1:]  # remove leading `/`
+                re_fname = re.compile('^' + re.escape(fname) + '$')
+                self.files_to_keep.add(re_fname)  # type: ignore
+        return True

-                zin.extract(member=item, path=temp_folder)
-                full_path = os.path.join(temp_folder, item.filename)
+    @staticmethod
+    def __remove_rsid(full_path: str) -> bool:
+        """ The method will remove "revision session ID".  We're '}rsid'
+        instead of proper parsing, since rsid can have multiple forms, like
+        `rsidRDefault`, `rsidR`, `rsids`, …

-                if self._specific_cleanup(full_path) is False:
-                    shutil.rmtree(temp_folder)
-                    os.remove(self.output_filename)
-                    logging.warning("Something went wrong during deep cleaning of %s",
-                                    item.filename)
-                    return False
+        We're removing rsid tags in two times, because we can't modify
+        the xml while we're iterating on it.

-                if item.filename in self.files_to_keep:
-                    # those files aren't supported, but we want to add them anyway
-                    pass
-                elif any(map(lambda r: r.search(item.filename), self.files_to_omit)):
-                    continue
-                else:
-                    # supported files that we want to clean then add
-                    tmp_parser, mtype = parser_factory.get_parser(full_path)  # type: ignore
-                    if not tmp_parser:
-                        shutil.rmtree(temp_folder)
-                        os.remove(self.output_filename)
-                        logging.error("In file %s, element %s's format (%s) " +
-                                      "isn't supported",
-                                      self.filename, item.filename, mtype)
+        For more details, see
+        - https://msdn.microsoft.com/en-us/library/office/documentformat.openxml.wordprocessing.previoussectionproperties.rsidrpr.aspx
+        - https://blogs.msdn.microsoft.com/brian_jones/2006/12/11/whats-up-with-all-those-rsids/
+        """
+        try:
+            tree, namespace = _parse_xml(full_path)
+        except ET.ParseError:
            return False
-                    tmp_parser.remove_all()
-                    os.rename(tmp_parser.output_filename, full_path)
-
-                zinfo = zipfile.ZipInfo(item.filename)  # type: ignore
-                clean_zinfo = self._clean_zipinfo(zinfo)
-                with open(full_path, 'rb') as f:
-                    zout.writestr(clean_zinfo, f.read())

-        shutil.rmtree(temp_folder)
+        # rsid, tags or attributes, are always under the `w` namespace
+        if 'w' not in namespace.keys():
            return True

+        parent_map = {c:p for p in tree.iter() for c in p}

-class MSOfficeParser(ArchiveBasedAbstractParser):
-    mimetypes = {
-        'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
-        'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
-        'application/vnd.openxmlformats-officedocument.presentationml.presentation'
-    }
-    files_to_keep = {
-        '[Content_Types].xml',
-        '_rels/.rels',
-        'word/_rels/document.xml.rels',
-        'word/document.xml',
-        'word/fontTable.xml',
-        'word/settings.xml',
-        'word/styles.xml',
-    }
-    files_to_omit = set(map(re.compile, {  # type: ignore
-        '^docProps/',
-    }))
+        elements_to_remove = list()
+        for item in tree.iterfind('.//', namespace):
+            if '}rsid' in item.tag.strip().lower():  # rsid as tag
+                elements_to_remove.append(item)
+                continue
+            for key in list(item.attrib.keys()):  # rsid as attribute
+                if '}rsid' in key.lower():
+                    del item.attrib[key]
+
+        for element in elements_to_remove:
+            parent_map[element].remove(element)
+
+        tree.write(full_path, xml_declaration=True)
+        return True

    @staticmethod
    def __remove_revisions(full_path: str) -> bool:
@@ -152,7 +169,8 @@ class MSOfficeParser(ArchiveBasedAbstractParser):
        """
        try:
            tree, namespace = _parse_xml(full_path)
-        except ET.ParseError:
+        except ET.ParseError as e:
+            logging.error("Unable to parse %s: %s", full_path, e)
            return False

        # Revisions are either deletions (`w:del`) or
@@ -182,13 +200,100 @@ class MSOfficeParser(ArchiveBasedAbstractParser):
            parent_map[element].remove(element)

        tree.write(full_path, xml_declaration=True)
+        return True

+    def __remove_content_type_members(self, full_path: str) -> bool:
+        """ The method will remove the dangling references
+        form the [Content_Types].xml file, since MS office doesn't like them
+        """
+        try:
+            tree, namespace = _parse_xml(full_path)
+        except ET.ParseError:  # pragma: no cover
+            return False
+
+        if len(namespace.items()) != 1:
+            return False  # there should be only one namespace for Types
+
+        removed_fnames = set()
+        with zipfile.ZipFile(self.filename) as zin:
+            for fname in [item.filename for item in zin.infolist()]:
+                for file_to_omit in self.files_to_omit:
+                    if file_to_omit.search(fname):
+                        matches = map(lambda r: r.search(fname), self.files_to_keep)
+                        if any(matches):  # the file is whitelisted
+                            continue
+                        removed_fnames.add(fname)
+                        break
+
+        root = tree.getroot()
+        for item in root.findall('{%s}Override' % namespace['']):
+            name = item.attrib['PartName'][1:]  # remove the leading '/'
+            if name in removed_fnames:
+                root.remove(item)
+
+        tree.write(full_path, xml_declaration=True)
        return True

    def _specific_cleanup(self, full_path: str) -> bool:
-        if full_path.endswith('/word/document.xml'):
+        # pylint: disable=too-many-return-statements
+        if os.stat(full_path).st_size == 0:  # Don't process empty files
+            return True
+
+        if not full_path.endswith('.xml'):
+            return True
+
+        if full_path.endswith('/[Content_Types].xml'):
+            # this file contains references to files that we might
+            # remove, and MS Office doesn't like dangling references
+            if self.__remove_content_type_members(full_path) is False:
+                return False
+        elif full_path.endswith('/word/document.xml'):
            # this file contains the revisions
-            return self.__remove_revisions(full_path)
+            if self.__remove_revisions(full_path) is False:
+                return False
+        elif full_path.endswith('/docProps/app.xml'):
+            # This file must be present and valid,
+            # so we're removing as much as we can.
+            with open(full_path, 'wb') as f:
+                f.write(b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>')
+                f.write(b'<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties">')
+                f.write(b'</Properties>')
+        elif full_path.endswith('/docProps/core.xml'):
+            # This file must be present and valid,
+            # so we're removing as much as we can.
+            with open(full_path, 'wb') as f:
+                f.write(b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>')
+                f.write(b'<cp:coreProperties xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties">')
+                f.write(b'</cp:coreProperties>')
+
+
+        if self.__remove_rsid(full_path) is False:
+            return False
+
+        try:
+            _sort_xml_attributes(full_path)
+        except ET.ParseError as e:  # pragma: no cover
+            logging.error("Unable to parse %s: %s", full_path, e)
+            return False
+
+        # This is awful, I'm sorry.
+        #
+        # Microsoft Office isn't happy when we have the `mc:Ignorable`
+        # tag containing namespaces that aren't present in the xml file,
+        # so instead of trying to remove this specific tag with etree,
+        # we're removing it, with a regexp.
+        #
+        # Since we're the ones producing this file, via the call to
+        # _sort_xml_attributes, there won't be any "funny tricks".
+        # Worst case, the tag isn't present, and everything is fine.
+        #
+        # see: https://docs.microsoft.com/en-us/dotnet/framework/wpf/advanced/mc-ignorable-attribute
+        with open(full_path, 'rb') as f:
+            text = f.read()
+            out = re.sub(b'mc:Ignorable="[^"]*"', b'', text, 1)
+        with open(full_path, 'wb') as f:
+            f.write(out)
+
        return True

    def get_meta(self) -> Dict[str, str]:
@@ -223,26 +328,31 @@ class LibreOfficeParser(ArchiveBasedAbstractParser):
        'application/vnd.oasis.opendocument.formula',
        'application/vnd.oasis.opendocument.image',
    }
-    files_to_keep = {
-        'META-INF/manifest.xml',
-        'content.xml',
-        'manifest.rdf',
-        'mimetype',
-        'settings.xml',
-        'styles.xml',
-    }
-    files_to_omit = set(map(re.compile, {  # type: ignore
+
+
+    def __init__(self, filename):
+        super().__init__(filename)
+
+        self.files_to_keep = set(map(re.compile, {  # type: ignore
+            r'^META-INF/manifest\.xml$',
+            r'^content\.xml$',
+            r'^manifest\.rdf$',
+            r'^mimetype$',
+            r'^settings\.xml$',
+            r'^styles\.xml$',
+        }))
+        self.files_to_omit = set(map(re.compile, {  # type: ignore
            r'^meta\.xml$',
-        '^Configurations2/',
-        '^Thumbnails/',
+            r'^Configurations2/',
+            r'^Thumbnails/',
        }))

-
    @staticmethod
    def __remove_revisions(full_path: str) -> bool:
        try:
            tree, namespace = _parse_xml(full_path)
-        except ET.ParseError:
+        except ET.ParseError as e:
+            logging.error("Unable to parse %s: %s", full_path, e)
            return False

        if 'office' not in namespace.keys():  # no revisions in the current file
@@ -253,12 +363,22 @@ class LibreOfficeParser(ArchiveBasedAbstractParser):
                text.remove(changes)

        tree.write(full_path, xml_declaration=True)
-
        return True

    def _specific_cleanup(self, full_path: str) -> bool:
+        if os.stat(full_path).st_size == 0:  # Don't process empty files
+            return True
+
+        if os.path.basename(full_path).endswith('.xml'):
            if os.path.basename(full_path) == 'content.xml':
-            return self.__remove_revisions(full_path)
+                if self.__remove_revisions(full_path) is False:
+                    return False
+
+            try:
+                _sort_xml_attributes(full_path)
+            except ET.ParseError as e:
+                logging.error("Unable to parse %s: %s", full_path, e)
+                return False
        return True

    def get_meta(self) -> Dict[str, str]:

--- a/libmat2/pdf.py
+++ b/libmat2/pdf.py
@@ -118,7 +118,6 @@ class PDFParser(abstract.AbstractParser):
        document.save('file://' + os.path.abspath(out_file))
        return True

-
    @staticmethod
    def __parse_metadata_field(data: str) -> dict:
        metadata = {}

--- a/libmat2/torrent.py
+++ b/libmat2/torrent.py
@@ -21,7 +21,6 @@ class TorrentParser(abstract.AbstractParser):
                metadata[key.decode('utf-8')] = value
        return metadata

-
    def remove_all(self) -> bool:
        cleaned = dict()
        for key, value in self.dict_repr.items():

--- a/mat2
+++ b/mat2
-#!/usr/bin/python3
+#!/usr/bin/env python3

 import os
 from typing import Tuple
 import sys
-import itertools
 import mimetypes
 import argparse
-import multiprocessing
 import logging

 try:
-    from libmat2 import parser_factory, UNSUPPORTED_EXTENSIONS, check_dependencies
+    from libmat2 import parser_factory, UNSUPPORTED_EXTENSIONS
+    from libmat2 import check_dependencies, UnknownMemberPolicy
 except ValueError as e:
    print(e)
    sys.exit(1)

-__version__ = '0.3.1'
+__version__ = '0.4.0'

 def __check_file(filename: str, mode: int=os.R_OK) -> bool:
    if not os.path.exists(filename):
@@ -37,10 +36,13 @@ def create_arg_parser():
                        version='MAT2 %s' % __version__)
    parser.add_argument('-l', '--list', action='store_true',
                        help='list all supported fileformats')
-    parser.add_argument('-c', '--check-dependencies', action='store_true',
+    parser.add_argument('--check-dependencies', action='store_true',
                        help='check if MAT2 has all the dependencies it needs')
    parser.add_argument('-V', '--verbose', action='store_true',
                        help='show more verbose status information')
+    parser.add_argument('--unknown-members', metavar='policy', default='abort',
+                        help='how to handle unknown members of archive-style files (policy should' +
+                        ' be one of: %s)' % ', '.join(p.value for p in UnknownMemberPolicy))


    info = parser.add_mutually_exclusive_group()
@@ -67,8 +69,8 @@ def show_meta(filename: str):
        except UnicodeEncodeError:
            print("  %s: harmful content" % k)

-def clean_meta(params: Tuple[str, bool]) -> bool:
-    filename, is_lightweight = params
+def clean_meta(params: Tuple[str, bool, UnknownMemberPolicy]) -> bool:
+    filename, is_lightweight, unknown_member_policy = params
    if not __check_file(filename, os.R_OK|os.W_OK):
        return False

@@ -76,6 +78,7 @@ def clean_meta(params: Tuple[str, bool]) -> bool:
    if p is None:
        print("[-] %s's format (%s) is not supported" % (filename, mtype))
        return False
+    p.unknown_member_policy = unknown_member_policy
    if is_lightweight:
        return p.remove_all_lightweight()
    return p.remove_all()
@@ -133,12 +136,16 @@ def main():
        return 0

    else:
-        p = multiprocessing.Pool()
-        mode = (args.lightweight is True)
-        l = zip(__get_files_recursively(args.files), itertools.repeat(mode))
+        unknown_member_policy = UnknownMemberPolicy(args.unknown_members)
+        if unknown_member_policy == UnknownMemberPolicy.KEEP:
+            logging.warning('Keeping unknown member files may leak metadata in the resulting file!')
+
+        no_failure = True
+        for f in __get_files_recursively(args.files):
+            if clean_meta([f, args.lightweight, unknown_member_policy]) is False:
+                no_failure = False
+        return 0 if no_failure is True else -1

-        ret = list(p.imap_unordered(clean_meta, list(l)))
-        return 0 if all(ret) else -1

 if __name__ == '__main__':
    sys.exit(main())
--- a/nautilus/mat2.py
+++ b/nautilus/mat2.py
@@ -104,7 +104,6 @@ class ColumnExtension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationW
        box.add(self.__create_treeview())
        window.show_all()

-
    @staticmethod
    def __validate(fileinfo) -> Tuple[bool, str]:
        """ Validate if a given file FileInfo `fileinfo` can be processed.
@@ -115,7 +114,6 @@ class ColumnExtension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationW
            return False, "Not writeable"
        return True, ""

-
    def __create_treeview(self) -> Gtk.TreeView:
        liststore = Gtk.ListStore(GdkPixbuf.Pixbuf, str, str)
        treeview = Gtk.TreeView(model=liststore)
@@ -148,7 +146,6 @@ class ColumnExtension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationW
        treeview.show_all()
        return treeview

-
    def __create_progressbar(self) -> Gtk.ProgressBar:
        """ Create the progressbar used to notify that files are currently
        being processed.
@@ -211,7 +208,6 @@ class ColumnExtension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationW
        processing_queue.put(None)  # signal that we processed all the files
        return True

-
    def __cb_menu_activate(self, menu, files):
        """ This method is called when the user clicked the "clean metadata"
        menu item.
@@ -228,7 +224,6 @@ class ColumnExtension(GObject.GObject, Nautilus.MenuProvider, Nautilus.LocationW
        thread.daemon = True
        thread.start()

-
    def get_background_items(self, window, file):
        """ https://bugzilla.gnome.org/show_bug.cgi?id=784278 """
        return None

--- a/setup.py
+++ b/setup.py
@@ -5,7 +5,7 @@ with open("README.md", "r") as fh:

 setuptools.setup(
    name="mat2",
-    version='0.3.1',
+    version='0.4.0',
    author="Julien (jvoisin) Voisin",
    author_email="julien.voisin+mat2@dustri.org",
    description="A handy tool to trash your metadata",
@@ -20,7 +20,7 @@ setuptools.setup(
        'pycairo',
    ],
    packages=setuptools.find_packages(exclude=('tests', )),
-    classifiers=(
+    classifiers=[
        "Development Status :: 3 - Alpha",
        "Environment :: Console",
        "License :: OSI Approved :: GNU Lesser General Public License v3 or later (LGPLv3+)",
@@ -28,7 +28,7 @@ setuptools.setup(
        "Programming Language :: Python :: 3 :: Only",
        "Topic :: Security",
        "Intended Audience :: End Users/Desktop",
-    ),
+    ],
    project_urls={
        'bugtacker': 'https://0xacab.org/jvoisin/mat2/issues',
    },

--- a/tests/data/broken_xml_content_types.docx
+++ b/tests/data/broken_xml_content_types.docx
--- a/tests/data/malformed_content_types.docx
+++ b/tests/data/malformed_content_types.docx