Skip to content

issues detecting XML files not named .xml

This bug was originally reported by Paul Wise (pabs@debian.org) in Debian bug #999438:

Package: diffoscope
Version: 190
Severity: normal

There are two issues with XML files not named *.xml:

They don't get reformatted before comparison, resulting in a diff of
the plain text, instead of a diff of the reformatted XML.

When comparing them with XML files named *.xml, a comparison of the
bytes is done, resulting in a diff of two hex dumps, instead of a diff
of the reformatted XML or a diff of the plain text. The reformatted XML
would be the best thing to diff, but plain text should be a fallback.

The xmllint tool can reformat them just fine and the file tool can
detect them as XML and detect their MIME type, so this issue is likely
to be a problem in the diffoscope code.

   $ head -vn-0 test-{old,new}.xml
   ==> test-old.xml <==
   <?xml version="1.0" encoding="UTF-8"?>
   <test>
   <foo>
   <bar>
   </bar>
   </foo>
   </test>

   ==> test-new.xml <==
   <?xml version="1.0" encoding="UTF-8"?>
   <test>
   <foo>
   <bar>
   <baz>
   </baz>
   </bar>
   </foo>
   </test>

   $ diffoscope test-{old,new}.xml
   --- test-old.xml
   +++ test-new.xml
   │   --- test-old.xml
   ├── +++ test-new.xml
   │ @@ -1,6 +1,8 @@
   │  <?xml version="1.0" encoding="utf-8"?>
   │  <test>
   │    <foo>
   │ -    <bar/>
   │ +    <bar>
   │ +      <baz/>
   │ +    </bar>
   │    </foo>
   │  </test>

   $ cp test-new.xml test-new.not-xml

   $ cp test-old.xml test-old.not-xml

   $ diffoscope test-{old,new}.not-xml
   --- test-old.not-xml
   +++ test-new.not-xml
   @@ -1,7 +1,9 @@
    <?xml version="1.0" encoding="UTF-8"?>
    <test>
    <foo>
    <bar>
   +<baz>
   +</baz>
    </bar>
    </foo>
    </test>

   $ diffoscope test-old.xml test-new.not-xml
   --- test-old.xml
   +++ test-new.not-xml
   @@ -1,5 +1,6 @@
    00000000: 3c3f 786d 6c20 7665 7273 696f 6e3d 2231  <?xml version="1
    00000010: 2e30 2220 656e 636f 6469 6e67 3d22 5554  .0" encoding="UT
    00000020: 462d 3822 3f3e 0a3c 7465 7374 3e0a 3c66  F-8"?>.<test>.<f
   -00000030: 6f6f 3e0a 3c62 6172 3e0a 3c2f 6261 723e  oo>.<bar>.</bar>
   -00000040: 0a3c 2f66 6f6f 3e0a 3c2f 7465 7374 3e0a  .</foo>.</test>.
   +00000030: 6f6f 3e0a 3c62 6172 3e0a 3c62 617a 3e0a  oo>.<bar>.<baz>.
   +00000040: 3c2f 6261 7a3e 0a3c 2f62 6172 3e0a 3c2f  </baz>.</bar>.</
   +00000050: 666f 6f3e 0a3c 2f74 6573 743e 0a         foo>.</test>.

   $ diffoscope test-old.not-xml test-new.xml
   --- test-old.not-xml
   +++ test-new.xml
   @@ -1,5 +1,6 @@
    00000000: 3c3f 786d 6c20 7665 7273 696f 6e3d 2231  <?xml version="1
    00000010: 2e30 2220 656e 636f 6469 6e67 3d22 5554  .0" encoding="UT
    00000020: 462d 3822 3f3e 0a3c 7465 7374 3e0a 3c66  F-8"?>.<test>.<f
   -00000030: 6f6f 3e0a 3c62 6172 3e0a 3c2f 6261 723e  oo>.<bar>.</bar>
   -00000040: 0a3c 2f66 6f6f 3e0a 3c2f 7465 7374 3e0a  .</foo>.</test>.
   +00000030: 6f6f 3e0a 3c62 6172 3e0a 3c62 617a 3e0a  oo>.<bar>.<baz>.
   +00000040: 3c2f 6261 7a3e 0a3c 2f62 6172 3e0a 3c2f  </baz>.</bar>.</
   +00000050: 666f 6f3e 0a3c 2f74 6573 743e 0a         foo>.</test>.

   $ xmllint --format test-old.xml
   <?xml version="1.0" encoding="UTF-8"?>
   <test>
     <foo>
       <bar>
   </bar>
     </foo>
   </test>

   $ xmllint --format test-new.xml
   <?xml version="1.0" encoding="UTF-8"?>
   <test>
     <foo>
       <bar>
         <baz>
   </baz>
       </bar>
     </foo>
   </test>

   $ xmllint --format test-old.not-xml
   <?xml version="1.0" encoding="UTF-8"?>
   <test>
     <foo>
       <bar>
   </bar>
     </foo>
   </test>

   $ xmllint --format test-new.not-xml
   <?xml version="1.0" encoding="UTF-8"?>
   <test>
     <foo>
       <bar>
         <baz>
   </baz>
       </bar>
     </foo>
   </test>

   $ file test-*
   test-new.not-xml: XML 1.0 document, ASCII text
   test-new.xml:     XML 1.0 document, ASCII text
   test-old.not-xml: XML 1.0 document, ASCII text
   test-old.xml:     XML 1.0 document, ASCII text

   $ file --mime test-*
   test-new.not-xml: text/xml; charset=us-ascii
   test-new.xml:     text/xml; charset=us-ascii
   test-old.not-xml: text/xml; charset=us-ascii
   test-old.xml:     text/xml; charset=us-ascii
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information