issues detecting XML files not named .xml
This bug was originally reported by Paul Wise (pabs@debian.org) in Debian bug #999438:
Package: diffoscope
Version: 190
Severity: normal
There are two issues with XML files not named *.xml:
They don't get reformatted before comparison, resulting in a diff of
the plain text, instead of a diff of the reformatted XML.
When comparing them with XML files named *.xml, a comparison of the
bytes is done, resulting in a diff of two hex dumps, instead of a diff
of the reformatted XML or a diff of the plain text. The reformatted XML
would be the best thing to diff, but plain text should be a fallback.
The xmllint tool can reformat them just fine and the file tool can
detect them as XML and detect their MIME type, so this issue is likely
to be a problem in the diffoscope code.
$ head -vn-0 test-{old,new}.xml
==> test-old.xml <==
<?xml version="1.0" encoding="UTF-8"?>
<test>
<foo>
<bar>
</bar>
</foo>
</test>
==> test-new.xml <==
<?xml version="1.0" encoding="UTF-8"?>
<test>
<foo>
<bar>
<baz>
</baz>
</bar>
</foo>
</test>
$ diffoscope test-{old,new}.xml
--- test-old.xml
+++ test-new.xml
│ --- test-old.xml
├── +++ test-new.xml
│ @@ -1,6 +1,8 @@
│ <?xml version="1.0" encoding="utf-8"?>
│ <test>
│ <foo>
│ - <bar/>
│ + <bar>
│ + <baz/>
│ + </bar>
│ </foo>
│ </test>
$ cp test-new.xml test-new.not-xml
$ cp test-old.xml test-old.not-xml
$ diffoscope test-{old,new}.not-xml
--- test-old.not-xml
+++ test-new.not-xml
@@ -1,7 +1,9 @@
<?xml version="1.0" encoding="UTF-8"?>
<test>
<foo>
<bar>
+<baz>
+</baz>
</bar>
</foo>
</test>
$ diffoscope test-old.xml test-new.not-xml
--- test-old.xml
+++ test-new.not-xml
@@ -1,5 +1,6 @@
00000000: 3c3f 786d 6c20 7665 7273 696f 6e3d 2231 <?xml version="1
00000010: 2e30 2220 656e 636f 6469 6e67 3d22 5554 .0" encoding="UT
00000020: 462d 3822 3f3e 0a3c 7465 7374 3e0a 3c66 F-8"?>.<test>.<f
-00000030: 6f6f 3e0a 3c62 6172 3e0a 3c2f 6261 723e oo>.<bar>.</bar>
-00000040: 0a3c 2f66 6f6f 3e0a 3c2f 7465 7374 3e0a .</foo>.</test>.
+00000030: 6f6f 3e0a 3c62 6172 3e0a 3c62 617a 3e0a oo>.<bar>.<baz>.
+00000040: 3c2f 6261 7a3e 0a3c 2f62 6172 3e0a 3c2f </baz>.</bar>.</
+00000050: 666f 6f3e 0a3c 2f74 6573 743e 0a foo>.</test>.
$ diffoscope test-old.not-xml test-new.xml
--- test-old.not-xml
+++ test-new.xml
@@ -1,5 +1,6 @@
00000000: 3c3f 786d 6c20 7665 7273 696f 6e3d 2231 <?xml version="1
00000010: 2e30 2220 656e 636f 6469 6e67 3d22 5554 .0" encoding="UT
00000020: 462d 3822 3f3e 0a3c 7465 7374 3e0a 3c66 F-8"?>.<test>.<f
-00000030: 6f6f 3e0a 3c62 6172 3e0a 3c2f 6261 723e oo>.<bar>.</bar>
-00000040: 0a3c 2f66 6f6f 3e0a 3c2f 7465 7374 3e0a .</foo>.</test>.
+00000030: 6f6f 3e0a 3c62 6172 3e0a 3c62 617a 3e0a oo>.<bar>.<baz>.
+00000040: 3c2f 6261 7a3e 0a3c 2f62 6172 3e0a 3c2f </baz>.</bar>.</
+00000050: 666f 6f3e 0a3c 2f74 6573 743e 0a foo>.</test>.
$ xmllint --format test-old.xml
<?xml version="1.0" encoding="UTF-8"?>
<test>
<foo>
<bar>
</bar>
</foo>
</test>
$ xmllint --format test-new.xml
<?xml version="1.0" encoding="UTF-8"?>
<test>
<foo>
<bar>
<baz>
</baz>
</bar>
</foo>
</test>
$ xmllint --format test-old.not-xml
<?xml version="1.0" encoding="UTF-8"?>
<test>
<foo>
<bar>
</bar>
</foo>
</test>
$ xmllint --format test-new.not-xml
<?xml version="1.0" encoding="UTF-8"?>
<test>
<foo>
<bar>
<baz>
</baz>
</bar>
</foo>
</test>
$ file test-*
test-new.not-xml: XML 1.0 document, ASCII text
test-new.xml: XML 1.0 document, ASCII text
test-old.not-xml: XML 1.0 document, ASCII text
test-old.xml: XML 1.0 document, ASCII text
$ file --mime test-*
test-new.not-xml: text/xml; charset=us-ascii
test-new.xml: text/xml; charset=us-ascii
test-old.not-xml: text/xml; charset=us-ascii
test-old.xml: text/xml; charset=us-ascii