Files reported as 0% similar when they are nearly entirely identical
I find it convenient to try comparing source tarballs for different versions of a software package with diffoscope to see how they change. It doesn't always work though.
https://github.com/gavinhoward/bc/releases/download/7.0.1/bc-7.0.1.tar.xz https://github.com/gavinhoward/bc/releases/download/7.0.2/bc-7.0.2.tar.xz
With diffoscope, I get this:
│ │ --- bc-7.0.1/manuals/bc/H.1.md
│ ├── +++ bc-7.0.2/manuals/bc/H.1.md
│ │┄ Files 0% similar despite different names
│ │ @@ -3327,2496 +3327,2496 @@
│ │ 0000cfe0: 6465 6e74 616c 2066 756e 6374 696f 6e20 dental function
│ │ 0000cff0: 2873 6565 2074 6865 202a 5472 616e 7363 (see the *Transc
│ │ 0000d000: 656e 6465 6e74 616c 2046 756e 6374 696f endental Functio
│ │ 0000d010: 6e73 2a0a 2020 2020 7375 6273 6563 7469 ns*. subsecti
│ │ 0000d020: 6f6e 2062 656c 6f77 292e 0a0a 2a2a 6672 on below)...**fr
│ │ 0000d030: 616e 6428 7029 2a2a 0a0a 3a20 2020 4765 and(p)**..: Ge
│ │ 0000d040: 6e65 7261 7465 7320 6120 7073 6575 646f nerates a pseudo
│ │ -0000d050: 2d72 616e 646f 6d20 696e 7465 6765 7220 -random integer
│ │ -0000d060: 6265 7477 6565 6e20 2a2a 302a 2a20 2869 between **0** (i
│ │ -0000d070: 6e63 6c75 7369 7665 2920 616e 6420 2a2a nclusive) and **
│ │ -0000d080: 312a 2a0a 2020 2020 2865 7863 6c75 7369 1**. (exclusi
If I instead manually unpack them and run git diff --no-index bc-7.0.*
:
diff --git a/bc-7.0.1/manuals/bc/H.1.md b/bc-7.0.2/manuals/bc/H.1.md
index aa313cd..fbc0658 100644
--- a/bc-7.0.1/manuals/bc/H.1.md
+++ b/bc-7.0.2/manuals/bc/H.1.md
@@ -1433,7 +1433,7 @@ The extended library is a **non-portable extension**.
**frand(p)**
-: Generates a pseudo-random integer between **0** (inclusive) and **1**
+: Generates a pseudo-random number between **0** (inclusive) and **1**
(exclusive) with the number of decimal digits after the decimal point equal
to the truncated absolute value of **p**. If **p** is not **0**, then
calling this function will change the value of **seed**. If **p** is **0**,
@@ -1441,7 +1441,7 @@ The extended library is a **non-portable extension**.
**ifrand(i, p)**
-: Generates a pseudo-random integer that is between **0** (inclusive) and the
+: Generates a pseudo-random number that is between **0** (inclusive) and the
truncated absolute value of **i** (exclusive) with the number of decimal
digits after the decimal point equal to the truncated absolute value of
**p**. If the absolute value of **i** is greater than or equal to **2**, and
In this case diffoscope has somehow decided that comparison as text files would yield a suboptimal diff -- why? They are markdown.
Oh, well, file
reports:
exported SGML document, ASCII text, with very long lines (512)
(It's not an SGML document, but it certainly is ASCII text. The top of the file has used <!---
to encapsulate the copyright header.)
Using hexdump to report fine-grained differences between binary files makes a lot of sense. I wonder if there's some way to tune it, however, to not get used for files that are "human-readable". "Contains only printable ASCII" is a good criterion here. Unicode starts asking some pretty tough questions.