Skip to content

Files reported as 0% similar when they are nearly entirely identical

I find it convenient to try comparing source tarballs for different versions of a software package with diffoscope to see how they change. It doesn't always work though.

https://github.com/gavinhoward/bc/releases/download/7.0.1/bc-7.0.1.tar.xz https://github.com/gavinhoward/bc/releases/download/7.0.2/bc-7.0.2.tar.xz

With diffoscope, I get this:

│ │   --- bc-7.0.1/manuals/bc/H.1.md
│ ├── +++ bc-7.0.2/manuals/bc/H.1.md
│ │┄ Files 0% similar despite different names
│ │ @@ -3327,2496 +3327,2496 @@
│ │  0000cfe0: 6465 6e74 616c 2066 756e 6374 696f 6e20  dental function 
│ │  0000cff0: 2873 6565 2074 6865 202a 5472 616e 7363  (see the *Transc
│ │  0000d000: 656e 6465 6e74 616c 2046 756e 6374 696f  endental Functio
│ │  0000d010: 6e73 2a0a 2020 2020 7375 6273 6563 7469  ns*.    subsecti
│ │  0000d020: 6f6e 2062 656c 6f77 292e 0a0a 2a2a 6672  on below)...**fr
│ │  0000d030: 616e 6428 7029 2a2a 0a0a 3a20 2020 4765  and(p)**..:   Ge
│ │  0000d040: 6e65 7261 7465 7320 6120 7073 6575 646f  nerates a pseudo
│ │ -0000d050: 2d72 616e 646f 6d20 696e 7465 6765 7220  -random integer 
│ │ -0000d060: 6265 7477 6565 6e20 2a2a 302a 2a20 2869  between **0** (i
│ │ -0000d070: 6e63 6c75 7369 7665 2920 616e 6420 2a2a  nclusive) and **
│ │ -0000d080: 312a 2a0a 2020 2020 2865 7863 6c75 7369  1**.    (exclusi

If I instead manually unpack them and run git diff --no-index bc-7.0.*:

diff --git a/bc-7.0.1/manuals/bc/H.1.md b/bc-7.0.2/manuals/bc/H.1.md
index aa313cd..fbc0658 100644
--- a/bc-7.0.1/manuals/bc/H.1.md
+++ b/bc-7.0.2/manuals/bc/H.1.md
@@ -1433,7 +1433,7 @@ The extended library is a **non-portable extension**.
 
 **frand(p)**
 
-:   Generates a pseudo-random integer between **0** (inclusive) and **1**
+:   Generates a pseudo-random number between **0** (inclusive) and **1**
     (exclusive) with the number of decimal digits after the decimal point equal
     to the truncated absolute value of **p**. If **p** is not **0**, then
     calling this function will change the value of **seed**. If **p** is **0**,
@@ -1441,7 +1441,7 @@ The extended library is a **non-portable extension**.
 
 **ifrand(i, p)**
 
-:   Generates a pseudo-random integer that is between **0** (inclusive) and the
+:   Generates a pseudo-random number that is between **0** (inclusive) and the
     truncated absolute value of **i** (exclusive) with the number of decimal
     digits after the decimal point equal to the truncated absolute value of
     **p**. If the absolute value of **i** is greater than or equal to **2**, and

In this case diffoscope has somehow decided that comparison as text files would yield a suboptimal diff -- why? They are markdown.

Oh, well, file reports:

exported SGML document, ASCII text, with very long lines (512)

(It's not an SGML document, but it certainly is ASCII text. The top of the file has used <!--- to encapsulate the copyright header.)

Using hexdump to report fine-grained differences between binary files makes a lot of sense. I wonder if there's some way to tune it, however, to not get used for files that are "human-readable". "Contains only printable ASCII" is a good criterion here. Unicode starts asking some pretty tough questions.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information