README 23.6 KB
Newer Older
1 2 3
Introduction
--------------------------------------------------------------------------
jdupes is a program for identifying and taking actions upon duplicate
4
files.
5 6 7 8 9 10 11 12 13

A WORD OF WARNING: jdupes IS NOT a drop-in compatible replacement for
fdupes! Do not blindly replace fdupes with jdupes in scripts and expect
everything to work the same way. Option availability and meanings differ
between the two programs. For example, the -I switch in jdupes means
"isolate" and blocks intra-argument matching, while in fdupes it means
"immediately delete files during scanning without prompting the user."


14
Why use jdupes instead of the original fdupes or other duplicate finders?
15 16 17 18
--------------------------------------------------------------------------
The biggest reason is raw speed. In testing on various data sets, jdupes is
over 7 times faster than fdupes-1.51 on average.

19
jdupes provides a native Windows port. Most duplicate scanners built on
20 21 22 23 24
Linux and other UNIX-like systems do not compile for Windows out-of-the-box
and even if they do, they don't support Unicode and other Windows-specific
quirks and features.

jdupes is generally stable. All releases of jdupes are compared against a
25
known working reference versions of fdupes or jdupes to be certain that
26 27 28 29 30 31 32
output does not change. You get the benefits of an aggressive development
process without putting your data at increased risk.

Code in jdupes is written with data loss avoidance as the highest priority.
If a choice must be made between being aggressive or careful, the careful
way is always chosen.

33
jdupes includes features that are not always found elsewhere. Examples of
34 35 36 37 38 39 40 41
such features include btrfs block-level deduplication and control over
which file is kept when a match set is automatically deleted. jdupes is
not afraid of dropping features of low value; a prime example is the -1
switch which outputs all matches in a set on one line, a feature which was
found to be useless in real-world tests and therefore thrown out.

While jdupes maintains some degree of compatibility with fdupes from which
it was originally derived, there is no guarantee that it will continue to
42 43 44 45
maintain such compatibility in the future. However, compatibility will be
retained between minor versions, i.e. jdupes-1.6 and jdupes-1.6.1 should
not have any significant differences in results with identical command
lines.
46

47 48 49
If the program eats your dog or sets fire to your lawn, the authors cannot
be held responsible. If you notice a bug, please report it.

50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

What jdupes is not: a similar (but not identical) file finding tool
--------------------------------------------------------------------------
Please note that jdupes ONLY works on 100% exact matches. It does not have
any sort of "similarity" matching, nor does it know anything about any
specific file formats such as images or sounds. Something as simple as a
change in embedded metadata such as the ID3 tags in an MP3 file or the EXIF
information in a JPEG image will not change the sound or image presented to
the user when opened, but technically it makes the file no longer identical
to the original.

Plenty of excellent tools already exist to "fuzzy match" specific file types
using knowledge of their file formats to help. There are no plans to add
this type of matching to jdupes.

65 66 67 68 69 70 71
There are some match options available in jdupes that enable dangerous file
matching based on partial or likely but not 100% certain matching. These
are considered expert options for special situations and are clearly and
loudly documented as being dangerous. The -Q and -T options are notable
examples, and the extreme danger of the -T option is safeguarded by a
requirement to specify it twice so it can't be used accidentally.

72 73 74 75 76

Usage
--------------------------------------------------------------------------
Usage: jdupes [options] DIRECTORY...

77 78
Duplicate file sets will be printed by default unless a different action
option is specified (delete, summarize, link, dedupe, etc.)
79
 -@ --loud              output annoying low-level debug info while running
80
 -0 --printnull         output nulls instead of CR/LF (like 'find -print0')
81 82 83
 -1 --one-file-system   do not match files on different filesystems/devices
 -A --nohidden          exclude hidden files from consideration
 -B --dedupe            Send matches to btrfs for block-level deduplication
84
 -C --chunksize=#       override I/O chunk size (min 4096, max 16777216)
85 86 87 88 89 90 91 92 93 94 95 96 97
 -d --delete            prompt user for files to preserve and delete all
                        others; important: under particular circumstances,
                        data may be lost when using this option together
                        with -s or --symlinks, or when specifying a
                        particular directory more than once; refer to the
                        documentation for additional information
 -D --debug             output debug statistics after completion
 -f --omitfirst         omit the first file in each set of matches
 -h --help              display this help message
 -H --hardlinks         treat any linked files as duplicate files. Normally
                        linked files are treated as non-duplicates for safety
 -i --reverse           reverse (invert) the match sort order
 -I --isolate           files in the same specified directory won't match
98
 -l --linksoft          make relative symlinks for duplicates w/o prompting
99
 -L --linkhard          hard link all duplicate files without prompting
100
                        Windows allows a maximum of 1023 hard links per file
101
 -m --summarize         summarize dupe information
102
 -M --printwithsummary  will print matches and --summarize at the end
103 104 105 106 107 108 109 110
 -N --noprompt          together with --delete, preserve the first file in
                        each set of duplicates and delete the rest without
                        prompting the user
 -o --order=BY          select sort order for output, linking and deleting; by
 -O --paramorder        Parameter order is more important than selected -O sort
                        mtime (BY=time) or filename (BY=name, the default)
 -p --permissions       don't consider files with different owner/group or
                        permission bits as duplicates
111
 -P --print=type        print extra info (partial, early, fullhash)
112 113 114
 -q --quiet             hide progress indicator
 -Q --quick             skip byte-by-byte duplicate verification. WARNING:
                        this may delete non-duplicates! Read the manual first!
115 116 117 118 119 120
 -r --recurse           for every directory, process its subdirectories too
 -R --recurse:          for each directory given after this option follow
                        subdirectories encountered within (note the ':' at
                        the end of the option, manpage for more details)
 -s --symlinks          follow symlinks
 -S --size              show size of duplicate files
121
 -t --no-tocttou        disable security check for file changes (aka TOCTTOU)
122 123 124
 -T --partial-only      match based on partial hashes only. WARNING:
                        EXTREMELY DANGEROUS paired with destructive actions!
                        -T must be specified twice to work. Read the manual!
125 126 127 128
 -v --version           display jdupes version and license information
 -x --xsize=SIZE        exclude files of size < SIZE bytes from consideration
    --xsize=+SIZE       '+' specified before SIZE, exclude size > SIZE
 -X --exclude=spec:info exclude files based on specified criteria
129
                        specs: size+-=
130
                        Exclusions are cumulative: -X dir:abc -X dir:efg
131
 -z --zeromatch         consider zero-length files to be duplicates
132
 -Z --softabort         If the user aborts (i.e. CTRL-C) act on matches so far
133
                        You can send SIGUSR1 to the program to toggle this
134 135

For sizes, K/M/G/T/P/E[B|iB] suffixes can be used (case-insensitive)
136

137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153
The -t/-no-tocttou option disables checks for file changes during and after
scanning. This opens a security vulnerability that is called a TOCTTOU (time
of check to time of use) vulnerability. The program normally runs checks
immediately before scanning or taking action upon a file to see if the file
has changed in some way since it was last checked. With this option enabled,
the program will not run any of these checks, making the algorithm slightly
faster, but also increasing the risk that the program scans a file, the file
is changed after the scan, and the program still acts like the file was in
its previous state. This is particularly dangerous when considering actions
such as linking and deleting. In the most extreme case, a file could be
deleted during scanning but match other files prior to that deletion; if the
file is the first in the list of duplicates and auto-delete is used, all of
the remaining matched files will be deleted as well. This option was added
due to user reports of some filesystems (particularly network filesystems)
changing the reported file information inappropriately, rendering the entire
program unusable on such filesystems.

154 155 156
The -n/--noempty option was removed for safety. Matching zero-length files as
duplicates now requires explicit use of the -z/--zeromatch option instead.

157
Duplicate files are listed together in groups with each file displayed on a
158
separate line. The groups are then separated from each other by blank lines.
159

160 161
The -s/--symlinks option will treat symlinked files as regular files, but
direct symlinks will be treated as if they are hard linked files and the
162
-H/--hardlinks option will apply to them in the same manner.
163 164 165 166 167 168 169 170 171 172

When using -d or --delete, care should be taken to insure against accidental
data loss. While no information will be immediately lost, using this option
together with -s or --symlink can lead to confusing information being
presented to the user when prompted for files to preserve. Specifically, a
user could accidentally preserve a symlink while deleting the file it points
to. A similar problem arises when specifying a particular directory more
than once. All files within that directory will be listed as their own
duplicates, leading to data loss should a user preserve a file without its
"duplicate" (the file itself!)
173

174 175 176 177 178 179 180 181 182 183
Using -1 or --one-file-system prevents matches that cross filesystems, but a
more relaxed form of this option may be added that allows cross-matching for
all filesystems that each parameter is present on.

-Z or --softabort used to be --hardabort in jdupes prior to v1.5 and had the
opposite behavior. Defaulting to taking action on abort is probably not what
most users would expect. The decision to invert rather than reassign to a
different option was made because this feature was still fairly new at the
time of the change.

184 185 186 187 188 189 190 191 192 193 194
On non-Windows platforms that support SIGUSR1, you can toggle the state of
the -Z option by sending a SIGUSR1 to the program. This is handy if you want
to abort jdupes, didn't specify -Z, and changed your mind and don't want to
lose all the work that was done so far. Just do 'killall -USR1 jdupes' and
you wll be able to abort with -Z. This works in reverse: if you want to
prevent a -Z from happening, a SIGUSR1 will toggle it back off. That's a lot
less useful because you can just stop and kill the program to get the same
effect, but it's there if you want it for some reason. Sending the signal
twice while the program is stopped will behave as if it was only sent once,
as per normal POSIX signal behavior.

195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229
The -O or --paramorder option allows the user greater control over wha
appears in the first position of a match set, specifically for keeping the -N
option from deleting all but one file in a set in a seemingly random way. All
directories specified on the command line will be used as the sorting order
of result sets first, followed by the sorting algorithm set by the -o or
--order option. This means that the order of all match pairs for a single
directory specification will retain the old sorting behavior even if this
option is specified.

When used together with options -s or --symlink, a user could accidentally
preserve a symlink while deleting the file it points to.

The -Q or --quick option only reads each file once, hashes it, and performs
comparisons based solely on the hashes. There is a small but significant risk
of a hash collision which is the purpose of the failsafe byte-for-byte
comparison that this option explicitly bypasses. Do not use it on ANY data
set for which any amount of data loss is unacceptable. You have been warned!

The -T or --partial-only option produces results based on a hash of the first
block of file data in each file, ignoring everything else in the file.
Partial hash checks have always been an important exclusion step in the
jdupes algorithm, usually hashing the first 4096 bytes of data and allowing
files that are different at the start to be rejected early. In certain
scenarios it may be a useful heuristic for a user to see that a set of files
has the same size and the same starting data, even if the remaining data does
not match; one example of this would be comparing files with data blocks that
are damaged or missing such as an incomplete file transfer or checking a data
recovery against known-good copies to see what damaged data can be deleted in
favor of restoring the known-good copy. This option is meant to be used with
informational actions and can result in EXTREME DATA LOSS if used with
options that delete files, create hard links, or perform other destructive
actions on data based on the matching output. Because of the potential for
massive data destruction, this option MUST BE SPECIFIED TWICE to take effect
and will error out if it is only specified once.

230 231 232 233 234 235
The -I/--isolate option attempts to block matches that are contained in
the same specified directory parameter on the command line. Due to the
underlying nature of the jdupes algorithm, a lot of matches will be
blocked by this option that probably should not be. This code could use
improvement.

236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256
The -C/--chunksize option overrides the size of the I/O "chunk" used for all
file operations. Larger numbers will increase the amount of data read at
once from each file and may improve performance when scanning lots of files
that are larger than the default chunk size by reducing "thrashing" of the
hard disk heads. Smaller numbers may increase algorithm speed depending on
the characteristics of your CPU but will usually increase I/O and system
call overhead as well. The nubmer also directly affects memory usage: I/O
chunk size is used for at least three allocations in the program, so using
a chunk size of 16777216 (16 MiB) will require 48 MiB of RAM. The default
is usually between 32768 and 65536 which results in the fastest raw speed
of the algorithm and generally good all-around performance. Feel free to
experiment with the number on your data set and report your experiences
(preferably with benchmarks and info on your data set.)

Using -P/--print will cause the program to print extra information that
may be useful but will pollute the output in a way that makes scripted
handling difficult. Its current purpose is to reveal more information about
the file matching process by printing match pairs that pass certain steps
of the process prior to full file comparison. This can be useful if you
have two files that are passing early checks but failing after full checks.

257

258
Hard and soft (symbolic) linking status symbols and behavior
259
--------------------------------------------------------------------------
260
A set of arrows are used in file linking to show what action was taken on
261 262 263 264
each link candidate. These arrows are as follows:

----> File was hard linked to the first file in the duplicate chain

265 266
-@@-> File was symlinked to the first file in the chain

267 268
-==-> Already a hard link to the first file in the chain

269 270
-//-> File linking failed due to an error during the linking process

271
If your data set has linked files and you do not use -H to always consider
272 273 274 275
them as duplicates, you may still see linked files appear together in match
sets. This is caused by a separate file that matches with linked files
independently and is the correct behavior. See notes below on the "triangle
problem" in jdupes for technical details.
276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291


Microsoft Windows platform-specific notes
--------------------------------------------------------------------------
The Windows port does not support Unicode, only ANSI file names. This is
because Unicode support on Windows is difficult to add to existing code
without making it very messy or breaking things. Support is eventually
planned for Unicode on Windows.

Windows has a hard limit of 1024 hard links per file. There is no way to
change this. The documentation for CreateHardLink() states: "The maximum
number of hard links that can be created with this function is 1023 per
file. If more than 1023 links are created for a file, an error results."
(The number is actually 1024, but they're ignoring the first file.)


292
The current jdupes algorithm's "triangle problem"
293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330
--------------------------------------------------------------------------
Pairs of files are excluded individually based on how the two files compare.
For example, if --hardlinks is not specified then two files which are hard
linked will not match one another for duplicate scanning purposes. The
problem with only examining files in pairs is that certain circumstances
will lead to the exclusion being overridden.

Let's say we have three files with identical contents:

a/file1
a/file2
a/file3

and 'a/file1' is linked to 'a/file3'. Here's how 'jdupes a/' sees them:

---
Are 'a/file1' and 'a/file2' matches? Yes
[point a/file1->duplicates to a/file2]

Are 'a/file1' and 'a/file3' matches? No (hard linked already, -H off)

Are 'a/file2' and 'a/file3' matches? Yes
[point a/file2->duplicates to a/file3]
---

Now you have the following duplicate list:

a/file1->duplicates ==> a/file2->duplicates ==> a/file3

The solution is to split match sets into multiple sets, but doing this
will also remove the guarantee that files will only ever appear in one
match set and could result in data loss if handled improperly. In the
future, options for "greedy" and "sparse" may be introduced to switch
between allowing triangle matches to be in the same set vs. splitting
sets after matching finishes without the "only ever appears once"
guarantee.


331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359
Does jdupes meet the "Good Practice when Deleting Duplicates" by rmlint?
--------------------------------------------------------------------------
Yes. If you've not read this list of cautions, it is available at
http://rmlint.readthedocs.io/en/latest/cautions.html

Here's a breakdown of how jdupes addresses each of the items listed.

"Backup your data"
"Measure twice, cut once"
These guidelines are for the user of duplicate scanning software, not the
software itself. Back up your files regularly. Use jdupes to print a list
of what is found as duplicated and check that list very carefully before
automatically deleting the files.

"Beware of unusual filename characters"
The only character that poses a concern in jdupes is a newline '\n' and
that is only a problem because the duplicate set printer uses them to
separate file names. Actions taken by jdupes are not parsed like a
command line, so spaces and other weird characters in names aren't a
problem. Escaping the names properly if acting on the printed output is a
problem for the user's shell script or other external program.

"Consider safe removal options"
This is also an exercise for the user.

"Traversal Robustness"
jdupes tracks each directory traversed by dev:inode pair to avoid adding
the contents of the same directory twice. This prevents the user from
being able to register all of their files twice by duplicating an entry
360 361 362 363
on the command line. Symlinked directories are only followed if they
weren't already followed earlier. Files are renamed to a temporary name
before any linking is done and if the link operation fails they are renamed
back to the original name.
364 365 366 367

"Collision Robustness"
jdupes uses jodyhash for file data hashing. This hash is extremely fast
with a low collision rate, but it still encounters collisions as any hash
368
function will ("secure" or otherwise) due to the pigeonhole principle. This
369
is why jdupes performs a full-file verification before declaring a match.
370
It's slower than matching by hash only, but the pigeonhole principle puts
371 372 373 374 375
all data sets larger than the hash at risk of collision, meaning a false
duplicate detection and data loss. The slower completion time is not as
important as data integrity. Checking for a match based on hashes alone
is irresponsible, and using secure hashes like MD5 or the SHA families
is orders of magnitude slower than jodyhash while still suffering from
376 377
the risk brought about by the pigeonholing. An example of this problem is
as follows: if you have 365 days in a year and 366 people, the chance of
378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410
having at least two birthdays on the same day is guaranteed; likewise,
even though SHA512 is a 512-bit (64-byte) wide hash, there are guaranteed
to be at least 256 pairs of data streams that causes a collision once any
of the data streams being hashed for comparison is 65 bytes (520 bits) or
larger.

"Unusual Characters Robustness"
jdupes does not protect the user from putting ASCII control characters in
their file names; they will mangle the output if printed, but they can
still be operated upon by the actions (delete, link, etc.) in jdupes.

"Seek Thrash Robustness"
jdupes uses an I/O chunk size that is optimized for reading as much as
possible from disk at once to take advantage of high sequential read
speeds in traditional rotating media drives while balancing against the
significantly higher rate of CPU cache misses triggered by an excessively
large I/O buffer size. Enlarging the I/O buffer further may allow for
lots of large files to be read with less head seeking, but the CPU cache
misses slow the algorithm down and memory usage increases to hold these
large buffers. jdupes is benchmarked periodically to make sure that the
chosen I/O chunk size is the best compromise for a wide variety of data
sets.

"Memory Usage Robustness"
This is a very subjective concern considering that even a cell phone in
someone's pocket has at least 1GB of RAM, however it still applies in the
embedded device world where 32MB of RAM might be all that you can have.
Even when processing a data set with over a million files, jdupes memory
usage (tested on Linux x86_64 with -O3 optimization) doesn't exceed 2GB.
A low memory mode can be chosen at compile time to reduce overall memory
usage with a small performance penalty.


411 412 413
Contact Information
--------------------------------------------------------------------------
For all jdupes inquiries, contact Jody Bruchon <jody@jodybruchon.com>
414
Please DO NOT contact Adrian Lopez about issues with jdupes.
415 416


417
Legal Information and Software License
418
--------------------------------------------------------------------------
419 420 421
jdupes is Copyright (C) 2015-2018 by Jody Bruchon <jody@jodybruchon.com>
Derived from the original 'fdupes' (C) 1999-2018 by Adrian Lopez
Includes other code libraries which are (C) 2015-2018 by Jody Bruchon
422 423

The MIT License
424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442

Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation files
(the "Software"), to deal in the Software without restriction,
including without limitation the rights to use, copy, modify, merge,
publish, distribute, sublicense, and/or sell copies of the Software,
and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
443