Commit 77cc5a46 authored by Emmanuel Bouthenot's avatar Emmanuel Bouthenot

New upstream version 0.11

parent 7d2d06d2
Frank DENIS <j at pureftpd.org>
/*
* Copyright (c) 2007, 2008, 2009 Frank DENIS <j at pureftpd.org>
*
* Permission to use, copy, modify, and distribute this software for any
* purpose with or without fee is hereby granted, provided that the above
* copyright notice and this permission notice appear in all copies.
*
* THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
* WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
* MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
* ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
* WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
* ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
* OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
*/
This diff is collapsed.
AUTOMAKE_OPTIONS = gnu
EXTRA_DIST = \
THANKS \
README-PHP
SUBDIRS = \
src \
man \
php
This diff is collapsed.
.:. LIBPUZZLE .:.
http://libpuzzle.pureftpd.org
------------------------ BLURB ------------------------
The Puzzle library is designed to quickly find visually similar images (gif,
png, jpg), even if they have been resized, recompressed, recolored or slightly
modified.
The library is free, lightweight yet very fast, configurable, easy to use and
it has been designed with security in mind. This is a C library, but is also
comes with a command-line tool and PHP bindings.
------------------------ REFERENCE ------------------------
The Puzzle library is a implementation of "An image signature for any kind of
image", by H. CHI WONG, Marschall BERN and David GOLDBERG.
------------------------ COMPILATION ------------------------
In order to load images, the library relies on the GD2 library.
You need to install gdlib2 and its development headers before compiling
libpuzzle.
The GD2 library is available as a pre-built package for most operating systems.
Debian and Ubuntu users should install the "libgd2-dev" or the "libgd2-xpm-dev"
package.
Gentoo users should install "media-libs/gd".
OpenBSD, NetBSD and DragonflyBSD users should install the "gd" package.
MacPorts users should install the "gd2" package.
X11 support is not required for the Puzzle library.
Once GD2 has been installed, configure the Puzzle library as usual:
./configure
This is a standard autoconf script, if you're not familiar with it, please
have a look at the INSTALL file.
Compile the beast:
make
Try the built-in tests:
make check
If everything looks fine, install the software:
make install
If anything goes wrong, please submit a bug report to:
libpuzzle [at] pureftpd [dot] org
------------------------ USAGE ------------------------
The API is documented in the libpuzzle(3) and puzzle_set(3) man pages.
You can also play with the puzzle-diff test application.
See puzzle-diff(8) for more info about the puzzle-diff application.
In order to be thread-safe, every exported function of the library requires a
PuzzleContext object. That object stores various run-time tunables.
Out of a bitmap picture, the Puzzle library can fill a PuzzleCVec object :
PuzzleContext context;
PuzzleCVec cvec;
puzzle_init_context(&context);
puzzle_init_cvec(&context, &cvec);
puzzle_fill_cvec_from_file(&context, &cvec, "directory/filename.jpg");
The PuzzleCvec structure holds two fields:
signed char *vec: a pointer to the first element of the vector
size_t sizeof_vec: the number of elements
The size depends on the "lambdas" value (see puzzle_set(3)).
PuzzleCvec structures can be compared:
d = puzzle_vector_normalized_distance(&context, &cvec1, &cvec2, 1);
d is the normalized distance between both vectors. If d is below 0.6, pictures
are probably similar.
If you need further help, feel free to subscribe to the mailing-list (see
below).
------------------------ INDEXING ------------------------
How to quickly find similar pictures, if they are millions of records?
The original paper has a simple, yet efficient answer.
Cut the vector in fixed-length words. For instance, let's consider the
following vector:
[ a b c d e f g h i j k l m n o p q r s t u v w x y z ]
With a word length (K) of 10, you can get the following words:
[ a b c d e f g h i j ] found at position 0
[ b c d e f g h i j k ] found at position 1
[ c d e f g h i j k l ] found at position 2
etc. until position N-1
Then, index your vector with a compound index of (word + position).
Even with millions of images, K = 10 and N = 100 should be enough to have very
little entries sharing the same index.
Here's a very basic sample database schema:
+-----------------------------+
| signatures |
+-----------------------------+
| sig_id | signature | pic_id |
+--------+-----------+--------+
+--------------------------+
| words |
+--------------------------+
| pos_and_word | fk_sig_id |
+--------------+-----------+
I'd recommend splitting at least the "words" table into multiple tables and/or
servers.
By default (lambas=9) signatures are 544 bytes long. In order to save storage
space, they can be compressed to 1/third of their original size through the
puzzle_compress_cvec() function. Before use, they must be uncompressed with
puzzle_uncompress_cvec().
------------------------ PUZZLE-DIFF ------------------------
A command-line tool is also available for scripting or testing.
It is installed as "puzzle-diff" and comes with a man page.
Sample usage:
- Output distance between two images:
$ puzzle-diff pic-a-0.jpg pics-a-1.jpg
0.102286
- Compare two images, exit with 10 if they look the same, exit with 20 if
they don't (may be useful for scripts):
$ puzzle-diff -e pic-a-0.jpg pics-a-1.jpg
$ echo $?
10
- Compute distance, without cropping and with computing the average intensity
of the whole blocks:
$ puzzle-diff -p 1.0 -c pic-a-0.jpg pic-a-1.jpg
0.0523151
------------------------ COMPARING IMAGES WITH PHP ------------------------
A PHP extension is bundled with the Libpuzzle package, and it provides PHP
bindings to most functions of the library.
Documentation for the Libpuzzle PHP extension is available in the README-PHP
file.
------------------------ APPS USING LIBPUZZLE ------------------------
Here are third-party projects using libpuzzle:
* ftwin - http://jok.is-a-geek.net/ftwin.php
ftwin is a tool useful to find duplicate files according to their content on
your file system.
------------------------ CONTACT ------------------------
The main web site for the project is: http://libpuzzle.pureftpd.org
If you need to share ideas with other users, or if you need help, feel free to
subscribe to the mailing-list.
In order to subscribe, just send a mail with random content to:
listpuzzle-subscribe at pureftpd dot org
For anything else, you can get in touch with me at:
libpuzzle at pureftpd dot org
If you are interested in bindings for Ruby, Python, PHP, etc. just ask!
Thank you,
-Frank.
.:. LIBPUZZLE - PHP EXTENSION .:.
http://libpuzzle.pureftpd.org
------------------------ PHP EXTENSION ------------------------
The Puzzle library can also be used through PHP, using a native extension.
Prerequisites are the PHP headers, libtool, autoconf and automake.
Here are the basic steps in order to install the extension:
(on OpenBSD: export AUTOMAKE_VERSION=1.9 ; export AUTOCONF_VERSION=2.61)
cd php/libpuzzle
phpize
./configure --with-libpuzzle
make clean
make
make install
If libpuzzle is installed in a non-standard location, use:
./configure --with-libpuzzle=/base/directory/for/libpuzzle
Then edit your php.ini file and add:
extension=libpuzzle.so
------------------------ USAGE ------------------------
The PHP extension provides bindings for the following tuning functions:
- puzzle_set_max_width()
- puzzle_set_max_height()
- puzzle_set_lambdas()
- puzzle_set_noise_cutoff()
- puzzle_set_p_ratio()
- puzzle_set_contrast_barrier_for_cropping()
- puzzle_set_max_cropping_ratio()
- puzzle_set_autocrop()
Have a look at the puzzle_set man page for more info about those.
Getting the signature of a picture is as simple as:
$signature = puzzle_fill_cvec_from_file($filename);
In order to compute the similarity between two pictures using their
signatures, use:
$d = puzzle_vector_normalized_distance($signature1, $signature2);
The result is between 0.0 and 1.0, with 0.6 being a good threshold to detect
visually similar pictures.
The PUZZLE_CVEC_SIMILARITY_THRESHOLD, PUZZLE_CVEC_SIMILARITY_HIGH_THRESHOLD,
PUZZLE_CVEC_SIMILARITY_LOW_THRESHOLD and PUZZLE_CVEC_SIMILARITY_LOWER_THRESHOLD
constants can also be used to get common thresholds :
if ($d < PUZZLE_CVEC_SIMILARITY_THRESHOLD) {
echo "Pictures look similar\n";
}
Before storing a signature into a database, you can compress it in order to
save some storage space:
$compressed_signature = puzzle_compress_cvec($signature);
Before use, those compressed signatures must be uncompressed with:
$signature = puzzle_uncompress_cvec($compressed_signature);
Xerox Research Center
H. CHI WONG
Marschall BERN
David GOLDBERG
Sameh CHAFIK
Gregory MAXWELL
This diff is collapsed.
This diff is collapsed.
/* config.h.in. Generated from configure.ac by autoheader. */
/* Define to 1 if you have the <dlfcn.h> header file. */
#undef HAVE_DLFCN_H
/* Define to 1 if you have the <inttypes.h> header file. */
#undef HAVE_INTTYPES_H
/* Define to 1 if you have the `gd' library (-lgd). */
#undef HAVE_LIBGD
/* Define to 1 if you have the `math' library (-lmath). */
#undef HAVE_LIBMATH
/* Define to 1 if you have the <limits.h> header file. */
#undef HAVE_LIMITS_H
/* Define to 1 if your system has a GNU libc compatible `malloc' function, and
to 0 otherwise. */
#undef HAVE_MALLOC
/* Define to 1 if you have the <memory.h> header file. */
#undef HAVE_MEMORY_H
/* Define to 1 if your system has a GNU libc compatible `realloc' function,
and to 0 otherwise. */
#undef HAVE_REALLOC
/* Define to 1 if you have the <stddef.h> header file. */
#undef HAVE_STDDEF_H
/* Define to 1 if you have the <stdint.h> header file. */
#undef HAVE_STDINT_H
/* Define to 1 if you have the <stdlib.h> header file. */
#undef HAVE_STDLIB_H
/* Define to 1 if you have the <strings.h> header file. */
#undef HAVE_STRINGS_H
/* Define to 1 if you have the <string.h> header file. */
#undef HAVE_STRING_H
/* Define to 1 if you have the `strtoul' function. */
#undef HAVE_STRTOUL
/* Define to 1 if you have the <sys/stat.h> header file. */
#undef HAVE_SYS_STAT_H
/* Define to 1 if you have the <sys/types.h> header file. */
#undef HAVE_SYS_TYPES_H
/* Define to 1 if you have the <unistd.h> header file. */
#undef HAVE_UNISTD_H
/* Define to the sub-directory in which libtool stores uninstalled libraries.
*/
#undef LT_OBJDIR
/* Name of package */
#undef PACKAGE
/* Define to the address where bug reports for this package should be sent. */
#undef PACKAGE_BUGREPORT
/* Define to the full name of this package. */
#undef PACKAGE_NAME
/* Define to the full name and version of this package. */
#undef PACKAGE_STRING
/* Define to the one symbol short name of this package. */
#undef PACKAGE_TARNAME
/* Define to the version of this package. */
#undef PACKAGE_VERSION
/* Define to 1 if you have the ANSI C header files. */
#undef STDC_HEADERS
/* Version number of package */
#undef VERSION
/* Define to empty if `const' does not conform to ANSI C. */
#undef const
/* Define to rpl_malloc if the replacement function should be used. */
#undef malloc
/* Define to `long int' if <sys/types.h> does not define. */
#undef off_t
/* Define to rpl_realloc if the replacement function should be used. */
#undef realloc
/* Define to `unsigned int' if <sys/types.h> does not define. */
#undef size_t
/* Define to `int' if <sys/types.h> does not define. */
#undef ssize_t
This diff is collapsed.
This diff is collapsed.
# -*- Autoconf -*-
# Process this file with autoconf to produce a configure script.
AC_PREREQ(2.61)
AC_INIT(libpuzzle, 0.11, bugs@pureftpd.org)
AC_CONFIG_SRCDIR([src/puzzle.h])
AC_CONFIG_HEADER([config.h])
AM_INIT_AUTOMAKE([1.9 dist-bzip2])
AM_MAINTAINER_MODE
# Checks for programs.
AC_PROG_CXX
AC_PROG_CC
AC_PROG_CPP
AC_PROG_INSTALL
AC_PROG_LN_S
AC_PROG_MAKE_SET
AC_PATH_PROG(GDLIBCONFIG, [gdlib-config])
CPPFLAGS="$CPPFLAGS -D_GNU_SOURCE=1"
CPPFLAGS="$CPPFLAGS `$GDLIBCONFIG --cflags`"
LDFLAGS="$LDFLAGS `$GDLIBCONFIG --ldflags`"
LDADD="$LDADD `$GDLIBCONFIG --libs`"
# Checks for libraries.
AC_CHECK_LIB([gd], [gdImageCreateFromGd2],,
AC_ERROR([libgd2 development files not found]))
# Checks for header files.
AC_HEADER_STDC
AM_PROG_LIBTOOL
AC_CHECK_HEADERS([limits.h memory.h stddef.h stdlib.h string.h unistd.h])
# Checks for typedefs, structures, and compiler characteristics.
AC_C_CONST
AC_TYPE_SIZE_T
AC_TYPE_SSIZE_T
AC_TYPE_OFF_T
# Checks for library functions.
AC_FUNC_MALLOC
AC_FUNC_REALLOC
AC_FUNC_MEMCMP
AC_CHECK_FUNC([floor], ,[AC_CHECK_LIB([math], [floor])])
AC_CHECK_FUNC([round], ,[AC_CHECK_LIB([math], [round])])
AC_CHECK_FUNCS([strtoul])
AC_SUBST([MAINT])
AC_CONFIG_FILES([Makefile
man/Makefile
src/Makefile
src/pics/Makefile
php/Makefile
php/libpuzzle/Makefile
php/libpuzzle/include/Makefile
php/libpuzzle/modules/Makefile
php/libpuzzle/build/Makefile
php/libpuzzle/tests/Makefile
php/libpuzzle/tests/pics/Makefile
php/examples/Makefile
php/examples/similar/Makefile
])
AC_OUTPUT
AC_MSG_NOTICE([+-------------------------------------------------------+])
AC_MSG_NOTICE([| You can subscribe to the Libpuzzle users mailing-list |])
AC_MSG_NOTICE([| to ask for help and to stay informed of new releases. |])
AC_MSG_NOTICE([| Go to http://libpuzzle.pureftpd.org/ml/ now! |])
AC_MSG_NOTICE([+-------------------------------------------------------+])
This diff is collapsed.
This diff is collapsed.
man_MANS = \
libpuzzle.3 \
puzzle_set.3 \
puzzle-diff.8
EXTRA_DIST = \
$(man_MANS)
This diff is collapsed.
.\"
.\" Copyright (c) 2007 Frank DENIS <j at pureftpd.org>
.\"
.\" Permission to use, copy, modify, and distribute this software for any
.\" purpose with or without fee is hereby granted, provided that the above
.\" copyright notice and this permission notice appear in all copies.
.\"
.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
.\"
.Dd $Mdocdate: September 24 2007 $
.Dt LIBPUZZLE 3
.Sh NAME
.Nm puzzle_init_cvec ,
.Nm puzzle_init_dvec ,
.Nm puzzle_fill_dvec_from_file ,
.Nm puzzle_fill_cvec_from_file ,
.Nm puzzle_fill_cvec_from_dvec ,
.Nm puzzle_free_cvec ,
.Nm puzzle_free_dvec ,
.Nm puzzle_init_compressed_cvec ,
.Nm puzzle_free_compressed_cvec ,
.Nm puzzle_compress_cvec ,
.Nm puzzle_uncompress_cvec ,
.Nm puzzle_vector_normalized_distance
.Nd compute comparable signatures of bitmap images.
.Sh SYNOPSIS
.Fd #include <puzzle.h>
.Ft int
.Fn puzzle_init_context "PuzzleContext *context"
.Ft int
.Fn puzzle_free_context "PuzzleContext *context"
.Ft int
.Fn puzzle_init_cvec "PuzzleContext *context" "PuzzleCvec *cvec"
.Ft int
.Fn puzzle_init_dvec "PuzzleContext *context" "PuzzleDvec *cvec"
.Ft void
.Fn puzzle_fill_dvec_from_file "PuzzleContext *context" "PuzzleDvec * dvec" "const char *file"
.Ft void
.Fn puzzle_fill_cvec_from_file "PuzzleContext *context" "PuzzleCvec * cvec" "const char *file"
.Ft void
.Fn puzzle_fill_cvec_from_dvec "PuzzleContext *context" "PuzzleCvec * cvec" "const PuzzleDvec *dvec"
.Ft void
.Fn puzzle_free_cvec "PuzzleContext *context" "PuzzleCvec *cvec"
.Ft void
.Fn puzzle_free_dvec "PuzzleContext *context" "PuzzleDvec *cvec"
.Ft void
.Fn puzzle_init_compressed_cvec "PuzzleContext *context" "PuzzleCompressedCvec * compressed_cvec"
.Ft void
.Fn puzzle_free_compressed_cvec "PuzzleContext *context" "PuzzleCompressedCvec * compressed_cvec"
.Ft int
.Fn puzzle_compress_cvec "PuzzleContext *context" "PuzzleCompressedCvec * compressed_cvec" "const PuzzleCvec * cvec"
.Ft int
.Fn puzzle_uncompress_cvec "PuzzleContext *context" "PuzzleCompressedCvec * compressed_cvec" "PuzzleCvec * const cvec"
.Ft double
.Fn puzzle_vector_normalized_distance "PuzzleContext *context" "const PuzzleCvec * cvec1" "const PuzzleCvec * cvec2", "const int fix_for_texts"
.Sh DESCRIPTION
The Puzzle library computes a signature out of a bitmap picture.
Signatures are comparable and similar pictures have similar signatures.
.Pp
After a picture has been loaded and uncompressed, featureless parts of
the image are skipped (autocrop), unless that step has been explicitely
disabled, see
.Xr puzzle_set 3
.Sh LIBPUZZLE CONTEXT
Every public function requires a
.Va PuzzleContext
object, that stores every required tunables.
.Pp
Any application using libpuzzle should initialize a
.Va PuzzleContext
object with
.Fn puzzle_init_context
and free it after use with
.Fn puzzle_free_context
.Bd \-literal \-offset indent
PuzzleContext context;
puzzle_init_context(&context);
...
puzzle_free_context(&context);
.Ed
.Sh DVEC AND CVEC VECTORS
The next step is to divide the cropped image into a grid and to compute
the average intensity of soft\(hyedged pixels in every block. The result is a
.Va PuzzleDvec
object.
.Pp
.Va PuzzleDvec
objects should be initialized before use, with
.Fn puzzle_init_dvec
and freed after use with
.Fn puzzle_free_dvec
.Pp
The
.Va PuzzleDvec
structure has two important fields:
.Va vec
is the pointer to the first element of the array containing the average
intensities, and
.Va sizeof_compressed_vec
is the number of elements.
.Pp
.Va PuzzleDvec
objects are not comparable, so what you usually want is to transform these
objects into
.Va PuzzleCvec
objects.
.Pp
A
.Va PuzzleCvec
object is a vector with relationships between adjacent blocks from a
.Va PuzzleDvec
object.
.Pp
The
.Fn puzzle_fill_cvec_from_dvec
fills a
.Va PuzzleCvec
object from a
.Va PuzzleDvec
object.
.Pp
But just like the other structure,
.Va PuzzleCvec
objects must be initialized and freed with
.Fn puzzle_init_cvec
and
.Fn puzzle_free_cvec
.Pp
.Va PuzzleCvec
objects have a vector whoose first element is in the
.Va vec
field, and the number of elements is in the
.Va sizeof_vec
field
.Sh LOADING PICTURES
.Va PuzzleDvec
and
.Va PuzzleCvec
objects can be computed from a bitmap picture file, with
.Fn puzzle_fill_dvec_from_file
and
.Fn puzzle_fill_cvec_from_file
.Pp
.Em GIF
,
.Em PNG
and
.Em JPEG
files formats are currently supported and automatically recognized.
.Pp
Here's a simple example that creates a
.Va PuzzleCvec
objects out of a file.
.Bd \-literal \-offset indent
PuzzleContext context;
PuzzleCvec cvec;
puzzle_init_context(&context);
puzzle_init_cvec(&context, &cvec);
puzzle_fill_cvec_from_file(&context, &cvec, "test\-picture.jpg");
...
puzzle_free_cvec(&context, &cvec);
puzzle_free_context(&context);
.Ed
.Sh COMPARING VECTORS
In order to check whether two pictures are similar, you need to compare their
.Va PuzzleCvec
signatures, using
.Fn puzzle_vector_normalized_distance
.Pp
That function returns a distance, between 0.0 and 1.0. The lesser, the nearer.
.Pp
Tests on common pictures show that a normalized distance of 0.6 (also defined as
.Va PUZZLE_CVEC_SIMILARITY_THRESHOLD
) means that both pictures are visually similar.
.Pp
If that threshold is not right for your set of pictures, you can experiment
with
.Va PUZZLE_CVEC_SIMILARITY_HIGH_THRESHOLD
,
.Va PUZZLE_CVEC_SIMILARITY_LOW_THRESHOLD
and
.Va PUZZLE_CVEC_SIMILARITY_LOWER_THRESHOLD
or with your own value.
.Pp
If the
.Fa fix_for_texts
of
.Fn puzzle_vector_normalized_distance
is
.Em 1
, a fix is applied to the computation in order to deal with bitmap pictures
that contain text. That fix is recommended, as it allows using the same
threshold for that kind of picture as for generic pictures.
.Pp
If
.Fa fix_for_texts
is
.Em 0
, that special way of computing the normalized distance is disabled.
.Bd \-literal \-offset indent
PuzzleContext context;
PuzzleCvec cvec1, cvec2;
double d;
puzzle_init_context(&context);
puzzle_init_cvec(&context, &cvec1);
puzzle_init_cvec(&context, &cvec2);
puzzle_fill_cvec_from_file(&context, &cvec1, "test\-picture\-1.jpg");
puzzle_fill_cvec_from_file(&context, &cvec2, "test\-picture\-2.jpg");
d = puzzle_vector_normalized_distance(&context, &cvec1, &cvec2, 1);
if (d < PUZZLE_CVEC_SIMILARITY_THRESHOLD) {
puts("Pictures are similar");
}
puzzle_free_cvec(&context, &cvec2);
puzzle_free_cvec(&context, &cvec1);
puzzle_free_context(&context);
.Ed
.Sh CVEC COMPRESSION
In order to reduce storage needs,
.Va PuzzleCvec
objects can be compressed to 1/3 of their original size.
.Pp
.Va PuzzleCompressedCvec
structures hold the compressed data. Before and after use, these structures
have to be passed to
.Fn puzzle_init_compressed_cvec
and
.Fn puzzle_free_compressed_cvec
.Pp
.Fn puzzle_compress_cvec