Provided by: magicrescue_1.1.10+dfsg-2build2_amd64 bug

NAME

       dupemap - Creates a database of file checksums and uses it to eliminate duplicates

SYNOPSIS

       dupemap [ options ] [ -d database ] operation path...

DESCRIPTION

       dupemap recursively scans each path to find checksums of file contents.  Directories are searched through
       in no particular order.  Its actions depend on whether the -d option is given, and on the operation
       parameter, which must be a comma-seperated list of scan, report, delete:

   Without -d
       dupemap will take action when it sees the same checksum repeated more than once, i.e. it simply finds
       duplicates recursively.  The action depends on operation:

       report Report what files are encountered more than once, printing their names to standard output.

       delete[,report]
              Delete files that are encountered more than once.  Print their names if report is also given.

              WARNING: use the report operation first to see what will be deleted.

              WARNING:  You are advised to make a backup of the target first, e.g. with "cp -al" (for GNU cp) to
              create hard links recursively.

   With -d
       The database argument to -d will denote a database file (see the "DATABASE" section in  this  manual  for
       details)  to read from or write to.  In this mode, the scan operation should be run on one path, followed
       by the report or delete operation on another (not the same!) path.

       scan   Add the checksum of each file to database.  This operation must be run  initially  to  create  the
              database.   To  start  over,  you  must  manually  delete the database file(s) (see the "DATABASE"
              section).

       report Print each file name if its checksum is found in database.

       delete[,report]
              Delete each file if its checksum is found in database.  If report is also present, print the  name
              of each deleted file.

              WARNING:  if  you run dupemap delete on the same path you just ran dupemap scan on, it will delete
              every file! The idea of these options is to scan one path and delete files in a second path.

              WARNING: use the report operation first to see what will be deleted.

              WARNING: You are advised to make a backup of the target first, e.g. with "cp -al" (for GNU cp)  to
              create hard links recursively.

OPTIONS

       -d database
              Use database as an on-disk database to read from or write to.  See the "DESCRIPTION" section above
              about how this influences the operation of dupemap.

       -I file
              Reads input files from file in addition to those listed on the command line.  If file is "-", read
              from standard input.  Each line will be interpreted as a file name.

              The  paths  given  here will NOT be scanned recursively.  Directories will be ignored and symlinks
              will be followed.

       -m minsize
              Ignore files below this size.

       -M maxsize
              Ignore files above this size.

USAGE

   General usage
       The easiest operations to understand is when the -d option is not given.  To delete all  duplicate  files
       in /tmp/recovered-files, do:

           $ dupemap delete /tmp/recovered-files

       Often, dupemap scan is run to produce a checksum database of all files in a directory tree.  Then dupemap
       delete  is run on another directory, possibly following dupemap report.  For example, to delete all files
       in /tmp/recovered-files that already exist in $HOME, do this:

           $ dupemap -d homedir.map scan $HOME
           $ dupemap -d homedir.map delete,report /tmp/recovered-files

   Usage with magicrescue
       The main application for dupemap is to  take  some  pain  out  of  performing  undelete  operations  with
       magicrescue(1).   The  reason is that magicrescue will extract every single file of the specified type on
       the block device, so undeleting files requires you to find a few files out of hundreds, which can take  a
       long  time if done manually.  What we want to do is to only extract the documents that don't exist on the
       file system already.

       In the following scenario, you have accidentally deleted some important Word documents  in  Windows.   If
       this  were  a  real-world scenario, then by all means use The Sleuth Kit.  However, magicrescue will work
       even when the directory entries were overwritten, i.e. more files were stored in the same folder later.

       You boot into Linux and change to  a  directory  with  lots  of  space.   Mount  the  Windows  partition,
       preferably read-only (especially with NTFS), and create the directories we will use.

           $ mount -o ro /dev/hda1 /mnt/windows
           $ mkdir healthy_docs rescued_docs

       Extract  all the healthy Word documents with magicrescue and build a database of their checksums.  It may
       seem a little redundant to send all the documents through magicrescue first, but the reason is that  this
       process  may  modify them (e.g. stripping trailing garbage), and therefore their checksum will not be the
       same as the original documents.  Also, it will find  documents  embedded  inside  other  files,  such  as
       uncompressed zip archives or files with the wrong extension.

           $ find /mnt/windows -type f \
             |magicrescue -I- -r msoffice -d healthy_docs
           $ dupemap -d healthy_docs.map scan healthy_docs
           $ rm -rf healthy_docs

       Now rescue all "msoffice" documents from the block device and get rid of everything that's not a *.doc.

           $ magicrescue -Mo -r msoffice -d rescued_docs /dev/hda1 \
             |grep -v '\.doc$'|xargs rm -f

       Remove all the rescued documents that also appear on the file system, and remove duplicates.

           $ dupemap -d healthy_docs.map delete,report rescued_docs
           $ dupemap delete,report rescued_docs

       The  rescued_docs  folder should now contain only a few files.  This will be the undeleted files and some
       documents that were not stored in contiguous blocks (use that defragger ;-)).

   Usage with fsck
       In this scenario (based on a true story), you have a hard disk that's gone bad.  You have managed  to  dd
       about 80% of the contents into the file diskimage, and you have an old backup from a few months ago.  The
       disk is using reiserfs on Linux.

       First,  use  fsck to make the file system usable again.  It will find many nameless files and put them in
       lost+found.  You need to make sure there is some free space on the disk image, so fsck has  something  to
       work with.

           $ cp diskimage diskimage.bak
           $ dd if=/dev/zero bs=1M count=2048 >> diskimage
           $ reiserfsck --rebuild-tree diskimage
           $ mount -o loop diskimage /mnt
           $ ls /mnt/lost+found
           (tons of files)

       Our  strategy will be to restore the system with the old backup as a base and merge the two other sets of
       files (/mnt/lost+found and /mnt) into the backup after eliminating duplicates.   Therefore  we  create  a
       checksum database of the directory we have unpacked the backup in.

           $ dupemap -d backup.map scan ~/backup

       Next, we eliminate all the files from the rescued image that are also present in the backup.

           $ dupemap -d backup.map delete,report /mnt

       We  also  want  to  remove  duplicates from lost+found, and we want to get rid of any files that are also
       present in the other directories in /mnt.

           $ dupemap delete,report /mnt/lost+found
           $ ls /mnt|grep -v lost+found|xargs dupemap -d mnt.map scan
           $ dupemap -d mnt.map delete,report /mnt/lost+found

       This should leave only the files in /mnt that have changed  since  the  last  backup  or  got  corrupted.
       Particularly,  the contents of /mnt/lost+found should now be reduced enough to manually sort through them
       (or perhaps use magicsort(1)).

   Primitive intrusion detection
       You can use dupemap to see what files change on your system.  This is one of the more  exotic  uses,  and
       it's only included for inspiration.

       First, you map the whole file system.

           $ dupemap -d old.map scan /

       Then  you come back a few days/weeks later and run dupemap report.  This will give you a view of what has
       not changed.  To see what has changed, you need a list of the whole file system.  You can get  this  list
       along with preparing a new map easily.  Both lists need to be sorted to be compared.

           $ dupemap -d old.map report /|sort > unchanged_files
           $ dupemap -d current.map scan /|sort > current_files

       All  that's  left  to do is comparing these files and preparing for next week.  This assumes that the dbm
       appends the ".db" extension to database files.

           $ diff unchanged_files current_files > changed_files
           $ mv current.map.db old.map.db

DATABASE

       The actual database file(s) written by dupecheck will have some relation to the  database  argument,  but
       most  implementations  append  an extension.  For example, Berkeley DB names the files database.db, while
       Solaris and GDBM creates both a database.dir and database.pag file.

       dupecheck depends on a database library for storing the checksums.   It  currently  requires  the  POSIX-
       standardized  ndbm  library,  which  must  be  present  on XSI-compliant UNIXes.  Implementations are not
       required to handle hash key collisions, and a failure to do that could make  dupecheck  delete  too  many
       files.  I haven't heard of such an implementation, though.

       The  current  checksum  algorithm  is the file's CRC32 combined with its size.  Both values are stored in
       native byte order, and because of varying type sizes the database is not portable  across  architectures,
       compilers and operating systems.

SEE ALSO

       magicrescue(1), weeder(1)

       This  tool  does  the same thing weeder does, except that weeder cannot seem to handle many files without
       crashing, and it has no largefile support.

BUGS

       There is a tiny chance that two different files can have the same checksum and size.  The probability  of
       this  happening  is around 1 to 10^14, and since dupemap is part of the Magic Rescue package, which deals
       with disaster recovery, that chance becomes an insignificant part of the game.  You should consider  this
       if you apply dupemap to other applications, especially if they are security-related (see next paragraph).

       It  is  possible to craft a file to have a known CRC32.  You need to keep this in mind if you use dupemap
       on untrusted data.  A solution to this could be to implement an option for using MD5 checksums instead.

AUTHOR

       Jonas Jensen <jbj@knef.dk>

LATEST VERSION

       This   tool   is   part   of    Magic    Rescue.     You    can    find    the    latest    version    at
       <https://github.com/jbj/magicrescue>

1.1.10                                             2018-10-16                                         DUPEMAP(1)