What's the Best Way to Clean Out Duplicate Files from Your Computer? (4/30/2018)

	Personalized Computer Services	(617) 484-6657

NEWSLETTER

Practical Computer Advice
from Martin Kadansky

Volume 12 Issue 4

April 2018

What's the Best Way to Clean Out Duplicate Files from Your Computer?

Summary

Are you looking for an easy-to-use technique or utility program to remove duplicate files from your computer "automatically," i.e., that works perfectly every time with no effort or thought on your part? Then please let me know if you find one, because this is a complex issue with no simple answers or safe and acceptable one-click solutions.

The problems with having duplicate files vs. the problems involved in trying to remove them

The most common reasons that duplicate files can be a problem include:

Confusion: Why is this file in this folder when there's another copy of it in that folder?
Clutter: If you didn't have duplicates, you would have fewer files on your computer.
Wasted space: If you didn't have duplicates, your files would take up less space on your computer.

On the other hand:

Modern disk drives are big, fast, and less expensive per unit storage than ever.
It is a myth that extra files "slow down" or "wear out" your computer. Unless your hard drive is truly full (i.e., more than 95% of its capacity), extra files will probably not affect the computer's general operating speed at all. They might make operations that involve looking through your entire hard drive take a little more time (antivirus scans, backups, etc.), but that's a different (and only occasional) issue.
Unless you already know that you have lots of duplicated files, if you are really concerned about clutter, wasted space, and unnecessary software, your time and effort is probably better spent working on those issues directly rather than looking for duplicates.
In the worst case, blindly deleting files from your computer without regard to what they are or where they're located, whether they're duplicates or not, can destroy your operating system and render your computer unusable.

How to use a utility program to quickly find and delete duplicate files, and the major problems that you will probably create

There are many free and commercial "duplicate finder" utility programs that give you a mechanical tool to help you identify sets of duplicate files on your computer. They generally work as follows:

They scan your computer for duplicate files (typically looking for files with exactly the same size and contents, plus other options). Such a scan can take a fair amount of time.
They display what they found in a list, with each set of duplicates (triplicates, etc.) grouped together.
You can open any file in the list, visit the folder where it's located, and more.
You can select and delete any of the files in the list.

However, any such utility will know nothing about why the duplicated files are present, much less the consequences of removing them. And while the process sounds simple, if you're not careful to do it properly, you might not only disrupt the logical integrity of the collections of files that you (or your software) have organized, you might also corrupt or destroy your computer's operating system, software, or valuable data, only because they naturally stored duplicate files that should not have been removed under any circumstances. In other words, if you're not careful, you might render certain programs or even your entire computer completely unusable.

Here's an analogy: Imagine you had a robot that you told to look over your office and remove any duplicate items. When the robot finished, you would not be happy to discover that it threw away all of your unused paper (keeping only one blank sheet), and all but one of your file folders, pens, paper clips, ink cartridges, batteries, cordless phone handsets, light bulbs, light switches, backup drives, etc. But that's what you said you wanted it to do, right?

And, since most duplicate-finding utility programs use strict criteria (usually identical file size and contents) to detect duplicates, they might ignore many sets of documents that you would consider duplicates but are technically different, so ironically they may not do as complete a job as you might expect. This is most often due to differences in metadata; see below for more information.

Also, such utility programs will not find "similar" files, like two Word files that would be the same except for a few changes that you made in one vs. the other.

How to use a utility program to carefully find and delete duplicate files, and hopefully avoid the major problems that you might create

If you really want to take on such a project, I recommend:

Do a thorough backup of your computer before you start.
Read reviews of any utility program before you start using it to find out how it works and what options it offers.
Don't let the utility scan your computer's entire internal drive. Limit the scope of the scan to your own data folders, restricting the scan to subfolders of your own data folders --"C:\Users\(username)" on Windows and "Macintosh HD:Users:(username)" on Macintosh. In particular you should stay away from folders that contain your operating system, software and software, as well as any other folders that contain files that you don't recognize or understand.
Once it shows you the list of duplicates, don't just blindly delete them. Instead, examine them carefully. Look at the folders in which they're located and think about the larger context. Are those duplicates there for a good reason, e.g., historical record, backup, collections that highlight your "best of" or "favorite" files, etc.?
As you work your way down the list, you may see entire folders with files that you don't understand, as well as complicated folders that you're not ready to deal with yet. Tell the utility to skip those folders and re-scan.
Expect this project to require a fair amount of your time and attention.

How to use your computer's own search function to find potentially duplicate files

A utility program that mechanically finds files with identical contents can save you time by narrowing down the files for you to consider removing, especially if you have hundreds or thousands of files to examine, but you could also simply search for all files of a given type (.doc, .jpg, .pdf, .xls, etc.), sort the results by size, and then manually review any files with the same size to see if they really have the same contents. Files with the same type, size, and modification date are more likely to have the same contents, but as above, I recommend doing a full backup first and then being careful before you delete anything.

Background: What makes one file a "duplicate" of another? How metadata makes this more complicated

If two files have identical contents, then they should be considered duplicates, right? If the contents are the same, they should also have exactly the same size, right? While both of those statements sound logical, the problem is that many file types have two separate "sides" to their contents:

Regular data: The text, pictures, sound, video, etc. (plus attributes like fonts, colors, margins, etc.) that you can see and edit, and
Metadata (literally, "data about other data"): This is extra data stored in a document that you can't normally see, which can include the name of the file's author or artist, the number of revisions, the date and time a photo was taken, the type of camera used, GPS location information, etc. Here's an analogy: You might have a 4x6 photo printed on photo paper. The picture side is like the regular data in a digital photo file, and the writing on the back ("Family trip to California in 2015") is like the metadata.

Two documents might have identical regular data, which means that you might reasonably consider them duplicates, but depending on their editing history, their metadata may be different. Since most duplicate-scanning programs only look at the file as a whole (including the metadata), they will treat these files as different and skip them.

When does copying a file preserve its metadata, and when does copying change it?

Imagine that you copied a file using any of these methods:

You made a copy of a file (or an entire folder) using Windows Explorer or the Finder on Macintosh, either within the same folder or from one folder to another, typically using click-and-drag, Copy-and-Paste, or a Duplicate command.
You copied a file from a portable device or external drive onto your computer more than once. This can easily happen, for example, if you copy all of the photos from your digital camera, smartphone, or tablet onto your computer, and then (without cleaning out the device) you take more photos, and then later copy everything to your computer again, creating duplicates or triplicates or more.
You restored a file or folder from a backup.
You downloaded a file from a web site multiple times, especially in frustration if you didn't know into which folder your browser was downloading it.
You saved the attachment from the same email multiple times, or received exactly the same attachment more than once in different emails.

Files copied in these ways (which don't involve opening the files) would have exactly the same size, regular contents, and metadata as the originals. However, the copies' name, creation dates, and modification dates might be different from the originals.

On the other hand, imagine that you copied a file using one of these methods:

You opened a Microsoft Word document, and then immediately did "File->Save as" to make a copy.
You made an exact duplicate of a Word document using one of the techniques above, and then you opened the new document, typed in a few characters, saved it, deleted those extra characters, and then saved it again.
You opened a Word document, did "Edit->Select All" to select all the text, started a new document, Pasted in the text, Saved it, and then updated the new file's margins etc. to match the original file.

Copies of documents created using these methods (which involve opening the files) would have the same regular contents but their metadata would be different, so they would not be fully identical to the originals. Also, their file sizes might be the same or slightly different, depending on how the metadata is stored.

Unfortunately, I have not yet seen a general-purpose duplicate-finding utility with the ability to ignore metadata, which would enable it to find "equivalent" files like these, i.e., with identical regular content but different metadata. So, such a utility probably won't do as thorough a job as you might like.

Some duplicates are normal and should not be removed

There are many types of duplicate files in your computer that should not be removed:

Apart from your own documents, there are hundreds of thousands of files that comprise your computer's operating system and programs. Among those there are a number of files that are a normal part of the infrastructure and which also just happen to be duplicates. Removing them may make your computer or software stop working or crash, so you should leave those files (and their folders) alone.
Among your own data, you might be using a program that stores its data in a special way that naturally has duplicate files. Arbitrarily removing those duplicates might make that program crash or corrupt its data.
You might have downloaded some digital music files and then added them to your computer's iTunes library. If your iTunes software is configured correctly, it makes copies of those files in its library folders so it won't depend on your original files staying exactly where they were. If you later start removing duplicated music files, you should be careful to keep the ones located in your iTunes library to avoid compromising its integrity.
You might have one folder containing the older version of a project and another containing the current one. If some of the documents didn't change in the new version of the project, then they are duplicates. Assuming that you want to keep both folders, each one should be complete, so it makes no sense to delete those duplicates from either folder.
Similarly, you might have saved two different web pages from the same web site into two folders, each of which is comprised of multiple files (text, pictures, etc.). While each folder has a number of files that are different, they might have some other files that are the same, especially if they have similar designs. It makes no sense to delete the duplicate files from either folder.
You might have intentionally created duplicate copies of certain files because they don't just fit into a single category. For example, you might have decided to put all of your daughter's wedding photos into a folder named "Mary's wedding," and you might also have copied your favorite 10 photos into a separate folder called "Best wedding photos."
You might have organized some documents into a particular folder, and then later decided to share some of them with other people via Dropbox. Rather than disturbing that folder by moving those documents into your Dropbox folder, you decide to put copies into your Dropbox.
When you customize how a given folder is displayed, your operating system creates special hidden files that store the settings you've chosen. Two folders with the same settings will probably have identical hidden files, and removing those special files will change those folders back to their default display settings. Those files are called "desktop.ini" on Windows and ".DS_Store" on Macintosh.

In all of these cases, you should not delete these naturally duplicated files.

Other similarities among files that might imply that they're duplicates, but they're not

Sometimes a pair of files may appear to be duplicates, but they aren't:

Shortcut files on Windows (or aliases on Macintosh) that contain no data themselves vs. the originals that they "point to"
Documents with similar or identical names and file types but different contents, e.g., abc1.doc vs. abc2.doc
Documents with similar or identical names but different file types, e.g., abc.doc vs. abc.pdf vs. abc.jpg, even if abc.pdf or abc.jpg file was actually generated from abc.doc

Where to go from here

Ask yourself whether cleaning out duplicates is really worth the time, effort, and potential risk.
To research duplicate-finding utilities, google: remove duplicates [to which you should add keywords like Windows, Macintosh, documents, photos, etc. as appropriate]
If you find a utility program called "xyz" which interests you, find online reviews of it by googling: xyz review OR compare [be sure to capitalize the "OR" keyword]
Most of the time, being overly concerned about "cleaning" your computer is a waste of time. Before trying to fix or remove things that might not make any significant difference, spend the time and effort instead to set up a scheduled and thorough backup system to protect against computer failure, and then periodically confirm that your backup still works properly.
http://en.wikipedia.org/wiki/Metadata

How to contact me:
email: martin@kadansky.com
phone: (617) 484-6657
web: http://www.kadansky.com

On a regular basis I write about real issues faced by typical computer users. To subscribe to this newsletter, please send an email to martin@kadansky.com and I'll add you to the list, or visit http://www.kadansky.com/newsletter

Did you miss a previous issue? You can find it in my newsletter archive: http://www.kadansky.com/newsletter

Your privacy is important to me. I do not share my newsletter mailing list with anyone else, nor do I rent it out.

Copyright (C) 2018 Kadansky Consulting, Inc. All rights reserved.

I love helping people learn how to use their computers better! Like a "computer driving instructor," I work 1-on-1 with small business owners and individuals to help them find a more productive and successful relationship with their computers and other high-tech gadgets.

Printer-friendly version

Subscribe to this free newsletter

Go to the Newsletter Archive

To the Top

Personalized Computer Services