Physically de-dupe the library

I am finding that there are duplicates copies in the library itself. Now, obviously photostructure recognizes that it’s the same picture, but it still copied all of them in the library.

I understand after reading the FAQ that this likely because there is some minute metadata difference between the files (added by google or other software) so photostructure errs in the side of caution and copies them all into the library.

So that’s my feature request: I would like the ability to generate a physical copy of the library where only the “best” version of a picture is saved. Just like photostructure decides which “best” picture to display, I would hope that at some point we’d have the ability to have that same determination applied to the “plus” library.

My library looks like yours: tons of dupes.

I’ve been hesitant to delete files from the library that aren’t considered the “primary,” given

In thinking about this more, though, I think I could do the following and still be “safe”:

  1. Only copy new asset variations into the library if they are the new “best” variation
  2. Only remove prior-copied variations from the library if there’s an existing copy of the file that PhotoStructure has found on a different volume, the volume is mounted, and the file’s prior SHA matches the current SHA (to ensure that there’s no data loss by deleting the copy)

What do you think?

(Edit: in re-reading your post, I may have misunderstood: are you wanting PhotoStructure to delete duplicate files that are outside of your library?)

You initial reading was correct. I don’t want photostructure to delete anything outside of the “plus” library. In fact, I mount the scanned folders read-only just to be extra sure.

I’d like the “plus” library to have only the “best” version of every asset. Basically a “physical” de-duping and not just a logical de-duping. Of course, all of the duplicates in the source paths would still be there should I disagree with a decision that photostructure made, so there really is no data loss should photostructure do something stupid.

So your ideas sound ok to me.

An additional idea that just came to mind just now - maybe worth exploring: could one specify (either through UI or configurations) a source folder that should always take precedence? Or even a ranking/weight for each folder? Thinking about my situation: there is one source path that I actively manage (edit metadata, post new pictures) while the other paths are more historical. You can see in my screenshot 3 paths: “ApplePhotos”, “GoogleTakeout” and “oldNas”. Really the ApplePhotos is the one copy that should always win in my book, with “oldNas” coming second and “GoogleTakeout” last. So the paths could be given extra consideration in the heuristic.

1 Like

I could add a “volume precedence” setting which would just be a list of volumes, but specifying volumes by mountpoints is problematic. I could accept volume labels, volshas, and mountpoints, I guess?

Let’s add this suggestion as a new feature:

If/when this is implemented, it would be slick to have a “Help me deduplicate” experience that displays the “best” asset with a variant and asks user to confirm deleting the worse one (or delete the “best” one and replace it). The majority of related thread Deleting / hiding photos - #3 by mrm seems to be addressing deleting entire assets, not variants of assets, but @codepoet did mention

I’d like to be able to verify the duplicates in a side-by-side and then delete the “other” one

I think you could use the same “Delete” action you already made for managing entire assets, but I do not think the “Archive” or “Remove” actions would apply.

And of course the variants would not actually be deleted from disk until you click the “Empty trash” button from the “View trash” search.

Lastly I think the “View trash” search should have some indicator differentiating between variants and whole assets. And the variants should include a button to compare again with the “best” variant.

2 Likes

Thanks for sharing those thoughts!

The v2.1 implementation of hide/remove/delete is actually only at the asset level–searches don’t know about asset files (yet–it’s why you can’t search for filenames–yet).

I could add a “delete now” option to each file in the asset info panel, that didn’t have the option of undo, but even that seems like it’d be dangerous. I’ll think about how this could work.

Along the same thoughts as “de-Dupe” - Give us the option to move all photos, not just copy. I’ve got so many apps that put photos in their own “special” folder. I want everything in one location, de-duped and searchable.

Thanks!

Howdy @YoungDave , welcome to PhotoStructure!

Yup–especially egregious are the apps that use “content addressable” paths (like the SHA, so you get stuff like /A8F3/A34G921E7831C.JPG, like what Apple Photos does (or did) with the Masters directory).

In case you missed it, you can disable automatic organization, and PhotoStructure will leave all the files where it found them, but this sounds like you do want auto-organization:

Before you run your first sync, take a look at the assetPathnameFormat setting. As of today, here’s the documentation for that:

# +---------------------------------------------------+
# |  PS_ASSET_PATHNAME_FORMAT or assetPathnameFormat  |
# +---------------------------------------------------+
#
# If you opt into "automatic organization" (see the setting
# "copyAssetsToLibrary"), they will be copied into <originals
# directory>/<result of assetPathnameFormat>.
#
# - See the originalsDir system setting for what your <originals directory> is
# (it defaults to your library root directory).
#
# - Please encode this path with forward-slashes, even if you're on Windows.
#
# - If any patterns resolve to including forward-slashes, know that will be
# interpreted as subdirectories.
#
# - If you want to add a static path, escape the pathname with single quotes
# (like "'photos'/y/MM/dd").
#
# - The result of this will always be interpreted as a relative path from your
# PhotoStructure originals directory.
#
# - Use token "BASE" as a shorthand for the original basename ("photo.jpg" for
# "/path/to/photo.jpg").
#
# - Use token "NAME" as a shorthand for the original filename, without the
# file extension ("photo" for "/path/to/photo.jpg").
#
# - Use token "PARENT" as a shorthand for the original file's parent directory
# name ("to" for "/path/to/photo.jpg").
#
# - Use token "GRANDPARENT" as a shorthand for the original file's grandparent
# directory name ("path" for "/path/to/photo.jpg").
#
# - Use token "EXT" for the filename's extension without the "." prefix (like
# "jpg" for "/path/to/photo.jpg").
#
# - Use token "ISO" as a shorthand for "yyyy-MM-dd'T'HH:mm:ss.SSSZZ".
#
# - You can escape other static text by wrapping with single quotes.
#
# - For other tokens, see
# <https://moment.github.io/luxon/#/formatting?id=table-of-tokens>.
#
# - See
# https://forum.photostructure.com/t/how-to-change-the-naming-structure/1184/2?u=mrm
# for more details.
#
# PS_ASSET_PATHNAME_FORMAT="y/y-MM-dd/BASE"

The idea is that if you’re not happy with “y/y-MM-dd/BASE”, change this to suit your taste before you start your sync.

Once the sync is finished, you can pipe this command into | xargs rm -v, but before you do:

  1. Make sure you have a full backup of all the files you’re about to delete, and that the backup is offline. It’s easy to get commands just a bit wrong, and rm can be terrifyingly fast.

  2. Make sure you’ve at least scanned through the sync report and deduplication results (either by leaving the asset info panel open and browsing your library–and then clicking the asset variations to verify PhotoStructure did the right thing). Deduplication is an inexact science (which I touched on above)

Also: know that the next build includes a number of bugfixes and improvements to image and video deduplication (including 2 new image hashing algorithms and a number of date-related changes): https://photostructure.com/about/2023-release-notes/#v210-alpha8

Perhaps another option to deal with de-duping a library could be a special view that shows the dupes, with some additional information that would make it helpful to decide which to keep. All of this just to aid in us manually removing the files. I have a library with a couple hundred thousand files, so manually de-duping it will take some time.

Perhaps where you have View archived, etc., you could have View Duplicates.

1 Like