Generate list of filenames for Assets with duplicates

Thomas_Hollier · February 18, 2022, 5:43am

Hi,

My library has 170,000 files from a variety of legacy sources and once the de-duplication happens, I end up with 112,000 assets. I studied the SQL file for a while and can see the assetID of each asset, the id of each file associated with that asset, which of these associated file is shown, its uri and fileSize.
All good so far…

What I’d like to do is to reclaim as much space as possible for the least amount of effort. While PS does great work catching duplicates, it is not (yet?) set up as a way to manage these duplicates and delete redundant data.

Rather than use a de-duplicator on the various input libraries (and introduce yet another software package), I’m looking for a way to fold the great de-duping work already done by PS back into the source. My thought is that I can use a csv of all the assets that have multiple files associated with them and write a script to encode some kind of keyword into my input files with exiftool and refresh those updated files in Lightroom to do the necessary visual checks before deleting the duplicates.

My question is: is there a simple way to get that specific data out of the sql file into a format more flexible to work with in a script?

mrm · February 18, 2022, 6:40am

PhotoStructure for Servers (you can install it next to your PhotoStructure for Desktops installation) has a bunch of command-line tools to do just this sort of data exploration and mass file changes.

I’ve cooked up a bunch of recipes here:

https://photostructure.com/server/tools/#show-me-all-the-shown-filenames-for-each-asset

Standard disclaimers:

PhotoStructure may not have deduplicated things exactly as you’d expect/hope.
PhotoStructure may have either aggregated too many files into a single asset (false positives) or too few (false negatives). You can see what it did by browsing your library and opening the “info” panel: the bold filename is the “primary”, or “shown” asset file variation. *
PhotoStructure may have not picked the “best” variant for a given asset that you’d rather have. Check the docs for how it figures out which is the “best” or (“shown”).
Please take a full backup before you do any file deleting, just for insurance.

* If you ever find that PhotoStructure’s deduper isn’t doing the right thing, please send an example of the original files (as long as they don’t have private content) to support@photostructure.com and I can see what’s going on.

Thomas_Hollier · February 18, 2022, 9:47pm

Thanks for the reply. I did study these and they gave me a starting point.

I realize my request may be beyond the scope of what you are currently taking on for photostructure but what I am trying to do is this:

First, I would like to generate a document from the library database that shows me all the assets that contain duplicates, as well as the uri to each file associated with that asset, which one is tagged as the primary, and the file size. Full disclosure: I am not a database guy and I don’t speak SQL query. My hunch is there is a simple way to look that up directly from a bash or python script without generating an intermediary file. If you can point me in that direction, I’ll take any hint you can give.

Second, in an ideal world, for each asset, I would be able to bring up the originals for each representation, re-tag specific ones as primary representations if needed, and tag redundant ones in some way that would allow me to delete the original file. This second part is probably less thought out and complicated from the implementation standpoint but maybe it helps describe my intent: to go through all my libraries from all the different approaches I’ve taken over the years, and remove all the redundancies. At this point, I figure I have close to 1 TB of duplicates that I don’t want to keep around.

Thomas_Hollier · February 18, 2022, 11:20pm

Okay, google (sometimes) being my friend, this seems to isolate the assetId’s that have multiple fileId’s associated with them. This also lists the shown version listed first. It also sorts the list based on the largest file first so you can clean those up first. I’m putting this here but use with care: I don’t know what I’m doing. I am testing this ON A COPY of the library database with the “DB Browser for SQLite” app on the mac.

SELECT assetId, GROUP_CONCAT(uri, " "), GROUP_CONCAT(shown  , ''), GROUP_CONCAT(fileSize  , ' ')
FROM (
   SELECT assetId, uri, shown, fileSize
   FROM assetFile
   ORDER BY assetId, shown DESC
   )
GROUP BY assetid
HAVING GROUP_CONCAT(shown , '') LIKE '%1%0%'
ORDER BY fileSize DESC;

mrm · February 19, 2022, 9:50pm

For this step, you want all assets with more than 1 associated asset file:

SELECT
  Asset.id AS assetId,
  count(AssetFile.id) AS afCount
FROM
  Asset
  JOIN AssetFile ON AssetFile.assetId = Asset.id
WHERE Asset.shown = 1 -- (don't return "work-in-progress" assets)
GROUP BY
  assetId
HAVING
  afCount > 1 -- (only those assets with more than one asset file)

Converting from URI to native filesystem path is actually non-trivial, which is why I wrote the list tool, https://photostructure.com/server/tools/#path-names-in-your-library. Unfortunately, although it supports custom WHERE clauses, it doesn’t (currently) support gluing in HAVING or GROUP BY, so you’d have to use a WHERE IN clause, which SQLite limits to 1024 items, I believe.

So, I think the real solution is to add a new option to the list tool that returns only those assets with duplicates, perhaps?