Deduplicate shenanigans

nuk · November 6, 2022, 5:04pm

Expected Behavior

These five images should be combined into one or two assets (including according to info ), but they’re not actually displayed that way on UI.

/20200427_225538.jpg = original portrait
/20200427_225616.jpg = original landscape
/107918631_157209.jpg = compressed portrait
/806525515_15743.jpg = compressed landscape
/review/806525515_15743.jpg = review copy of compressed landscape from parent directory, only change being the additional tag “review20221105”

Current Behavior

I actually see four assets on the UI; only the “compressed portrait” and “compressed landscape” images were combined.

Per PhotoStructure | What do you mean by “deduplicate”? I review the info command output and found that PhotoStructure does indeed think the two compressed images should be combined – but it also thinks many other asset variations should also be combined. And it does not think the two original images should be combined.

Expected output from info queries:

Set alias

xxxxx@xxxxx:/$ alias phstr='sudo docker exec -u node -it photostructure-server-alpha-20221106 ./photostructure'

Don't combine original portrait & original landscape

xxxxx@xxxxx:/$ phstr info /pictures/20200427_225538.jpg /pictures/20200427_225616.jpg
{
  fileComparison: 'These files represent different assets: captured-at 2020042722553840±10ms != 2020042722561648±10ms',
  primary: undefined,
  imageHashComparison: {
    imageCorr: 0.74,
    aRotation: 90,
    colorCorr: 0.97,
    meanCorr: 0.82,
    greyscale: false
  },
...

Do combine original portrait & compressed portrait

xxxxx@xxxxx:/$ phstr info /pictures/20200427_225538.jpg /pictures/107918631_157209.jpg
{
  fileComparison: 'These two files will be aggregated into a single asset.',
  primary: '/pictures/20200427_225538.jpg',
  imageHashComparison: {
    imageCorr: 0.91,
    aRotation: 90,
    colorCorr: 1,
    meanCorr: 0.94,
    greyscale: false
  },
...

Do combine original landscape & compressed landscape

xxxxx@xxxxx:/$ phstr info /pictures/20200427_225616.jpg /pictures/806525515_15743.jpg
{
  fileComparison: 'These two files will be aggregated into a single asset.',
  primary: '/pictures/20200427_225616.jpg',
  imageHashComparison: {
    imageCorr: 0.93,
    aRotation: 0,
    colorCorr: 1,
    meanCorr: 0.95,
    greyscale: false
  },
...

Do combine compressed landscape & review copy

xxxxx@xxxxx:/$ phstr info /pictures/review/806525515_15743.jpg /pictures/806525515_15743.jpg
{
  fileComparison: 'These two files will be aggregated into a single asset.',{
  fileComparison: 'These two files will be aggregated into a single asset.',
  primary: '/pictures/review/806525515_15743.jpg',
  imageHashComparison: {
    imageCorr: 1,
    aRotation: 0,
    colorCorr: 1,
    meanCorr: 1,
    greyscale: false
  },
...

Unexpected but acceptable output from info queries:

Do combine compressed landscape & compressed portrait

xxxxx@xxxxx:/$ phstr info /pictures/107918631_157209.jpg /pictures/806525515_15743.jpg
{
  fileComparison: 'These two files will be aggregated into a single asset.',
  primary: '/pictures/806525515_15743.jpg',
  imageHashComparison: {
    imageCorr: 0.7,
    aRotation: 0,
    colorCorr: 0.97,
    meanCorr: 0.79,
    greyscale: false
  },
...

Steps to Reproduce

See the five images in zip attachment to support@photostructure.com. I don’t believe the metadata has anything I want to keep private, including GPS data, but it’s over 4MB so the forum won’t let me attach it.

Extract the five images
Run PhotoStructure and let it build the library
(probably not related, but I did “Shutdown” and change environment variables a few times to try to get it to process faster, then after starting PhotoStructure again clicked “Resync” to get it to continue with the import process)
Look for the images and see they aren’t combined. I checked both by searching via When, and also via Keyword “review20221105”.

Environment

Synology DS920+, DSM 7.1.1-42962 Update 1
PhotoStructure 2.1.0-alpha.7 on Docker

mrm · November 6, 2022, 5:11pm

A+ bug report, thanks a ton! I’ll look into this right now.

mrm · November 6, 2022, 8:39pm

Those are interesting exemplars!

I just ran the code that will be alpha.8 against these images, and the results are different–your landscape and portrait are properly kicked into different assets.

Seems that the captured-at time can only be gleaned from stat, which is not reliable.

PhotoStructure creates a perceptual image hash in CIELAB space by squooshing pixels into a square. I had considered different rotations for the given images (the “aRotation” field), but I’d not taken into account images that had been rotated and re-rastered–I’d only thought of image rotation via the Orientation metadata flag. Re-rastered rotations would be common for image edits, though. I’ll have to apply the new image rotation discount if the aspect ratio doesn’t match.

Here’s the comparison between 1079* and 8065*:

info ~/Downloads/nuk/pictures/107918631_157209.jpg ~/Downloads/nuk/pictures/806525515_15743.jpg --filter capturedAt --filter imageHash --filter imageHashComparison
{
  imageHashComparison: {
    imageCorr: 0.74,
    minImageCorr: 0.8,
    imageCoeffDelta: 0,
    colorCorr: 0.98,
    minColorCorr: 0.65,
    colorCoeffDelta: 0,
    isGreyscale: false,
    aRotation: 0,
    isSimilar: false,
    whyNotSimilar: 'different image content'
  },
  a: {
    nativePath: '/home/mrm/Downloads/nuk/pictures/107918631_157209.jpg',
    capturedAt: {
      date: 2020-04-27T16:46:46.000Z,
      localCentiseconds: 2020042709464600,
      src: 'stat',
      toLocal: 2020042709464600
    },
    imageHash: {
      dominantColors: [
        { name: 'Black', pct: 88, rgb: '#0B0B0D' },
        { name: 'Bistre', pct: 5, rgb: '#2D2617' },
        { name: 'Shadow green', pct: 4, rgb: '#1E2B16' },
        { name: 'Dark yellow-green', pct: 1, rgb: '#3E461D' },
        { name: 'Yellow-orange', pct: 1, rgb: '#F09F1E' }
      ],
      dominantColorsDescription: 'dominantColorsFromModes: {"uniqColors":37,"mergedColors":14,"pixelCount":1036,"pctOmitted":1}',
      isGreyscale: false,
      meanHash: 'HR0TBg0Y//6qyuD5+v4AABwdEwYAAA==',
      mimetype: 'image/jpeg',
      mode0: 658,
      mode0pct: 88,
      mode1: 691,
      mode1pct: 5,
      mode2: 689,
      mode2pct: 4,
      mode3: 920,
      mode3pct: 1,
      mode4: 3694,
      mode4pct: 1
    }
  },
  b: {
    nativePath: '/home/mrm/Downloads/nuk/pictures/806525515_15743.jpg',
    capturedAt: {
      date: 2020-04-27T16:46:46.000Z,
      localCentiseconds: 2020042709464600,
      src: 'stat',
      toLocal: 2020042709464600
    },
    imageHash: {
      dominantColors: [
        { name: 'Black', pct: 89, rgb: '#0B0B0D' },
        { name: '90% black', pct: 6, rgb: '#201A06' },
        { name: 'Shadow green', pct: 3, rgb: '#121E06' },
        { name: 'Forest green', pct: 1, rgb: '#394631' },
        { name: 'Dark yellow', pct: 1, rgb: '#A78D38' },
        { name: 'Yellow-orange', pct: 1, rgb: '#DD8E00' }
      ],
      dominantColorsDescription: 'dominantColorsFromModes: {"uniqColors":33,"mergedColors":12,"pixelCount":1036,"pctOmitted":-1}',
      isGreyscale: false,
      meanHash: 'gAAYHRoHBwn//quooPj4/YAAHB0OBgIA',
      mimetype: 'image/jpeg',
      mode0: 658,
      mode0pct: 89,
      mode1: 663,
      mode1pct: 6,
      mode2: 661,
      mode2pct: 3,
      mode3: 913,
      mode3pct: 1,
      mode4: 2774,
      mode4pct: 1,
      mode5: 3690,
      mode5pct: 1
    }
  }
}

nuk · November 6, 2022, 9:24pm

Ok, thank you. Comparisons can always be improved and it’s great the new alpha.8 won’t combine the compressed landscape & compressed portrait into a single asset.

But to me, the bigger alpha.7 issue is the UI. How could the UI show four distinct assets, when info recognizes relations between all 5 images? And for the one asset that did merge two images, the comparison scores between those two images were the worst of 5 comparisons I made above.

mrm · November 7, 2022, 8:00pm

Asset aggregation in certain circumstances can be nondeterministic when there’s insufficient reliable asset metadata.

There are a number of things I do to try to ensure aggregation is deterministic, but with incremental imports, there are sequences that can happen that could result in different aggregations. I’ve made many improvements to asset aggregation, but it still conceivably could happen with assets that have inferred, or “fuzzy” dates (from stat or from parsing dates out of the filename).

If you’re interested, I can elaborate with an example.

nuk · November 7, 2022, 9:14pm

Am I understanding you correctly?

For original portrait & compressed portrait above, info command returned These two files will be aggregated into a single asset..
UI did not aggregate the original & compressed portrait.
This is expected behavior with the current code.

If I start a brand-new docker image with just the items in the zip folder, alpha.7 creates four assets out of the five images in spite of what info says when comparing the images. And this is expected behavior with the current code?

mrm · November 8, 2022, 6:11pm

So what I think should be the expected outcome, eventually, is two assets, where the landscapes and portraits are aggregated. I’d expect v1.1 to aggregate these all together, and, because the colors match reasonably but the image hash is borderline, v2.1-alpha.7 will aggregate randomly.

FWIW, if these same images had trustworthy captured-at times, stored in tags (rather than relying on stat’s mtime or birthtime), aggregation should “just work” with both v1.1 and v2.1-alpha.7.

The current code doesn’t take into account aspect ratio: I added that support last night. I also added a feature to info so that it will show what expected clusters will result when given more than 1 file.

As I said before, these clusters may not be what results from an import, as the info tool has the advantage of sorting the photos beforehand by “primary variant sort order”, which ensures the result is deterministic. See more here: https://photostructure.com/faq/what-do-you-mean-by-deduplicate/#how-does-photostructure-pick-which-file-to-show

There’s a directory scanner process that feeds a work queue, and sync pops filenames off of that queue during sync/import. To make the import of a directory deterministic, I could do the same sort of thing: wait for all the filename candidates to be found, then sort them all, then process in that order, but as soon as you import multiple volumes, the sorting issue hits us again.

nuk · November 8, 2022, 11:29pm

I’ve read through all of What do you mean by deduplicate and see no mention of randomness or indeterminacy for either aggregation/deduplication/clustering process or in the “picking best variant” process.

To me, aggregation/deduplication is one of Photostructure’s most important features. Once aggregation is done, picking the best variant for default display doesn’t matter nearly as much, the algorithm you already documented is more than sufficient.

If the directory scanner and info tool can’t easily and deterministically handle deduplication during importing, then please let users manually trigger a deduplication process for the current library. I want maximally de-duplicated assets.

mrm · November 9, 2022, 12:12am

Apologies–this is a bit hard to explain. Maybe an example can help:

When PhotoStructure is visiting a file it’s never seen before, it tries to “adopt” an existing asset (based on captured-at time). If this is a typical asset from a smartphone or digital camera, the time is accurate to the second (and sometimes fractional millisecond!). The likelihood that your library has more than 1 asset taken at the same millisecond is quite low, so the query results are typically either 0 or 1 row.

If the captured-at time is “fuzzy” (say, when the time is from mtime, or we only have a year, month, and day), then there may be thousands of “adoption candidate” assets in your library (all taken in the same year, or same month).

PhotoStructure only looks at the top 256 candidates, sorted by nearest captured-at time, to prevent the file import process from timing out. I’d hard-coded this 256 value–along with a // TODO: good luck explaining this–but here I am explaining it–so I just added this new setting: maxContemporaryAdoptionAssets. Here’s the description:

To handle photos and videos with “fuzzy” captured-at times (those that are missing second, minute, hour, or even day resolution), how many previously-imported assets with nearby captured-at times should PhotoStructure look for in your library to find an adoption candidate?
Higher values will slow down imports, but may result in more accurate de-duplication results.

So, here’s where this issue comes into play: if none of the candidates are relevant, but the “correct” asset to adopt the file was, say, 257 assets away from the candidate’s captured-at-time, you’d get 2 assets in your library that actually are the same thing.

The “library rebuild” process reconsiders asset aggregations automatically.

nuk · November 9, 2022, 3:52am

Thank you for your patience, I really appreciate it!

Ok, I will try setting maxContemporaryAdoptionAssets to 100,000. In my example five images, somehow the fifth image got a new date (Nov 2022 instead of Apr 2020) when I copy-pasted it into the review folder. I still want PhotoStructure to aggregate visually similar assets regardless of date, so that’s the best option for me at this time.

In the future, perhaps you could index an “average color score” for images, then search for duplicates not just with nearest 256 images by date, but also nearest 256 images by “average color score”. Here’s a random library I found that talks about “average hashing” which is maybe what I’m thinking. I guess the overall concept is “procedural hashing”
https://idealo.github.io/imagededup/

mrm · November 9, 2022, 5:45pm

I actually do already! Check your AssetFile table–those mode0, mode1, … values are color modes, in CIELAB space, encoded with the same sort of bitzip algorithm that geohashes use.

Perceptual hashing (phash) is a variant of mhash. Both mhash and phash use greyscales, so they’re “color blind.” PhotoStructure uses a CIELAB phash to avoid colorblindness.

I tried wavelet hashing, and, at least for the image benchmarking set that I have (several hundred images picked to represent common “image exemplars”, like selfies, portraits, foodie shots, group portraits, pets, nature, sunsets, …), I was surprised that it didn’t do better than the current CIELAB phash.

I hadn’t thought of using Mobilenet embeddings (what this package calls “CNN”) for image deduplication though–that’s something I’ll put on my to-check-out-later list. Thanks for the link!

nuk · November 9, 2022, 11:39pm

Awesome, glad to hear you’re already thinking about it and have tried a few different phashes!

In the future, I suggest considering candidates not just by nearby indexed date, but also nearby indexed perceptual hash (perceptual makes more sense than procedureal, thanks for noting that!).

To filter the additional candidates from indexed phash, you might add new config parameters adoptionCandidatesPhashMaxQuantity and adoptionCandidatesPhashMaxDifference, to limit quantity and/or quality of candidates.

mrm · November 10, 2022, 6:55pm

I do look for assets with similar image content, but only if the captured-at time is not exact.

I have a bunch of the same images I’ve taken over the years to show seasonal progression, for example, and I don’t want those images merged.

mrm · August 14, 2023, 8:23pm

OK, circling back (thanks for the reminder on Discord!

There are a bunch of new features/changes in v23.8 to handle these cases and help debug them in the future:

The new assetAggregation setting, which should avoid the a-matches-b-matches-c-but-a-doesn’t-match-c nondeterminism issue.

# ----------------------------------------
# PS_ASSET_AGGREGATION or assetAggregation
# ----------------------------------------
#
# How should assets be aggregated?
#
# - "union" will allow asset file variants to join an asset if they match
# *any* existing variant.
#
# - "intersection" will only allow asset file variants to join an asset if
# they match *all* existing variants.
#
# Versions prior to 23.8 defaulted to "union" behavior.

The new allowFuzzyDateImageHashMatches setting, which will allow stat-based captured-at photos to try to match against similar images. Note that the default is false, though, as it will cause scanned images to be aggregated possibly too aggressively. You’ll want to set this to true.

# ------------------------------------------------------------------------
# PS_ALLOW_FUZZY_DATE_IMAGE_HASH_MATCHES or allowFuzzyDateImageHashMatches
# ------------------------------------------------------------------------
#
# For images that don't have a reliable precise captured-at time (say, from
# "stat" or datestamp from pathname), can we aggregate assets purely by exact
# image hash matches?
#
# See https://forum.photostructure.com/t/deduplicate-shenanigans/1732/11 for
# more details.

The info tool now handles more than 2 files, by automatically adding a “clusters” field, that lets you try out different deduping settings:

$ ./photostructure info $(find '/tmp/nuk' -type f) --filter clusters
{
  clusters: [
    [ 'psfile://2NMQsMVCK/tmp/nuk/20200427_225616.jpg' ],
    [ 'psfile://2NMQsMVCK/tmp/nuk/20200427_225538.jpg' ],
    [ 'psfile://2NMQsMVCK/tmp/nuk/review/806525515_15743.jpg' ],
    [ 'psfile://2NMQsMVCK/tmp/nuk/806525515_15743.jpg' ],
    [ 'psfile://2NMQsMVCK/tmp/nuk/107918631_157209.jpg' ]
  ]
}

$ allowFuzzyDateImageHashMatches=1 ./photostructure info $(find '/tmp/nuk' -type f) --filter clusters
{
  clusters: [
    [
      'psfile://2NMQsMVCK/tmp/nuk/20200427_225616.jpg',
      'psfile://2NMQsMVCK/tmp/nuk/review/806525515_15743.jpg',
      'psfile://2NMQsMVCK/tmp/nuk/806525515_15743.jpg'
    ],
    [
      'psfile://2NMQsMVCK/tmp/nuk/20200427_225538.jpg',
      'psfile://2NMQsMVCK/tmp/nuk/107918631_157209.jpg'
    ]
  ]
}

(so I believe that aggregation is what you’re expecting, correct?)

nuk · August 16, 2023, 12:34am

“landscape and portrait are properly kicked into different assets”
That’s what I would expect, and what is accomplished by default according to your latest output from info.

allowFuzzyDateImageHashMatches=1 sounds like a good feature, but I don’t think it should completely ignore aspect ratio and thus I don’t think it should make a difference for these particular exemplars. Clearly it should not default to True based on your example of images with seasonal progression.