Photos not properly deduped due to lensId issue

I have two copies of the same photo, and PhotoStructure did not correctly dedupe them:

Original (6.26 MB): http://d.ls/photostructure/dupebug1.jpg
Version from Google Photos takeout (2.73 MB): http://d.ls/photostructure/dupebug2.jpg

It seems like Google messed with the metadata slightly, as the word “Sony” is missing from the lensId of the Google version:

PS D:\PhotoStructure> docker run --mount type=bind,source=D:\PhotoStructure\,target=/data photostructure/server photostructure info "/data/2014/2014-07-18/DSC00717-1.jpg" "/data/2014/2014-07-18/DSC00717.JPG"
{
  message: 'These files differ: Different lensId: LensID:E PZ 16-50mm F3.5-5.6 OSS != LensID:Sony E PZ 16-50mm F3.5-5.6 OSS',
  similarImages: false,
  meanHamm: 0.97,
  labModesCorr: 0.96,
  a: {
    '$ctor': 'models.AssetFile',
    aperture: 9,
    capturedAtLocal: 2014071803411300,
    capturedAtPrecisionMs: 0,
    capturedAtSrc: 'tags:DateTimeOriginal',
    fileSize: 2726930,
    focalLength: '16.0 mm',
    height: 4912,
    iso: 100,
    lensId: 'LensID:E PZ 16-50mm F3.5-5.6 OSS',
    make: 'Sony',
    meanHash: '+fnhgcMjdzPg4J683FD4mAMHPz8ff///',
    mimetype: 'image/jpeg',
    mode0: 3136,
    mode1: 3616,
    mode2: 1380,
    mode3: 3620,
    mode4: 1572,
    mode5: 1792,
    mode6: 3584,
    model: 'NEX-6',
    mountpoint: '/data',
    mtime: 1609577680000,
    sha: 'pRfLsCwxinTpggM14RsYcw5kh/NYJcUL',
    shutterSpeed: '1/250',
    uri: 'psfile://2jPPUQgqo/2014/2014-07-18/DSC00717-1.jpg',
    version: 11,
    width: 3264
  },
  b: {
    '$ctor': 'models.AssetFile',
    aperture: 9,
    cameraId: 'InternalSerialNumber:1700990c',
    capturedAtLocal: 2014071803411300,
    capturedAtPrecisionMs: 0,
    capturedAtSrc: 'tags:DateTimeOriginal',
    fileSize: 6258688,
    focalLength: '16.0 mm',
    height: 4912,
    imageId: 'ShutterCount:772',
    iso: 100,
    lensId: 'LensID:Sony E PZ 16-50mm F3.5-5.6 OSS',
    make: 'Sony',
    meanHash: '+fnhgcMjNzPAwL683FD4OAMHPz8ff///',
    mimetype: 'image/jpeg',
    mode0: 3136,
    mode1: 3616,
    mode2: 3620,
    mode3: 1572,
    mode4: 1792,
    mode5: 3584,
    mode6: 3588,
    model: 'NEX-6',
    mountpoint: '/data',
    mtime: 1405683672000,
    rotation: 270,
    sha: 'Bf4kAn0QAz9g7NBhdDQSfg3yqrFFc6Xb',
    shutterSpeed: '1/250',
    uri: 'psfile://2jPPUQgqo/2014/2014-07-18/DSC00717.JPG',
    version: 11,
    width: 3264
  }
}

(a is the Google compressed version; b is the original)

Notice that in the original, the lensId is “Sony E PZ 16-50mm F3.5-5.6 OSS”, however in the Google version it’s just “E PZ 16-50mm F3.5-5.6 OSS” (no “Sony”).

There’s quite a few photos like this, so I have a large number of duplicates in my library now. Is there a workaround (eg some way to ignore lensId when checking for dupes)? Alternatively, if/once you fix this issue, would a sync properly dedupe them?

Thanks!

Ugh, sorry about that!

Not that I can think of, sorry. I added quite a few more settings for image deduping in v1.0.0, but those aren’t released yet.

A “rebuild” re-aggregates assets, so that’s what you’ll want. A “sync” is much faster, as it makes the assumption that current asset aggregations are correct.

1 Like

No worries :slight_smile: as a developer myself, I totally understand that there’s always edge cases with complex features like this.

Will this particular issue be fixed in v1.0.0?

Would you consider adding some way to manually mark two images as duplicates for cases where the algorithm doesn’t correctly detect it? A command-line tool for that would be fine.

I’ve just made the cameraId, imageId, and lensId look for matches that may or may not include a make. (Hacks like this make me feel a bit :nauseated_face: but anything is better than having dupes…)

Here’s the result: :tada:

$ ./photostructure info ~/Desktop/merge/dupebug*
{
  fileComparison: 'These two files will be aggregated into a single asset.',
  variant: true,

Sure: but what would that look like?

Presumably this tool would take 2 or more paths, generate a UUID, and add that value as a tag to each file so future imports would aggregate them appropriately.

(There’s an ImageUniqueID tag in EXIF, but that’s already being happily erased and edited by Google Photos and Adobe products, which is why PhotoStructure completely ignores it.)

There’s MasterDocumentID, ShortDocumentID, and UniqueDocumentID (see IPTC tags): I think any of those would be more promising, and the tag itself could be a Setting.

Basically the AssetFileComparator would then need to immediately match up asset files if this magic tag had matching contents.

If you want this (or something else like it, if you have better ideas), feel free to add it as a feature request!

1 Like

Thank you! Sounds great. Looking forward to trying it out :slight_smile:

Heh, I know very little about EXIF, so I didn’t think about it too much :sweat_smile:

I was thinking they could be marked as dupes only in the PhotoStructure database, or perhaps in a sidecar file, but I’m not sure if that’s something you’d want to do. I guess it’s better to have some permanent marker directly in the EXIF data such that it’s not coupled to PhotoStructure itself?

Yeah, that would be my thinking.

There will be metadata that’s only in the PhotoStructure library database (like sharing access tokens), but I’d like to minimize that as much as possible.