Support manually editing capture time

stovepipe_jones · April 12, 2023, 2:07am

@mrm this discussion has me questioning the very foundation of my workflow!

I’ve indeed discovered photostructure re-adding photos that were already there–because I retroacively tagged version A of the photo in the ps library with digikam.

Ps circled back at some time later and found a different sha of that image and brought it in as *-1.jpg. I have 6035 of these now!

I’m trying to figure out how to keep photostructure in my workflow–I love its deduplication and presentation style. Digikam is great for deep management, but not great for displaying.

It would be neat to confg PS to overlook exifdata when deduplicating, but I imagine that would take an entirely different approach than just hashing the file. I’ve seen some python modules that do funky math magic to create some sort of common-pixel fingerprint on just the image data… that dedup work was able to identify the same image even if it were resized.

Do you have any suggestions on a sequence or strategy that allows retroactive tagging of things in a ps library?

mrm · April 12, 2023, 5:03am

Oof, sorry for PhotoStructure adding those duplicates!

Unfortunately, v2.1.0-alpha.7 only had one image hash algorithm (a “mean hash” triplet in L*a*b colorspace), which (by design) would collide with very similar image content.

Prior versions just looked at the image hash, which resulted in quite-differently-timestamped images being aggregated together if the shot was basically the same.

The next build includes two new image hash algorithms–DCT and gradient/diff hashes–so if all three hashes match, it’s very similar image content.

So–what would be the “clever” solution for PhotoStructure to not just generate more duplicates in your library? Basically–how do I do the right thing if one variation of a photos’ captured-at time was edited to be correct, but there are other variations with the old (incorrect) time?

Approach 0: PhotoStructure edits the captured-at time, and therefore knows what’s going on, and can do the right thing.

Approach 1: I could SHA just the image content (basically SHA’ing the image after stripping all tags), but this can be “tricked” by editing the image contents.

Approach 2: Some software applies a unique identifier (a “uuid”) to all images–and all edits, as long as the uuid persists, are considered the “same” image. The issue with this approach is it hopes that other external software will retain these UUID tag values when editing/re-saving. Some image formats (like Apple’s live-photo HEIC) already include UUIDs.

Can you think of another approach? I’m happy to talk through that (either here or on discord).

stovepipe_jones · April 12, 2023, 4:31pm

I think approach 1 seems the most reasonable among the three because:

ps already has a sha hashing framework in place so regression testing would be brief
exif is generally standardized in its placement and structure, so an exif stripper ought not be too difficult to implement in a way that it doesn’t damage sync durations
I think if the image itself changes (and thus the image sha), it would be reasonable to consider it a different photo. Even something relatively ordinary as a whitebalance correction would produce an image that looks different from its parent. Despite that, bloating a date with 10 near-alike pictures because a user was tinkering with filters will feel burdensome later when they’re just looking through the album with friends. Having a ps option to group very-similar items together and visually display the most recent/oldest/biggest/exif-heaviest might be a nice way to coexist with such duplicates and reduce visual clutter while retaining recognition of their differences, should that level of distinction be desired now and then.
I think most users would find being able to retroactively edit the metadata of their library useful. Ps is a very fine cataloging and displaying application that stakes no claim on classification. This leaves a complimentary place open for such a thing (like digikam).

One thing I forgot to mention is that I’m still using v 1.1.0. I’ve tried 2.* versions with snapshots of this library and both times ended in ruin (though I no longer remember the specifics). If it would be of interest to you, I can give it another go if there’s a particular chunk of telemetry you’d be interested in seeing when ps 2.x is placed in front of 180,000 pictures.

tkohhh · April 12, 2023, 4:35pm

Just to make sure I’m following along, the issue you’re describing is only relevant if you use Automatic Organization, right?

I do not use Automatic Organization, and I have never seen any problems with editing tags (date or otherwise) causing duplicates in Photostructure.

mrm · April 12, 2023, 10:22pm

@tkohhh : Yes–automatic organization has to be enabled for PhotoStructure to make copies of a given photo or video.

Automatic organization (as it stands) ensures there’s a copy of every unique file SHA in your library. Unfortunately, every metadata edit will result in a “unique” new file SHA.

I’m warming up to “approach 1” above. ExifTool just added ImageDataMD5 support for several filetypes (unfortunately, it doesn’t handle HEIC/HEIF or many video types yet, but every part helps), and I just included that into exiftool-vendored (which is the library PhotoStructure uses to do metadata I/O):

https://github.com/photostructure/exiftool-vendored.js/blob/main/CHANGELOG.md#v2150

For unsupported filetypes, I’ll need to do the work myself to strip metadata and then SHA the result.

I was thinking last night that this could be combined, very much like I do with volume UUID SHAs, with the “unique id” extracted from

DocumentId
OriginalDocumentID
BurstUUID
MediaGroupUUID
BurstId
CameraBurstID
InstanceId

(all of which may have UUIDs associated to them)