Duplicate files in library with different extension capitalization

distefam · July 19, 2021, 7:03pm

I have duplicates of what should be identical photos in my PhotoStructure library (/ps/library). The photos were imported to my computer via two different programs (Digikam and Rapid Photo Downloader) into two different directories, each of which is mounted into PhotoStructure via my Docker Compose settings.

They have the exact same filename except the extension on one has an uppercase .JPG while the other has a lower-case .jpg. Other than that, the photos should be identical.

Running photostructure info to compare the two yields the following:

{
  message: 'These files are similar',
  similarImages: true,
  meanHamm: 1,
  labModesCorr: 1,
  a: {
    '$ctor': 'models.AssetFile',
    aperture: 2,
    cameraId: 'SerialNumber:12A00336',
    capturedAtLocal: 2021071016504700,
    capturedAtPrecisionMs: 0,
    capturedAtSrc: 'tags:DateTimeOriginal',
    fileSize: 14288033,
    focalLength: '23.0 mm',
    height: 4160,
    iso: 160,
    make: 'Fujifilm',
    meanHash: 'PhofHh4eDgY9Pz0jI6TgIn9DAhweHx8P',
    mimetype: 'image/jpeg',
    mode0: 658,
    mode1: 1097,
    mode2: 662,
    mode3: 2747,
    mode4: 2751,
    mode5: 2745,
    mode6: 2749,
    model: 'X100V',
    mountpoint: '/ps/library',
    mtime: 1625950248000,
    rotation: 0,
    sha: 'yUNWECck6Nv/w3HR3q9tSvHq+3BNVphe',
    shutterSpeed: '1/1300',
    uri: 'pslib:/2021/2021-07-10/DSCF0156.jpg',
    version: 11,
    width: 6240
  },
  b: {
    '$ctor': 'models.AssetFile',
    aperture: 2,
    cameraId: 'SerialNumber:12A00336',
    capturedAtLocal: 2021071016504700,
    capturedAtPrecisionMs: 0,
    capturedAtSrc: 'tags:DateTimeOriginal',
    fileSize: 14278209,
    focalLength: '23.0 mm',
    height: 4160,
    iso: 160,
    make: 'Fujifilm',
    meanHash: 'PhofHh4eDgY9Pz0jI6TgIn9DAhweHx8P',
    mimetype: 'image/jpeg',
    mode0: 658,
    mode1: 1097,
    mode2: 662,
    mode3: 2747,
    mode4: 2751,
    mode5: 2745,
    mode6: 2749,
    model: 'X100V',
    mountpoint: '/ps/library',
    mtime: 1626293114000,
    rotation: 0,
    sha: 'U0jC4Mqg1kL/d43sAOoFHZxw1AyZrRon',
    shutterSpeed: '1/1300',
    uri: 'pslib:/2021/2021-07-10/DSCF0156.JPG',
    version: 11,
    width: 6240
  }
}

I see that they have a slightly different filesize and sha hash (though the mean hash remains the same). I have no idea how this would be possible given they were imported from the same memory card.

Environment

PhotoStructure v0.9.1 via Docker Compose running on an Alpine Linux VM.

mrm · July 19, 2021, 8:39pm

Howdy, thanks for reporting this.

To clarify, a “mean hash” is a coarse view of image contents. The file SHA will match if and only if the file contents are the same.

If the file SHAs are different, they’re two different files.

bdillahu · July 19, 2021, 9:20pm

Does the different file name, normally stored I the exif metadata, potentially cause the hash difference?

mrm · July 19, 2021, 11:33pm

File names aren’t typically stored in metadata (at least, when they come straight out of devices): if you’re looking at ExifTool’s output, know that a bunch of fields are from the dirent, the inode, and inferrable from the encoded image.

As an example, here’s PhotoStructure’s 1-pixel PNG that’s been stripped of metadata:

$ exiftool -j public/images/1.png 
[{
  "SourceFile": "public/images/1.png",
  "ExifToolVersion": 11.88,
  "FileName": "1.png",
  "Directory": "public/images",
  "FileSize": "68 bytes",
  "FileModifyDate": "2021:06:01 11:46:30-07:00",
  "FileAccessDate": "2021:07:19 10:24:10-07:00",
  "FileInodeChangeDate": "2021:06:01 11:46:30-07:00",
  "FilePermissions": "rw-rw-r--",
  "FileType": "PNG",
  "FileTypeExtension": "png",
  "MIMEType": "image/png",
  "ImageWidth": 1,
  "ImageHeight": 1,
  "BitDepth": 8,
  "ColorType": "Grayscale with Alpha",
  "Compression": "Deflate/Inflate",
  "Filter": "Adaptive",
  "Interlace": "Noninterlaced",
  "ImageSize": "1x1",
  "Megapixels": 0.000001
}]

Here’s the whole file (it’s short!)

$ hexdump -C  public/images/1.png 
00000000  89 50 4e 47 0d 0a 1a 0a  00 00 00 0d 49 48 44 52  |.PNG........IHDR|
00000010  00 00 00 01 00 00 00 01  08 04 00 00 00 b5 1c 0c  |................|
00000020  02 00 00 00 0b 49 44 41  54 78 9c 63 fa 6f 0c 00  |.....IDATx.c.o..|
00000030  02 3a 01 35 3e 7b f1 a8  00 00 00 00 49 45 4e 44  |.:.5>{......IEND|
00000040  ae 42 60 82                                       |.B`.|
00000044

But you’re correct: if you change metadata within the file, that will change the SHA.

(I had thought about SHAing the image content itself by creating a new image with all metadata stripped, but it turns out you can produce the same pixels with any number of different streams of bytes due to subtly different image encoding or compression libraries, which is why I ended up with the mean hash approach).

bdillahu · July 20, 2021, 12:15am

I’ve seen libraries that would do a hash of just image data (I played with some code to try to dedupe pictures a while back), but they always seemed pretty obscure and not supported across much.

Odd to me that it isn’t more common… I care most about the image… the metadata is important, but much more likely to have been modified somehow.

mrm · July 20, 2021, 12:25am

Agreed

I haven’t been hit by this scenario because I’ve stored most of my changes as sidecars: so the original image files don’t change when I change metadata, and the file SHA doesn’t change, either.

If people aren’t using sidecars, though, I can see how this would be an issue. I think the fact that you can re-encode an image and produce a dozen different bytestreams (and different “image SHAs”) is what’s causing me pause.

bdillahu · July 20, 2021, 1:01am

I haven’t been bit by the reencode as much as various tools that don’t use sidecars…