Improve Import Performance over Network

AndreasFreund · February 18, 2022, 9:28pm

I’m currently in the process of importing 50.000 images into PhotoStructure for Windows. The images are stored on a network share (directly accessed via the full path \\host\share) and the import is running a lot slower than expected.

The import is progressing at ~5 images per second, while the network connection is fully utilized at 98MB/s. As the images are ~3MB on average, this means that every image is read 98 / 3 / 5 ≈ 6.5 times over the network. I assume one of the reads is the hash calculation, while the other 5.5 reads are the thumbnail creation. Windows is probably not immediately (or never) caching the image files, leading to bad performance.

A simple optimization could be to just create the largest thumbnail from the original file and then use the largest thumbnail as source for all other thumbnails. This would increase performance in this specific case by 3 times, if my assumptions above are correct.

mrm · February 19, 2022, 10:34pm

Welcome to PhotoStructure, @AndreasFreund !

PhotoStructure tries to be light-footed on disk reads, and should only fully read any given image twice: once to compute a SHA, and once again to decode the image. Reading the metadata tags should be a partial fseek of the header. For RAW images, if an embedded TIFF or JPG is available, I’ll use that instead, too, so that’s another fseek-partial-read.

The file SHA is used as a second step (past stating the file for filesize and mtime) for determining if the file has been touched since the last import, for finding exact file matches in the future. The SHA is also used as one of the deduplication heuristics, and for validating the file was copied to your library without errors. Unfortunately, this results in your network tossing any given file over the wire 2+ times (or 4+ times if you have “automatic organization” enabled!)

You can avoid one of these by disabling SHA file copy validation: set the library setting verifyFileCopies to false.

That’s actually how PhotoStructure already builds all previews (both original-aspect and square).

The resulting JPGs still need to get pushed over the network wire to get stored (if your library is on a remote filesystem).

If you only visit PhotoStructure on small-resolution devices, you can skip the larger preview sizes by setting the previewResolutions setting to include only smaller sizes. Here’s the docs for that setting:

# +----------------------+
# |  previewResolutions  |
# +----------------------+
#
# This controls the resolutions that PhotoStructure creates for every asset.
# Note that resolutions will be skipped if there already is a preview value
# with 2.5x the megapixels, so even though there are a lot of sizes here,
# you'll only see 3-4 images on your disk per asset.
#
# environment: "PS_PREVIEW_RESOLUTIONS"
# validValues: "uhd8k", "uhd5k", "uhd4k", "qhd", "fhd", "hd", "wvga", "qvga"
# or "qqvga"
#
previewResolutions = [
#  "uhd4k", # so you could skip this if you don't have any 4k displays...
  "fhd",
  "wvga",
  "qqvga"
]

More information about this is here: https://photostructure.com/getting-started/how-much-disk-space-do-i-need-for-my-photostructure-library/#how-can-i-minimize-the-disk-space-that-photostructure-uses

All this said: I could add a setting that auto-enables for large-memory-endowed servers to read files into an LRU/FIFO cache once, and only operate from in-memory caches. I’ll think about that.

Cheers,

Matthew