Initial library build - processing pauses

Wanderweger · January 23, 2022, 5:12am

I’ve been trying to build my initial library of 500K+ photos on an external drive and am about halfway through after a few days. At the beginning PS was continuously processing images but now I’m noticing the following behavior:

Shows processing message, processes 1-10 images
Processing message vanishes and the library build seems to pause (no external disk noise)
After 30-60 seconds the processing message reappears and I can hear the disk writing again.
After processing 1-10 images, pauses again and cycle repeats.

I have changed skipLibraryOpenLocks=true, would you have any other suggestions? BTW I vaguely recall running into a similar issue a year ago with a much earlier version but can’t remember how that was solved.

(NOTE: I just found my previous post from Jan 21 - “photo processing bar”, sorry if this is a repost. I’ve also sent you the logs).

mrm · January 24, 2022, 8:14pm

Sorry for this! TL;DR: the next release should address this issue.

If you’re interested in more details, read on:

Last year I stress-tested Version 1.x of PhotoStructure with my 500k+ personal library on my 2-core Intel NUC with an SSD.

Since releasing v1.1, I found that larger or slower libraries (100k+ asset files, or a library on an HDD) had issues on larger-core systems: sync progress would slow down or even stop altogether.

I discovered through system and node profiling that PhotoStructure was “stuck” in disk I/O, and specifically, SQLite mutex operations.

I designed version 1.x with this architecture:

This architecture requires web, sync, and all instances of sync-file to have an open connection to SQLite and the library database.

This works fine as long as the database isn’t large and is on a fast SSD.

Once the library gets larger, or the number of sync-file instances grows, this approach degrades as each process is keeping it’s own copy of the entire database in sync with the disk: tons of CPU is used just in database bookkeeping.

So, last November I tried a different approach: sync would be the only db writer, and smaller subtasks, like image hashing and preview generation, would be offloaded to threads:

This was promising, but I quickly found that Node.js’s worker_thread implementation still has show-stopping concurrency bugs.

I was able to switch to child processes (thanks to batch-cluster), and that helped, but it took another month of profiling and hotspot remediation before things were , and s.

I also added automatic concurrency throttling based on soft timeout rates, so if PhotoStructure finds, say, your NAS doesn’t reply quickly to file operations, it will import fewer files concurrently.

The release notes for the next release include some other related work, as well: https://photostructure.com/about/2022-release-notes/#v210-alpha1

Wanderweger · January 25, 2022, 2:25am

Thanks for the update, looking forward to the next version!

Const · January 25, 2022, 3:46pm

Thanks for important “plumbing” update. Without solid underlying infrastructure we would not have “flashy” face detection, geo tagging and other “must haves”.

Cheers,
Konstantin.