2.1.0-alpha7 sync stops after a while

Frustrated with hard-to-diagnose CPU load issues in version 1, I updated to 2.1.0-alpha7.

I am running photostructure in docker, and while upgrading I got rid of my old container and ALL serialized state. 2.1.0-alpha7 started from scratch.

Here is my issue: it seems to scan all my 150K+ files just fine, but then imports a fraction of them (500-2000), at which point it stops. It looks like sync is restarted once a day, causes a scan that cheerfully discovers 100K+ items to do, processes another handful and stops again. It has been going on like this for a while (I went on vacation and did not keep a close track), and after weeks and weeks of running it is up to about 35K items imported.

Sync logs with level=info are very verbose and nothing jumps out at me. How can I find out why is sync stopping, and what can I do to make it try with more persistence?

EDIT: Or, maybe, there is at least a way to restart just processing, without triggering re-scan (which also takes a while)?

Looks like at the end of the sync this happens:

2023-01-17T03:29:51.860Z sync warn  work.CpuUsage Current system load is too high to schedule more work { currentBusyPct: 100, max: 100 }                                        
2023-01-17T03:30:00.426Z main warn  ChildService(sync) healthCheck for ChildService(sync): unhealthy⁶⁵, restarting (Used memory used by sync (2.5 GB) is high⁵,RSS memory used by sync (1.6 GB) is high⁵)                                   
2023-01-17T03:30:00.474Z worker-2268 warn  async.PromiseTimer error counts: { 'fs.unlink': 352, 'fs.rmdir': 43, 'fs.readJson': 21 }                                              
2023-01-17T03:30:02.470Z worker-2290 warn  async.PromiseTimer error counts: { 'fs.unlink': 193, 'fs.rmdir': 17, 'fs.readJson': 15 }

Apparently “restarting” does not actually lead to restart, but rather terminates sync completely.

Machine has 32 GB of ram, and About or “info” subcommand correctly report that 28 GB are free so I don’t know why 2.5GB/1.6GB used by sync are considered to be high…

Ok, setting PS_MAX_MEMORY_MB and PS_MAX_RSS_MEMORY_MB both to 8000 allows me to see more progress, but sync eventually stops with a recurring error that looks like this:

{"ts":1673928004318,"l":"warn","ctx":"work.CpuUsage","msg":"Current system load is too high to schedule more work","meta":{"currentBusyPct":100,"max":150}}
{"ts":1673928012346,"l":"warn","ctx":"time(tag.read())","msg":"unsettled for 3626ms"}

As you can see, I set PS_CPU_MAX_LOAD_PERCENT to 150 in an effort to cajole photostructure to continue, but to no avail

After that it just periodically prints an error which also occurs throughout the sync and seemingly does not affect the progress then:

{"ts":1673955914566,"l":"warn","ctx":"Error","msg":"onError(): ENOENT: no such file or directory, open '/ps/library/.photostructure/cache-1026/readdircache/8d/60/bhg3gvfdtz7z8cczyfvn.json.gz' at (missing stack)","meta":{"event":"nonFatal","message":"unhandledRejection"}}
{"ts":1673955921561,"l":"warn","ctx":"async.Deferred(WritableToBuffer)","msg":".reject()","meta":"ENOENT: no such file or directory, open '/ps/library/.photostructure/cache-1026/readdircache/8d/60/bhg3gvfdtz7z8cczyfvn.json.gz' at (missing stack)"}
{"ts":1673955921562,"l":"warn","ctx":"Error","msg":"onError(): ENOENT: no such file or directory, open '/ps/library/.photostructure/cache-1026/readdircache/8d/60/bhg3gvfdtz7z8cczyfvn.json.gz' at (missing stack)","meta":{"event":"nonFatal","message":"unhandledRejection"}}
{"ts":1673955956564,"l":"warn","ctx":"async.Deferred(WritableToBuffer)","msg":".reject()","meta":"ENOENT: no such file or directory, open '/ps/library/.photostructure/cache-1026/readdircache/8d/60/bhg3gvfdtz7z8cczyfvn.json.gz' at (missing stack)"}
{"ts":1673955956565,"l":"warn","ctx":"Error","msg":"onError(): ENOENT: no such file or directory, open '/ps/library/.photostructure/cache-1026/readdircache/8d/60/bhg3gvfdtz7z8cczyfvn.json.gz' at (missing stack)","meta":{"event":"nonFatal","message":"unhandledRejection"}}
{"ts":1673955963566,"l":"warn","ctx":"async.Deferred(WritableToBuffer)","msg":".reject()","meta":"ENOENT: no such file or directory, open '/ps/library/.photostructure/cache-1026/readdircache/8d/60/bhg3gvfdtz7z8cczyfvn.json.gz' at (missing stack)"}
{"ts":1673955963567,"l":"warn","ctx":"Error","msg":"onError(): ENOENT: no such file or directory, open '/ps/library/.photostructure/cache-1026/readdircache/8d/60/bhg3gvfdtz7z8cczyfvn.json.gz' at (missing stack)","meta":{"event":"nonFatal","message":"unhandledRejection"}}

At this point, “photostructure sync” is sitting there consuming 50% of a single CPU core, not making any progress and not writing any more logs

I also noticed that every sync log begins with:

{"ts":1673957968735,"l":"warn","ctx":"Sentry","msg":"Failed to set up sentry","meta":"TypeError: Cannot read properties of undefined (reading 'init') at t.installSentry (/ps/app/bin/sync.js:9:884914); t.Service.setupErrorHandling (/ps/app/bin/sync.js:9:893928); t.Service._setup (/ps/app/bin/sync.js:9:892186)…t (/ps/app/bin/sync.js:9:890629)"}

Yep, I tried three more times, seems like “system load” is the show-stopper now:

2023-01-17T13:00:03.660Z sync warn  work.CpuUsage Current system load is too high to schedule more work { currentBusyPct: 100, max: 150 }
2023-01-17T13:01:32.311Z main warn  fs.FileCleanup(imageCacheCleanup) cleanup(): starting { root: '/ps/library/.photostructure/cache-1026/imgcache', staleMs: '15m' }
2023-01-17T13:09:02.296Z main warn  fs.FileCleanup(imageCacheCleanup) cleanup(): starting { root: '/ps/library/.photostructure/cache-1026/imgcache', staleMs: '15m' }
2023-01-17T13:16:32.298Z main warn  fs.FileCleanup(sharedFileCleanup) cleanup(): starting { root: '/ps/library/.photostructure/docker-config/shared-state', staleMs: '1h' }
2023-01-17T13:16:32.299Z main warn  fs.FileCleanup(previewWipCleanup) cleanup(): starting { root: '/ps/library/.photostructure/previews', staleMs: '1h' }
2023-01-17T13:16:32.300Z main warn  fs.FileCleanup(advisoryLockWipCleanup) cleanup(): starting { root: '/ps/library/.photostructure/docker-config/advisory-locks', staleMs: '1h' }
2023-01-17T13:16:32.312Z main warn  fs.FileCleanup(imageCacheCleanup) cleanup(): starting { root: '/ps/library/.photostructure/cache-1026/imgcache', staleMs: '15m' }
2023-01-17T13:19:29.876Z sync warn  DbBackup(/ps/library/.photostructure/models/db.sqlite3) backup(): no-op, db.mutex already has writers

So I guess if I were to distill this whole thread, I would’ve said that my complain is two-fold:

  • Sync stops when I want it to continue (I certainly don’t fine CPU load or load average in any way exceptional, but looks like sync does)
  • When it stops, the sync process keeps spinning and polling something, consuming CPU until the daily sync restart happens.

(I’m pretty sure you know already, but) PhotoStructure automatically throttles work based on system load to prevent itself from basically DDOS’ing itself. If something else on the system hammers the CPU, it should automatically pause (indefinitely).

Is load high on the system when “system load is too high”? It’s certainly possible that there’s a bug in the current-CPU-utilization code.

Unfortunately, the info tool doesn’t currently check the current cpu usage percent. I’ve just added that to the alpha 8 build, so as soon as that is released, it’ll be easier to debug what’s going on here.

Thanks for bringing this to my attention, and for your patience here!

Hmm. I am not sure how to quantify the “too high” here. Yes, there is other software running on this box, and yes, sometimes the load average spikes for a brief period of time to 4-8-10 (which, given the nature of the load average, does not really tell us a whole lot). I certainly do not perceive the box to be loaded, not by a stretch.

I feel that if I thought that photostructure is being a problem here and causing high system load, I could just use docker facilities to throttle its access to CPU or IO or both. Let’s say that I am perfectly content with photostructure to continue doing what it is doing (like versions 1.x did). Is there any way to take the training wheels off and let me manage the resource consumption? :slight_smile:

Even if the CPU is 100% utilized, I don’t really see this being an issue - I paid for the whole CPU, I might as well be using the whole CPU :slight_smile:

If you’ve only got 4 CPU cores, 10 is rough–the system is probably thrashing caches.

I wrote more about this topic here:

You could set cpuLoadPercent to something like 200, and maxConcurrentImports to something like 3 (if you have a 4-core machine)

I’d like to disagree here. I could (for the sake of argument) start ten parallel “find / -print”'s which would get me LA of 10 and box would still be perfectly responsive. Again, nothing else on this box feels unbearably sluggish or stuttering when this happens (which I am entirely happy with)

I will try your suggestion of cpuLoadPercent of 200 (or even more) and maxConcurrentImports of 3 now

Reporting back after my experiments. I tried cpuLoadPercent of 150, 200, and 300. It looks like it is capped at 200, as even with 300 I am still getting:

2023-01-18T09:00:10.372Z sync warn  work.CpuUsage Current system load is too high to schedule more work { currentBusyPct: 100, max: 200 }

I noticed that this message is emitted several times before sync finally stops (with all three cpuLoadPercent settings).

I also tried maxConcurrentImports set to 3, 2 and 1. Overall, I can’t say that I saw a noticeable difference in behaviour - sync still stops after 200-500 (rarely 700+) items - whereas versions 1.x were able to complete the import without any fuss.

Is there anything else I can try?

PS:
This is how “top” inside container is looking like once the sync is stopped (this is taken some time after the last attempt, so things have definitely settled down by now):

  PID USER      PR  NI    VIRT    RES  %CPU  %MEM     TIME+ S COMMAND                                                                                                         
    1 root      20   0    0.8m   0.0m   0.0   0.0   0:00.17 S /sbin/tini -- /ps/app/docker-entrypoint.sh                                                                      
    8 node      20   0  251.0m  28.6m   0.0   0.1   0:00.23 S  `- /usr/local/bin/node /ps/app/photostructure                                                                  
   28 node      20   0  398.9m 142.8m 103.3   0.4   3:33.51 S      `- photostructure main                                                                                     
   60 node      20   0  401.4m 133.8m   0.7   0.4   1:25.56 S          `- photostructure web                                                                                  
   62 node      20   0    0.0m   0.0m   0.0   0.0   0:00.00 Z          `- [sh] <defunct>                                                                                      
  120 node      29   9  619.2m 378.8m  21.3   1.2  50:25.69 S          `- photostructure sync                                                                                 
  139 node      29   9    0.0m   0.0m   0.0   0.0   0:00.00 Z              `- [sh] <defunct>                                                                                  
 4288 node      29   9    1.7m   1.2m   0.0   0.0   0:00.01 S              `- findmnt --poll                                                                                  
 4954 node      29   9   21.4m  19.4m  16.7   0.1   0:01.53 S              `- /usr/bin/perl -w /ps/app/node_modules/exiftool-vendored.pl/bin/exiftool -stay_open True -@ -    

As you can see, main, sync and exiftool keep churning CPU - I think they keep polling exiftool?