Best settings to performance, large deduplicate photo/video library with wrong Metadata files

Hello, i’m searching for tips by experienced users or (creator) to which library setting overrides are best for my use case:

I have a very large home library of 3.23 TB with 88,077 folders and 1,277,615 files.

This backup has photos and videos and many, if not all of them are largely and repeatedly deduplicated times and times again. The worst part is the wrong or nonexistent metadata like “date of creation” of the files. Old photos and videos with low resolution from old machines/mobile phones.

Why did this happen? Inexperience, many users dealing with same data (human error), repeated and redundant backups for security, some lost original metadata because of different OS and software issues, whatsapp/telegram/others deduplicates, photo editing, etc.

My goal is to keep or identify the original files and eliminate deduplicates without losing old photos and videos and the correct timeline. i need to identify which setting are best for performance reducing times of imports, and which settings are best for identifying original files and reducing all deduplicates.

The worst problems/difficulties i’m trying to make the best of it:

1 - best performance for my setup in imports

2 - identify deduplicate files that lose creation date and became older than original files (big problem)

3 - identify deduplicate files with later date but higher resolution/size than original file (wich one is wich?)

3 - identify and rename different photos with exact same date (lost or wrong metadata)

4 - identify same photos with different filenames and dates, and sometimes different sizes/resolution

5 - not losing photoburst with over sensitive deduplicating matches. (if i need to make choices this is the least important)

My setup is:

UNRAID (HIPERVISOR)

Using DOCKER for Photostructure

CPU Intel® Core™ i5-9500T 2.20GHz 6 cores 6 threads.

Memory RAM 32 GiB DDR4

Nvidia RTX 1060 3GB

Cache Crucial P5 Plus 2TB PCIe NvME for Photostructure library import and appdata

ZFS stripe5 - 5x - Western Digital 4TB Purple 5400rpm SATA III 64MB for my old unorganized library

This system can be 90% dedicated to photostructure because it´s my priority.

I have good read/write bandwidth and fair enough cpu power for this application.

This is my latest library import config:

  • 28 library setting overrides

Source:

/ps/library/.photostructure/settings.toml

Settings:

  • PS_ENABLE_ARCHIVE=true

  • PS_ENABLE_DELETE=true

  • PS_ENABLE_EMPTY_TRASH=true

  • PS_ENABLE_REMOVE=true

  • PS_ENABLE_REMOVE_ASSETS=true

  • PS_KEYWORD_BLOCKLIST=[]

  • PS_MAX_ASSET_FILE_SIZE_BYTES=10000000000

  • PS_MIN_ASSET_FILE_SIZE_BYTES=15000

  • PS_MIN_IMAGE_DIMENSION=240

  • PS_MIN_VIDEO_DIMENSION=120

  • PS_MIN_VIDEO_DURATION_SEC=1

  • PS_REJECT_RATINGS_LESS_THAN=0

  • PS_FUZZY_YEAR_PARSING=true

  • PS_USE_STAT_TO_INFER_DATES=false

  • PS_VARIANT_SORT_CRITERIA_POWER=0.3

  • PS_ALLOW_USER_AGENT=true

  • PS_EMAIL=xxxxx@xxxxx.com

  • PS_REPORT_ERRORS=true

  • PS_MAX_ERRORS_PER_DAY=5

  • PS_MATCH_SIDECARS_FUZZILY=true

  • PS_WRITE_METADATA_TO_SIDECARS_IF_IMAGE=false

  • PS_WRITE_METADATA_TO_SIDECARS_IF_SIDECAR_EXISTS=false

  • PS_AUTO_REFRESH_LICENSE=true

  • PS_PICK_PLAN_ON_WELCOME=true

  • PS_ASSET_PATHNAME_FORMAT=y/MM/yMMdd_HHmmss.EXT

  • PS_EXCLUDE_NO_MEDIA_ASSETS_ON_REBUILD=false

  • PS_SYNC_REPORT_RETENTION_COUNT=30

  • PS_AUTO_UPDATE_CHECK=true

  • no system setting overrides

  • Some volumes are missing UUIDs

  • Storage volumes are OK* ``

  • PhotoStructure is not running as root

  • PhotoStructure is up to date

``

  • CPU utilization is 3%

  • Operating system is OK

Tools

  • SQLite is OK

  • ExifTool is OK

  • jpegtran is OK

  • Sharp is OK

  • Node.js is OK

  • HEIF images will be imported

  • Videos will be imported

Resume:

Are some of my settings wrong for my use case?

Which settings are recommended?

Which settings are best practice?

Which settings are best for full power performance without import errors or server crash?

Thanks to the creator, great professional and human, and thanks to the community!

Importing library my cpu it’s at 50% and my nvme and array are just sleeping…


System information

Version 2024.3.3-beta
Edition PhotoStructure for Docker
Health checks All critical health checks pass
Subscription plus
Licensed to xxxxxx@xxxxxx.com
Expires or renews 2024-11-30
OS Debian GNU/Linux 12 (bookworm) on x64 (Docker)
CPUs 6 × Intel(R) Core™ i5-9500T CPU @ 2.20GHz
System load 50% busy
Concurrency Target system use: 75% (3 concurrent imports, 1 gfx/process)
Web uptime 3 hours, 29 minutes
Current user node
Library path /ps/library
Library metrics 11,577 assets


25,462 image files
18 video files
553 tags
Log directory /ps/library/.photostructure/logs
Log level error

Sync information

path status last started last completed
/mnt/to-import-photos todo 3 hours, 29 minutes ago

A full import it takes days.

Welcome to the PhotoStructure forum, @paqmac ! Thanks for the kind words.

In general, the more CPU cores and the faster the disk you can put the PhotoStructure library (at least $library/.photostructure/models!), the better.

The Intel® Core™ i5-9500T doesn’t have hyperthreading, so PhotoStructure will only schedule work on 4 of those cores, and sync takes a core, so you should only see 3 concurrent file imports. Your system may be OK with setting PS_CPU_BUSY_PERCENT=85 or even 90. Check your sync report for timeouts to validate you aren’t overscheduling and getting hammered by iowait.

I scanned through your settings, and those seem, for the most part, just fine, except the following:

Note that this will omit photos that have had metadata stripped from them. Ideally, they will be in a directory that encodes the day (or month): ideally, .../YYYY-mm-dd/..., but there are a ton of different patterns I try to match against.

If sync is stuck like that, there should be some clue as to what’s going wrong in the sync reports and the logfiles.

This is tricky, given that the current PhotoStructure schema doesn’t really track photo bursts correctly–they are lumped together as a single set of variations. Ideally, all the assets assigned to a “burst” would be given a common “burst ID”, but I haven’t found a metadata tag that is consistent between cameras (so I’ll have to resort to some sort of “set of synonyms” like I have had for so many other things).

If processing takes about 10 seconds per asset (which may be on average about right, especially if you have HEVC and videos that need transcoding), that pencils out:

require("./dist/core/date/DurationFormat")
  .fmtFullDuration(25_000 * 10_000)
'2d21h26m40s'

The next release improves the function timers, so setting PS_EMIT_TIMINGS_ON_EXIT=true will enumerate all higher-level functions, how often they are being called, and what is taking up most of the time. Running sync against a single file or a handful of problematic files can be enlightening.

Mine were already at 85, i’ve increased to 90 now and i’ve set processPriority = “Normal”.
i’ve read all settings toml about performance and memory management but my understanding is that with my ssd nvme all default settings are ok and the only setting that makes real difference are related to cpu, cores and workers. I’m i wrong?

About PS_USE_STAT_TO_INFER_DATES=false, my idea or understanding about it was not no mess with date files specially when reading windows create file date and risk to set wrong date.

> # When enabled, and the “captured-at” time isn’t found in metadata,
> # PhotoStructure will also look for the captured-at datetime encoded in the
> # file “birthtime” (on Windows), or the lesser value of “mtime” and “ctime”
> # (on macOS and Linux). Note that these values are not very reliable, as file
> # transfers and backups frequently don’t retain these values correctly.

after reading carefully i understood the advantage of gathering info date when there are none.

any other advice?

when i make a library toml change should i rebuild or restart sync?

I’m thinking about safety renaming and reorganize my 3.26TB Library with YY/MM folder and y/MM/yMMdd_HHmmss.EXT(iterating 001 deduplication) files for better understanding that large mess and help photostructure imports.
This way I could delete permanently the old, unorganized, original library and permanently shrinking it by deleting obvious deduplication by hand, helping future imports and increasing chances of succeeding an optimal organization. What do you think about it?
Can I use photostructure or other software for a “pre” file/folder date organization and then analyzing assets? Or is this redundant and i really need to trust photostructure with this config?

Do you have an estimated time about new version release?