My right or your right?
- 0 Posts
- 6 Comments
hoppolito@mander.xyzto Selfhosted@lemmy.world•Searching through a bulk of pdf filesEnglish5·28 days agoFor the OCR process you can probably wrangle up a simple bash pipeline with ocrmypdf and just let it run in the background once until all your PDFs have a text layer.
With that tool it should be doable with something like a simple while loop:
find . -type f -name '*.pdf' -print0 | while IFS= read -r -d '' file; do echo "Processing $file ..." ocrmypdf "$file" "$file" # ocrmypdf "$file" "${file%.pdf}_ocr.pdf" # if you want a new file instead of overwriting the old done
If you need additional languages or other options you’ll have to delve a little deeper into the ocrmypdf documentation but this should be enough duct tape to just whip up a full OCR cycle.
hoppolito@mander.xyzto Selfhosted@lemmy.world•Searching through a bulk of pdf filesEnglish11·28 days agoIn case you are already using ripgrep (rg) instead of grep, there is also ripgrep-all (rga) which lets you search through a whole bunch of files like PDFs quickly. And it’s cached, so while the first indexing takes a moment any further search is lightning fast.
It supports a whole truckload of file types (pdf, odt, xlsx, tar.gz, mp4, and so on) but i mostly used it to quickly search through thousands of research papers. Takes around 5 minutes to index everything for my 4000 PDFs on the first run, then it’s smooth sailing for any further searches from there.
On the other hand I would, as it is really interesting to me the kind of day-to-day behind the scenes of a community place like this.
As would evidently, or perhaps has, Cris_Color@lemmy.world, so I don’t think it’s an unreasonable article hook/assumption to make.
hoppolito@mander.xyzto Gaming@lemmy.world•Valve confirms credit card companies pressured it to delist certain adult games from SteamEnglish0·2 months agoWank faster, my tax report is due!
I also switched to gonic over navidrome (even though I liked it a lot) because iirc I couldn’t get navidrome to pull Artist pictures in correctly. gonic i could just connect to lastfm and everything worked - and i could still connect to listenbrainz for my actual scrobbling.