Pdf ocr linux command line

1/12/2024

TAGS: The following were used in some way in the writing of this script.ġ. RETURNS: List of PDFs (Posix Path) that Need to be OCR'd Detect Whether a PDF has been OCRd - Chris.Property ptyScriptAuthor : "Christopher Stone" - mod by JMichaelTX Here’s the script I used: property ptyScriptName : "Detect Whether a PDF has been OCRd - Chris" When I run your script, but also using Shane’s MetaLib lib to search a folder recursively that contains PDFs and other stuff, it took only 0.012 sec/PDF: Ĭhris, thanks for a great script, but you’ve got a slow Mac.# Test for Text Content (I'm not satisfied with this yet). Set thePage to (theDoc's pageAtIndex:(i - 1))'s |string|() as text Set end of nonOcrList to contents of thePath If (its pdfHasBeenOcrd:thePath) = false then Repeat with thePath in finderSelectionList Set (contents of theFile) to POSIX path of (contents of theFile)

Repeat with theFile in finderSelectionList I’d still like a better test than text-length, but at least this script is much less clumsy than the first one. My first script had a number of false negatives due to \uFFFC characters showing up in NON ocr’d files. I’m not entirely satisfied with the text-length test in the appended script, but when run it finds all the non-ocr’d pdfs in my 500 file test set in only 25 seconds (less than half the time of first script). (theText's appendString:(thePage's |string|())) Set thePage to (theDoc's pageAtIndex:(i - 1)) Set theCount to theDoc's pageCount() as integer Set theDoc to current application's PDFDocument's alloc()'s initWithURL:anNSURL Set anNSURL to current application's |NSURL|'s fileURLWithPath:thePath Set theText to current application's NSMutableString's |string|() (theDoc's removePageAtIndex:(i - 1)) - zero-based indexesĮnd extractPages:thruTo:ofPDFDocAt:usingTempFile: Repeat with i from (firstPage - 1) to 1 by -1 Set theDoc to current application's PDFDocument's alloc()'s initWithURL:inNSURL Set inNSURL to current application's class "NSURL"'s fileURLWithPath:posixPath On extractPages:firstPage thruTo:lastPage ofPDFDocAt:posixPath usingTempFile:tempFilePath If pdfText = "" then set end of nonOcrList to contents of pdfFilePath Set pdfText to (its pdf2Text:pdfTempFilePath) Set pdfTempFilePath to (my extractPages:1 thruTo:1 ofPDFDocAt:pdfFilePath usingTempFile:tempFilePath) # Extract page 1 of the selected PDF file to a temp file. Repeat with pdfFilePath in finderSelectionList Set contents of i to POSIX path of (contents of i) If length of finderSelectionList = 0 then error "No files were selected in the Finder!" Set finderSelectionList to selection as alias list # Selected files in the Finder are the target. Set tempFilePath to (POSIX path of (path to temporary items folder from user domain)) & "temp.pdf" # Task: Determine if Selected PDF Files have been OCR'd. I haven’t compared this to the command line utilities yet, but I probably will. That’s not super fast, but it’s not horribly slow either. On my system it takes 1 minute 15 seconds to scan 502 PDF files with an aggregate file size of 962.3 MB. It extracts page one of the PDF to a temp file and then attempts to extract text from it.

0 Comments

Pdf ocr linux command line

Leave a Reply.

Author

Archives

Categories