Нужен GhostScript и Tesseract.
Powershell 7, с использованием parallel foreach. Распознаются tiff?|jpe?g|pdf|png
.
#Requires -Version 7.0 # https://stackoverflow.com/questions/4695695/convert-pdf-to-jpg-or-png-using-c-sharp-or-command-line # & "$tesseract" --help-extra <# # Show resolution and DPI of an image # & C:\scripts\ImageMagick\identify.exe -format "%w x %h %x x %y" "D:\temp\ocr\pict.png" # https://askubuntu.com/questions/760993/how-to-programmatically-determine-dpi-of-images-in-pdf-file [int]$dpi = & C:\scripts\ImageMagick\identify.exe -format "%x" "C:\Pictures\pic.jpg" if ($dpi -lt 72) {$dpi = 72} #> $ghostScript = "C:\scripts\Ghostscript\bin\gswin64c.exe" $tesseract = "C:\scripts\Tesseract\tesseract.exe" $path = "C:\temp\ocr" # quit if no appropriate files in the folder $sources = dir "$path" |? Extension -match "tiff?$|jpe?g$|png$|pdf$" if (!$sources) {exit} $temp = "C:\temp\ocr$((get-date).ToString("yyyyMMddHHmmss"))" # CPU threads $threads = (gcim win32_processor).NumberOfLogicalProcessors # temp folder, moving files mkdir "$temp" > $null mv $sources "$temp" cd "$temp" # processing $pdfs,$images = (dir "$temp").where({$_.extension -eq '.pdf'}, 'Split') # PDFs $pdfs |ForEach-Object -Parallel { & "$using:ghostScript" -dBATCH -dNOPAUSE -sDEVICE=pnggray -r300 "-sOutputFile=$($_.basename)-%04d.png" "$($_.fullname)" } -ThrottleLimit $threads foreach ($pdf in $pdfs) { (dir *.png) -match "$($pdf.basename)-\d{4}" |ForEach-Object -Parallel { & "$using:tesseract" ".\$($_.name)" "$($_.basename)" -l rus+eng } -ThrottleLimit $threads gc ((dir *.txt) -match "$($pdf.basename)-\d{4}") -Encoding UTF8 |Out-File "$path\$($pdf.basename).txt" -Encoding 1251 } # Images $images |ForEach-Object -Parallel { & "$using:tesseract" ".\$($_.name)" "$($_.basename)" -l rus+eng gc ".\$($_.basename).txt" -Encoding UTF8 |Out-File "$using:path\$($_.basename).txt" -Encoding 1251 } -ThrottleLimit $threads # remove temp folder cd "$path" rmdir "$temp" -Recurse -Force -Confirm:$false
Powershell 5.1
$ghostScript = "C:\scripts\Ghostscript\bin\gswin64c.exe" $tesseract = "C:\scripts\Tesseract\tesseract.exe" $path = "C:\temp\ocr" # quit if no appropriate files in the folder $sources = dir "$path" |? Extension -match "tiff?$|jpe?g$|png$|pdf$" if (!$sources) {exit} $temp = "C:\temp\ocr$((get-date).ToString("yyyyMMddHHmmss"))" # temp folder, moving files mkdir "$temp" > $null $sources |mv -Destination "$temp" cd "$temp" # processing $pdfs,$images = (dir "$temp").where({$_.extension -eq '.pdf'}, 'Split') # PDFs foreach ($pdf in $pdfs) { & "$ghostScript" -dBATCH -dNOPAUSE -sDEVICE=pnggray -r300 "-sOutputFile=$($pdf.basename)-%04d.png" "$($pdf.fullname)" (dir *.png) -match "$($pdf.basename)" |% { & "$tesseract" ".\$($_.name)" "$($_.basename)" -l rus+eng } gc ((dir *.txt) -match "$($pdf.basename)") -Encoding UTF8 |Out-File "$path\$($pdf.basename).txt" -Encoding default } # Images $images |% { & "$tesseract" ".\$($_.name)" "$($_.basename)" -l rus+eng gc ".\$($_.basename).txt" -Encoding UTF8 |Out-File "$path\$($_.basename).txt" -Encoding default } # Removing temp folder cd "$path" rmdir "$temp" -Recurse -Force -Confirm:$false
В случае, если PDF содержит не картинки, а текстовый слой.
& "C:\scripts\Ghostscript\bin\gswin64c.exe" -dBATCH -dNOPAUSE -sDEVICE=txtwrite -o "D:\Downloads\output.txt" "D:\Downloads\input.pdf"
https://stackoverflow.com/questions/3650957/how-to-extract-text-from-a-pdf
https://stackoverflow.com/questions/6187250/pdf-text-extraction-from-given-coordinates
Вся обработка выполняется с помощью бесплатных инструментов.
Хороший результат даёт gImageReader - это графический интерфейс к программе Tesseract.
Чтобы распознавать русский текст, нужно скачать поддержку русского языка и словарь для удобства вычитки. Всё это делается прямо из интерфейса программы.
Дополнительно имеется хороший текстовый редактор (там отображаются результаты). Можно убрать переносы в конце строк и т. д.