久久国产成人av_抖音国产毛片_a片网站免费观看_A片无码播放手机在线观看,色五月在线观看,亚洲精品m在线观看,女人自慰的免费网址,悠悠在线观看精品视频,一级日本片免费的,亚洲精品久,国产精品成人久久久久久久

分享

掃描版PDF轉(zhuǎn)文字Word(python3)

 和相品 2020-05-13

一.將掃描版PDF轉(zhuǎn)為可復(fù)制文字版PDF

1.利用百度api將掃描版的pdf轉(zhuǎn)為文字版的pdf

申請網(wǎng)址:https://console.bce.baidu.com

點擊后創(chuàng)建文字識別應(yīng)用,在應(yīng)用列表中可見調(diào)用api時的APP_ID,、API_KEY、SECRET_KEY

2.依次安裝以下python模塊

  1. pip3 install PyPDF2
  2. pip3 install baidu-aip
  3. pip3 install pdfkit
  4. pip3 install pymupdf

3.安裝wkhtmltopdf 軟件

下載網(wǎng)址:https:///downloads.html

記下安裝目錄下 bin/wkhtmltopdf.exe位置,,程序中的 path_wk 參數(shù)需要此位置

4.程序:

  1. from PyPDF2 import PdfFileReader, PdfFileWriter
  2. from aip import AipOcr
  3. import pdfkit
  4. import fitz
  5. import os


  6. pdfpath = 'D:\pdf3'
  7. pdfname = '水滸傳.pdf'
  8. path_wk = r'D:/Procedure/wkhtmltopdf/bin/wkhtmltopdf.exe'


  9. APP_ID = '1234567'
  10. API_KEY = 'abcdefg'
  11. SECRET_KEY = 'qwertyuiop'

  12. # 以下為處理程序---------------------------------------------------------------------------
  13. pdfkit_config = pdfkit.configuration(wkhtmltopdf=path_wk)
  14. pdfkit_options = {'encoding': 'UTF-8', }
  15. # 將每頁pdf轉(zhuǎn)為png格式圖片
  16. def pdf_image():
  17. pdf = fitz.open(pdfpath+os.sep+pdfname)
  18. for pg in range(0, pdf.pageCount):
  19. # 獲得每一頁的對象
  20. page = pdf[pg]
  21. trans = fitz.Matrix(1.0, 1.0).preRotate(0),
  22. # 獲得每一頁的流對象
  23. pm = page.getPixmap(matrix=trans, alpha=False)
  24. # 保存圖片
  25. pm.writePNG(image_path + os.sep + pdfname[:-4] + '_' + '{:0>3d}.png'.format(pg + 1))
  26. page_range = range(pdf.pageCount)
  27. pdf.close()
  28. return page_range


  29. def read_png_str(page_range):
  30. # 讀取本地圖片的函數(shù)
  31. def get_file_content(filePath):
  32. with open(filePath, 'rb') as fp:
  33. return fp.read()

  34. all_pngstr = []
  35. image_list = []
  36. for page_num in page_range:
  37. # 讀取本地圖片
  38. image = get_file_content(image_path + os.sep + r'{}_{}.png'.format(pdfname[:-4], '%03d' % (page_num + 1)))
  39. image_list.append(image)

  40. # 新建一個AipOcr
  41. client = AipOcr(APP_ID, API_KEY, SECRET_KEY)
  42. options = {}
  43. options["language_type"] = "CHN_ENG"
  44. options["detect_direction"] = "false"
  45. options["detect_language"] = "false"
  46. options["probability"] = "false"
  47. for image in image_list:
  48. # 文字識別,得到一個字典
  49. pngjson = client.basicGeneral(image, options)
  50. pngstr = ''
  51. for x in pngjson['words_result']:
  52. pngstr = pngstr + x['words'] + '</br>'
  53. print('正在調(diào)用百度接口:第{}個,,共{}個'.format(len(all_pngstr), len(image_list)))
  54. all_pngstr.append(pngstr)
  55. return all_pngstr


  56. def str2pdf(page_range, all_pngstr):
  57. # 字符串寫入PDF
  58. for page_num in page_range:
  59. print('正在將字符串寫入PDF:第{}個,共{}個'.format((page_num + 1), len(page_range)))
  60. pdfkit.from_string((all_pngstr[page_num]), disperse_pdfpath + os.sep + '%s.pdf' % (str(page_num + 1)),
  61. configuration=pdfkit_config, options=pdfkit_options)


  62. def pdf_merge(page_range):
  63. # 合并單頁PDF
  64. pdf_output = PdfFileWriter()
  65. for page_num in page_range:
  66. print('正在合并單頁:第{}個,,共{}個'.format((page_num + 1), len(page_range)))
  67. pdf_input = PdfFileReader(open(disperse_pdfpath + os.sep + '%s.pdf' % (str(page_num + 1)), 'rb'))
  68. page = pdf_input.getPage(0)
  69. pdf_output.addPage(page)
  70. newPdfPath = pdfpath+os.sep + 'new_{}'.format(pdfname)
  71. pdf_output.write(open(newPdfPath, 'wb'))
  72. return newPdfPath


  73. image_path = pdfpath + os.sep + "image"
  74. if not os.path.exists(image_path):
  75. os.mkdir(image_path)

  76. disperse_pdfpath = pdfpath + os.sep + "pdf"
  77. if not os.path.exists(disperse_pdfpath):
  78. os.mkdir(disperse_pdfpath)

  79. range_count = pdf_image()
  80. all_th = read_png_str(range_count)
  81. str2pdf(range_count, all_th)
  82. pdf_merge(range_count)

 

二.將掃描版PDF轉(zhuǎn)為可復(fù)制文字版Word文檔

1.在安裝了上節(jié)所需的環(huán)境的基礎(chǔ)下,安裝python-docx python模塊

pip3 install python-docx

2.程序:

  1. from docx import Document
  2. from aip import AipOcr
  3. import pdfkit
  4. import fitz
  5. import os

  6. pdfpath = 'D:\pdf'
  7. pdfname = '水滸傳.pdf'
  8. path_wk = r'D:/Procedure/wkhtmltopdf/bin/wkhtmltopdf.exe'

  9. APP_ID = '123456789'
  10. API_KEY = 'abcdefg'
  11. SECRET_KEY = 'qwertyuiop'

  12. # ---------------------------------------------------------------------------
  13. pdfkit_config = pdfkit.configuration(wkhtmltopdf=path_wk)
  14. pdfkit_options = {'encoding': 'UTF-8', }


  15. # 將每頁pdf轉(zhuǎn)為png格式圖片
  16. def pdf_image():
  17. pdf = fitz.open(pdfpath + os.sep + pdfname)
  18. for pg in range(0, pdf.pageCount):
  19. # 獲得每一頁的對象
  20. page = pdf[pg]
  21. trans = fitz.Matrix(1.0, 1.0).preRotate(0)
  22. # 獲得每一頁的流對象
  23. pm = page.getPixmap(matrix=trans, alpha=False)
  24. # 保存圖片
  25. pm.writePNG(image_path + os.sep + pdfname[:-4] + '_' + '{:0>3d}.png'.format(pg + 1))
  26. page_range = range(pdf.pageCount)
  27. pdf.close()
  28. return page_range


  29. # 將圖片中的文字轉(zhuǎn)換為字符串
  30. def read_png_str(page_range):
  31. # 讀取本地圖片的函數(shù)
  32. def get_file_content(filePath):
  33. with open(filePath, 'rb') as fp:
  34. return fp.read()

  35. allPngStr = []
  36. image_list = []
  37. for page_num in page_range:
  38. # 讀取本地圖片
  39. image = get_file_content(image_path + os.sep + r'{}_{}.png'.format(pdfname[:-4], '%03d' % (page_num + 1)))
  40. print(image)
  41. image_list.append(image)

  42. # 新建一個AipOcr
  43. client = AipOcr(APP_ID, API_KEY, SECRET_KEY)
  44. # 可選參數(shù)
  45. options = {}
  46. options["language_type"] = "CHN_ENG"
  47. options["detect_direction"] = "false"
  48. options["detect_language"] = "false"
  49. options["probability"] = "false"
  50. for image in image_list:
  51. # 通用文字識別,得到的是一個dict
  52. pngjson = client.basicGeneral(image, options)
  53. pngstr = ''
  54. for x in pngjson['words_result']:
  55. pngstr = pngstr + x['words'] + '\n'
  56. print('正在調(diào)用百度接口:第{}個,,共{}個'.format(len(allPngStr), len(image_list)))
  57. allPngStr.append(pngstr)
  58. return allPngStr


  59. def str2word(allPngStr):
  60. document = Document()
  61. for i in allPngStr:
  62. document.add_paragraph(
  63. i, style='ListBullet'
  64. )
  65. document.save(pdfpath + os.sep + pdfname[:-4] + '.docx')

  66. print('處理完成')


  67. image_path = pdfpath + os.sep + "image"
  68. if not os.path.exists(image_path):
  69. os.mkdir(image_path)

  70. range_count = pdf_image()
  71. allPngStr = read_png_str(range_count)
  72. str2word(allPngStr)

三.將PDF中的文字轉(zhuǎn)為word文檔

1.安裝如下兩個python模塊

  1. pip3 install pdfminer3k

  2. pip3 install python-docx

2.程序:

  1. from pdfminer.pdfparser import PDFParser, PDFDocument
  2. from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
  3. from pdfminer.layout import LAParams
  4. from pdfminer.converter import PDFPageAggregator
  5. from docx import Document
  6. import warnings
  7. import os


  8. filePath = 'D:/pdf/水滸傳.pdf'


  9. file_name = os.open(filePath, os.O_RDWR)
  10. document = Document()
  11. warnings.filterwarnings("ignore")
  12. def pdf2word():
  13. fn = open(file_name, 'rb')
  14. parser = PDFParser(fn)
  15. doc = PDFDocument()
  16. parser.set_document(doc)
  17. doc.set_parser(parser)
  18. resource = PDFResourceManager()
  19. laparams = LAParams()
  20. device = PDFPageAggregator(resource, laparams=laparams)
  21. interpreter = PDFPageInterpreter(resource, device)
  22. for i in doc.get_pages():
  23. interpreter.process_page(i)
  24. layout = device.get_result()
  25. for out in layout:
  26. if hasattr(out, "get_text"):
  27. content = out.get_text().replace(u'\xa0', u' ')
  28. document.add_paragraph(
  29. content, style='ListBullet'
  30. )
  31. document.save(filePath[:-4] + '.docx')
  32. print('處理完成')


  33. if __name__ == '__main__':
  34. pdf2word()

參考博客:https://blog.csdn.net/dianepure/article/details/88568761

 

    本站是提供個人知識管理的網(wǎng)絡(luò)存儲空間,,所有內(nèi)容均由用戶發(fā)布,不代表本站觀點,。請注意甄別內(nèi)容中的聯(lián)系方式,、誘導(dǎo)購買等信息,謹防詐騙,。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,,請點擊一鍵舉報,。
    轉(zhuǎn)藏 分享 獻花(0

    0條評論

    發(fā)表

    請遵守用戶 評論公約

    類似文章 更多