Pandas Jinja,，輕松創(chuàng)建一個 PDF 報表

F2967527 2022-01-28

展開全文

我們都知道，Pandas 擅長處理大量數(shù)據(jù)并以多種文本和視覺表示形式對其進行總結(jié),，它支持將結(jié)構(gòu)輸出到 CSV,、Excel、HTML,、json 等,。但是如果我們想將多條數(shù)據(jù)合并到一個文檔中，就有些復雜了,。例如,，如果要將兩個 DataFrames 放在一張 Excel 工作表上，則需要使用 Excel 庫手動構(gòu)建輸出,。雖然可行,，但并不簡單。本文將介紹一種將多條信息組合成 HTML 模板,，然后使用 Jinja 模板和 WeasyPrint 將其轉(zhuǎn)換為獨立 PDF 文檔的方法,，一起來看看吧~

總體流程

如報告文章所示，使用 Pandas 將數(shù)據(jù)輸出到 Excel 文件中的多個工作表或從 pandas DataFrames 創(chuàng)建多個 Excel 文件都非常方便,。但是,，如果我們想將多條信息組合到一個文件中,，那么直接從 Pandas 中完成的簡單方法卻并不多，下面我們來探索一條可行的簡單方法

在本文中,，我將使用以下流程來創(chuàng)建多頁 PDF 文檔

這種方法的好處是我們可以將自己的工具替換到此工作流程中。不喜歡用 Jinja,？那么可以插入 mako 或其他任何模板工具

工具選擇

首先,，我們使用 HTML 作為模板語言，因為它可能是生成結(jié)構(gòu)化數(shù)據(jù)并允許設(shè)置相對豐富的格式的最簡單方法

其次,，選擇 Jinja 是因為我有使用 Django/Flask 的經(jīng)驗,，上手比較容易

這個工具鏈中最困難的部分是弄清楚如何將 HTML 呈現(xiàn)為 PDF。我覺得目前還沒有非常好的解決方案,，我這里選擇了 WeasyPrint,，大家也可以嘗試一下其他的工具

數(shù)據(jù)處理

導入模塊，讀取銷售信息

from __future__ import print_function
import pandas as pd
import numpy as np
df = pd.read_excel('sales-funnel.xlsx')
df.head()

Output:

將數(shù)據(jù)進行透視表匯總處理

sales_report = pd.pivot_table(df, index=['Manager', 'Rep', 'Product'], values=['Price', 'Quantity'],
                           aggfunc=[np.sum, np.mean], fill_value=0)
sales_report.head()

Output:

模板

Jinja 模板非常強大,，支持許多高級功能,，例如沙盒執(zhí)行和自動轉(zhuǎn)義等等

Jinja 的另一個不錯的功能是它包含多個內(nèi)置過濾器，這將允許我們以在 Pandas 中難以做到的方式格式化我們的一些數(shù)據(jù)

為了在我們的應(yīng)用程序中使用 Jinja,，我們需要做 3 件事：

創(chuàng)建模板
將變量添加到模板上下文中
將模板渲染成 HTML

我們先創(chuàng)建一個簡單的模板 myreport.html

<!DOCTYPE html>
<html>
<head lang='en'>
    <meta charset='UTF-8'>
    <title>{{ title }}</title>
</head>
<body>
    <h2>Sales Funnel Report - National</h2>
     {{ national_pivot_table }}
</body>
</html>

此代碼的兩個關(guān)鍵部分是 {{ title }} 和 {{ national_pivot_table }},。它們本質(zhì)上是我們在渲染文檔時將提供的變量的占位符

要填充這些變量，我們需要創(chuàng)建一個 Jinja 環(huán)境并獲取我們的模板：

from jinja2 import Environment, FileSystemLoader
env = Environment(loader=FileSystemLoader('.'))
template = env.get_template('myreport.html')

在上面的示例中,，我們假設(shè)模板位于當前目錄中

另一個關(guān)鍵組件是 env 的創(chuàng)建,，這個變量是我們將內(nèi)容傳遞給模板的方式。我們創(chuàng)建一個名為 template_var 的字典,，其中包含我們要傳遞給模板的所有變量

變量的名稱與我們的模板匹配

template_vars = {'title' : 'Sales Funnel Report - National',
                 'national_pivot_table': sales_report.to_html()}

最后一步是使用輸出中包含的變量來呈現(xiàn) HTML,，這將創(chuàng)建一個字符串，我們最終將傳遞給我們的 PDF 創(chuàng)建引擎

html_out = template.render(template_vars)

生成 PDF

PDF 創(chuàng)建部分也相對簡單,，我們需要做一些導入并將一個字符串傳遞給 PDF 生成器

from weasyprint import HTML
HTML(string=html_out).write_pdf('report.pdf')

此命令會創(chuàng)建一個如下所示的 PDF 報告：

雖然報告生成了,，但是看起來很難看啊，我們來優(yōu)化下,，添加 CSS

這里使用 blue print 的 typography.css 作為我們的 style.css 的基礎(chǔ),，它有以下幾個優(yōu)點：

它比較小且易于理解
它可以在 PDF 引擎中工作而不會引發(fā)錯誤和警告
它包括看起來相當不錯的基本表格格式

HTML(string=html_out).write_pdf(args.outfile.name, stylesheets=['style.css'])

可以看到，僅僅添加一行代碼,，產(chǎn)生的效果卻大大不同

更復雜的模板

為了生成更有用的報告,，我們將結(jié)合上面顯示的匯總統(tǒng)計數(shù)據(jù)，并將報告拆分為每個經(jīng)理包含一個單獨的 PDF 頁面

讓我們從更新的模板（myreport.html）開始：

<!DOCTYPE html>
<html>
<head lang='en'>
    <meta charset='UTF-8'>
    <title>{{ title }} </title>
</head>
<body>
<div class='container'>
    <h2>Sales Funnel Report - National</h2>
     {{ national_pivot_table }}
    {% include 'summary.html' %}
</div>
<div class='container'>
    {% for manager in Manager_Detail %}
        <p style='page-break-before: always' ></p>
        <h2>Sales Funnel Report - {{manager.0}}</h2>
        {{manager.1}}
        {% include 'summary.html' %}
    {% endfor %}
</div>
</body>
</html>

我們注意到的第一件事是有一個包含語句,，它提到了另一個文件,。包含允許我們引入一段 HTML 并在代碼的不同部分重復使用它。在這種情況下,，摘要包含一些我們希望在每個報告中包含的簡單的國家級統(tǒng)計數(shù)據(jù),，以便管理人員可以將他們的績效與全國平均水平進行比較,。

以下是 summary.html 的樣子：

<h3>National Summary: CPUs</h3>
    <ul>
        <li>Average Quantity: {{CPU.0|round(1)}}</li>
        <li>Average Price: {{CPU.1|round(1)}}</li>
    </ul>
<h3>National Summary: Software</h3>
    <ul>
        <li>Average Quantity: {{Software.0|round(1)}}</li>
        <li>Average Price: {{Software.1|round(1)}}</li>
    </ul>

在此代碼段中，看到我們可以訪問一些其他變量：CPU 和 Software ,。其中每一個都是一個 python 列表,，其中包括 CPU 和軟件銷售的平均數(shù)量和價格

還注意到我們使用管道|將每個值四舍五入到小數(shù)點后 1 位。這是使用 Jinja 過濾器的一個具體示例

還有一個 for 循環(huán)允許我們在報告中顯示每個經(jīng)理的詳細信息,。Jinja 的模板語言只包含一個非常小的代碼子集,，它會改變控制流

附加統(tǒng)計信息

下面編寫供模板調(diào)用的函數(shù)和代碼

一個簡單的匯總函數(shù)

def get_summary_stats(df,product):
    '''
    For certain products we want National Summary level information on the reports
    Return a list of the average quantity and price
    '''
    results = []
    results.append(df[df['Product']==product]['Quantity'].mean())
    results.append(df[df['Product']==product]['Price'].mean())
    return results

創(chuàng)建經(jīng)理詳細信息

manager_df = []
for manager in sales_report.index.get_level_values(0).unique():
    manager_df.append([manager, sales_report.xs(manager, level=0).to_html()])

最后，使用以下變量調(diào)用模板

template_vars = {'title' : 'National Sales Funnel Report',
                 'CPU' : get_summary_stats(df, 'CPU'),
                 'Software': get_summary_stats(df, 'Software'),
                 'national_pivot_table': sales_report.to_html(),
                 'Manager_Detail': manager_df}
# Render our file and create the PDF using our css style file
html_out = template.render(template_vars)
HTML(string=html_out).write_pdf('report.pdf',stylesheets=['style.css'])

這樣我們的 pdf 報表就完成了,，整體效果如下

完整代碼：

from __future__ import print_function
import pandas as pd
import numpy as np
import argparse
from jinja2 import Environment, FileSystemLoader
from weasyprint import HTML


def create_pivot(df, infile, index_list=['Manager', 'Rep', 'Product'], value_list=['Price', 'Quantity']):
    '''
    Create a pivot table from a raw DataFrame and return it as a DataFrame
    '''
    table = pd.pivot_table(df, index=index_list, values=value_list,
                           aggfunc=[np.sum, np.mean], fill_value=0)
    return table

def get_summary_stats(df,product):
    '''
    For certain products we want National Summary level information on the reports
    Return a list of the average quantity and price
    '''
    results = []
    results.append(df[df['Product']==product]['Quantity'].mean())
    results.append(df[df['Product']==product]['Price'].mean())
    return results

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Generate PDF report')
    parser.add_argument('infile', type=argparse.FileType('r'),
    help='report source file in Excel')
    parser.add_argument('outfile', type=argparse.FileType('w'),
    help='output file in PDF')
    args = parser.parse_args()

    df = pd.read_excel(args.infile.name)
    sales_report = create_pivot(df, args.infile.name)

    manager_df = []
    for manager in sales_report.index.get_level_values(0).unique():
        manager_df.append([manager, sales_report.xs(manager, level=0).to_html()])

    env = Environment(loader=FileSystemLoader('.'))
    template = env.get_template('myreport.html')
    template_vars = {'title' : 'National Sales Funnel Report',
                     'CPU' : get_summary_stats(df, 'CPU'),
                     'Software': get_summary_stats(df, 'Software'),
                     'national_pivot_table': sales_report.to_html(),
                     'Manager_Detail': manager_df}

    html_out = template.render(template_vars)
    HTML(string=html_out).write_pdf(args.outfile.name,stylesheets=['style.css'])

wén

文

mò

末

zèng

贈

shu

書

文末推薦一本《機器學習線性代數(shù)基礎(chǔ)：Python語言描述》,，本書以機器學習涉及的線性代數(shù)核心知識為重點，進行新的嘗試和突破：從坐標與變換,、空間與映射,、近似與擬合、相似與特征,、降維與壓縮這5個維度,，環(huán)環(huán)相扣地展開線性代數(shù)與機器學習算法緊密結(jié)合的核心內(nèi)容。