【原】10 分鐘掌握 Pandas 核心操作：從零開始的數(shù)據(jù)分析實(shí)戰(zhàn)

ml_Py 2024-10-25 發(fā)布于河南

展開全文

大家好,，我是章北海

在數(shù)據(jù)分析領(lǐng)域，Pandas 是一個(gè)不可或缺的 Python 庫,。

本文將通過一個(gè)真實(shí)的銷售數(shù)據(jù)分析案例,，系統(tǒng)地介紹 Pandas 的核心操作,。無論你是數(shù)據(jù)分析新手還是希望系統(tǒng)復(fù)習(xí)的老手,，這篇文章都值得收藏,。

一、項(xiàng)目背景

假設(shè)你是一家連鎖店的數(shù)據(jù)分析師,，需要處理和分析全國各地區(qū)的銷售數(shù)據(jù)。數(shù)據(jù)分散在不同的文件中,，包括：

銷售記錄（CSV 格式）
客戶信息（JSON 格式）

我們的目標(biāo)是將這些數(shù)據(jù)整合起來,，進(jìn)行清洗和分析，最終得出有價(jià)值的商業(yè)洞察,。

二,、數(shù)據(jù)準(zhǔn)備

首先，讓我們創(chuàng)建示例數(shù)據(jù)：

# 創(chuàng)建示例 CSV 數(shù)據(jù)
sales_data = """
date,product,price,quantity,region
2024-01-01,A,100,5,North
2024-01-02,B,200,,South
2024-01-03,A,100,3,East
2024-01-04,C,300,4,West
2024-01-05,B,200,2,North
"""

# 創(chuàng)建示例 JSON 數(shù)據(jù)
customer_data = """
{
    "customers": [
        {"id": 1, "name": "張三", "region": "North"},
        {"id": 2, "name": "李四", "region": "South"}
    ]
}
"""

# 將數(shù)據(jù)保存到文件
with open('sales.csv', 'w') as f:
    f.write(sales_data)
    
with open('customers.json', 'w') as f:
    f.write(customer_data)

三,、數(shù)據(jù)導(dǎo)入

Pandas 提供了豐富的數(shù)據(jù)導(dǎo)入功能,，可以處理多種格式的數(shù)據(jù)文件：

import pandas as pd

# 導(dǎo)入 CSV 格式銷售數(shù)據(jù)
df_sales = pd.read_csv('sales.csv')

# 導(dǎo)入 JSON 格式客戶數(shù)據(jù)
df_customers = pd.read_json('customers.json')

# 創(chuàng)建日期索引
date_range = pd.date_range(start='2024-01-01', end='2024-01-05', freq='D')

# 合并數(shù)據(jù)集
df_combined = pd.concat([df_sales, df_customers], axis=1)

四,、數(shù)據(jù)清洗

數(shù)據(jù)清洗是數(shù)據(jù)分析中最重要的步驟,，包括處理缺失值、數(shù)據(jù)轉(zhuǎn)換等操作：

# 1. 處理缺失值
df_sales = df_sales.fillna(0)  # 將缺失值填充為 0

# 2. 刪除無效數(shù)據(jù)
df_sales = df_sales.dropna(how='all')  # 刪除全為空的行

# 3. 數(shù)據(jù)排序
df_sales = df_sales.sort_values('price')  # 按價(jià)格排序

# 4. 數(shù)據(jù)轉(zhuǎn)換
df_sales['total'] = df_sales.apply(lambda x: x['price'] * x['quantity'], axis=1)

# 5. 分組統(tǒng)計(jì)
region_stats = df_sales.groupby('region').agg({
    'total': 'sum',
    'quantity': 'count'
})

# 6. 數(shù)據(jù)合并
df_merged = df_sales.join(df_customers.set_index('region'), on='region')

# 7. 重命名列
df_sales = df_sales.rename(columns={'quantity': 'sales_volume'})

# 8. 設(shè)置索引
df_sales = df_sales.set_index('date')

五,、數(shù)據(jù)統(tǒng)計(jì)分析

清洗完數(shù)據(jù)后,，我們可以進(jìn)行統(tǒng)計(jì)分析：

# 1. 查看數(shù)據(jù)概覽
print("數(shù)據(jù)前 5 行：")
print(df_sales.head())

print("\n數(shù)據(jù)基本信息：")
print(df_sales.info())

# 2. 基礎(chǔ)統(tǒng)計(jì)
print("\n基礎(chǔ)統(tǒng)計(jì)值：")
print(df_sales.describe())

# 3. 詳細(xì)統(tǒng)計(jì)
print("\n具體統(tǒng)計(jì)指標(biāo)：")
print("平均價(jià)格：", df_sales['price'].mean())
print("價(jià)格中位數(shù)：", df_sales['price'].median())
print("銷量總計(jì)：", df_sales['sales_volume'].count())
print("價(jià)格標(biāo)準(zhǔn)差：", df_sales['price'].std())
print("最高價(jià)格：", df_sales['price'].max())
print("最低價(jià)格：", df_sales['price'].min())

六、分析結(jié)果示例

數(shù)據(jù)前5行:
            product  price  sales_volume region  total
2024-01-01       A    100            5  North    500
2024-01-02       B    200            0  South      0
2024-01-03       A    100            3   East    300
2024-01-04       C    300            4   West   1200
2024-01-05       B    200            2  North    400

平均價(jià)格: 180.0
價(jià)格中位數(shù): 200.0
銷量總計(jì): 5
價(jià)格標(biāo)準(zhǔn)差: 84.85
最高價(jià)格: 300
最低價(jià)格: 100