Pandas是Python領(lǐng)域強(qiáng)大的數(shù)據(jù)處理與分析的類庫(kù),,而SQL是關(guān)系型數(shù)據(jù)庫(kù)MYSQL,、Oracle的查詢語(yǔ)言,,這兩者都是對(duì)表格型數(shù)據(jù)的查詢和操作,,有很多相似之處,,本文和視頻(視頻在文章最下方)實(shí)現(xiàn)兩者的對(duì)比,。
Pandas和數(shù)據(jù)庫(kù)查詢語(yǔ)言SQL的對(duì)比
Pandas:Python最流行的數(shù)據(jù)處理與數(shù)據(jù)分析的類庫(kù)
SQL:結(jié)構(gòu)化查詢語(yǔ)言,,用于對(duì)MySQL、Oracle等關(guān)系型數(shù)據(jù)庫(kù)的增刪改查
兩者都是對(duì)“表格型”數(shù)據(jù)的操作和查詢,,所以很多語(yǔ)法都能對(duì)應(yīng)起來(lái)
對(duì)比列表:
SELECT數(shù)據(jù)查詢
WHERE按條件查詢
in和not in的條件查詢
groupby分組統(tǒng)計(jì)
JOIN數(shù)據(jù)關(guān)聯(lián)
UNION數(shù)據(jù)合并
Order Limit先排序后分頁(yè)
取每個(gè)分組group的top n
UPDATE數(shù)據(jù)更新
DELETE刪除數(shù)據(jù)
0. 讀取泰坦尼克數(shù)據(jù)集
import pandas as pd
import numpy as np
df = pd.read_csv('./datas/titanic/titanic_train.csv')
df.head()
1. SELECT數(shù)據(jù)查詢
# SQL:
sql = '''
SELECT PassengerId, Sex, Age, Survived
FROM titanic
LIMIT 5;
'''
# Pandas
df[['PassengerId', 'Sex', 'Age', 'Survived']].head(5)
df.head(5)類似select * from table limit 5,,查詢所有的字段
2. WHERE按條件查詢
# SQL:
sql = '''
SELECT *
FROM titanic
where Sex='male' and Age>=20.0 and Age<=40.0
LIMIT 5;
'''
# 使用括號(hào)的方式,級(jí)聯(lián)多個(gè)條件|
condition = (df['Sex']=='male') & (df['Age']>=20.0) & (df['Age']<=40.0)
condition.value_counts()
df[condition].head(5)
3. in和not in的條件查詢
df['Pclass'].unique()
# SQL:
sql = '''
SELECT *
FROM titanic
where Pclass in (1,2)
LIMIT 5;
'''
# in
df[df['Pclass'].isin((1,2))].head()
# not in
df[~df['Pclass'].isin((1,2))].head()
4. groupby分組統(tǒng)計(jì)
4.1 單個(gè)列的聚合
# SQL:
sql = '''
SELECT
-- 分性別的存活人數(shù)
sum(Survived),
-- 分性別的平均年齡
mean(Age)
-- 分性別的平均票價(jià)
mean(Fare)
FROM titanic
group by Sex
'''
df.groupby('Sex').agg({'Survived':np.sum, 'Age':np.mean, 'Fare':np.mean})
4.2 多個(gè)列的聚合
# SQL:
sql = '''
SELECT
-- 不同存活和性別分組的,,平均年齡
mean(Age)
-- 不同存活和性別分組的,,平均票價(jià)
mean(Fare)
FROM titanic
group by Survived, Sex
'''
df.groupby(['Survived', 'Sex']).agg({'Age':np.mean, 'Fare':np.mean})
5. JOIN數(shù)據(jù)關(guān)聯(lián)
# 電影評(píng)分?jǐn)?shù)據(jù)集,評(píng)分表
df_rating = pd.read_csv('./datas/ml-latest-small/ratings.csv')
df_rating.head(5)
# 電影評(píng)分?jǐn)?shù)據(jù)集,,電影信息表
df_movies = pd.read_csv('./datas/ml-latest-small/movies.csv')
df_movies.head(5)
# SQL:
sql = '''
SELECT *
FROM
rating join movies
on(rating.movieId=movies.movieId)
limit 5
'''
df_merged = pd.merge(left=df_rating, right=df_movies, on='movieId')
df_merged.head(5)
6. UNION數(shù)據(jù)合并
df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],
'rank': range(1, 4)})
df1
df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],
'rank': [1, 4, 5]})
df2
# SQL:
sql = '''
SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
'''
# pandas
pd.concat([df1, df2])
7. Order Limit先排序后分頁(yè)
# SQL:
sql = '''
SELECT *
from titanic
order by Fare
limit 5
'''
df.sort_values('Fare', ascending=False).head(5)
8. 取每個(gè)分組group的top n
# MYSQL不支持
# Oracle有ROW_NUMBER語(yǔ)法
# 按(Survived,,Sex)分組,取Age的TOP 2
df.groupby(['Survived', 'Sex']).apply(
lambda df:df.sort_values('Age', ascending=False).head(2))
9. UPDATE數(shù)據(jù)更新
df.info()
# SQL:
sql = '''
UPDATE titanic
set Age=0
where Age is null
'''
condition = df['Age'].isna()
condition.value_counts()
df[condition] = 0
df['Age'].isna().value_counts()
10. DELETE刪除數(shù)據(jù)
# SQL:
sql = '''
DELETE FROM titanic
where Age=0
'''
df_new = df[df['Age']!=0]
df_new[df_new['Age']==0]