像Excel一樣使用Python進(jìn)行數(shù)據(jù)分析

zhaoshaok 2018-03-11

展開(kāi)全文

序

Excel是數(shù)據(jù)分鐘中最常用的工具,，通過(guò)Python和Excel功能對(duì)比，介紹如何使用Python通過(guò)函數(shù)式編程完成Excel中的數(shù)據(jù)處理及分析工作,。

在Python中pandas庫(kù)用于數(shù)據(jù)處理,，我們從1787頁(yè)的pandas官網(wǎng)文檔中總結(jié)出最常用的36個(gè)函數(shù)，通過(guò)這些函數(shù)介紹如何通過(guò)Python完成數(shù)據(jù)生成和導(dǎo)入,，數(shù)據(jù)清洗,，預(yù)處理，以及最常見(jiàn)的數(shù)據(jù)分類(lèi),，數(shù)據(jù)篩選,，分類(lèi)匯總,，透視等最常見(jiàn)的操作。

第7章數(shù)據(jù)匯總

本章主要講解如何對(duì)數(shù)據(jù)進(jìn)行分類(lèi)匯總,。

Excel中使用分類(lèi)匯總和數(shù)據(jù)透視可以按特定維度對(duì)數(shù)據(jù)進(jìn)行匯總;

Python中使用的主要函數(shù)是groupby和pivot_table.

下面分別介紹這兩個(gè)函數(shù)的使用方法,。

第1章生成數(shù)據(jù)表

Excel

常見(jiàn)的生成數(shù)據(jù)表的方法有兩種，第一種是導(dǎo)入外部數(shù)據(jù),，第二種是直接寫(xiě)入數(shù)據(jù),。

Excel中的“文件”菜單中提供了獲取外部數(shù)據(jù)的功能，支持?jǐn)?shù)據(jù)庫(kù)和文本文件和頁(yè)面的多種數(shù)據(jù)源導(dǎo)入,。

計(jì)算機(jī)生成了可選文字:
文件
開(kāi) 始插入頁(yè) 面布局公式
數(shù) 哐審閱視圖
匚叾
丷苜屬性
自 Access 自網(wǎng) 站自文本自其他來(lái) 源現(xiàn) 有連接全部刷新
獲取外

Python

Python支持從多種類(lèi)型的數(shù)據(jù)導(dǎo)入,。

在開(kāi)始使用Python進(jìn)行數(shù)據(jù)導(dǎo)入之前需要先導(dǎo)入pandas庫(kù)，為了方便起見(jiàn),，我們也同時(shí)導(dǎo)入numpy庫(kù)

1.導(dǎo)入數(shù)據(jù)表

下面分別是Excel和csv格式文件中導(dǎo)入數(shù)據(jù)并創(chuàng)建數(shù)據(jù)表的方法,。

代碼是最簡(jiǎn)模式，里面有很多可選參數(shù)設(shè)置,，例如列名稱(chēng),，索引列，數(shù)據(jù)格式,。

計(jì)算機(jī)生成了可選文字:
df=pd. DataFrame()d ,。 read_c sv （《 name. CSV' ,header=l) ）
df=pd. DataFrame(pd. read_ExceI （《 name. ））

help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)

Read CSV (comma-separated) file into DataFrame

Also supports optionally iterating or breaking of the file

into chunks.

Additional help can be found in the `online docs for IO Tools

<http://pandas./pandas-docs/stable/io.html>`_.

Parameters

----------

filepath_or_buffer : str, pathlib.Path, py._path.local.LocalPath or any object with a read() method (such as a file handle or StringIO)

The string could be a URL. Valid URL schemes include http, ftp, s3, and

file. For file URLs, a host is expected. For instance, a local file could

be file ://localhost/path/to/table.csv

sep : str, default ','

Delimiter to use. If sep is None, will try to automatically determine

this. Separators longer than 1 character and different from ``'\s+'`` will

be interpreted as regular expressions, will force use of the python parsing

engine and will ignore quotes in the data. Regex example: ``'\r\t'``

delimiter : str, default ``None``

Alternative argument name for sep.

delim_whitespace : boolean, default False

Specifies whether or not whitespace (e.g. ``' '`` or ``' '``) will be

used as the sep. Equivalent to setting ``sep='\s+'``. If this option

is set to True, nothing should be passed in for the ``delimiter``

parameter.

.. versionadded:: 0.18.1 support for the Python parser.

header : int or list of ints, default 'infer'

Row number(s) to use as the column names, and the start of the data.

Default behavior is as if set to 0 if no ``names`` passed, otherwise

``None``. Explicitly pass ``header=0`` to be able to replace existing

names. The header can be a list of integers that specify row locations for

a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not

specified will be skipped (e.g. 2 in this example is skipped). Note that

this parameter ignores commented lines and empty lines if

``skip_blank_lines=True``, so header=0 denotes the first line of data

rather than the first line of the file.

names : array-like, default None

List of column names to use. If file contains no header row, then you

should explicitly pass header=None. Duplicates in this list are not

allowed unless mangle_dupe_cols=True, which is the default.

index_col : int or sequence or False, default None

Column to use as the row labels of the DataFrame. If a sequence is given, a

MultiIndex is used. If you have a malformed file with delimiters at the end

of each line, you might consider index_col=False to force pandas to _not_

use the first column as the index (row names)

usecols : array-like, default None

Return a subset of the columns. All elements in this array must either

be positional (i.e. integer indices into the document columns) or strings

that correspond to column names provided either by the user in `names` or

inferred from the document header row(s). For example, a valid `usecols`

parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Using this parameter

results in much faster parsing time and lower memory usage.

as_recarray : boolean, default False

DEPRECATED: this argument will be removed in a future version. Please call

`pd.read_csv(...).to_records()` instead.

Return a NumPy recarray instead of a DataFrame after parsing the data.

If set to True, this option takes precedence over the `squeeze` parameter.

In addition, as row indices are not available in such a format, the

`index_col` parameter will be ignored.

squeeze : boolean, default False

If the parsed data only contains one column then return a Series

prefix : str, default None

Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...

mangle_dupe_cols : boolean, default True

Duplicate columns will be specified as 'X.0'...'X.N', rather than

'X'...'X'. Passing in False will cause data to be overwritten if there

are duplicate names in the columns.

dtype : Type name or dict of column -> type, default None

Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}

(Unsupported with engine='python'). Use `str` or `object` to preserve and

not interpret dtype.

engine : {'c', 'python'}, optional

Parser engine to use. The C engine is faster while the python engine is

currently more feature-complete.

converters : dict, default None

Dict of functions for converting values in certain columns. Keys can either

be integers or column labels

true_values : list, default None

Values to consider as True

false_values : list, default None

Values to consider as False

skipinitialspace : boolean, default False

Skip spaces after delimiter.

skiprows : list-like or integer, default None

Line numbers to skip (0-indexed) or number of lines to skip (int)

at the start of the file

skipfooter : int, default 0

Number of lines at bottom of file to skip (Unsupported with engine='c')

skip_footer : int, default 0

DEPRECATED: use the `skipfooter` parameter instead, as they are identical

nrows : int, default None

Number of rows of file to read. Useful for reading pieces of large files

na_values : scalar, str, list-like, or dict, default None

Additional strings to recognize as NA/NaN. If dict passed, specific

per-column NA values. By default the following values are interpreted as

NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',

'1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'nan'`.

keep_default_na : bool, default True

If na_values are specified and keep_default_na is False the default NaN

values are overridden, otherwise they're appended to.

na_filter : boolean, default True

Detect missing value markers (empty strings and the value of na_values). In

data without any NAs, passing na_filter=False can improve the performance

of reading a large file

verbose : boolean, default False

Indicate number of NA values placed in non-numeric columns

skip_blank_lines : boolean, default True

If True, skip over blank lines rather than interpreting as NaN values

parse_dates : boolean or list of ints or names or list of lists or dict, default False

* boolean. If True -> try parsing the index.

* list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3

each as a separate date column.

* list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as

a single date column.

* dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result

'foo'

Note: A fast-path exists for iso8601-formatted dates.

infer_datetime_format : boolean, default False

If True and parse_dates is enabled, pandas will attempt to infer the format

of the datetime strings in the columns, and if it can be inferred, switch

to a faster method of parsing them. In some cases this can increase the

parsing speed by ~5-10x.

keep_date_col : boolean, default False

If True and parse_dates specifies combining multiple columns then

keep the original columns.

date_parser : function, default None

Function to use for converting a sequence of string columns to an array of

datetime instances. The default uses ``dateutil.parser.parser`` to do the

conversion. Pandas will try to call date_parser in three different ways,

advancing to the next if an exception occurs: 1) Pass one or more arrays

(as defined by parse_dates) as arguments; 2) concatenate (row-wise) the

string values from the columns defined by parse_dates into a single array

and pass that; and 3) call date_parser once for each row using one or more

strings (corresponding to the columns defined by parse_dates) as arguments.

dayfirst : boolean, default False

DD/MM format dates, international and European format

iterator : boolean, default False

Return TextFileReader object for iteration or getting chunks with

``get_chunk()``.

chunksize : int, default None

Return TextFileReader object for iteration. `See IO Tools docs for more

information

<http://pandas./pandas-docs/stable/io.html#io-chunking>`_ on

``iterator`` and ``chunksize``.

compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'

For on-the-fly decompression of on-disk data. If 'infer', then use gzip,

bz2, zip or xz if filepath_or_buffer is a string ending in '.gz', '.bz2',

'.zip', or 'xz', respectively, and no decompression otherwise. If using

'zip', the ZIP file must contain only one data file to be read in.

Set to None for no decompression.

.. versionadded:: 0.18.1 support for 'zip' and 'xz' compression.

thousands : str, default None

Thousands separator

decimal : str, default '.'

Character to recognize as decimal point (e.g. use ',' for European data).

float_precision : string, default None

Specifies which converter the C engine should use for floating-point

values. The options are `None` for the ordinary converter,

`high` for the high-precision converter, and `round_trip` for the

round-trip converter.

lineterminator : str (length 1), default None

Character to break file into lines. Only valid with C parser.

quotechar : str (length 1), optional

The character used to denote the start and end of a quoted item. Quoted

items can include the delimiter and it will be ignored.

quoting : int or csv.QUOTE_* instance, default 0

Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of

QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).

doublequote : boolean, default ``True``

When quotechar is specified and quoting is not ``QUOTE_NONE``, indicate

whether or not to interpret two consecutive quotechar elements INSIDE a

field as a single ``quotechar`` element.

escapechar : str (length 1), default None

One-character string used to escape delimiter when quoting is QUOTE_NONE.

comment : str, default None

Indicates remainder of line should not be parsed. If found at the beginning

of a line, the line will be ignored altogether. This parameter must be a

single character. Like empty lines (as long as ``skip_blank_lines=True``),

fully commented lines are ignored by the parameter `header` but not by

`skiprows`. For example, if comment='#', parsing '#empty\na,b,c\n1,2,3'

with `header=0` will result in 'a,b,c' being

treated as the header.

encoding : str, default None

Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python

standard encodings

<https://docs./3/library/codecs.html#standard-encodings>`_

dialect : str or csv.Dialect instance, default None

If None defaults to Excel dialect. Ignored if sep longer than 1 char

See csv.Dialect documentation for more details

tupleize_cols : boolean, default False

Leave a list of tuples on columns as is (default is to convert to

a Multi Index on the columns)

error_bad_lines : boolean, default True

Lines with too many fields (e.g. a csv line with too many commas) will by

default cause an exception to be raised, and no DataFrame will be returned.

If False, then these "bad lines" will dropped from the DataFrame that is

returned. (Only valid with C parser)

warn_bad_lines : boolean, default True

If error_bad_lines is False, and warn_bad_lines is True, a warning for each

"bad line" will be output. (Only valid with C parser).

low_memory : boolean, default True

Internally process the file in chunks, resulting in lower memory use

while parsing, but possibly mixed type inference. To ensure no mixed

types either set False, or specify the type with the `dtype` parameter.

Note that the entire file is read into a single DataFrame regardless,

use the `chunksize` or `iterator` parameter to return the data in chunks.

(Only valid with C parser)

buffer_lines : int, default None

DEPRECATED: this argument will be removed in a future version because its

value is not respected by the parser

compact_ints : boolean, default False

DEPRECATED: this argument will be removed in a future version

If compact_ints is True, then for any column that is of integer dtype,

the parser will attempt to cast it as the smallest integer dtype possible,

either signed or unsigned depending on the specification from the

`use_unsigned` parameter.

use_unsigned : boolean, default False

DEPRECATED: this argument will be removed in a future version

If integer columns are being compacted (i.e. `compact_ints=True`), specify

whether the column should be compacted to the smallest signed or unsigned

integer dtype.

memory_map : boolean, default False

If a filepath is provided for `filepath_or_buffer`, map the file object

directly onto memory and access the data directly from there. Using this

option can improve performance because there is no longer any I/O overhead.

Returns

-------

result : DataFrame or TextParser

2.創(chuàng)建數(shù)據(jù)表

另一種方法是通過(guò)直接寫(xiě)入數(shù)據(jù)來(lái)生成數(shù)據(jù)表,，Excel中直接在單元格中輸入數(shù)據(jù)就可以,，Python中通過(guò)下面的代碼來(lái)實(shí)現(xiàn)。

計(jì)算機(jī)生成了可選文字:
1 n [ 45 ]
： df
Out [ 45 ] 生
date
2213 一 21 一
City category
2
2
4
1231
1232
1233
1234
1235
2213 一 21 一 22
2213 一 21 一 23
24
2213 一 21 一 25
2213 一 21 一 25
8 巳 1 ] Ing
guangzhou
She n 記 he n
shanghai
日 E I ] ING
112 一〔
32
5433 · 2

生成數(shù)據(jù)表的函數(shù)是pandas庫(kù)中的DataFrame函數(shù),，數(shù)據(jù)表一共有6行數(shù)據(jù)，每行有6個(gè)字段,。在數(shù)據(jù)中我們特意設(shè)置了一些NA值和有問(wèn)題的字段,，例如包含空格等。

后面將在數(shù)據(jù)清洗步驟進(jìn)行處理,。

后面我們將統(tǒng)一以DataFrame的簡(jiǎn)稱(chēng)df來(lái)命名數(shù)據(jù)表,。

以上是剛剛創(chuàng)建的數(shù)據(jù)表，我們沒(méi)有設(shè)置索引列,，price字段中包含有NA值,，city字段中還包含了一些臟數(shù)據(jù)。

第2章數(shù)據(jù)表檢查

本章主要介紹對(duì)數(shù)據(jù)表進(jìn)行檢查,。

Python中處理的數(shù)據(jù)量通常會(huì)比較大,，比如紐約的出租車(chē)數(shù)據(jù)和Citibike的騎行數(shù)據(jù)，其數(shù)據(jù)量都在千萬(wàn)級(jí),，我們無(wú)法一目了然地了解數(shù)據(jù)表的整體情況,，必須要通過(guò)一些方法來(lái)獲得數(shù)據(jù)表的關(guān)鍵信息。

數(shù)據(jù)表檢查的另一個(gè)目的是了解數(shù)據(jù)的概況,，例如整個(gè)數(shù)據(jù)表的大小,，所占空間,，數(shù)據(jù)格式,，是否有空值和具體的數(shù)據(jù)內(nèi)容，為后面的清洗和預(yù)處理做好準(zhǔn)備,。

1.數(shù)據(jù)維度（行列）

Excel中可以通過(guò)CTRL+向下的光標(biāo)鍵,，和CTRL+向右的光標(biāo)鍵來(lái)查看行號(hào)和列好。

Python中使用shape函數(shù)來(lái)產(chǎn)看數(shù)據(jù)表的維度,，也就是行數(shù)和列數(shù),，函數(shù)返回的結(jié)果(6,6)表示數(shù)據(jù)表有6行，6列,。

下面是具體的代碼,。

計(jì)算機(jī)生成了可選文字:
# 查看數(shù) 據(jù) 表的維度
df. shape
（ 6 ， 6 ）

2.數(shù)據(jù)表信息

使用info函數(shù)查看數(shù)據(jù)表的整體信息,，這里返回的信息比較多,，包括數(shù)據(jù)維度、列名稱(chēng),、數(shù)據(jù)格式和所占空間等信息,。

計(jì)算機(jī)生成了可選文字:
# 數(shù) 據(jù) 表信息
df.lnfo()
<class ·pandas. core. frame.DataFrame'>
Rangelndex: 6 entries, 0 to 5
Data columns (total 6 COIumnS) ：
id
date
City
category
age
pr1ce
6 non—null int64
6 non—null datet ime64 [ns]
6 non—null object
6 non—null object
6 non—null int64
4 non—null f loat64
dtypes ： datetime64[ns] （ 1 ） ,， float64(1) ，
memo ry usage: 368 ． 0 + bytes
int64(2) ,，
object(2)

3.查看數(shù)據(jù)格式

Excel中通過(guò)選中單元格并查看開(kāi)始菜單中的數(shù)值類(lèi)型來(lái)判斷數(shù)據(jù)的格式,。

計(jì)算機(jī)生成了可選文字:
丹始
兇切
下栳刷
捲入貞面布局公式數(shù) 瞎審閱視圖
開(kāi) 發(fā) 工目加頂
，自動(dòng) 行
0 笙藁合后居中 -

Python中使用dtypes函數(shù)來(lái)返回?cái)?shù)據(jù)格式,。

dtyps是一個(gè)查看數(shù)據(jù)格式的函數(shù),，可以一次性查看數(shù)據(jù)表中所有數(shù)據(jù)的格式，也可以指定一列來(lái)單獨(dú)查看,。

計(jì)算機(jī)生成了可選文字:
# 查看數(shù) 據(jù) 表各列格式
df. dtypes
date
City
category
age
pr1ce
主 nt64
datet me64 [ n s ]
object
object
主 nt64
f loat64
dtype ： object

4.查看空值

Excel中查看空值的方法是使用‘定位條件’功能對(duì)數(shù)據(jù)表中的空值進(jìn)行定位,。

‘定位條件’在‘開(kāi)始’目錄下的‘查找和選擇’目錄中。

計(jì)算機(jī)生成了可選文字:
定位條件
O 批注
@ 量（ 0
C) 公式舊
數(shù) 字 (U)
文本兇
囝輯值（ G ）
841 錯(cuò) 誤舊
0 空值
C) 當(dāng) 前區(qū) 域
(D 當(dāng) 前數(shù) 組（
C) 對(duì) 象
C) 行內(nèi) 容差異單元格 (W ）
O 列內(nèi) 容差異單元格）
@ 引用單元格舊
O 從屬單元格（ D ）
@ 直屬 (1)
所有級(jí) 別舊
@ 最后一個(gè) 單元格
C) 可見(jiàn) 單元格 (Y)
@ 條亻牛格式 (I)
@ 數(shù) 據(jù) 驗(yàn) 證 (Y)
@ 全部舊
相同舊
取消

Isnull是Python中檢驗(yàn)空值的函數(shù),，返回的結(jié)果是邏輯值,，包含空值返回True，不包含則返回False,。

用戶既可以對(duì)整個(gè)數(shù)據(jù)表進(jìn)行檢查,，也可以單獨(dú)對(duì)某一列進(jìn)行空值檢查。

計(jì)算機(jī)生成了可選文字:
1 n [ 59 ] ： df·isnull()
Out [ 59 ] 生
date
False
False
False
False
False
False
Ity category
2
2
4
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
True
False
F IS 已
True
F IS 已

計(jì)算機(jī)生成了可選文字:
# 檢查特定列空值
0
1
2
3
4
5
Name ：
False
True
False
False
True
False
prlce,
dtype:
bool

5.查看唯一值

Excel中查看唯一值得方法是使用‘條件格式’對(duì)唯一值進(jìn)行顏色標(biāo)記,。

計(jì)算機(jī)生成了可選文字:
重復(fù) 值
為包含以下類(lèi) 型值的單元格設(shè) 首格式：
值 ,，設(shè) 置為淺紅埴充色深紅色文不
取消

Python中是用unique函數(shù)查看唯一值。

Unique是查看唯一值的函數(shù),，只能對(duì)數(shù)據(jù)表中的特定列進(jìn)行檢查,。

下面是代碼，返回的結(jié)果是該列中的唯一值,。

類(lèi)似與Excel中刪除重復(fù)項(xiàng)后的結(jié)果,。

計(jì)算機(jī)生成了可選文字:
# 查看 City 列中的唯一值
array( [ 'Beij ing
guangzhou
BEIJING 《 ] ， dtype=object)
1 Shenzhen
shanghai'

6.查看數(shù)據(jù)表數(shù)值

Python中的Values函數(shù)用來(lái)查看數(shù)據(jù)表中的數(shù)值,。

以數(shù)組的形式返回,，不包含表頭信息。

7.查看列名稱(chēng)

計(jì)算機(jī)生成了可選文字:
co 恤函數(shù) 用來(lái) 單獨(dú) 查看數(shù) 據(jù) 表中的列名稱(chēng) ,。
# 查看列名稱(chēng)
df.columns
Index( [ 《 id 《
object')
《 category
dtype=

8.查看前10行數(shù)據(jù)

Head()函數(shù)用來(lái)查看數(shù)據(jù)表中前N行數(shù)據(jù),，默認(rèn)head()顯示前10行數(shù)據(jù)，可以自己設(shè)置參數(shù)值來(lái)確定查看的行數(shù),。

下面的代碼中設(shè)置查看前3行的數(shù)據(jù),。

計(jì)算機(jī)生成了可選文字:
# 查看前 3 行數(shù) 據(jù)
d f ． head （ 3 ）
0
1
2
id
date
1001
2013 ． 01 ． 02
1002
2013 ． 01 · 03
1003 2013 ． 01 ． 04
City
Beijing
SH
guangzhou
category age price
100 ． A
100 ． B
110 ． A
23
44
54
1200 ℃
NaN
2133 ℃

9.查看后10行數(shù)據(jù)

tail函數(shù)與head函數(shù)相反，用來(lái)查看數(shù)據(jù)表中后N行的數(shù)據(jù),，默認(rèn)tail()顯示后10行數(shù)據(jù),，可以自己設(shè)置參數(shù)值來(lái)確定查看的行數(shù)。

下面的代碼中設(shè)置查看后3行的數(shù)據(jù),。

計(jì)算機(jī)生成了可選文字:
# 查看最后 3 行
d f ． ta iI( 3 ）

計(jì)算機(jī)生成了可選文字:
id
date
3 1004 2013 ． 01 ． 05
4 1005 2013 ． 01 ． 06
5 1006 2013 ． 01 ． 07
city
Shenzhen
shanghai
BEIJING
category age
1 10-C
210 · A
130 · F
32
34
32
price
5433 ． 0
NaN
4432 ． 0

第3章數(shù)據(jù)表清洗

本章介紹的是對(duì)數(shù)據(jù)表中的問(wèn)題進(jìn)行清洗,，主要內(nèi)容包括對(duì)空值、大小寫(xiě)問(wèn)題,、數(shù)據(jù)格式和重復(fù)值的處理,。

這里不包含對(duì)數(shù)據(jù)間的邏輯驗(yàn)證。

1.處理空值（刪除或填充）

我們?cè)趧?chuàng)建數(shù)據(jù)表的時(shí)候在price字段中故意設(shè)置了幾個(gè)NA值,。

對(duì)于空值的處理方式有很多種,，可以直接刪除包含空值的數(shù)據(jù)，也可以對(duì)控制進(jìn)行填充,，比如用0填充或者用均值填充,。還可以根據(jù)不同字段的邏輯對(duì)空值進(jìn)行推算。

Excel中可以通過(guò)‘查找和替換’功能對(duì)空值進(jìn)行處理,，將空值統(tǒng)一替換為0或均值,。也可以通過(guò)‘定位’空值來(lái)實(shí)現(xiàn)。

計(jì)算機(jī)生成了可選文字:
查找和替懊
查伐 (P) 替 P ）
查找內(nèi) 容（ N ）：
替懊為舊：
全部替換
替換
查伐全部
選頃田 > >
查伐下一個(gè) 舊

Python中處理空值的方法比較靈活,，可以使用Dropna函數(shù)用來(lái)刪除數(shù)據(jù)表中包含空值的數(shù)據(jù),，也可以使用fillna函數(shù)對(duì)空值進(jìn)行填充。下面的代碼和結(jié)果中可以看到使用dropna函數(shù)后,，包含NA值得兩個(gè)字段已經(jīng)不見(jiàn)了,。返回的是一個(gè)不包含空值的數(shù)據(jù)表。

計(jì)算機(jī)生成了可選文字:
# 刪除數(shù) 據(jù) 表中含有空值的行
df.dropna(how='any' ）
id
1001
04
1006 2013 ． 01 ． 07
City
Beijing
guangzhou
Shenzhen
BEIJING
category age
0
2
3
5
date
2013 ￡ 1 2
1003 2013 ￡ 1 ．
1004 2013 ￡ 1 ． 05
100 ． A
110 ． A
1 10-C
130 ． F
23
54
32
32
price
1200 ℃
2133 ℃
5433 ℃
4432 ℃

除此之外,，也可以使用數(shù)字對(duì)空值進(jìn)行填充，下面的代碼使用fillna函數(shù)對(duì)空值字段填充數(shù)字0.

計(jì)算機(jī)生成了可選文字:
# 使用數(shù) 字真充數(shù) 據(jù) 表中空值
df.filIna(vaIue=0)

我們選擇填充的方式來(lái)處理空值,，使用price列的均值來(lái)填充NA字段,，同樣使用fillna函數(shù)，在要填充的數(shù)值中使用mean函數(shù)先計(jì)算price列當(dāng)前的均值,，然后使用這個(gè)均值對(duì)NA進(jìn)行填充,。可以看到兩個(gè)空值字段顯示為3299.5

計(jì)算機(jī)生成了可選文字:
# 使用 pr 躚 e 均值對(duì) NA 進(jìn) 行填充
df [ 《 ] ,。 price' ] ． mean 0 ）
9
1
2
3
4
1200 ． 0
3299 ． 5
2133 ． 0
5433 ． 0
3299 ． 5

計(jì)算機(jī)生成了可選文字:
伽 t [ 7 ·
date
2213 一 21 一
City category
2
4
1232
1233
1234
1235
2213 一 21 一 23
24
2213 一 21 一 25
2213 一 21 一 25
8 巳 1 ] Ing
guangzhou
Shenzhen
shanghai
日 E I ] ING
123 一
112 一二
112 一〔
212 一層
4 #
54
32
34
3299 ·
2133 · 2
5433 · 2
3299 ·

2.清理空格

處理空值,，字段中的空格也是數(shù)據(jù)清洗中一個(gè)常見(jiàn)的問(wèn)題,，下面是清楚字符中空格的代碼。

計(jì)算機(jī)生成了可選文字:
# 清除 City 字段中的字符空格
df['C1ty' ] =df [ 'City 《 ] map(str.strip)

3.大小寫(xiě)轉(zhuǎn)換

在英文字段中,，字母的大小寫(xiě)不統(tǒng)一也是一個(gè)常見(jiàn)的問(wèn)題,。

Excel中有UPPER,LOWER等函數(shù)，Python中也有同名函數(shù)用來(lái)解決大小寫(xiě)的問(wèn)題,。在數(shù)據(jù)表的city列中就存在這樣的問(wèn)題,。

我們將city列的所有字母轉(zhuǎn)換為小寫(xiě)。下面是具體的代碼和結(jié)果,。

計(jì)算機(jī)生成了可選文字:
city 列大小寫(xiě) 轉(zhuǎn) 換
df['C1ty' ] =df [ ] ． St r ． lower()

4.更改數(shù)據(jù)格式

Excel中通過(guò)‘設(shè)置單元格格式’功能可以修改數(shù)據(jù)格式,。

Python中通過(guò)astype函數(shù)用來(lái)修改數(shù)據(jù)格式。

計(jì)算機(jī)生成了可選文字:
單元格格式
數(shù) 字對(duì) 齊
貨幣
會(huì) 計(jì) 虧用
自分比
§ 定義
字體
邊
填充
保護(hù)
小數(shù) 數(shù) （ D ）： 0
0 使用干位分隔符 0 圓
負(fù) 數(shù) （ N ）：
{ 1234 ）
（ 123 引
1 234
一 1234
數(shù) 值格式于一般數(shù) 字的表示 · 貨幣和會(huì) 計(jì) 格式則提供貨幣值計(jì) 筻的專(zhuān) 用格式 ·

Python中dtype是查看數(shù)據(jù)格式的函數(shù),，與之對(duì)應(yīng)的是astype函數(shù),，用來(lái)更改數(shù)據(jù)格式。下面的代碼將price字段的值修改為int格式,。

計(jì)算機(jī)生成了可選文字:
# 更改數(shù) 據(jù) 格式
df['pnce'] .astype( 《 int 《）
9
1
2
3
4
5
Name ：
1299
3299
2133
5433
3299
4432
prlce,
dtype:
主 nt32

5.更改列名稱(chēng)

Rename是更改列名稱(chēng)的函數(shù),，我們將來(lái)數(shù)據(jù)表中的category列更改為category-size。

下面是具體的代碼和更改后的結(jié)果,。

計(jì)算機(jī)生成了可選文字:
# 更改列名稱(chēng)
df. rename (columns={ 《 category 《
《 category—size'})
2013 ． 01 ． 04 guangzhou
0
2
3
4
5
id
1001
1003
1004
1005
1006
date
2013 ． 01 ． 02
1002 2013 ． 01 ． 03
2013 ． 01
2013 ． 01 ． 06
2013 ． 01 ． 07
category-Size age
． 05
City
Beijing
sh
Shenzhen
shanghai
Beijing
100 ． A
100 ． B
110 ． A
110-C
210 ． A
130 ． F
23
44
54
32
34
32
prICe
1200
3299
2133
5433
329g
4432

6.刪除重復(fù)值

很多數(shù)據(jù)表中還包含重復(fù)值的問(wèn)題,，Excel的數(shù)據(jù)目錄下有‘刪除重復(fù)項(xiàng)’的功能，可以用來(lái)刪除數(shù)據(jù)表中的重復(fù)值,。

默認(rèn)Excel會(huì)保留最先出現(xiàn)的數(shù)據(jù),，刪除后面重復(fù)出現(xiàn)的數(shù)據(jù)。

計(jì)算機(jī)生成了可選文字:
目一而
分列快速填刪除驗(yàn) 合并計(jì) 算模擬分析關(guān) 系
重復(fù) 項(xiàng) 證 ,，

Python中使用drop_duplicates函數(shù)刪除重復(fù)值,。

我們以數(shù)據(jù)表中的city列為例，city字段中存在重復(fù)值,。

默認(rèn)情況下drop_duplicates()將刪除后出現(xiàn)的重復(fù)值（與Excel邏輯一致）,。增加keep='last'參數(shù)后將刪除最先出現(xiàn)的重復(fù)值，保留最后的值,。

下面是具體的代碼和比較結(jié)果,。

計(jì)算機(jī)生成了可選文字:
原始的 ci 呼列中 be 巧 ing 存在重復(fù) ，分別在第一位和最后一位 ,。
df [ 'city' ]
0
1
2
3
4
5
Name ：
beij ing
guangzhou
Shenzhen
shanghai
beij Ing
city, dtype:
object

使用默認(rèn)的drop_duplicates()函數(shù)刪除重復(fù)值,，從結(jié)果中可以看到第一位的beijing被保留，最后出現(xiàn)的beijing被刪除,。

計(jì)算機(jī)生成了可選文字:
I n [ 108 ] ： dfC'city').drop duplicates 0
Out [ 8 ] 生
Beijing
guangzhou
Shenzhen
shanghai
m 巳： city, dtype
object

設(shè)置keep='last'參數(shù)后,，與之前刪除重復(fù)值的結(jié)果相反，第一位出現(xiàn)的beijing被刪除，保留了最后一位出現(xiàn)的beijing,。

計(jì)算機(jī)生成了可選文字:
# 刪除先出現(xiàn) 的重復(fù) 值
df[ 'city'] .drop—duplicates(keep=' last' ）
1
2
3
4
5
Name ：
guangzhou
Shenzhen
shanghai
belj Ing
city, dtype:
Objec

7.數(shù)值修改及替換

數(shù)據(jù)清洗中最后一個(gè)問(wèn)題是數(shù)值修改或替換,，Excel中使用“查找和替換”功能就可以實(shí)現(xiàn)數(shù)值的替換。

計(jì)算機(jī)生成了可選文字:
查找 (D) 替換（ P ）
查找內(nèi) 容（ N ）．
替換為舊．
找全部 (I)
選項(xiàng) 田 ”
查戊下一個(gè) 舊

Python中使用replace函數(shù)實(shí)現(xiàn)數(shù)據(jù)替換,。

數(shù)據(jù)表中city字段上海存在兩種寫(xiě)法,，分別為shanghai和SH。

我們使用replace函數(shù)對(duì)SH進(jìn)行替換,。

計(jì)算機(jī)生成了可選文字:
# 數(shù) 據(jù) 替換
df [ ] ,。 replace( 'sh 《
《 shanghai' ）
0
1
2
3
4
5
Name ：
beij Ing
shanghai
guangzhou
Shenzhen
shanghai
beij ing
City, dtype:
object

第4章數(shù)據(jù)預(yù)處理

本章主要講的是數(shù)據(jù)的預(yù)處理，對(duì)清洗完的數(shù)據(jù)進(jìn)行整理以便后期的統(tǒng)計(jì)和分析工作,。

主要包括數(shù)據(jù)表的合并,，排序，數(shù)值分列,，數(shù)據(jù)分組及標(biāo)記等工作,。

1.數(shù)據(jù)表合并

首先是對(duì)不同的數(shù)據(jù)表進(jìn)行合并，我們這里創(chuàng)建一個(gè)新的數(shù)據(jù)表df1,，并將df和df1兩個(gè)數(shù)據(jù)表進(jìn)行合并,。

在Excel中沒(méi)有直接完成數(shù)據(jù)表合并的功能，可以通過(guò)vlookup函數(shù)分步實(shí)現(xiàn),。

在Python中可以通過(guò)merge函數(shù)一次性實(shí)現(xiàn),。

下面建立df1數(shù)據(jù)表，用于和df數(shù)據(jù)表進(jìn)行合并,。

計(jì)算機(jī)生成了可選文字:
1 n [113) ．
1 n [ 且 4 ] ·
Out C114)
dfl=pd ． D t F 「 m 巳（ {
· · ： [ 1231J 1232J 1233J1234
gender' ： [ 《 male 'female
'm—point ' ： [ 12J 12J 22J42J42
1235 , 1235J 1237 , 1238
'male 'female
4 六 3 六 2 明 } ）
' fema le
' ma le
'female'],
dfI
1231
1232
1233
1234
1235
1235
1237
1238
m-polnt pay
7
gender
female
female
female
female

使用merge函數(shù)對(duì)兩個(gè)數(shù)據(jù)表進(jìn)行合并,，合并的方式為inner，將兩個(gè)數(shù)據(jù)表中共有的數(shù)據(jù)匹配到一起生成新的數(shù)據(jù)表,。并命名為df_inner,。

除了inner方式以外，合并的方式還有left,，right和outer方式,。這幾種方式的差別在我其他的文章中有詳細(xì)的說(shuō)明和對(duì)比。

計(jì)算機(jī)生成了可選文字:
df_left=pd ,。 merge()f ,， dfl, how= 《 left ）
df_right=pd merge(df,dfl, how=' right' ）
df_outer=pd merge(df,dfl,how='outerl ）

2.設(shè)置索引列

完成數(shù)據(jù)表的合并后，我們對(duì)df_inner數(shù)據(jù)表設(shè)置索引列,，索引列的功能很多,，可以進(jìn)行數(shù)據(jù)提取，匯總,，也可以進(jìn)行數(shù)據(jù)篩選等。

設(shè)置索引的函數(shù)為set_index.

3.排序（按索引，按數(shù)值）

Excel中可以通過(guò)數(shù)據(jù)目錄下的排序按鈕直接對(duì)數(shù)據(jù)表進(jìn)行排序,，比較簡(jiǎn)單,。

計(jì)算機(jī)生成了可選文字:
排序篩選《重新
和篩選

Python中需要使用ort_values函數(shù)和sort_index函數(shù)完成排序。

在Python中,，既可以按索引對(duì)數(shù)據(jù)表進(jìn)行排序,，也可以看置頂列的數(shù)值進(jìn)行排序。

首先我們按age列中用戶的年齡對(duì)數(shù)據(jù)表進(jìn)行排序,。

使用的函數(shù)為sort_values.

Sort_index函數(shù)用來(lái)將數(shù)據(jù)表按索引列的值進(jìn)行排序,。

4.數(shù)據(jù)分組

Excel中可以通過(guò)vlookup函數(shù)進(jìn)行近似匹配來(lái)完成對(duì)數(shù)值的分組，或者使用“數(shù)據(jù)透視表”來(lái)完成分組,。

相應(yīng)的Python中使用where函數(shù)完成數(shù)據(jù)分組,。

where函數(shù)用來(lái)對(duì)數(shù)據(jù)進(jìn)行判斷和分組，下面的代碼中我們對(duì)price列的值進(jìn)行判斷,，將符合條件的分為一組,，不符合條件的分為另一組，并使用group字段進(jìn)行標(biāo)記,。

除了where函數(shù)以外,，還可以對(duì)多個(gè)字段的值進(jìn)行判斷后對(duì)數(shù)據(jù)進(jìn)行分組,，下面的代碼中對(duì)city列等于beijing并且price列大于等于4000的數(shù)據(jù)標(biāo)記為1.

5.數(shù)據(jù)分列

與數(shù)據(jù)分組相反的是對(duì)數(shù)值進(jìn)行分列，Excel中的數(shù)據(jù)目錄下提供‘分列’功能,。

在Python中使用split函數(shù)實(shí)現(xiàn)分列,。

計(jì)算機(jī)生成了可選文字:
分列
[ 二 0
速埴充刪除數(shù) 據(jù) 驗(yàn) 合并計(jì) 篡模擬分析關(guān) 系
重頂證，
數(shù) 據(jù) 工具

在數(shù)據(jù)表中category列中的數(shù)據(jù)包含有兩個(gè)信息,，前面的數(shù)字為類(lèi)別id,，后面的字母為size值。中間以連字符進(jìn)行聯(lián)結(jié),。我們使用split函數(shù)對(duì)這個(gè)字段進(jìn)行拆分,，并將拆分后的數(shù)據(jù)表匹配回原數(shù)據(jù)表中。

計(jì)算機(jī)生成了可選文字:
# 對(duì) category 字段的值依次進(jìn) 行分列 ,，并創(chuàng) 建數(shù) 據(jù) 表 ,，索引值為 df 一 inner 的索引列，列
名稱(chēng) 為 category 和 Size
pd.DataFrame( （ x ． split( 《
《） for x in df_inner[ 'category'] ） ,， index=d
f_inner. index, columns=[ 《 category' ,， size' ] ）

計(jì)算機(jī)生成了可選文字:
I n [ 142 ] ： pd.DataFrame((x.sp1it('-') for x in
df nn 「 [ 'category ), index=df nn 已「． index, columns=C ' category
Out [ 142 ] 。
category 51 記已

第5章數(shù)據(jù)提取

數(shù)據(jù)提取,，也就是數(shù)據(jù)分析中最常見(jiàn)的一個(gè)工作,。

這部分主要使用3個(gè)函數(shù),，即loc,iloc和ix。

loc函數(shù)按標(biāo)簽值進(jìn)行提??；

iloc函數(shù)按位置進(jìn)行提取,；

ix函數(shù)可以同時(shí)按標(biāo)簽和位置進(jìn)行提取,。

下面介紹每一種函數(shù)的使用方法。

1.按標(biāo)簽提取loc函數(shù)

loc函數(shù)按數(shù)據(jù)表的索引標(biāo)簽進(jìn)行提取,，下面的代碼中提取了索引列為3的單條數(shù)據(jù),。

計(jì)算機(jī)生成了可選文字:
# 按索引提取單行的數(shù) 值
df_lnner. IOC [ 3 ]
id
date
City
category
age
pr1ce
gender
m—P01nt
pay
group
Sign
category_
sIze
Name ： 3 ，
1004
2013 一 01 一 05 00 ： 00 ： 00
Shenzhen
110 一 C
5433
fema le
40
Y
high
NaN
1
110
C
dtype:
object

使用冒號(hào)可以限定提取數(shù)據(jù)的范圍,，冒號(hào)前面為開(kāi)始的標(biāo)簽值,，后面為結(jié)束的標(biāo)簽值。

下面提取了0-5的數(shù)據(jù)行,。

計(jì)算機(jī)生成了可選文字:
# 按索引提取區(qū) 域行數(shù) 值
df_inner. 10 c [ 0 ： 5 ]
category
100 ． A
1 10-C
130 ． F
age
23
32
32
pnce
1200
5433
4432
gender
male
female
female
m-polnt
10
40
40
pay
Y
Y
group
《 ow
high
Slgn
NaN
NaN
category_l size
0
3
5
id
date
1001
2013 ． 01 ． 02
1004 2013 ． 01 ． 05
1006 2013 ￡ 1 7
City
Beijing
Shenzhen
Beijing
1
1 10
A
C
F

Reset_index函數(shù)用于恢復(fù)索引,，這里我們重新將date字段的日期設(shè)置為數(shù)據(jù)表的索引，并按日期進(jìn)行數(shù)據(jù)提取,。

使用冒號(hào)限定提取數(shù)據(jù)的范圍,，冒號(hào)前面為空表示從0開(kāi)始。

提取所有2013年1月4日以前的數(shù)據(jù),。

計(jì)算機(jī)生成了可選文字:
# 提取 4 日之前的所有數(shù) 據(jù)
df_inner[: 《 2013 一 01 一 04 《
city
Beijing
shanghai
guangzhou
category age price
gender m-point pay group sign
ca 怡 go 吖一 1 size
2013 1 刁 2
2013 “ 01 “ 03
2013 ． 01 刁 4
id
0 001
1002
1003
0 00 ． A
1 0 3
110 ． A
23
44
54
12 羽
3299
2133
male
ma
male
12
20
N
Y
high
《 ow
NaN
NaN
102
110
A
A

2.按位置提取 iloc函數(shù)

使用iloc函數(shù)按位置對(duì)數(shù)據(jù)表中的數(shù)據(jù)進(jìn)行提取,，這里冒號(hào)前后的數(shù)字不再是索引的標(biāo)簽名稱(chēng)，而是數(shù)據(jù)所在的位置,，從0開(kāi)始,。

計(jì)算機(jī)生成了可選文字:
使用 iloc 按位置區(qū) 域提取數(shù) 據(jù)
df_inner.iIoc[:3, ： 2 ]
date
2013 ． 01 ． 02
2013 · 01 “ 05
2013 1 “ 0 ／
id
1001
1004
1006
city
Beijing
shenzhen
beijing

iloc函數(shù)除了可以按區(qū)域提取數(shù)據(jù)，還可以按位置逐條提取,，前面方括號(hào)中的0,2,5,表示數(shù)據(jù)所在行的位置,，后面方括號(hào)中的數(shù)表示所在列的位置。

計(jì)算機(jī)生成了可選文字:
1 n [ 巧 9 ] ·
Out [ 篁 59 ] ：
date
2213 一 21 一 22
2213 一 21 一 24
2213 一 21 一 27
df nn 已「． iloc [ C?, 2, 5 ]
1223
2133
4 # 32
gender
female

3.按標(biāo)簽和位置提取 ix函數(shù)

ix是loc和iloc的混合,，既能按索引標(biāo)簽提取,，也能按位置進(jìn)行數(shù)據(jù)提取。

下面的代碼中行的位置按索引日期設(shè)置,，列按位置設(shè)置,。

計(jì)算機(jī)生成了可選文字:
# 使用按索引標(biāo) 簽和位置混合提取數(shù) 據(jù)
df_inner. ix [ ：《 2013 一 01 一 03 《，： 4 ]
id
date
City
category age
2013 ． 01 ． 02 1001 Beijing
100 ． A
2013 “ 01 刁 3 1C02 shanghai 1 開(kāi) -8
23
44

4.按條件提?。▍^(qū)域和條件值）

除了按標(biāo)簽和位置提取數(shù)據(jù)以外,，還可以按具體的條件取數(shù)。

下面使用loc和isin兩個(gè)函數(shù)配合使用,，按指定條件對(duì)數(shù)據(jù)進(jìn)行提取,。

使用isin函數(shù)對(duì)city中的值是否為beijing進(jìn)行判斷,。

計(jì)算機(jī)生成了可選文字:
I n [ 162 ] ： df n 已「 0 ty ' ] ．（ [
Out [ 2 ] ：
Beijing'))
date
2213 一 21 一 22
2213 一 21 一 23
2213 一 21 一 24
2213 一 21 一 25
2213 一 21 一 25
2213 一 21 一 27
Name ：
City
False
False
False
False
True
, dtype. bool

將isin函數(shù)嵌套到loc的數(shù)據(jù)提取函數(shù)中，將判斷結(jié)果為True數(shù)據(jù)提取出來(lái),。

這里我們把判斷條件改為city值是否為beijing和shanghai,，如果是,，就把這條數(shù)據(jù)提取出來(lái),。

數(shù)據(jù)提取還可以完成類(lèi)似數(shù)據(jù)分列的工作,，從合并的數(shù)值中提取出指定的數(shù)值。

計(jì)算機(jī)生成了可選文字:
category=df_xnner [ 《 category 《 ]
0
3
5
4
1
2
199 一 A
119 一 C
130 一 F
210 一 A
199 一 B
110 一 A
Name ： category, dtype: object
# 提取前三個(gè) 字符 ,，并生成數(shù) 據(jù) 表
pd.DataFrame(category.str[:3] ）

計(jì)算機(jī)生成了可選文字:
1 n [ 5 ] ·
1 n [ 7 ] ·
Out [ 67 ] ：
date
2213 一 21 一 22
2213 一 21 一 23
2213 一 21 一 24
2213 一 21 一 25
2213 一 21 一 25
2213 一 21 一 27
category=df nn 已「 [ ' category x')
category
category x, dtype: object
I n [ 168 ] ： pd .DataFrame(category st 「 [ ：明）
伽 t [ 篁 68 ] ·
category x
date
2213 一 21 一 22
2213 一 21 一 23
2213 一 21 一 24
2213 一 21 一 25
2213 一 21 一 25
2213 一 21 一 27

#提取類(lèi)別的字段的前3個(gè)字符

第6章數(shù)據(jù)篩選

使用與,，或，非三個(gè)條件配合大于,，小于和等于對(duì)數(shù)據(jù)進(jìn)行篩選,，并進(jìn)行計(jì)數(shù)和求和。

與Excel中的篩選功能和countifs和sumifs功能相似,。

1.按條件篩選（與,、或、非）

Excel數(shù)據(jù)目錄下提供“篩選”功能,，用于對(duì)數(shù)據(jù)表按不同的條件進(jìn)行篩選,。

計(jì)算機(jī)生成了可選文字:
馭 i 不
重新應(yīng) 用
廴排序篩選
高級(jí)
排序和師辶一

Python中使用loc函數(shù)配合篩選條件來(lái)完成篩選功能。

配合sum和count函數(shù)還能實(shí)現(xiàn)Excel中sumif和countif函數(shù)的功能,。

使用“與”條件進(jìn)行篩選,，條件是年齡大于25歲，并且城市為beijing,。篩選后只有一條數(shù)據(jù)符合要求,。

計(jì)算機(jī)生成了可選文字:
I n [ 172 ] ： df nn 已「． loc [ ()f inne
0 id ity g 巴
1 n [ 丷 3 ] ·
Out [ 篁 72 ] ：
date
「 [ 的 >25)&(df n 已「 0 ty ' ] ：
Beijing'),
'category 'gender']]
City
2213 一 21 一 27 1235 Beijing
g 巳 category
32
gender
femalel

使用“或”條件進(jìn)行篩選，條件是年齡大于25歲,，或城市為beijing,。

計(jì)算機(jī)生成了可選文字:
I n [ 173 ] ： df
CIt-y
Out [ 篁 73 ] ：
date
2213 一 21 一 22
2213 一 21 一 23
2213 一 21 一 24
2213 一 21 一 25
2213 一 21 一 25
2213 一 21 一 27
.10cC()f inne
'category 'gender']]
Beijing'),
1231
1232
1233
1234
1235
1235
City
Beijing
shanghai
guangzhou
Shenzhen
shanghai
g 巳 category
23
4 #
54
34
gender
female
female
female

在前面的代碼后增加price字段以及sum函數(shù)，按篩選后的結(jié)果將price字段值進(jìn)行求和,，相當(dāng)于Excel中sumifs的功能,。

使用“非”條件進(jìn)行篩選，城市不等于beijing,。符合條件的數(shù)據(jù)有4條,。將篩選結(jié)果按id列進(jìn)行排序。

計(jì)算機(jī)生成了可選文字:
I n [ 178 ] ： df
Clty
Out [ 篁 78 ] ·
date
2213 一 21 一 23
2213 一 21 一 24
2213 一 21 一 25
2213 一 21 一 25
loc Cdf inner( ty !='beijing
'category', 'gender']] · 50 「 t values (by
1232
1233
1234
1235
City
shanghai
guangzhou
Shenzhen
shanghai
g 巳 category
4 #
54
34
gender
female
female

在前面的代碼后面增加city列,，并使用count函數(shù)進(jìn)行計(jì)數(shù),。相當(dāng)于Excel中的countifs函數(shù)的功能,。

還有一種篩選的方式是用query函數(shù),。

下面是具體的代碼和篩選結(jié)果,。

計(jì)算機(jī)生成了可選文字:
1 n [ 2 ％ ] ·
date
2213 一 21 一 22
2213 一 21 一 23
2213 一 21 一 25
2213 一 21 一 27
date
2213 一 21 一 23
df nn 已「
1231
1232
1235
1235
5 Ign
City category x
''shanghai") ')
gender
Beijing
shanghai
shanghai
Beijing
123 一
212 一層
m-polnt pay group
12 Y IOW
12 high
4 #
34
32
3299
3299
4432
female
female
51gn2 categoryJ 51 記已

在前面的代碼后增加price字段和sum函數(shù)。對(duì)篩選后的price字段進(jìn)行求和,，相當(dāng)于Excel中的sumifs函數(shù)的功能,。

計(jì)算機(jī)生成了可選文字:
# 對(duì) 篩選后的結(jié) 果按 pr 躚 e 進(jìn) 行求和
df inner.query( 'City
["beijing"
12230
" shanghai"
）思 price. sum( ）

1.分類(lèi)匯總

Excel的數(shù)據(jù)目錄下提供了“分類(lèi)匯總”功能，可以按指定的字段和匯總方式對(duì)數(shù)據(jù)表進(jìn)行匯總,。

Python中通過(guò)groupby函數(shù)完成相應(yīng)的操作,，并可以支持多級(jí)分類(lèi)匯總。

groupby函數(shù)是進(jìn)行分類(lèi)匯總的函數(shù),，使用方法很簡(jiǎn)單,，指定要分組的列名稱(chēng)就可以，也可以同時(shí)指定多個(gè)列名稱(chēng),，groupby按列名稱(chēng)出現(xiàn)的順序進(jìn)行分組,。同時(shí)要制定分組后的匯總方式，常見(jiàn)的是計(jì)數(shù)和求和兩種,。

可以在groupby中設(shè)置列名稱(chēng)來(lái)對(duì)特定的列進(jìn)行匯總,。下面的代碼中按城市對(duì)id字段進(jìn)行匯總計(jì)數(shù)。

在前面的基礎(chǔ)上增加第二個(gè)列名稱(chēng),，分布對(duì)city和size兩個(gè)字段進(jìn)行計(jì)數(shù)匯總,。

除了計(jì)數(shù)和求和外，還可以對(duì)匯總后的數(shù)據(jù)同時(shí)按多個(gè)維度進(jìn)行計(jì)算,，下面的代碼中按城市對(duì)price字段進(jìn)行匯總,，并分別計(jì)算price的數(shù)量，總金額和平均金額,。

2.數(shù)據(jù)透視

Excel中插入目錄下提供“數(shù)據(jù)透視表”功能對(duì)數(shù)據(jù)表按特定維度進(jìn)行匯總,。

Python中也提供了數(shù)據(jù)透視表功能。通過(guò)pivot_table函數(shù)實(shí)現(xiàn)同樣的效果,。

數(shù)據(jù)透視表也是常用的一種數(shù)據(jù)分類(lèi)匯總方式,，并且功能上比groupby要更強(qiáng)大一些。

下面的代碼中設(shè)定city為行字段,，size為列字段,，price為值字段。

分別計(jì)算price的數(shù)量和金額并且按行與列進(jìn)行匯總,。

第8章數(shù)據(jù)統(tǒng)計(jì)

本章主要介紹數(shù)據(jù)采樣,，標(biāo)準(zhǔn)差，協(xié)方差和相關(guān)系數(shù)的使用方法,。

1.數(shù)據(jù)采樣

Excel的數(shù)據(jù)分析功能中提供了數(shù)據(jù)抽樣的功能,，如下路所示,。

Python通過(guò)sample函數(shù)完成數(shù)據(jù)采樣。

sample函數(shù)是進(jìn)行數(shù)據(jù)采樣的函數(shù),，設(shè)置n的數(shù)量就可以了,。

函數(shù)自動(dòng)返回采集的結(jié)果。

Weights參數(shù)是采樣的權(quán)重,，通過(guò)設(shè)置不同的權(quán)重可以更改采樣的結(jié)果,，權(quán)重高得數(shù)據(jù)將更有希望被選中。

這里手動(dòng)設(shè)置6條數(shù)據(jù)的權(quán)重值,。將前面4個(gè)設(shè)置為0,，后面兩個(gè)設(shè)置為0.5.

sample函數(shù)中還有一個(gè)參數(shù)replace，用來(lái)設(shè)置采樣后是否放回,。

2.描述統(tǒng)計(jì)

Excel中的數(shù)據(jù)分析中提供了描述統(tǒng)計(jì)的功能。

Python中可以通過(guò)describe函數(shù)對(duì)數(shù)據(jù)進(jìn)行描述統(tǒng)計(jì),。

describe函數(shù)是進(jìn)行描述統(tǒng)計(jì)的函數(shù),，自動(dòng)生成數(shù)據(jù)的數(shù)量，均值,，標(biāo)準(zhǔn)差等數(shù)據(jù),。

下面的代碼中對(duì)數(shù)據(jù)表進(jìn)行描述統(tǒng)計(jì)，并使用roung函數(shù)設(shè)置結(jié)果顯示的小數(shù)位,。并對(duì)結(jié)果數(shù)據(jù)進(jìn)行轉(zhuǎn)置,。

3.標(biāo)準(zhǔn)差

Python中std函數(shù)用來(lái)計(jì)算特定數(shù)據(jù)列的標(biāo)準(zhǔn)差。

4.協(xié)方差

Excel中的數(shù)據(jù)分析功能中提供協(xié)方差的計(jì)算,；

Python中通過(guò)cov函數(shù)計(jì)算兩個(gè)字段或數(shù)據(jù)表中各字段間的協(xié)方差,。

cov函數(shù)用來(lái)計(jì)算兩個(gè)字段間的協(xié)方差，可以支隊(duì)特定字段進(jìn)行計(jì)算,，也可以對(duì)整個(gè)數(shù)據(jù)表中各個(gè)列之間進(jìn)行計(jì)算,。

5.相關(guān)分析

Excel的數(shù)據(jù)分析功能中提供了相關(guān)系數(shù)的計(jì)算功能；

Python中則通過(guò)corr完成相關(guān)分析的操作,，并返回相關(guān)系數(shù),。

corr函數(shù)用來(lái)計(jì)算數(shù)據(jù)間的相關(guān)系數(shù)，可以單獨(dú)對(duì)特定數(shù)據(jù)進(jìn)行計(jì)算,，也可以對(duì)整個(gè)數(shù)據(jù)表中各個(gè)列進(jìn)行計(jì)算,。

相關(guān)系數(shù)在-1和1之間，接近1為正相關(guān),，接近-1為負(fù)相關(guān),，0為不相關(guān)。

第9章數(shù)據(jù)輸出

處理和分析完的數(shù)據(jù)可以輸出為xlsx格式和csv格式,。

1.寫(xiě)入Excel

2.寫(xiě)入csv

案例：990萬(wàn)次騎行：紐約自行車(chē)共享系統(tǒng)分析

本站僅用于個(gè)人學(xué)習(xí)的筆記和記錄,，有時(shí)候是閱讀別人的文章的時(shí)候做著筆記整理,，有些是在實(shí)際工作中遇到的問(wèn)題以及解決小問(wèn)題的記錄。當(dāng)前僅用于作為一個(gè)萌新程序媛的查漏補(bǔ)缺地檢索地,。

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間,，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn),。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式,、誘導(dǎo)購(gòu)買(mǎi)等信息，謹(jǐn)防詐騙,。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來(lái)自： zhaoshaok > 《待分類(lèi)》

舉報(bào)/認(rèn)領(lǐng)