序
Excel是數(shù)據(jù)分鐘中最常用的工具,,通過(guò)Python和Excel功能對(duì)比,介紹如何使用Python通過(guò)函數(shù)式編程完成Excel中的數(shù)據(jù)處理及分析工作,。
在Python中pandas庫(kù)用于數(shù)據(jù)處理,,我們從1787頁(yè)的pandas官網(wǎng)文檔中總結(jié)出最常用的36個(gè)函數(shù),通過(guò)這些函數(shù)介紹如何通過(guò)Python完成數(shù)據(jù)生成和導(dǎo)入,,數(shù)據(jù)清洗,,預(yù)處理,以及最常見(jiàn)的數(shù)據(jù)分類(lèi),,數(shù)據(jù)篩選,,分類(lèi)匯總,,透視等最常見(jiàn)的操作。
第1章生成數(shù)據(jù)表
Excel
常見(jiàn)的生成數(shù)據(jù)表的方法有兩種,第一種是導(dǎo)入外部數(shù)據(jù),,第二種是直接寫(xiě)入數(shù)據(jù),。
Excel中的“文件”菜單中提供了獲取外部數(shù)據(jù)的功能,支持?jǐn)?shù)據(jù)庫(kù)和文本文件和頁(yè)面的多種數(shù)據(jù)源導(dǎo)入,。
Python
Python支持從多種類(lèi)型的數(shù)據(jù)導(dǎo)入,。
在開(kāi)始使用Python進(jìn)行數(shù)據(jù)導(dǎo)入之前需要先導(dǎo)入pandas庫(kù),為了方便起見(jiàn),,我們也同時(shí)導(dǎo)入numpy庫(kù)
1.導(dǎo)入數(shù)據(jù)表
下面分別是Excel和csv格式文件中導(dǎo)入數(shù)據(jù)并創(chuàng)建數(shù)據(jù)表的方法,。
代碼是最簡(jiǎn)模式,里面有很多可選參數(shù)設(shè)置,,例如列名稱(chēng),,索引列,數(shù)據(jù)格式,。
help(pd.read_csv)
Help on function read_csv in module pandas.io.parsers:
read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)
Read CSV (comma-separated) file into DataFrame
Also supports optionally iterating or breaking of the file
into chunks.
Additional help can be found in the `online docs for IO Tools
<http://pandas./pandas-docs/stable/io.html>`_.
Parameters
----------
filepath_or_buffer : str, pathlib.Path, py._path.local.LocalPath or any object with a read() method (such as a file handle or StringIO)
The string could be a URL. Valid URL schemes include http, ftp, s3, and
file. For file URLs, a host is expected. For instance, a local file could
be file ://localhost/path/to/table.csv
sep : str, default ','
Delimiter to use. If sep is None, will try to automatically determine
this. Separators longer than 1 character and different from ``'\s+'`` will
be interpreted as regular expressions, will force use of the python parsing
engine and will ignore quotes in the data. Regex example: ``'\r\t'``
delimiter : str, default ``None``
Alternative argument name for sep.
delim_whitespace : boolean, default False
Specifies whether or not whitespace (e.g. ``' '`` or ``' '``) will be
used as the sep. Equivalent to setting ``sep='\s+'``. If this option
is set to True, nothing should be passed in for the ``delimiter``
parameter.
.. versionadded:: 0.18.1 support for the Python parser.
header : int or list of ints, default 'infer'
Row number(s) to use as the column names, and the start of the data.
Default behavior is as if set to 0 if no ``names`` passed, otherwise
``None``. Explicitly pass ``header=0`` to be able to replace existing
names. The header can be a list of integers that specify row locations for
a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not
specified will be skipped (e.g. 2 in this example is skipped). Note that
this parameter ignores commented lines and empty lines if
``skip_blank_lines=True``, so header=0 denotes the first line of data
rather than the first line of the file.
names : array-like, default None
List of column names to use. If file contains no header row, then you
should explicitly pass header=None. Duplicates in this list are not
allowed unless mangle_dupe_cols=True, which is the default.
index_col : int or sequence or False, default None
Column to use as the row labels of the DataFrame. If a sequence is given, a
MultiIndex is used. If you have a malformed file with delimiters at the end
of each line, you might consider index_col=False to force pandas to _not_
use the first column as the index (row names)
usecols : array-like, default None
Return a subset of the columns. All elements in this array must either
be positional (i.e. integer indices into the document columns) or strings
that correspond to column names provided either by the user in `names` or
inferred from the document header row(s). For example, a valid `usecols`
parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Using this parameter
results in much faster parsing time and lower memory usage.
as_recarray : boolean, default False
DEPRECATED: this argument will be removed in a future version. Please call
`pd.read_csv(...).to_records()` instead.
Return a NumPy recarray instead of a DataFrame after parsing the data.
If set to True, this option takes precedence over the `squeeze` parameter.
In addition, as row indices are not available in such a format, the
`index_col` parameter will be ignored.
squeeze : boolean, default False
If the parsed data only contains one column then return a Series
prefix : str, default None
Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...
mangle_dupe_cols : boolean, default True
Duplicate columns will be specified as 'X.0'...'X.N', rather than
'X'...'X'. Passing in False will cause data to be overwritten if there
are duplicate names in the columns.
dtype : Type name or dict of column -> type, default None
Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
(Unsupported with engine='python'). Use `str` or `object` to preserve and
not interpret dtype.
engine : {'c', 'python'}, optional
Parser engine to use. The C engine is faster while the python engine is
currently more feature-complete.
converters : dict, default None
Dict of functions for converting values in certain columns. Keys can either
be integers or column labels
true_values : list, default None
Values to consider as True
false_values : list, default None
Values to consider as False
skipinitialspace : boolean, default False
Skip spaces after delimiter.
skiprows : list-like or integer, default None
Line numbers to skip (0-indexed) or number of lines to skip (int)
at the start of the file
skipfooter : int, default 0
Number of lines at bottom of file to skip (Unsupported with engine='c')
skip_footer : int, default 0
DEPRECATED: use the `skipfooter` parameter instead, as they are identical
nrows : int, default None
Number of rows of file to read. Useful for reading pieces of large files
na_values : scalar, str, list-like, or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific
per-column NA values. By default the following values are interpreted as
NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
'1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'nan'`.
keep_default_na : bool, default True
If na_values are specified and keep_default_na is False the default NaN
values are overridden, otherwise they're appended to.
na_filter : boolean, default True
Detect missing value markers (empty strings and the value of na_values). In
data without any NAs, passing na_filter=False can improve the performance
of reading a large file
verbose : boolean, default False
Indicate number of NA values placed in non-numeric columns
skip_blank_lines : boolean, default True
If True, skip over blank lines rather than interpreting as NaN values
parse_dates : boolean or list of ints or names or list of lists or dict, default False
* boolean. If True -> try parsing the index.
* list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3
each as a separate date column.
* list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as
a single date column.
* dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result
'foo'
Note: A fast-path exists for iso8601-formatted dates.
infer_datetime_format : boolean, default False
If True and parse_dates is enabled, pandas will attempt to infer the format
of the datetime strings in the columns, and if it can be inferred, switch
to a faster method of parsing them. In some cases this can increase the
parsing speed by ~5-10x.
keep_date_col : boolean, default False
If True and parse_dates specifies combining multiple columns then
keep the original columns.
date_parser : function, default None
Function to use for converting a sequence of string columns to an array of
datetime instances. The default uses ``dateutil.parser.parser`` to do the
conversion. Pandas will try to call date_parser in three different ways,
advancing to the next if an exception occurs: 1) Pass one or more arrays
(as defined by parse_dates) as arguments; 2) concatenate (row-wise) the
string values from the columns defined by parse_dates into a single array
and pass that; and 3) call date_parser once for each row using one or more
strings (corresponding to the columns defined by parse_dates) as arguments.
dayfirst : boolean, default False
DD/MM format dates, international and European format
iterator : boolean, default False
Return TextFileReader object for iteration or getting chunks with
``get_chunk()``.
chunksize : int, default None
Return TextFileReader object for iteration. `See IO Tools docs for more
information
<http://pandas./pandas-docs/stable/io.html#io-chunking>`_ on
``iterator`` and ``chunksize``.
compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
For on-the-fly decompression of on-disk data. If 'infer', then use gzip,
bz2, zip or xz if filepath_or_buffer is a string ending in '.gz', '.bz2',
'.zip', or 'xz', respectively, and no decompression otherwise. If using
'zip', the ZIP file must contain only one data file to be read in.
Set to None for no decompression.
.. versionadded:: 0.18.1 support for 'zip' and 'xz' compression.
thousands : str, default None
Thousands separator
decimal : str, default '.'
Character to recognize as decimal point (e.g. use ',' for European data).
float_precision : string, default None
Specifies which converter the C engine should use for floating-point
values. The options are `None` for the ordinary converter,
`high` for the high-precision converter, and `round_trip` for the
round-trip converter.
lineterminator : str (length 1), default None
Character to break file into lines. Only valid with C parser.
quotechar : str (length 1), optional
The character used to denote the start and end of a quoted item. Quoted
items can include the delimiter and it will be ignored.
quoting : int or csv.QUOTE_* instance, default 0
Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of
QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
doublequote : boolean, default ``True``
When quotechar is specified and quoting is not ``QUOTE_NONE``, indicate
whether or not to interpret two consecutive quotechar elements INSIDE a
field as a single ``quotechar`` element.
escapechar : str (length 1), default None
One-character string used to escape delimiter when quoting is QUOTE_NONE.
comment : str, default None
Indicates remainder of line should not be parsed. If found at the beginning
of a line, the line will be ignored altogether. This parameter must be a
single character. Like empty lines (as long as ``skip_blank_lines=True``),
fully commented lines are ignored by the parameter `header` but not by
`skiprows`. For example, if comment='#', parsing '#empty\na,b,c\n1,2,3'
with `header=0` will result in 'a,b,c' being
treated as the header.
encoding : str, default None
Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python
standard encodings
<https://docs./3/library/codecs.html#standard-encodings>`_
dialect : str or csv.Dialect instance, default None
If None defaults to Excel dialect. Ignored if sep longer than 1 char
See csv.Dialect documentation for more details
tupleize_cols : boolean, default False
Leave a list of tuples on columns as is (default is to convert to
a Multi Index on the columns)
error_bad_lines : boolean, default True
Lines with too many fields (e.g. a csv line with too many commas) will by
default cause an exception to be raised, and no DataFrame will be returned.
If False, then these "bad lines" will dropped from the DataFrame that is
returned. (Only valid with C parser)
warn_bad_lines : boolean, default True
If error_bad_lines is False, and warn_bad_lines is True, a warning for each
"bad line" will be output. (Only valid with C parser).
low_memory : boolean, default True
Internally process the file in chunks, resulting in lower memory use
while parsing, but possibly mixed type inference. To ensure no mixed
types either set False, or specify the type with the `dtype` parameter.
Note that the entire file is read into a single DataFrame regardless,
use the `chunksize` or `iterator` parameter to return the data in chunks.
(Only valid with C parser)
buffer_lines : int, default None
DEPRECATED: this argument will be removed in a future version because its
value is not respected by the parser
compact_ints : boolean, default False
DEPRECATED: this argument will be removed in a future version
If compact_ints is True, then for any column that is of integer dtype,
the parser will attempt to cast it as the smallest integer dtype possible,
either signed or unsigned depending on the specification from the
`use_unsigned` parameter.
use_unsigned : boolean, default False
DEPRECATED: this argument will be removed in a future version
If integer columns are being compacted (i.e. `compact_ints=True`), specify
whether the column should be compacted to the smallest signed or unsigned
integer dtype.
memory_map : boolean, default False
If a filepath is provided for `filepath_or_buffer`, map the file object
directly onto memory and access the data directly from there. Using this
option can improve performance because there is no longer any I/O overhead.
Returns
-------
result : DataFrame or TextParser
2.創(chuàng)建數(shù)據(jù)表
另一種方法是通過(guò)直接寫(xiě)入數(shù)據(jù)來(lái)生成數(shù)據(jù)表,,Excel中直接在單元格中輸入數(shù)據(jù)就可以,,Python中通過(guò)下面的代碼來(lái)實(shí)現(xiàn)。
生成數(shù)據(jù)表的函數(shù)是pandas庫(kù)中的DataFrame函數(shù),,數(shù)據(jù)表一共有6行數(shù)據(jù),每行有6個(gè)字段,。在數(shù)據(jù)中我們特意設(shè)置了一些NA值和有問(wèn)題的字段,,例如包含空格等。
后面將在數(shù)據(jù)清洗步驟進(jìn)行處理,。
后面我們將統(tǒng)一以DataFrame的簡(jiǎn)稱(chēng)df來(lái)命名數(shù)據(jù)表,。
以上是剛剛創(chuàng)建的數(shù)據(jù)表,我們沒(méi)有設(shè)置索引列,,price字段中包含有NA值,,city字段中還包含了一些臟數(shù)據(jù)。
第2章數(shù)據(jù)表檢查
本章主要介紹對(duì)數(shù)據(jù)表進(jìn)行檢查,。
Python中處理的數(shù)據(jù)量通常會(huì)比較大,,比如紐約的出租車(chē)數(shù)據(jù)和Citibike的騎行數(shù)據(jù),其數(shù)據(jù)量都在千萬(wàn)級(jí),,我們無(wú)法一目了然地了解數(shù)據(jù)表的整體情況,,必須要通過(guò)一些方法來(lái)獲得數(shù)據(jù)表的關(guān)鍵信息。
數(shù)據(jù)表檢查的另一個(gè)目的是了解數(shù)據(jù)的概況,,例如整個(gè)數(shù)據(jù)表的大小,,所占空間,,數(shù)據(jù)格式,,是否有空值和具體的數(shù)據(jù)內(nèi)容,為后面的清洗和預(yù)處理做好準(zhǔn)備,。
1.數(shù)據(jù)維度(行列)
Excel中可以通過(guò)CTRL+向下的光標(biāo)鍵,,和CTRL+向右的光標(biāo)鍵來(lái)查看行號(hào)和列好。
Python中使用shape函數(shù)來(lái)產(chǎn)看數(shù)據(jù)表的維度,,也就是行數(shù)和列數(shù),,函數(shù)返回的結(jié)果(6,6)表示數(shù)據(jù)表有6行,6列,。
下面是具體的代碼,。
2.數(shù)據(jù)表信息
使用info函數(shù)查看數(shù)據(jù)表的整體信息,,這里返回的信息比較多,,包括數(shù)據(jù)維度、列名稱(chēng),、數(shù)據(jù)格式和所占空間等信息,。
3.查看數(shù)據(jù)格式
Excel中通過(guò)選中單元格并查看開(kāi)始菜單中的數(shù)值類(lèi)型來(lái)判斷數(shù)據(jù)的格式,。
Python中使用dtypes函數(shù)來(lái)返回?cái)?shù)據(jù)格式,。
dtyps是一個(gè)查看數(shù)據(jù)格式的函數(shù),,可以一次性查看數(shù)據(jù)表中所有數(shù)據(jù)的格式,也可以指定一列來(lái)單獨(dú)查看,。
4.查看空值
Excel中查看空值的方法是使用‘定位條件’功能對(duì)數(shù)據(jù)表中的空值進(jìn)行定位,。
‘定位條件’在‘開(kāi)始’目錄下的‘查找和選擇’目錄中。
Isnull是Python中檢驗(yàn)空值的函數(shù),,返回的結(jié)果是邏輯值,,包含空值返回True,不包含則返回False,。
用戶既可以對(duì)整個(gè)數(shù)據(jù)表進(jìn)行檢查,,也可以單獨(dú)對(duì)某一列進(jìn)行空值檢查。
5.查看唯一值
Excel中查看唯一值得方法是使用‘條件格式’對(duì)唯一值進(jìn)行顏色標(biāo)記,。
Python中是用unique函數(shù)查看唯一值。
Unique是查看唯一值的函數(shù),,只能對(duì)數(shù)據(jù)表中的特定列進(jìn)行檢查,。
下面是代碼,返回的結(jié)果是該列中的唯一值,。
類(lèi)似與Excel中刪除重復(fù)項(xiàng)后的結(jié)果,。
6.查看數(shù)據(jù)表數(shù)值
Python中的Values函數(shù)用來(lái)查看數(shù)據(jù)表中的數(shù)值,。
以數(shù)組的形式返回,,不包含表頭信息。
7.查看列名稱(chēng)
8.查看前10行數(shù)據(jù)
Head()函數(shù)用來(lái)查看數(shù)據(jù)表中前N行數(shù)據(jù),,默認(rèn)head()顯示前10行數(shù)據(jù),可以自己設(shè)置參數(shù)值來(lái)確定查看的行數(shù),。
下面的代碼中設(shè)置查看前3行的數(shù)據(jù),。
9.查看后10行數(shù)據(jù)
tail函數(shù)與head函數(shù)相反,用來(lái)查看數(shù)據(jù)表中后N行的數(shù)據(jù),,默認(rèn)tail()顯示后10行數(shù)據(jù),,可以自己設(shè)置參數(shù)值來(lái)確定查看的行數(shù)。
下面的代碼中設(shè)置查看后3行的數(shù)據(jù),。
第3章數(shù)據(jù)表清洗
本章介紹的是對(duì)數(shù)據(jù)表中的問(wèn)題進(jìn)行清洗,,主要內(nèi)容包括對(duì)空值、大小寫(xiě)問(wèn)題,、數(shù)據(jù)格式和重復(fù)值的處理,。
這里不包含對(duì)數(shù)據(jù)間的邏輯驗(yàn)證。
1.處理空值(刪除或填充)
我們?cè)趧?chuàng)建數(shù)據(jù)表的時(shí)候在price字段中故意設(shè)置了幾個(gè)NA值,。
對(duì)于空值的處理方式有很多種,,可以直接刪除包含空值的數(shù)據(jù),也可以對(duì)控制進(jìn)行填充,,比如用0填充或者用均值填充,。還可以根據(jù)不同字段的邏輯對(duì)空值進(jìn)行推算。
Excel中可以通過(guò)‘查找和替換’功能對(duì)空值進(jìn)行處理,,將空值統(tǒng)一替換為0或均值,。也可以通過(guò)‘定位’空值來(lái)實(shí)現(xiàn)。
Python中處理空值的方法比較靈活,,可以使用Dropna函數(shù)用來(lái)刪除數(shù)據(jù)表中包含空值的數(shù)據(jù),,也可以使用fillna函數(shù)對(duì)空值進(jìn)行填充。下面的代碼和結(jié)果中可以看到使用dropna函數(shù)后,,包含NA值得兩個(gè)字段已經(jīng)不見(jiàn)了,。返回的是一個(gè)不包含空值的數(shù)據(jù)表。
除此之外,,也可以使用數(shù)字對(duì)空值進(jìn)行填充,下面的代碼使用fillna函數(shù)對(duì)空值字段填充數(shù)字0.
我們選擇填充的方式來(lái)處理空值,,使用price列的均值來(lái)填充NA字段,,同樣使用fillna函數(shù),在要填充的數(shù)值中使用mean函數(shù)先計(jì)算price列當(dāng)前的均值,,然后使用這個(gè)均值對(duì)NA進(jìn)行填充,。可以看到兩個(gè)空值字段顯示為3299.5
2.清理空格
處理空值,,字段中的空格也是數(shù)據(jù)清洗中一個(gè)常見(jiàn)的問(wèn)題,,下面是清楚字符中空格的代碼。
3.大小寫(xiě)轉(zhuǎn)換
在英文字段中,,字母的大小寫(xiě)不統(tǒng)一也是一個(gè)常見(jiàn)的問(wèn)題,。
Excel中有UPPER,LOWER等函數(shù),Python中也有同名函數(shù)用來(lái)解決大小寫(xiě)的問(wèn)題,。在數(shù)據(jù)表的city列中就存在這樣的問(wèn)題,。
我們將city列的所有字母轉(zhuǎn)換為小寫(xiě)。下面是具體的代碼和結(jié)果,。
4.更改數(shù)據(jù)格式
Excel中通過(guò)‘設(shè)置單元格格式’功能可以修改數(shù)據(jù)格式,。
Python中通過(guò)astype函數(shù)用來(lái)修改數(shù)據(jù)格式。
Python中dtype是查看數(shù)據(jù)格式的函數(shù),,與之對(duì)應(yīng)的是astype函數(shù),,用來(lái)更改數(shù)據(jù)格式。下面的代碼將price字段的值修改為int格式,。
5.更改列名稱(chēng)
Rename是更改列名稱(chēng)的函數(shù),,我們將來(lái)數(shù)據(jù)表中的category列更改為category-size。
下面是具體的代碼和更改后的結(jié)果,。
6.刪除重復(fù)值
很多數(shù)據(jù)表中還包含重復(fù)值的問(wèn)題,,Excel的數(shù)據(jù)目錄下有‘刪除重復(fù)項(xiàng)’的功能,可以用來(lái)刪除數(shù)據(jù)表中的重復(fù)值,。
默認(rèn)Excel會(huì)保留最先出現(xiàn)的數(shù)據(jù),,刪除后面重復(fù)出現(xiàn)的數(shù)據(jù)。
Python中使用drop_duplicates函數(shù)刪除重復(fù)值,。
我們以數(shù)據(jù)表中的city列為例,city字段中存在重復(fù)值,。
默認(rèn)情況下drop_duplicates()將刪除后出現(xiàn)的重復(fù)值(與Excel邏輯一致),。增加keep='last'參數(shù)后將刪除最先出現(xiàn)的重復(fù)值,保留最后的值,。
下面是具體的代碼和比較結(jié)果,。
使用默認(rèn)的drop_duplicates()函數(shù)刪除重復(fù)值,,從結(jié)果中可以看到第一位的beijing被保留,最后出現(xiàn)的beijing被刪除,。
設(shè)置keep='last'參數(shù)后,,與之前刪除重復(fù)值的結(jié)果相反,第一位出現(xiàn)的beijing被刪除,保留了最后一位出現(xiàn)的beijing,。
7.數(shù)值修改及替換
數(shù)據(jù)清洗中最后一個(gè)問(wèn)題是數(shù)值修改或替換,,Excel中使用“查找和替換”功能就可以實(shí)現(xiàn)數(shù)值的替換。
Python中使用replace函數(shù)實(shí)現(xiàn)數(shù)據(jù)替換,。
數(shù)據(jù)表中city字段上海存在兩種寫(xiě)法,,分別為shanghai和SH。
我們使用replace函數(shù)對(duì)SH進(jìn)行替換,。
第4章 數(shù)據(jù)預(yù)處理
本章主要講的是數(shù)據(jù)的預(yù)處理,對(duì)清洗完的數(shù)據(jù)進(jìn)行整理以便后期的統(tǒng)計(jì)和分析工作,。
主要包括數(shù)據(jù)表的合并,,排序,數(shù)值分列,,數(shù)據(jù)分組及標(biāo)記等工作,。
1.數(shù)據(jù)表合并
首先是對(duì)不同的數(shù)據(jù)表進(jìn)行合并,我們這里創(chuàng)建一個(gè)新的數(shù)據(jù)表df1,,并將df和df1兩個(gè)數(shù)據(jù)表進(jìn)行合并,。
在Excel中沒(méi)有直接完成數(shù)據(jù)表合并的功能,可以通過(guò)vlookup函數(shù)分步實(shí)現(xiàn),。
在Python中可以通過(guò)merge函數(shù)一次性實(shí)現(xiàn),。
下面建立df1數(shù)據(jù)表,用于和df數(shù)據(jù)表進(jìn)行合并,。
使用merge函數(shù)對(duì)兩個(gè)數(shù)據(jù)表進(jìn)行合并,,合并的方式為inner,將兩個(gè)數(shù)據(jù)表中共有的數(shù)據(jù)匹配到一起生成新的數(shù)據(jù)表,。并命名為df_inner,。
除了inner方式以外,合并的方式還有left,,right和outer方式,。這幾種方式的差別在我其他的文章中有詳細(xì)的說(shuō)明和對(duì)比。
2.設(shè)置索引列
完成數(shù)據(jù)表的合并后,我們對(duì)df_inner數(shù)據(jù)表設(shè)置索引列,,索引列的功能很多,,可以進(jìn)行數(shù)據(jù)提取,匯總,,也可以進(jìn)行數(shù)據(jù)篩選等。
設(shè)置索引的函數(shù)為set_index.
3.排序(按索引,按數(shù)值)
Excel中可以通過(guò)數(shù)據(jù)目錄下的排序按鈕直接對(duì)數(shù)據(jù)表進(jìn)行排序,,比較簡(jiǎn)單,。
Python中需要使用ort_values函數(shù)和sort_index函數(shù)完成排序。
在Python中,,既可以按索引對(duì)數(shù)據(jù)表進(jìn)行排序,,也可以看置頂列的數(shù)值進(jìn)行排序。
首先我們按age列中用戶的年齡對(duì)數(shù)據(jù)表進(jìn)行排序,。
使用的函數(shù)為sort_values.
Sort_index函數(shù)用來(lái)將數(shù)據(jù)表按索引列的值進(jìn)行排序,。
4.數(shù)據(jù)分組
Excel中可以通過(guò)vlookup函數(shù)進(jìn)行近似匹配來(lái)完成對(duì)數(shù)值的分組,或者使用“數(shù)據(jù)透視表”來(lái)完成分組,。
相應(yīng)的Python中使用where函數(shù)完成數(shù)據(jù)分組,。
where函數(shù)用來(lái)對(duì)數(shù)據(jù)進(jìn)行判斷和分組,下面的代碼中我們對(duì)price列的值進(jìn)行判斷,,將符合條件的分為一組,,不符合條件的分為另一組,并使用group字段進(jìn)行標(biāo)記,。
除了where函數(shù)以外,,還可以對(duì)多個(gè)字段的值進(jìn)行判斷后對(duì)數(shù)據(jù)進(jìn)行分組,,下面的代碼中對(duì)city列等于beijing并且price列大于等于4000的數(shù)據(jù)標(biāo)記為1.
5.數(shù)據(jù)分列
與數(shù)據(jù)分組相反的是對(duì)數(shù)值進(jìn)行分列,Excel中的數(shù)據(jù)目錄下提供‘分列’功能,。
在Python中使用split函數(shù)實(shí)現(xiàn)分列,。
在數(shù)據(jù)表中category列中的數(shù)據(jù)包含有兩個(gè)信息,,前面的數(shù)字為類(lèi)別id,,后面的字母為size值。中間以連字符進(jìn)行聯(lián)結(jié),。我們使用split函數(shù)對(duì)這個(gè)字段進(jìn)行拆分,,并將拆分后的數(shù)據(jù)表匹配回原數(shù)據(jù)表中。
第5章數(shù)據(jù)提取
數(shù)據(jù)提取,,也就是數(shù)據(jù)分析中最常見(jiàn)的一個(gè)工作,。
這部分主要使用3個(gè)函數(shù),,即loc,iloc和ix。
loc函數(shù)按標(biāo)簽值進(jìn)行提??;
iloc函數(shù)按位置進(jìn)行提取,;
ix函數(shù)可以同時(shí)按標(biāo)簽和位置進(jìn)行提取,。
下面介紹每一種函數(shù)的使用方法。
1.按標(biāo)簽提取loc函數(shù)
loc函數(shù)按數(shù)據(jù)表的索引標(biāo)簽進(jìn)行提取,,下面的代碼中提取了索引列為3的單條數(shù)據(jù),。
使用冒號(hào)可以限定提取數(shù)據(jù)的范圍,,冒號(hào)前面為開(kāi)始的標(biāo)簽值,,后面為結(jié)束的標(biāo)簽值。
下面提取了0-5的數(shù)據(jù)行,。
Reset_index函數(shù)用于恢復(fù)索引,,這里我們重新將date字段的日期設(shè)置為數(shù)據(jù)表的索引,并按日期進(jìn)行數(shù)據(jù)提取,。
使用冒號(hào)限定提取數(shù)據(jù)的范圍,,冒號(hào)前面為空表示從0開(kāi)始。
提取所有2013年1月4日以前的數(shù)據(jù),。
2.按位置提取 iloc函數(shù)
使用iloc函數(shù)按位置對(duì)數(shù)據(jù)表中的數(shù)據(jù)進(jìn)行提取,,這里冒號(hào)前后的數(shù)字不再是索引的標(biāo)簽名稱(chēng),而是數(shù)據(jù)所在的位置,,從0開(kāi)始,。
iloc函數(shù)除了可以按區(qū)域提取數(shù)據(jù),還可以按位置逐條提取,,前面方括號(hào)中的0,2,5,表示數(shù)據(jù)所在行的位置,,后面方括號(hào)中的數(shù)表示所在列的位置。
3.按標(biāo)簽和位置提取 ix函數(shù)
ix是loc和iloc的混合,,既能按索引標(biāo)簽提取,,也能按位置進(jìn)行數(shù)據(jù)提取。
下面的代碼中行的位置按索引日期設(shè)置,,列按位置設(shè)置,。
4.按條件提?。▍^(qū)域和條件值)
除了按標(biāo)簽和位置提取數(shù)據(jù)以外,,還可以按具體的條件取數(shù)。
下面使用loc和isin兩個(gè)函數(shù)配合使用,,按指定條件對(duì)數(shù)據(jù)進(jìn)行提取,。
使用isin函數(shù)對(duì)city中的值是否為beijing進(jìn)行判斷,。
將isin函數(shù)嵌套到loc的數(shù)據(jù)提取函數(shù)中,將判斷結(jié)果為True數(shù)據(jù)提取出來(lái),。
這里我們把判斷條件改為city值是否為beijing和shanghai,,如果是,,就把這條數(shù)據(jù)提取出來(lái),。
數(shù)據(jù)提取還可以完成類(lèi)似數(shù)據(jù)分列的工作,,從合并的數(shù)值中提取出指定的數(shù)值。
#提取類(lèi)別的字段的前3個(gè)字符
第6章數(shù)據(jù)篩選
使用與,,或,非三個(gè)條件配合大于,,小于和等于對(duì)數(shù)據(jù)進(jìn)行篩選,,并進(jìn)行計(jì)數(shù)和求和。
與Excel中的篩選功能和countifs和sumifs功能相似,。
1.按條件篩選(與,、或、非)
Excel數(shù)據(jù)目錄下提供“篩選”功能,,用于對(duì)數(shù)據(jù)表按不同的條件進(jìn)行篩選,。
Python中使用loc函數(shù)配合篩選條件來(lái)完成篩選功能。
配合sum和count函數(shù)還能實(shí)現(xiàn)Excel中sumif和countif函數(shù)的功能,。
使用“與”條件進(jìn)行篩選,,條件是年齡大于25歲,并且城市為beijing,。篩選后只有一條數(shù)據(jù)符合要求,。
使用“或”條件進(jìn)行篩選,條件是年齡大于25歲,,或城市為beijing,。
在前面的代碼后增加price字段以及sum函數(shù),按篩選后的結(jié)果將price字段值進(jìn)行求和,,相當(dāng)于Excel中sumifs的功能,。
使用“非”條件進(jìn)行篩選,城市不等于beijing,。符合條件的數(shù)據(jù)有4條,。將篩選結(jié)果按id列進(jìn)行排序。
在前面的代碼后面增加city列,,并使用count函數(shù)進(jìn)行計(jì)數(shù),。相當(dāng)于Excel中的countifs函數(shù)的功能,。
還有一種篩選的方式是用query函數(shù),。
下面是具體的代碼和篩選結(jié)果,。
在前面的代碼后增加price字段和sum函數(shù)。對(duì)篩選后的price字段進(jìn)行求和,,相當(dāng)于Excel中的sumifs函數(shù)的功能,。