Pandas 笔记：数据的处理

基础用法
- 查看行列标签
  - 查看行索引
  - 查看列名
- 选择行或列
  - 选择多行
  - 选择多列
高级用法
注意事项

基础 Dataframe

df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
                   'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
                   'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
                   'height':[165, 70, 120, 80, 180, 172, 150],
                   'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])

	age	color	food	height	score	state
Jane	30	blue	Steak	165	4.6	NY
Nick	2	green	Lamb	70	8.3	TX
Aaron	12	red	Mango	120	9.0	FL
Penelope	4	white	Apple	80	3.3	AL
Dean	32	gray	Cheese	180	1.8	AK
Christina	33	black	Melon	172	9.5	TX
Cornelia	69	red	Beans	150	2.2	TX

以下例子均基于以上 dataframe 展开。为避免修改源 Dataframe，先进行以上 dataframe 的复制。

df1 = df.copy()

基础用法

查看行列标签

查看行索引

df1.index

查看列名

df1.columns

选择行或列

详见 Pandas 笔记：Dataframe 中的数据选择

注意，列表选择的是列，序列选择的是行！

选择多行

按行号取（不含结束点！）

df1[0: 2]

按行索引取（含结束点！）

df1['Jane': 'Nick']

	age	color	food	height	score	state
Jane	30	blue	Steak	5’5”	4.6	NY
Nick	2	green	Lamb	2’4”	8.3	TX

当然也可（建议使用）

df1.iloc[0: 2, :]

df1.loc['Jane': 'Nick', :]

选择多列

注意有两层中括号

df1[['color', 'food']]

	color	food
Jane	blue	Steak
Nick	green	Lamb
Aaron	red	Mango
Penelope	white	Apple
Dean	gray	Cheese
Christina	black	Melon
Cornelia	red	Beans

当然也可（建议使用）

df1.loc[:, ['color', 'food']]

高级用法

布尔选择行

单条件

df1.loc[:, df1.loc['age'] > 10]

	age	color	food	height	score	state
Jane	30	blue	Steak	165	4.6	NY
Aaron	12	red	Mango	120	9.0	FL
Dean	32	gray	Cheese	180	1.8	AK
Christina	33	black	Melon	172	9.5	TX
Cornelia	69	red	Beans	150	2.2	TX

条件：是否在某集合中

DataFrame.isin() 和 Series.within()。

集合可以为列表(list)，集合(set)，数据系列(Series)，以及数组(Numpy array)。

见 python - How to filter Pandas dataframe using ‘in’ and ‘not in’ like in SQL - Stack Overflow

多条件

注意每个条件加括号

df1.loc[(df1['age'] > 10) & (df1['food'].isin({'Steak', 'Lamb', 'Mango'})), :]

或者

df1.query("age > 10 and food in ['Steak', 'Lamb', 'Mango']")

	age	color	food	height	score	state
Jane	30	blue	Steak	165	4.6	NY
Aaron	12	red	Mango	120	9.0	FL

布尔选择元素

按条件过滤元素

df2 = pd.DataFrame({
    'A': ['x', 'y', 'z', 'q'], 'B': ['w', 'a', np.nan, 'x'], 'C': np.arange(4)})

	A	B	C
0	x	w	0
1	y	a	1
2	z	NaN	2
3	q	x	3

c1 = ['x', 'w', 'p']
df2[df2[['A', 'B']].isin(c1)]

	A	B	C
0	x	w	NaN
1	NaN	NaN	NaN
2	NaN	NaN	NaN
3	NaN	x	NaN

按任一列满足即可的条件选择行

df2[df2[['A', 'B']].isin(c1).any(axis=1)]

	A	B	C
0	x	w	0
3	q	x	3

按所有列满足才可的条件选择行

df2[df2[['A', 'B']].isin(c1).all(axis=1)]

	A	B	C
0	x	w	0

还有基于 Numpy 和基于 list comprehension 的更多筛选方法，见 python - How to filter Pandas dataframe using ‘in’ and ‘not in’ like in SQL - Stack Overflow

批量更改列

字典映射法

不在字典中的映射为 NaN

dic = {'NY':'New York', 'TX':'Texas', 'FL':'Florida', 'AL':'Alabama'}
df1['state'].map(dic)
df1['state'] = df1['state'].map(dic)

	age	color	food	height	score	state
Jane	30	blue	Steak	165	4.6	New York
Nick	2	green	Lamb	70	8.3	Texas
Aaron	12	red	Mango	120	9.0	Florida
Penelope	4	white	Apple	80	3.3	Alabama
Dean	32	gray	Cheese	180	1.8	NaN
Christina	33	black	Melon	172	9.5	Texas
Cornelia	69	red	Beans	150	2.2	Texas

若要让不在字典中的保留原值

df1['state']=df1['state'].map(dic).fillna(df['state'])

或者

df1['state'].replace(dic, inplace=True)

（来源：Stackoverflow）

函数法

其他参数放在元组里用 args 传入

def cm_to_feet_inches(size, int_inches=False):
    feet = int(size // 30.48)
    inches = size / 2.54 - feet * 12
    if int_inches:
        return "%d\'%d\"" % (feet, round(inches))
    else:
        return "%d\'%.2f\"" % (feet, inches)


df1['height'] = df1['height'].apply(cm_to_feet_inches, args=(True,))

	age	color	food	height	score	state
Jane	30	blue	Steak	5’5”	4.6	NY
Nick	2	green	Lamb	2’4”	8.3	TX
Aaron	12	red	Mango	3’11”	9.0	FL
Penelope	4	white	Apple	2’7”	3.3	AL
Dean	32	gray	Cheese	5’11”	1.8	AK
Christina	33	black	Melon	5’8”	9.5	TX
Cornelia	69	red	Beans	4’11”	2.2	TX

也可使用 lambda 方法

df1['height'] = df1['height'].apply(lambda x : cm_to_feet_inches(x, True))

遍历行

数据量大时性能较低，不建议使用

for index, row in df1.iterrows():
    print(index, row['age'], row['color'], sep=', ')

Jane, 30, blue
Nick, 2, green
Aaron, 12, red
Penelope, 4, white
Dean, 32, gray
Christina, 33, black
Cornelia, 69, red

倒序遍历

for index, row in df1.iloc[::-1].iterrows():
    print(index, row['age'], row['color'], sep=', ')

Cornelia, 69, red
Christina, 33, black
Dean, 32, gray
Penelope, 4, white
Aaron, 12, red
Nick, 2, green
Jane, 30, blue

注意事项

df.index() 返回的不是行数（从0开始那个），而是 index 内容的列表