Pandas 笔记:数据的处理

Pandas Note: Data Manipulation in Pandas

Posted by J Leaves on February 20, 2020

基础 Dataframe

1
2
3
4
5
6
7
8
df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
                   'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
                   'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
                   'height':[165, 70, 120, 80, 180, 172, 150],
                   'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])
  age color food height score state
Jane 30 blue Steak 165 4.6 NY
Nick 2 green Lamb 70 8.3 TX
Aaron 12 red Mango 120 9.0 FL
Penelope 4 white Apple 80 3.3 AL
Dean 32 gray Cheese 180 1.8 AK
Christina 33 black Melon 172 9.5 TX
Cornelia 69 red Beans 150 2.2 TX

以下例子均基于以上 dataframe 展开。为避免修改源 Dataframe,先进行以上 dataframe 的复制。

1
df1 = df.copy()

基础用法

查看行列标签

查看行索引
1
df1.index
查看列名
1
df1.columns

选择行或列

详见 Pandas 笔记:Dataframe 中的数据选择

注意,列表选择的是列,序列选择的是行!

选择多行

按行号取(不含结束点!)

1
df1[0: 2]

按行索引取(含结束点!)

1
df1['Jane': 'Nick']
  age color food height score state
Jane 30 blue Steak 5’5” 4.6 NY
Nick 2 green Lamb 2’4” 8.3 TX

当然也可(建议使用)

1
df1.iloc[0: 2, :]
1
df1.loc['Jane': 'Nick', :]
选择多列

注意有两层中括号

1
df1[['color', 'food']]
  color food
Jane blue Steak
Nick green Lamb
Aaron red Mango
Penelope white Apple
Dean gray Cheese
Christina black Melon
Cornelia red Beans

当然也可(建议使用)

1
df1.loc[:, ['color', 'food']]

高级用法

布尔选择行

单条件
1
df1.loc[:, df1.loc['age'] > 10]
  age color food height score state
Jane 30 blue Steak 165 4.6 NY
Aaron 12 red Mango 120 9.0 FL
Dean 32 gray Cheese 180 1.8 AK
Christina 33 black Melon 172 9.5 TX
Cornelia 69 red Beans 150 2.2 TX
条件:是否在某集合中

DataFrame.isin()Series.within()

集合可以为列表(list),集合(set),数据系列(Series),以及数组(Numpy array)。

python - How to filter Pandas dataframe using ‘in’ and ‘not in’ like in SQL - Stack Overflow

多条件

注意每个条件加括号

1
df1.loc[(df1['age'] > 10) & (df1['food'].isin({'Steak', 'Lamb', 'Mango'})), :]

或者

1
df1.query("age > 10 and food in ['Steak', 'Lamb', 'Mango']")
  age color food height score state
Jane 30 blue Steak 165 4.6 NY
Aaron 12 red Mango 120 9.0 FL

布尔选择元素

按条件过滤元素
1
2
df2 = pd.DataFrame({
    'A': ['x', 'y', 'z', 'q'], 'B': ['w', 'a', np.nan, 'x'], 'C': np.arange(4)})
  A B C
0 x w 0
1 y a 1
2 z NaN 2
3 q x 3
1
2
c1 = ['x', 'w', 'p']
df2[df2[['A', 'B']].isin(c1)]
  A B C
0 x w NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN x NaN
按任一列满足即可的条件选择行
1
df2[df2[['A', 'B']].isin(c1).any(axis=1)]
  A B C
0 x w 0
3 q x 3
按所有列满足才可的条件选择行
1
df2[df2[['A', 'B']].isin(c1).all(axis=1)]
  A B C
0 x w 0

还有基于 Numpy 和基于 list comprehension 的更多筛选方法,见 python - How to filter Pandas dataframe using ‘in’ and ‘not in’ like in SQL - Stack Overflow

批量更改列

字典映射法

不在字典中的映射为 NaN

1
2
3
dic = {'NY':'New York', 'TX':'Texas', 'FL':'Florida', 'AL':'Alabama'}
df1['state'].map(dic)
df1['state'] = df1['state'].map(dic)
  age color food height score state
Jane 30 blue Steak 165 4.6 New York
Nick 2 green Lamb 70 8.3 Texas
Aaron 12 red Mango 120 9.0 Florida
Penelope 4 white Apple 80 3.3 Alabama
Dean 32 gray Cheese 180 1.8 NaN
Christina 33 black Melon 172 9.5 Texas
Cornelia 69 red Beans 150 2.2 Texas

若要让不在字典中的保留原值

1
df1['state']=df1['state'].map(dic).fillna(df['state'])

或者

1
df1['state'].replace(dic, inplace=True)

(来源:Stackoverflow

函数法

其他参数放在元组里用 args 传入

1
2
3
4
5
6
7
8
9
10
def cm_to_feet_inches(size, int_inches=False):
    feet = int(size // 30.48)
    inches = size / 2.54 - feet * 12
    if int_inches:
        return "%d\'%d\"" % (feet, round(inches))
    else:
        return "%d\'%.2f\"" % (feet, inches)


df1['height'] = df1['height'].apply(cm_to_feet_inches, args=(True,))
  age color food height score state
Jane 30 blue Steak 5’5” 4.6 NY
Nick 2 green Lamb 2’4” 8.3 TX
Aaron 12 red Mango 3’11” 9.0 FL
Penelope 4 white Apple 2’7” 3.3 AL
Dean 32 gray Cheese 5’11” 1.8 AK
Christina 33 black Melon 5’8” 9.5 TX
Cornelia 69 red Beans 4’11” 2.2 TX

也可使用 lambda 方法

1
df1['height'] = df1['height'].apply(lambda x : cm_to_feet_inches(x, True))

遍历行

数据量大时性能较低,不建议使用

1
2
for index, row in df1.iterrows():
    print(index, row['age'], row['color'], sep=', ')
1
2
3
4
5
6
7
Jane, 30, blue
Nick, 2, green
Aaron, 12, red
Penelope, 4, white
Dean, 32, gray
Christina, 33, black
Cornelia, 69, red

倒序遍历

1
2
for index, row in df1.iloc[::-1].iterrows():
    print(index, row['age'], row['color'], sep=', ')
1
2
3
4
5
6
7
Cornelia, 69, red
Christina, 33, black
Dean, 32, gray
Penelope, 4, white
Aaron, 12, red
Nick, 2, green
Jane, 30, blue

注意事项

df.index() 返回的不是行数(从0开始那个),而是 index 内容的列表