: | ---------: | -----: | -----: | ------: | --------: | -----: | ---: | ------: | -------: |
| 0 | 2018-01-01 | 3℃ | -6℃ | 晴~多云 | 东北风 | 1-2级 | 59 | 良 | 2 |
| 1 | 2018-01-02 | 2℃ | -5℃ | 阴~多云 | 东北风 | 1-2级 | 49 | 优 | 1 |
| 2 | 2018-01-03 | 2℃ | -5℃ | 多云 | 北风 | 1-2级 | 28 | 优 | 1 |
| 3 | 2018-01-04 | 0℃ | -8℃ | 阴 | 东北风 | 1-2级 | 28 | 优 | 1 |
| 4 | 2018-01-05 | 3℃ | -6℃ | 多云~晴 | 西北风 | 1-2级 | 50 | 优 | 1 |

In [4]:

df.dtypes

Out[4]:

ymd          object
bWendu       object
yWendu       object
tianqi       object
fengxiang    object
fengli       object
aqi           int64
aqiInfo      object
aqiLevel      int64
dtype: object

1、获取Series的str属性,使用各种字符串处理函数

In [5]:

df["bWendu"].str

Out[5]:

<pandas.core.strings.StringMethods at 0x1af21871808>

In [6]:

# 字符串替换函数
df["bWendu"].str.replace("℃", "")

Out[6]:

0       3
1       2
2       2
3       0
4       3
       ..
360    -5
361    -3
362    -3
363    -2
364    -2
Name: bWendu, Length: 365, dtype: object

In [7]:

# 判断是不是数字
df["bWendu"].str.isnumeric()

Out[7]:

0      False
1      False
2      False
3      False
4      False
       ...  
360    False
361    False
362    False
363    False
364    False
Name: bWendu, Length: 365, dtype: bool

In [8]:

df["aqi"].str.len()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-12cdcbdb6f81> in <module>
----> 1 df["aqi"].str.len()

d:\appdata\python37\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5173             or name in self._accessors
   5174         ):
-> 5175             return object.__getattribute__(self, name)
   5176         else:
   5177             if self._info_axis._can_hold_identifiers_and_holds_name(name):

d:\appdata\python37\lib\site-packages\pandas\core\accessor.py in __get__(self, obj, cls)
    173             # we're accessing the attribute of the class, i.e., Dataset.geo
    174             return self._accessor
--> 175         accessor_obj = self._accessor(obj)
    176         # Replace the property with the accessor object. Inspired by:
    177         # http://www.pydanny.com/cached-property.html

d:\appdata\python37\lib\site-packages\pandas\core\strings.py in __init__(self, data)
   1915 
   1916     def __init__(self, data):
-> 1917         self._inferred_dtype = self._validate(data)
   1918         self._is_categorical = is_categorical_dtype(data)
   1919 

d:\appdata\python37\lib\site-packages\pandas\core\strings.py in _validate(data)
   1965 
   1966         if inferred_dtype not in allowed_types:
-> 1967             raise AttributeError("Can only use .str accessor with string " "values!")
   1968         return inferred_dtype
   1969 

AttributeError: Can only use .str accessor with string values!

2、使用str的startswith、contains等得到bool的Series可以做条件查询

In [9]:

condition = df["ymd"].str.startswith("2018-03")

In [10]:

condition

Out[10]:

0      False
1      False
2      False
3      False
4      False
       ...  
360    False
361    False
362    False
363    False
364    False
Name: ymd, Length: 365, dtype: bool

In [11]:

df[condition].head()

Out[11]:

ymdbWenduyWendutianqifengxiangfengliaqiaqiInfoaqiLevel
592018-03-018℃-3℃多云西南风1-2级461
602018-03-029℃-1℃晴~多云北风1-2级952
612018-03-0313℃3℃多云~阴北风1-2级214重度污染5
622018-03-047℃-2℃阴~多云东南风1-2级144轻度污染3
632018-03-058℃-3℃南风1-2级942

3、需要多次str处理的链式操作

怎样提取201803这样的数字月份?
1、先将日期2018-03-31替换成20180331的形式
2、提取月份字符串201803

In [12]:

df["ymd"].str.replace("-", "")

Out[12]:

0      20180101
1      20180102
2      20180103
3      20180104
4      20180105
         ...   
360    20181227
361    20181228
362    20181229
363    20181230
364    20181231
Name: ymd, Length: 365, dtype: object

In [13]:

# 每次调用函数,都返回一个新Series
df["ymd"].str.replace("-", "").slice(0, 6)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-ae278fb12255> in <module>
      1 # 每次调用函数,都返回一个新Series
----> 2 df["ymd"].str.replace("-", "").slice(0, 6)

d:\appdata\python37\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5177             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5178                 return self[name]
-> 5179             return object.__getattribute__(self, name)
   5180 
   5181     def __setattr__(self, name, value):

AttributeError: 'Series' object has no attribute 'slice'

In [14]:

df["ymd"].str.replace("-", "").str.slice(0, 6)

Out[14]:

0      201801
1      201801
2      201801
3      201801
4      201801
        ...  
360    201812
361    201812
362    201812
363    201812
364    201812
Name: ymd, Length: 365, dtype: object

In [15]:

# slice就是切片语法,可以直接用
df["ymd"].str.replace("-", "").str[0:6]

Out[15]:

0      201801
1      201801
2      201801
3      201801
4      201801
        ...  
360    201812
361    201812
362    201812
363    201812
364    201812
Name: ymd, Length: 365, dtype: object

4. 使用正则表达式的处理

In [16]:

# 添加新列
def get_nianyueri(x):
    year,month,day = x["ymd"].split("-")
    return f"{year}年{month}月{day}日"
df["中文日期"] = df.apply(get_nianyueri, axis=1)

In [17]:

df["中文日期"]

Out[17]:

0      2018年01月01日
1      2018年01月02日
2      2018年01月03日
3      2018年01月04日
4      2018年01月05日
          ...     
360    2018年12月27日
361    2018年12月28日
362    2018年12月29日
363    2018年12月30日
364    2018年12月31日
Name: 中文日期, Length: 365, dtype: object

问题:怎样将“2018年12月31日”中的年、月、日三个中文字符去除?

In [18]:

# 方法1:链式replace
df["中文日期"].str.replace("年", "").str.replace("月","").str.replace("日", "")

Out[18]:

0      20180101
1      20180102
2      20180103
3      20180104
4      20180105
         ...   
360    20181227
361    20181228
362    20181229
363    20181230
364    20181231
Name: 中文日期, Length: 365, dtype: object

*Series.str默认就开启了正则表达式模式*

In [19]:

# 方法2:正则表达式替换
df["中文日期"].str.replace("[年月日]", "")

Out[19]:

0      20180101
1      20180102
2      20180103
3      20180104
4      20180105
         ...   
360    20181227
361    20181228
362    20181229
363    20181230
364    20181231
Name: 中文日期, Length: 365, dtype: object

11、Pandas的axis参数怎么理解?

  • axis=0或者"index":

    • 如果是单行操作,就指的是某一行
    • 如果是聚合操作,指的是跨行cross rows
  • axis=1或者"columns":

    • 如果是单列操作,就指的是某一列

    • 如果是聚合操作,指的是跨列cross columns

      *按哪个axis,就是这个axis要动起来(类似被for遍历),其它的axis保持不动*

In [1]:

import pandas as pd
import numpy as np

In [2]:

df = pd.DataFrame(
    np.arange(12).reshape(3,4),
    columns=['A', 'B', 'C', 'D']
)

In [3]:

df

Out[3]:

ABCD
00123
14567
2891011

1、单列drop,就是删除某一列

In [4]:

# 代表的就是删除某列
df.drop("A", axis=1)

Out[4]:

BCD
0123
1567
291011

2、单行drop,就是删除某一行

In [5]:

df

Out[5]:

ABCD
00123
14567
2891011

In [6]:

# 代表的就是删除某行
df.drop(1, axis=0)

Out[6]:

ABCD
00123
2891011

3、按axis=0/index执行mean聚合操作

反直觉:输出的不是每行的结果,而是每列的结果

In [7]:

df

Out[7]:

ABCD
00123
14567
2891011

In [8]:

# axis=0 or axis=index
df.mean(axis=0)

Out[8]:

A    4.0
B    5.0
C    6.0
D    7.0
dtype: float64

*指定了按哪个axis,就是这个axis要动起来(类似被for遍历),其它的axis保持不动*

4、按axis=1/columns执行mean聚合操作

反直觉:输出的不是每行的结果,而是每列的结果

In [9]:

df

Out[9]:

ABCD
00123
14567
2891011

In [10]:

# axis=1 or axis=columns
df.mean(axis=1)

Out[10]:

0    1.5
1    5.5
2    9.5
dtype: float64

*指定了按哪个axis,就是这个axis要动起来(类似被for遍历),其它的axis保持不动*

5、再次举例,加深理解

In [11]:

def get_sum_value(x):
    return x["A"] + x["B"] + x["C"] + x["D"]

df["sum_value"] = df.apply(get_sum_value, axis=1)

In [12]:

df

Out[12]:

ABCDsum_value
001236
1456722
289101138

*指定了按哪个axis,就是这个axis要动起来(类似被for遍历),其它的axis保持不动*

12、Pandas的索引index的用途

把数据存储于普通的column列也能用于数据查询,那使用index有什么好处?

index的用途总结:

  1. 更方便的数据查询;
  2. 使用index可以获得性能提升;
  3. 自动的数据对齐功能;
  4. 更多更强大的数据结构支持;

In [1]:

import pandas as pd

In [2]:

df = pd.read_csv("./datas/ml-latest-small/ratings.csv")

In [3]:

df.head()

Out[3]:

userIdmovieIdratingtimestamp
0114.0964982703
1134.0964981247
2164.0964982224
31475.0964983815
41505.0964982931

In [4]:

df.count()

Out[4]:

userId       100836
movieId      100836
rating       100836
timestamp    100836
dtype: int64

1、使用index查询数据

In [5]:

# drop==False,让索引列还保持在column
df.set_index("userId", inplace=True, drop=False)

In [6]:

df.head()

Out[6]:

userIdmovieIdratingtimestamp
userId
1114.0964982703
1134.0964981247
1164.0964982224
11475.0964983815
11505.0964982931

In [7]:

df.index

Out[7]:

Int64Index([  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
            ...
            610, 610, 610, 610, 610, 610, 610, 610, 610, 610],
           dtype='int64', name='userId', length=100836)

In [9]:

# 使用index的查询方法
df.loc[500].head(5)

Out[9]:

userIdmovieIdratingtimestamp
userId
50050014.01005527755
500500111.01005528017
500500391.01005527926
5005001011.01005527980
5005001044.01005528065

In [8]:

# 使用column的condition查询方法
df.loc[df["userId"] == 500].head()

Out[8]:

userIdmovieIdratingtimestamp
userId
50050014.01005527755
500500111.01005528017
500500391.01005527926
5005001011.01005527980
5005001044.01005528065

2. 使用index会提升查询性能

  • 如果index是唯一的,Pandas会使用哈希表优化,查询性能为O(1);
  • 如果index不是唯一的,但是有序,Pandas会使用二分查找算法,查询性能为O(logN);
  • 如果index是完全随机的,那么每次查询都要扫描全表,查询性能为O(N);

实验1:完全随机的顺序查询

In [10]:

# 将数据随机打散
from sklearn.utils import shuffle
df_shuffle = shuffle(df)

In [11]:

df_shuffle.head()

Out[11]:

userIdmovieIdratingtimestamp
userId
16016023401.0985383314
12912911363.51167375403
167167441914.51154718915
5365362763.0832839990
676759522.01501274082

In [12]:

# 索引是否是递增的
df_shuffle.index.is_monotonic_increasing

Out[12]:

False

In [13]:

df_shuffle.index.is_unique

Out[13]:

False

In [14]:

# 计时,查询id==500数据性能
%timeit df_shuffle.loc[500]
376 µs ± 52.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

实验2:将index排序后的查询

In [15]:

df_sorted = df_shuffle.sort_index()

In [16]:

df_sorted.head()

Out[16]:

userIdmovieIdratingtimestamp
userId
1129854.0964983034
1126172.0964982588
1136394.0964982271
1164.0964982224
117334.0964982400

In [17]:

# 索引是否是递增的
df_sorted.index.is_monotonic_increasing

Out[17]:

True

In [18]:

df_sorted.index.is_unique

Out[18]:

False

In [19]:

%timeit df_sorted.loc[500]
203 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

3. 使用index能自动对齐数据

包括series和dataframe

In [20]:

s1 = pd.Series([1,2,3], index=list("abc"))

In [21]:

s1

Out[21]:

a    1
b    2
c    3
dtype: int64

In [22]:

s2 = pd.Series([2,3,4], index=list("bcd"))

In [23]:

s2

Out[23]:

b    2
c    3
d    4
dtype: int64

In [24]:

s1+s2

Out[24]:

a    NaN
b    4.0
c    6.0
d    NaN
dtype: float64

4. 使用index更多更强大的数据结构支持

*很多强大的索引数据结构*

  • CategoricalIndex,基于分类数据的Index,提升性能;
  • MultiIndex,多维索引,用于groupby多维聚合后结果等;
  • DatetimeIndex,时间类型索引,强大的日期和时间的方法支持;

13、Pandas怎样实现DataFrame的Merge

Pandas的Merge,相当于Sql的Join,将不同的表按key关联到一个表

merge的语法:

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

  • left,right:要merge的dataframe或者有name的Series
  • how:join类型,'left', 'right', 'outer', 'inner'
  • on:join的key,left和right都需要有这个key
  • left_on:left的df或者series的key
  • right_on:right的df或者seires的key
  • left_index,right_index:使用index而不是普通的column做join
  • suffixes:两个元素的后缀,如果列有重名,自动添加后缀,默认是('_x', '_y')

文档地址:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

本次讲解提纲:

  1. 电影数据集的join实例
  2. 理解merge时一对一、一对多、多对多的数量对齐关系
  3. 理解left join、right join、inner join、outer join的区别
  4. 如果出现非Key的字段重名怎么办

1、电影数据集的join实例

电影评分数据集

是推荐系统研究的很好的数据集
位于本代码目录:./datas/movielens-1m

包含三个文件:

  1. 用户对电影的评分数据 ratings.dat
  2. 用户本身的信息数据 users.dat
  3. 电影本身的数据 movies.dat

可以关联三个表,得到一个完整的大表

数据集官方地址:https://grouplens.org/datasets/movielens/

In [1]:

import pandas as pd

In [2]:

df_ratings = pd.read_csv(
    "./datas/movielens-1m/ratings.dat", 
    sep="::",
    engine='python', 
    names="UserID::MovieID::Rating::Timestamp".split("::")
)

In [3]:

df_ratings.head()

Out[3]:

UserIDMovieIDRatingTimestamp
0111935978300760
116613978302109
219143978301968
3134084978300275
4123555978824291

In [4]:

df_users = pd.read_csv(
    "./datas/movielens-1m/users.dat", 
    sep="::",
    engine='python', 
    names="UserID::Gender::Age::Occupation::Zip-code".split("::")
)

In [5]:

df_users.head()

Out[5]:

UserIDGenderAgeOccupationZip-code
01F11048067
12M561670072
23M251555117
34M45702460
45M252055455

In [6]:

df_movies = pd.read_csv(
    "./datas/movielens-1m/movies.dat", 
    sep="::",
    engine='python', 
    names="MovieID::Title::Genres".split("::")
)

In [7]:

df_movies.head()

Out[7]:

MovieIDTitleGenres
01Toy Story (1995)Animation|Children's|Comedy
12Jumanji (1995)Adventure|Children's|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy

In [8]:

df_ratings_users = pd.merge(
   df_ratings, df_users, left_on="UserID", right_on="UserID", how="inner"
)

In [9]:

df_ratings_users.head()

Out[9]:

UserIDMovieIDRatingTimestampGenderAgeOccupationZip-code
0111935978300760F11048067
116613978302109F11048067
219143978301968F11048067
3134084978300275F11048067
4123555978824291F11048067

In [10]:

df_ratings_users_movies = pd.merge(
    df_ratings_users, df_movies, left_on="MovieID", right_on="MovieID", how="inner"
)

In [11]:

df_ratings_users_movies.head(10)

Out[11]:

UserIDMovieIDRatingTimestampGenderAgeOccupationZip-codeTitleGenres
0111935978300760F11048067One Flew Over the Cuckoo's Nest (1975)Drama
1211935978298413M561670072One Flew Over the Cuckoo's Nest (1975)Drama
21211934978220179M251232793One Flew Over the Cuckoo's Nest (1975)Drama
31511934978199279M25722903One Flew Over the Cuckoo's Nest (1975)Drama
41711935978158471M50195350One Flew Over the Cuckoo's Nest (1975)Drama
51811934978156168F18395825One Flew Over the Cuckoo's Nest (1975)Drama
61911935982730936M11048073One Flew Over the Cuckoo's Nest (1975)Drama
72411935978136709F25710023One Flew Over the Cuckoo's Nest (1975)Drama
82811933978125194F25114607One Flew Over the Cuckoo's Nest (1975)Drama
93311935978557765M45355421One Flew Over the Cuckoo's Nest (1975)Drama

2、理解merge时数量的对齐关系

以下关系要正确理解:

  • one-to-one:一对一关系,关联的key都是唯一的
    • 比如(学号,姓名) merge (学号,年龄)
    • 结果条数为:1*1
  • one-to-many:一对多关系,左边唯一key,右边不唯一key
    • 比如(学号,姓名) merge (学号,[语文成绩、数学成绩、英语成绩])
    • 结果条数为:1*N
  • many-to-many:多对多关系,左边右边都不是唯一的
    • 比如(学号,[语文成绩、数学成绩、英语成绩]) merge (学号,[篮球、足球、乒乓球])
    • 结果条数为:M*N

2.1 one-to-one 一对一关系的merge

In[12]:

left = pd.DataFrame({'sno': [11, 12, 13, 14],
                      'name': ['name_a', 'name_b', 'name_c', 'name_d']
                    })
left

Out[12]:

snoname
011name_a
112name_b
213name_c
314name_d

In [13]:

right = pd.DataFrame({'sno': [11, 12, 13, 14],
                      'age': ['21', '22', '23', '24']
                    })
right

Out[13]:

snoage
01121
11222
21323
31424

In [14]:

# 一对一关系,结果中有4条
pd.merge(left, right, on='sno')

Out[14]:

snonameage
011name_a21
112name_b22
213name_c23
314name_d24

2.2 one-to-many 一对多关系的merge

注意:数据会被复制

In[15]:

left = pd.DataFrame({'sno': [11, 12, 13, 14],
                      'name': ['name_a', 'name_b', 'name_c', 'name_d']
                    })
left

Out[15]:

snoname
011name_a
112name_b
213name_c
314name_d

In [16]:

right = pd.DataFrame({'sno': [11, 11, 11, 12, 12, 13],
                       'grade': ['语文88', '数学90', '英语75','语文66', '数学55', '英语29']
                     })
right

Out[16]:

snograde
011语文88
111数学90
211英语75
312语文66
412数学55
513英语29

In [17]:

# 数目以多的一边为准
pd.merge(left, right, on='sno')

Out[17]:

snonamegrade
011name_a语文88
111name_a数学90
211name_a英语75
312name_b语文66
412name_b数学55
513name_c英语29

2.3 many-to-many 多对多关系的merge

注意:结果数量会出现乘法

In [18]:

left = pd.DataFrame({'sno': [11, 11, 12, 12,12],
                      '爱好': ['篮球', '羽毛球', '乒乓球', '篮球', "足球"]
                    })
left

Out[18]:

sno爱好
011篮球
111羽毛球
212乒乓球
312篮球
412足球

In [19]:

right = pd.DataFrame({'sno': [11, 11, 11, 12, 12, 13],
                       'grade': ['语文88', '数学90', '英语75','语文66', '数学55', '英语29']
                     })
right

Out[19]:

snograde
011语文88
111数学90
211英语75
312语文66
412数学55
513英语29

In [20]:

pd.merge(left, right, on='sno')

Out[20]:

sno爱好grade
011篮球语文88
111篮球数学90
211篮球英语75
311羽毛球语文88
411羽毛球数学90
511羽毛球英语75
612乒乓球语文66
712乒乓球数学55
812篮球语文66
912篮球数学55
1012足球语文66
1112足球数学55

3、理解left join、right join、inner join、outer join的区别

In [21]:

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
                      'C': ['C0', 'C1', 'C4', 'C5'],
                      'D': ['D0', 'D1', 'D4', 'D5']})

In [22]:

left

Out[22]:

keyAB
0K0A0B0
1K1A1B1
2K2A2B2
3K3A3B3

In [23]:

right

Out[23]:

keyCD
0K0C0D0
1K1C1D1
2K4C4D4
3K5C5D5

3.1 inner join,默认

左边和右边的key都有,才会出现在结果里

In [24]:

pd.merge(left, right, how='inner')

Out[24]:

keyABCD
0K0A0B0C0D0
1K1A1B1C1D1

3.2 left join

左边的都会出现在结果里,右边的如果无法匹配则为Null

In [25]:

pd.merge(left, right, how='left')

Out[25]:

keyABCD
0K0A0B0C0D0
1K1A1B1C1D1
2K2A2B2NaNNaN
3K3A3B3NaNNaN

3.3 right join

右边的都会出现在结果里,左边的如果无法匹配则为Null

In [26]:

pd.merge(left, right, how='right')

Out[26]:

keyABCD
0K0A0B0C0D0
1K1A1B1C1D1
2K4NaNNaNC4D4
3K5NaNNaNC5D5

3.4 outer join

左边、右边的都会出现在结果里,如果无法匹配则为Null

In [27]:

pd.merge(left, right, how='outer')

Out[27]:

keyABCD
0K0A0B0C0D0
1K1A1B1C1D1
2K2A2B2NaNNaN
3K3A3B3NaNNaN
4K4NaNNaNC4D4
5K5NaNNaNC5D5

4、如果出现非Key的字段重名怎么办

In [28]:

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
                      'A': ['A10', 'A11', 'A12', 'A13'],
                      'D': ['D0', 'D1', 'D4', 'D5']})

In [29]:

left

Out[29]:

keyAB
0K0A0B0
1K1A1B1
2K2A2B2
3K3A3B3

In [30]:

right

Out[30]:

keyAD
0K0A10D0
1K1A11D1
2K4A12D4
3K5A13D5

In [31]:

pd.merge(left, right, on='key')

Out[31]:

keyA_xBA_yD
0K0A0B0A10D0
1K1A1B1A11D1

In [32]:

pd.merge(left, right, on='key', suffixes=('_left', '_right'))

Out[32]:

keyA_leftBA_rightD
0K0A0B0A10D0
1K1A1B1A11D1

14、Pandas实现数据的合并concat

使用场景:

批量合并相同格式的Excel、给DataFrame添加行、给DataFrame添加列

一句话说明concat语法:

  • 使用某种合并方式(inner/outer)
  • 沿着某个轴向(axis=0/1)
  • 把多个Pandas对象(DataFrame/Series)合并成一个。

concat语法:pandas.concat(objs, axis=0, join='outer', ignore_index=False)

  • objs:一个列表,内容可以是DataFrame或者Series,可以混合
  • axis:默认是0代表按行合并,如果等于1代表按列合并
  • join:合并的时候索引的对齐方式,默认是outer join,也可以是inner join
  • ignore_index:是否忽略掉原来的数据索引

append语法:DataFrame.append(other, ignore_index=False)

append只有按行合并,没有按列合并,相当于concat按行的简写形式

  • other:单个dataframe、series、dict,或者列表
  • ignore_index:是否忽略掉原来的数据索引

参考文档:

In [1]:

import pandas as pd

import warnings
warnings.filterwarnings('ignore')

一、使用pandas.concat合并数据

In [2]:

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3'],
                    'E': ['E0', 'E1', 'E2', 'E3']
                   })
df1

Out[2]:

ABCDE
0A0B0C0D0E0
1A1B1C1D1E1
2A2B2C2D2E2
3A3B3C3D3E3

In [3]:

df2 = pd.DataFrame({ 'A': ['A4', 'A5', 'A6', 'A7'],
                     'B': ['B4', 'B5', 'B6', 'B7'],
                     'C': ['C4', 'C5', 'C6', 'C7'],
                     'D': ['D4', 'D5', 'D6', 'D7'],
                     'F': ['F4', 'F5', 'F6', 'F7']
                   })
df2

Out[3]:

ABCDF
0A4B4C4D4F4
1A5B5C5D5F5
2A6B6C6D6F6
3A7B7C7D7F7

*1、默认的concat,参数为axis=0、join=outer、ignore_index=False*

In [4]:

pd.concat([df1,df2])

Out[4]:

ABCDEF
0A0B0C0D0E0NaN
1A1B1C1D1E1NaN
2A2B2C2D2E2NaN
3A3B3C3D3E3NaN
0A4B4C4D4NaNF4
1A5B5C5D5NaNF5
2A6B6C6D6NaNF6
3A7B7C7D7NaNF7

*2、使用ignore_index=True可以忽略原来的索引*

In [5]:

pd.concat([df1,df2], ignore_index=True)

Out[5]:

ABCDEF
0A0B0C0D0E0NaN
1A1B1C1D1E1NaN
2A2B2C2D2E2NaN
3A3B3C3D3E3NaN
4A4B4C4D4NaNF4
5A5B5C5D5NaNF5
6A6B6C6D6NaNF6
7A7B7C7D7NaNF7

*3、使用join=inner过滤掉不匹配的列*

In [6]:

pd.concat([df1,df2], ignore_index=True, join="inner")

Out[6]:

ABCD
0A0B0C0D0
1A1B1C1D1
2A2B2C2D2
3A3B3C3D3
4A4B4C4D4
5A5B5C5D5
6A6B6C6D6
7A7B7C7D7

*4、使用axis=1相当于添加新列*

In [7]:

df1

Out[7]:

ABCDE
0A0B0C0D0E0
1A1B1C1D1E1
2A2B2C2D2E2
3A3B3C3D3E3

*A:添加一列Series*

In [8]:

s1 = pd.Series(list(range(4)), name="F")
pd.concat([df1,s1], axis=1)

Out[8]:

ABCDEF
0A0B0C0D0E00
1A1B1C1D1E11
2A2B2C2D2E22
3A3B3C3D3E33

*B:添加多列Series*

In [9]:

s2 = df1.apply(lambda x:x["A"]+"_GG", axis=1)

In [10]:

s2

Out[10]:

0    A0_GG
1    A1_GG
2    A2_GG
3    A3_GG
dtype: object

In [11]:

s2.name="G"

In [12]:

pd.concat([df1,s1,s2], axis=1)

Out[12]:

ABCDEFG
0A0B0C0D0E00A0_GG
1A1B1C1D1E11A1_GG
2A2B2C2D2E22A2_GG
3A3B3C3D3E33A3_GG

In [13]:

# 列表可以只有Series
pd.concat([s1,s2], axis=1)

Out[13]:

FG
00A0_GG
11A1_GG
22A2_GG
33A3_GG

In [14]:

# 列表是可以混合顺序的
pd.concat([s1,df1,s2], axis=1)

Out[14]:

FABCDEG
00A0B0C0D0E0A0_GG
11A1B1C1D1E1A1_GG
22A2B2C2D2E2A2_GG
33A3B3C3D3E3A3_GG

二、使用DataFrame.append按行合并数据

In [15]:

df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df1

Out[15]:

AB
012
134

In [16]:

df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df2

Out[16]:

AB
056
178

*1、给1个dataframe添加另一个dataframe*

In [17]:

df1.append(df2)

Out[17]:

AB
012
134
056
178

*2、忽略原来的索引ignore_index=True*

In [18]:

df1.append(df2, ignore_index=True)

Out[18]:

AB
012
134
256
378

*3、可以一行一行的给DataFrame添加数据*

In [19]:

# 一个空的df
df = pd.DataFrame(columns=['A'])
df

Out[19]:

A

*A:低性能版本*

In [20]:

for i in range(5):
    # 注意这里每次都在复制
    df = df.append({'A': i}, ignore_index=True)
df

Out[20]:

A
00
11
22
33
44

*B:性能好的版本*

In [21]:

# 第一个入参是一个列表,避免了多次复制
pd.concat(
    [pd.DataFrame([i], columns=['A']) for i in range(5)],
    ignore_index=True
)

Out[21]:

A
00
11
22
33
44

15、Pandas批量拆分Excel与合并Excel

实例演示:

  1. 将一个大Excel等份拆成多个Excel
  2. 将多个小Excel合并成一个大Excel并标记来源

In [1]:

work_dir="./course_datas/c15_excel_split_merge"
splits_dir=f"{work_dir}/splits"

import os
if not os.path.exists(splits_dir):
    os.mkdir(splits_dir)

0、读取源Excel到Pandas

In [2]:

import pandas as pd

In [3]:

df_source = pd.read_excel(f"{work_dir}/crazyant_blog_articles_source.xlsx")

In [4]:

df_source.head()

Out[4]:

idtitletags
02585Tensorflow怎样接收变长列表特征python,tensorflow,特征工程
12583Pandas实现数据的合并concatpandas,python,数据分析
22574Pandas的Index索引有什么用途?pandas,python,数据分析
32564机器学习常用数据集大全python,机器学习
42561一个数据科学家的修炼路径数据分析

In [5]:

df_source.index

Out[5]:

RangeIndex(start=0, stop=258, step=1)

In [6]:

df_source.shape

Out[6]:

(258, 3)

In [7]:

total_row_count = df_source.shape[0]
total_row_count

Out[7]:

258

一、将一个大Excel等份拆成多个Excel

  1. 使用df.iloc方法,将一个大的dataframe,拆分成多个小dataframe
  2. 将使用dataframe.to_excel保存每个小Excel

1、计算拆分后的每个excel的行数

In [9]:

# 这个大excel,会拆分给这几个人
user_names = ["xiao_shuai", "xiao_wang", "xiao_ming", "xiao_lei", "xiao_bo", "xiao_hong"]

In [10]:

# 每个人的任务数目
split_size = total_row_count // len(user_names)
if total_row_count % len(user_names) != 0:
    split_size += 1

split_size

Out[10]:

43

2、拆分成多个dataframe

In [12]:

df_subs = []
for idx, user_name in enumerate(user_names):
    # iloc的开始索引
    begin = idx*split_size
    # iloc的结束索引
    end = begin+split_size
    # 实现df按照iloc拆分
    df_sub = df_source.iloc[begin:end]
    # 将每个子df存入列表
    df_subs.append((idx, user_name, df_sub))

3、将每个datafame存入excel

In [13]:

for idx, user_name, df_sub in df_subs:
    file_name = f"{splits_dir}/crazyant_blog_articles_{idx}_{user_name}.xlsx"
    df_sub.to_excel(file_name, index=False)

二、合并多个小Excel到一个大Excel

  1. 遍历文件夹,得到要合并的Excel文件列表
  2. 分别读取到dataframe,给每个df添加一列用于标记来源
  3. 使用pd.concat进行df批量合并
  4. 将合并后的dataframe输出到excel

1. 遍历文件夹,得到要合并的Excel名称列表

In [14]:

import os
excel_names = []
for excel_name in os.listdir(splits_dir):
    excel_names.append(excel_name)
excel_names

Out[14]:

['crazyant_blog_articles_0_xiao_shuai.xlsx',
 'crazyant_blog_articles_1_xiao_wang.xlsx',
 'crazyant_blog_articles_2_xiao_ming.xlsx',
 'crazyant_blog_articles_3_xiao_lei.xlsx',
 'crazyant_blog_articles_4_xiao_bo.xlsx',
 'crazyant_blog_articles_5_xiao_hong.xlsx']

2. 分别读取到dataframe

In [15]:

df_list = []

for excel_name in excel_names:
    # 读取每个excel到df
    excel_path = f"{splits_dir}/{excel_name}"
    df_split = pd.read_excel(excel_path)
    # 得到username
    username = excel_name.replace("crazyant_blog_articles_", "").replace(".xlsx", "")[2:]
    print(excel_name, username)
    # 给每个df添加1列,即用户名字
    df_split["username"] = username
    
    df_list.append(df_split)
crazyant_blog_articles_0_xiao_shuai.xlsx xiao_shuai
crazyant_blog_articles_1_xiao_wang.xlsx xiao_wang
crazyant_blog_articles_2_xiao_ming.xlsx xiao_ming
crazyant_blog_articles_3_xiao_lei.xlsx xiao_lei
crazyant_blog_articles_4_xiao_bo.xlsx xiao_bo
crazyant_blog_articles_5_xiao_hong.xlsx xiao_hong

3. 使用pd.concat进行合并

In [16]:

df_merged = pd.concat(df_list)

In [17]:

df_merged.shape

Out[17]:

(258, 4)

In [18]:

df_merged.head()

Out[18]:

idtitletagsusername
02585Tensorflow怎样接收变长列表特征python,tensorflow,特征工程xiao_shuai
12583Pandas实现数据的合并concatpandas,python,数据分析xiao_shuai
22574Pandas的Index索引有什么用途?pandas,python,数据分析xiao_shuai
32564机器学习常用数据集大全python,机器学习xiao_shuai
42561一个数据科学家的修炼路径数据分析xiao_shuai

In [19]:

df_merged["username"].value_counts()

Out[19]:

xiao_hong     43
xiao_bo       43
xiao_shuai    43
xiao_lei      43
xiao_wang     43
xiao_ming     43
Name: username, dtype: int64

4. 将合并后的dataframe输出到excel

In [20]:

df_merged.to_excel(f"{work_dir}/crazyant_blog_articles_merged.xlsx", index=False)

16、Pandas怎样实现groupby分组统计

类似SQL:
select city,max(temperature) from city_weather group by city;

groupby:先对数据分组,然后在每个分组上应用聚合函数、转换函数

本次演示:
一、分组使用聚合函数做数据统计
二、遍历groupby的结果理解执行流程
三、实例分组探索天气数据

In [1]:

import pandas as pd
import numpy as np
# 加上这一句,能在jupyter notebook展示matplot图表
%matplotlib inline

In [2]:

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
                   'D': np.random.randn(8)})
df

Out[2]:

ABCD
0fooone0.5429030.788896
1barone-0.375789-0.345869
2footwo-0.9034070.428031
3barthree-1.5647480.081163
4footwo-1.0936020.837348
5bartwo-0.2024030.701301
6fooone-0.665189-1.505290
7foothree-0.4983390.534438

一、分组使用聚合函数做数据统计

1、单个列groupby,查询所有数据列的统计

In [3]:

df.groupby('A').sum()

Out[3]:

CD
A
bar-2.1429400.436595
foo-2.6176331.083423

我们看到:

  1. groupby中的'A'变成了数据的索引列
  2. 因为要统计sum,但B列不是数字,所以被自动忽略掉

2、多个列groupby,查询所有数据列的统计

In [4]:

df.groupby(['A','B']).mean()

Out[4]:

CD
AB
barone-0.375789-0.345869
three-1.5647480.081163
two-0.2024030.701301
fooone-0.061143-0.358197
three-0.4983390.534438
two-0.9985040.632690

我们看到:('A','B')成对变成了二级索引

In [5]:

df.groupby(['A','B'], as_index=False).mean()

Out[5]:

ABCD
0barone-0.375789-0.345869
1barthree-1.5647480.081163
2bartwo-0.2024030.701301
3fooone-0.061143-0.358197
4foothree-0.4983390.534438
5footwo-0.9985040.632690

3、同时查看多种数据统计

In [6]:

df.groupby('A').agg([np.sum, np.mean, np.std])

Out[6]:

CD
summeanstdsummeanstd
A
bar-2.142940-0.7143130.7415830.4365950.1455320.526544
foo-2.617633-0.5235270.6378221.0834230.2166850.977686

我们看到:列变成了多级索引

4、查看单列的结果数据统计

In [7]:

# 方法1:预过滤,性能更好
df.groupby('A')['C'].agg([np.sum, np.mean, np.std])

Out[7]:

summeanstd
A
bar-2.142940-0.7143130.741583
foo-2.617633-0.5235270.637822

In [8]:

# 方法2
df.groupby('A').agg([np.sum, np.mean, np.std])['C']

Out[8]:

summeanstd
A
bar-2.142940-0.7143130.741583
foo-2.617633-0.5235270.637822

5、不同列使用不同的聚合函数

In [9]:

df.groupby('A').agg({"C":np.sum, "D":np.mean})

Out[9]:

CD
A
bar-2.1429400.145532
foo-2.6176330.216685

二、遍历groupby的结果理解执行流程

for循环可以直接遍历每个group

1、遍历单个列聚合的分组

In [10]:

g = df.groupby('A')
g

Out[10]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000123B250E548>

In [11]:

for name,group in g:
    print(name)
    print(group)
    print()
bar
     A      B         C         D
1  bar    one -0.375789 -0.345869
3  bar  three -1.564748  0.081163
5  bar    two -0.202403  0.701301

foo
     A      B         C         D
0  foo    one  0.542903  0.788896
2  foo    two -0.903407  0.428031
4  foo    two -1.093602  0.837348
6  foo    one -0.665189 -1.505290
7  foo  three -0.498339  0.534438

*可以获取单个分组的数据*

In [12]:

g.get_group('bar')

Out[12]:

ABCD
1barone-0.375789-0.345869
3barthree-1.5647480.081163
5bartwo-0.2024030.701301
2、遍历多个列聚合的分组

In [13]:

g = df.groupby(['A', 'B'])

In [14]:

for name,group in g:
    print(name)
    print(group)
    print()
('bar', 'one')
     A    B         C         D
1  bar  one -0.375789 -0.345869

('bar', 'three')
     A      B         C         D
3  bar  three -1.564748  0.081163

('bar', 'two')
     A    B         C         D
5  bar  two -0.202403  0.701301

('foo', 'one')
     A    B         C         D
0  foo  one  0.542903  0.788896
6  foo  one -0.665189 -1.505290

('foo', 'three')
     A      B         C         D
7  foo  three -0.498339  0.534438

('foo', 'two')
     A    B         C         D
2  foo  two -0.903407  0.428031
4  foo  two -1.093602  0.837348

可以看到,name是一个2个元素的tuple,代表不同的列

In [15]:

g.get_group(('foo', 'one'))

Out[15]:

ABCD
0fooone0.5429030.788896
6fooone-0.665189-1.505290

*可以直接查询group后的某几列,生成Series或者子DataFrame*

In [16]:

g['C']

Out[16]:

<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000123C33F64C8>

In [17]:

for name, group in g['C']:
    print(name)
    print(group)
    print(type(group))
    print()
('bar', 'one')
1   -0.375789
Name: C, dtype: float64
<class 'pandas.core.series.Series'>

('bar', 'three')
3   -1.564748
Name: C, dtype: float64
<class 'pandas.core.series.Series'>

('bar', 'two')
5   -0.202403
Name: C, dtype: float64
<class 'pandas.core.series.Series'>

('foo', 'one')
0    0.542903
6   -0.665189
Name: C, dtype: float64
<class 'pandas.core.series.Series'>

('foo', 'three')
7   -0.498339
Name: C, dtype: float64
<class 'pandas.core.series.Series'>

('foo', 'two')
2   -0.903407
4   -1.093602
Name: C, dtype: float64
<class 'pandas.core.series.Series'>

其实所有的聚合统计,都是在dataframe和series上进行的;

三、实例分组探索天气数据

In [18]:

fpath = "./datas/beijing_tianqi/beijing_tianqi_2018.csv"
df = pd.read_csv(fpath)
# 替换掉温度的后缀℃
df.loc[:, "bWendu"] = df["bWendu"].str.replace("℃", "").astype('int32')
df.loc[:, "yWendu"] = df["yWendu"].str.replace("℃", "").astype('int32')
df.head()

Out[18]:

ymdbWenduyWendutianqifengxiangfengliaqiaqiInfoaqiLevel
02018-01-013-6晴~多云东北风1-2级592
12018-01-022-5阴~多云东北风1-2级491
22018-01-032-5多云北风1-2级281
32018-01-040-8东北风1-2级281
42018-01-053-6多云~晴西北风1-2级501

In [19]:

# 新增一列为月份
df['month'] = df['ymd'].str[:7]
df.head()

Out[19]:

ymdbWenduyWendutianqifengxiangfengliaqiaqiInfoaqiLevelmonth
02018-01-013-6晴~多云东北风1-2级5922018-01
12018-01-022-5阴~多云东北风1-2级4912018-01
22018-01-032-5多云北风1-2级2812018-01
32018-01-040-8东北风1-2级2812018-01
42018-01-053-6多云~晴西北风1-2级5012018-01

1、查看每个月的最高温度

In [20]:

data = df.groupby('month')['bWendu'].max()
data

Out[20]:

month
2018-01     7
2018-02    12
2018-03    27
2018-04    30
2018-05    35
2018-06    38
2018-07    37
2018-08    36
2018-09    31
2018-10    25
2018-11    18
2018-12    10
Name: bWendu, dtype: int32

In [21]:

type(data)

Out[21]:

pandas.core.series.Series

In [22]:

data.plot()

Out[22]:

<matplotlib.axes._subplots.AxesSubplot at 0x123c344b308>

2、查看每个月的最高温度、最低温度、平均空气质量指数

In [23]:

df.head()

Out[23]:

ymdbWenduyWendutianqifengxiangfengliaqiaqiInfoaqiLevelmonth
02018-01-013-6晴~多云东北风1-2级5922018-01
12018-01-022-5阴~多云东北风1-2级4912018-01
22018-01-032-5多云北风1-2级2812018-01
32018-01-040-8东北风1-2级2812018-01
42018-01-053-6多云~晴西北风1-2级5012018-01

In [24]:

group_data = df.groupby('month').agg({"bWendu":np.max, "yWendu":np.min, "aqi":np.mean})
group_data

Out[24]:

bWenduyWenduaqi
month
2018-017-1260.677419
2018-0212-1078.857143
2018-0327-4130.322581
2018-04301102.866667
2018-05351099.064516
2018-06381782.300000
2018-07372272.677419
2018-08362059.516129
2018-09311150.433333
2018-1025167.096774
2018-1118-4105.100000
2018-1210-1277.354839

In [25]:

group_data.plot()

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0x123c5502d48>

17、Pandas的分层索引MultiIndex

为什么要学习分层索引MultiIndex?

  • 分层索引:在一个轴向上拥有多个索引层级,可以表达更高维度数据的形式;
  • 可以更方便的进行数据筛选,如果有序则性能更好;
  • groupby等操作的结果,如果是多KEY,结果是分层索引,需要会使用
  • 一般不需要自己创建分层索引(MultiIndex有构造函数但一般不用)

演示数据:百度、阿里巴巴、爱奇艺、京东四家公司的10天股票数据
数据来自:英为财经
https://cn.investing.com/

本次演示提纲:
一、Series的分层索引MultiIndex
二、Series有多层索引怎样筛选数据?
三、DataFrame的多层索引MultiIndex
四、DataFrame有多层索引怎样筛选数据?

In [1]:

import pandas as pd
%matplotlib inline

In [1]:

stocks = pd.read_excel('./datas/stocks/互联网公司股票.xlsx')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-574feeeb8bb0> in <module>
----> 1 stocks = pd.read_excel('./datas/stocks/互联网公司股票.xlsx')

NameError: name 'pd' is not defined

In [3]:

stocks.shape

Out[3]:

(12, 8)

In [4]:

stocks.head(3)

Out[4]:

日期公司收盘开盘交易量涨跌幅
02019-10-03BIDU104.32102.35104.73101.152.240.02
12019-10-02BIDU102.62100.85103.2499.502.690.01
22019-10-01BIDU102.00102.80103.26101.001.78-0.01

In [5]:

stocks["公司"].unique()

Out[5]:

array(['BIDU', 'BABA', 'IQ', 'JD'], dtype=object)

In [6]:

stocks.index

Out[6]:

RangeIndex(start=0, stop=12, step=1)

In [7]:

stocks.groupby('公司')["收盘"].mean()

Out[7]:

公司
BABA    166.80
BIDU    102.98
IQ       15.90
JD       28.35
Name: 收盘, dtype: float64

一、Series的分层索引MultiIndex

In [8]:

ser = stocks.groupby(['公司', '日期'])['收盘'].mean()
ser

Out[8]:

公司    日期        
BABA  2019-10-01    165.15
      2019-10-02    165.77
      2019-10-03    169.48
BIDU  2019-10-01    102.00
      2019-10-02    102.62
      2019-10-03    104.32
IQ    2019-10-01     15.92
      2019-10-02     15.72
      2019-10-03     16.06
JD    2019-10-01     28.19
      2019-10-02     28.06
      2019-10-03     28.80
Name: 收盘, dtype: float64

多维索引中,空白的意思是:使用上面的值

In [9]:

ser.index

Out[9]:

MultiIndex([('BABA', '2019-10-01'),
            ('BABA', '2019-10-02'),
            ('BABA', '2019-10-03'),
            ('BIDU', '2019-10-01'),
            ('BIDU', '2019-10-02'),
            ('BIDU', '2019-10-03'),
            (  'IQ', '2019-10-01'),
            (  'IQ', '2019-10-02'),
            (  'IQ', '2019-10-03'),
            (  'JD', '2019-10-01'),
            (  'JD', '2019-10-02'),
            (  'JD', '2019-10-03')],
           names=['公司', '日期'])

In [10]:

# unstack把二级索引变成列
ser.unstack()

Out[10]:

日期2019-10-012019-10-022019-10-03
公司
BABA165.15165.77169.48
BIDU102.00102.62104.32
IQ15.9215.7216.06
JD28.1928.0628.80

In [11]:

ser

Out[11]:

公司    日期        
BABA  2019-10-01    165.15
      2019-10-02    165.77
      2019-10-03    169.48
BIDU  2019-10-01    102.00
      2019-10-02    102.62
      2019-10-03    104.32
IQ    2019-10-01     15.92
      2019-10-02     15.72
      2019-10-03     16.06
JD    2019-10-01     28.19
      2019-10-02     28.06
      2019-10-03     28.80
Name: 收盘, dtype: float64

In [12]:

ser.reset_index()

Out[12]:

公司日期收盘
0BABA2019-10-01165.15
1BABA2019-10-02165.77
2BABA2019-10-03169.48
3BIDU2019-10-01102.00
4BIDU2019-10-02102.62
5BIDU2019-10-03104.32
6IQ2019-10-0115.92
7IQ2019-10-0215.72
8IQ2019-10-0316.06
9JD2019-10-0128.19
10JD2019-10-0228.06
11JD2019-10-0328.80

二、Series有多层索引MultiIndex怎样筛选数据?

In [13]:

ser

Out[13]:

公司    日期        
BABA  2019-10-01    165.15
      2019-10-02    165.77
      2019-10-03    169.48
BIDU  2019-10-01    102.00
      2019-10-02    102.62
      2019-10-03    104.32
IQ    2019-10-01     15.92
      2019-10-02     15.72
      2019-10-03     16.06
JD    2019-10-01     28.19
      2019-10-02     28.06
      2019-10-03     28.80
Name: 收盘, dtype: float64

In [14]:

ser.loc['BIDU']

Out[14]:

日期
2019-10-01    102.00
2019-10-02    102.62
2019-10-03    104.32
Name: 收盘, dtype: float64

In [15]:

# 多层索引,可以用元组的形式筛选
ser.loc[('BIDU', '2019-10-02')]

Out[15]:

102.62

In [16]:

ser.loc[:, '2019-10-02']

Out[16]:

公司
BABA    165.77
BIDU    102.62
IQ       15.72
JD       28.06
Name: 收盘, dtype: float64

三、DataFrame的多层索引MultiIndex

In [17]:

stocks.head()

Out[17]:

日期公司收盘开盘交易量涨跌幅
02019-10-03BIDU104.32102.35104.73101.152.240.02
12019-10-02BIDU102.62100.85103.2499.502.690.01
22019-10-01BIDU102.00102.80103.26101.001.78-0.01
32019-10-03BABA169.48166.65170.18165.0010.390.02
42019-10-02BABA165.77162.82166.88161.9011.600.00

In [18]:

stocks.set_index(['公司', '日期'], inplace=True)
stocks

Out[18]:

收盘开盘交易量涨跌幅
公司日期
BIDU2019-10-03104.32102.35104.73101.152.240.02
2019-10-02102.62100.85103.2499.502.690.01
2019-10-01102.00102.80103.26101.001.78-0.01
BABA2019-10-03169.48166.65170.18165.0010.390.02
2019-10-02165.77162.82166.88161.9011.600.00
2019-10-01165.15168.01168.23163.6414.19-0.01
IQ2019-10-0316.0615.7116.3815.3210.080.02
2019-10-0215.7215.8515.8715.128.10-0.01
2019-10-0115.9216.1416.2215.5011.65-0.01
JD2019-10-0328.8028.1128.9727.828.770.03
2019-10-0228.0628.0028.2227.539.530.00
2019-10-0128.1928.2228.5727.9710.640.00

In [19]:

stocks.index

Out[19]:

MultiIndex([('BIDU', '2019-10-03'),
            ('BIDU', '2019-10-02'),
            ('BIDU', '2019-10-01'),
            ('BABA', '2019-10-03'),
            ('BABA', '2019-10-02'),
            ('BABA', '2019-10-01'),
            (  'IQ', '2019-10-03'),
            (  'IQ', '2019-10-02'),
            (  'IQ', '2019-10-01'),
            (  'JD', '2019-10-03'),
            (  'JD', '2019-10-02'),
            (  'JD', '2019-10-01')],
           names=['公司', '日期'])

In [20]:

stocks.sort_index(inplace=True)
stocks

Out[20]:

收盘开盘交易量涨跌幅
公司日期
BABA2019-10-01165.15168.01168.23163.6414.19-0.01
2019-10-02165.77162.82166.88161.9011.600.00
2019-10-03169.48166.65170.18165.0010.390.02
BIDU2019-10-01102.00102.80103.26101.001.78-0.01
2019-10-02102.62100.85103.2499.502.690.01
2019-10-03104.32102.35104.73101.152.240.02
IQ2019-10-0115.9216.1416.2215.5011.65-0.01
2019-10-0215.7215.8515.8715.128.10-0.01
2019-10-0316.0615.7116.3815.3210.080.02
JD2019-10-0128.1928.2228.5727.9710.640.00
2019-10-0228.0628.0028.2227.539.530.00
2019-10-0328.8028.1128.9727.828.770.03

四、DataFrame有多层索引MultiIndex怎样筛选数据?

【*重要知识*】在选择数据时:

  • 元组(key1,key2)代表筛选多层索引,其中key1是索引第一级,key2是第二级,比如key1=JD, key2=2019-10-02
  • 列表[key1,key2]代表同一层的多个KEY,其中key1和key2是并列的同级索引,比如key1=JD, key2=BIDU

In [21]:

stocks.loc['BIDU']

Out[21]:

收盘开盘交易量涨跌幅
日期
2019-10-01102.00102.80103.26101.001.78-0.01
2019-10-02102.62100.85103.2499.502.690.01
2019-10-03104.32102.35104.73101.152.240.02

In [22]:

stocks.loc[('BIDU', '2019-10-02'), :]

Out[22]:

收盘     102.62
开盘     100.85
高      103.24
低       99.50
交易量      2.69
涨跌幅      0.01
Name: (BIDU, 2019-10-02), dtype: float64

In [23]:

stocks.loc[('BIDU', '2019-10-02'), '开盘']

Out[23]:

100.85

In [24]:

stocks.loc[['BIDU', 'JD'], :]

Out[24]:

收盘开盘交易量涨跌幅
公司日期
BIDU2019-10-01102.00102.80103.26101.001.78-0.01
2019-10-02102.62100.85103.2499.502.690.01
2019-10-03104.32102.35104.73101.152.240.02
JD2019-10-0128.1928.2228.5727.9710.640.00
2019-10-0228.0628.0028.2227.539.530.00
2019-10-0328.8028.1128.9727.828.770.03

In [25]:

stocks.loc[(['BIDU', 'JD'], '2019-10-03'), :]

Out[25]:

收盘开盘交易量涨跌幅
公司日期
BIDU2019-10-03104.32102.35104.73101.152.240.02
JD2019-10-0328.8028.1128.9727.828.770.03

In [26]:

stocks.loc[(['BIDU', 'JD'], '2019-10-03'), '收盘']

Out[26]:

公司    日期        
BIDU  2019-10-03    104.32
JD    2019-10-03     28.80
Name: 收盘, dtype: float64

In [27]:

stocks.loc[('BIDU', ['2019-10-02', '2019-10-03']), '收盘']

Out[27]:

公司    日期        
BIDU  2019-10-02    102.62
      2019-10-03    104.32
Name: 收盘, dtype: float64

In [28]:

# slice(None)代表筛选这一索引的所有内容
stocks.loc[(slice(None), ['2019-10-02', '2019-10-03']), :]

Out[28]:

收盘开盘交易量涨跌幅
公司日期
BABA2019-10-02165.77162.82166.88161.9011.600.00
2019-10-03169.48166.65170.18165.0010.390.02
BIDU2019-10-02102.62100.85103.2499.502.690.01
2019-10-03104.32102.35104.73101.152.240.02
IQ2019-10-0215.7215.8515.8715.128.10-0.01
2019-10-0316.0615.7116.3815.3210.080.02
JD2019-10-0228.0628.0028.2227.539.530.00
2019-10-0328.8028.1128.9727.828.770.03

In [29]:

stocks.reset_index()

Out[29]:

公司日期收盘开盘交易量涨跌幅
0BABA2019-10-01165.15168.01168.23163.6414.19-0.01
1BABA2019-10-02165.77162.82166.88161.9011.600.00
2BABA2019-10-03169.48166.65170.18165.0010.390.02
3BIDU2019-10-01102.00102.80103.26101.001.78-0.01
4BIDU2019-10-02102.62100.85103.2499.502.690.01
5BIDU2019-10-03104.32102.35104.73101.152.240.02
6IQ2019-10-0115.9216.1416.2215.5011.65-0.01
7IQ2019-10-0215.7215.8515.8715.128.10-0.01
8IQ2019-10-0316.0615.7116.3815.3210.080.02
9JD2019-10-0128.1928.2228.5727.9710.640.00
10JD2019-10-0228.0628.0028.2227.539.530.00
11JD2019-10-0328.8028.1128.9727.828.770.03

18、Pandas的数据转换函数map、apply、applymap

数据转换函数对比:map、apply、applymap:

  1. map:只用于Series,实现每个值->值的映射;
  2. apply:用于Series实现每个值的处理,用于Dataframe实现某个轴的Series的处理;
  3. applymap:只能用于DataFrame,用于处理该DataFrame的每个元素;

1. map用于Series值的转换

实例:将股票代码英文转换成中文名字

Series.map(dict) or Series.map(function)均可

In [1]:

import pandas as pd
stocks = pd.read_excel('./datas/stocks/互联网公司股票.xlsx')
stocks.head()

Out[1]:

日期公司收盘开盘交易量涨跌幅
02019-10-03BIDU104.32102.35104.73101.152.240.02
12019-10-02BIDU102.62100.85103.2499.502.690.01
22019-10-01BIDU102.00102.80103.26101.001.78-0.01
32019-10-03BABA169.48166.65170.18165.0010.390.02
42019-10-02BABA165.77162.82166.88161.9011.600.00

In [2]:

stocks["公司"].unique()

Out[2]:

array(['BIDU', 'BABA', 'IQ', 'JD'], dtype=object)

In [3]:

# 公司股票代码到中文的映射,注意这里是小写
dict_company_names = {
    "bidu": "百度",
    "baba": "阿里巴巴",
    "iq": "爱奇艺", 
    "jd": "京东"
}

方法1:Series.map(dict)

In [4]:

stocks["公司中文1"] = stocks["公司"].str.lower().map(dict_company_names)

In [5]:

stocks.head()

Out[5]:

日期公司收盘开盘交易量涨跌幅公司中文1
02019-10-03BIDU104.32102.35104.73101.152.240.02百度
12019-10-02BIDU102.62100.85103.2499.502.690.01百度
22019-10-01BIDU102.00102.80103.26101.001.78-0.01百度
32019-10-03BABA169.48166.65170.18165.0010.390.02阿里巴巴
42019-10-02BABA165.77162.82166.88161.9011.600.00阿里巴巴

方法2:Series.map(function)

function的参数是Series的每个元素的值

In [6]:

stocks["公司中文2"] = stocks["公司"].map(lambda x : dict_company_names[x.lower()])

In [7]:

stocks.head()

Out[7]:

日期公司收盘开盘交易量涨跌幅公司中文1公司中文2
02019-10-03BIDU104.32102.35104.73101.152.240.02百度百度
12019-10-02BIDU102.62100.85103.2499.502.690.01百度百度
22019-10-01BIDU102.00102.80103.26101.001.78-0.01百度百度
32019-10-03BABA169.48166.65170.18165.0010.390.02阿里巴巴阿里巴巴
42019-10-02BABA165.77162.82166.88161.9011.600.00阿里巴巴阿里巴巴

2. apply用于Series和DataFrame的转换

  • Series.apply(function), 函数的参数是每个值
  • DataFrame.apply(function), 函数的参数是Series

Series.apply(function)

function的参数是Series的每个值

In [8]:

stocks["公司中文3"] = stocks["公司"].apply(
    lambda x : dict_company_names[x.lower()])

In [9]:

stocks.head()

Out[9]:

日期公司收盘开盘交易量涨跌幅公司中文1公司中文2公司中文3
02019-10-03BIDU104.32102.35104.73101.152.240.02百度百度百度
12019-10-02BIDU102.62100.85103.2499.502.690.01百度百度百度
22019-10-01BIDU102.00102.80103.26101.001.78-0.01百度百度百度
32019-10-03BABA169.48166.65170.18165.0010.390.02阿里巴巴阿里巴巴阿里巴巴
42019-10-02BABA165.77162.82166.88161.9011.600.00阿里巴巴阿里巴巴阿里巴巴

DataFrame.apply(function)

function的参数是对应轴的Series

In [10]:

stocks["公司中文4"] = stocks.apply(
    lambda x : dict_company_names[x["公司"].lower()], 
    axis=1)

注意这个代码:
1、apply是在stocks这个DataFrame上调用;
2、lambda x的x是一个Series,因为指定了axis=1所以Seires的key是列名,可以用x['公司']获取

In [11]:

stocks.head()

Out[11]:

日期公司收盘开盘交易量涨跌幅公司中文1公司中文2公司中文3公司中文4
02019-10-03BIDU104.32102.35104.73101.152.240.02百度百度百度百度
12019-10-02BIDU102.62100.85103.2499.502.690.01百度百度百度百度
22019-10-01BIDU102.00102.80103.26101.001.78-0.01百度百度百度百度
32019-10-03BABA169.48166.65170.18165.0010.390.02阿里巴巴阿里巴巴阿里巴巴阿里巴巴
42019-10-02BABA165.77162.82166.88161.9011.600.00阿里巴巴阿里巴巴阿里巴巴阿里巴巴

3. applymap用于DataFrame所有值的转换

In [12]:

sub_df = stocks[['收盘', '开盘', '高', '低', '交易量']]

In [13]:

sub_df.head()

Out[13]:

收盘开盘交易量
0104.32102.35104.73101.152.24
1102.62100.85103.2499.502.69
2102.00102.80103.26101.001.78
3169.48166.65170.18165.0010.39
4165.77162.82166.88161.9011.60

In [14]:

# 将这些数字取整数,应用于所有元素
sub_df.applymap(lambda x : int(x))

Out[14]:

收盘开盘交易量
01041021041012
1102100103992
21021021031011
316916617016510
416516216616111
516516816816314
61615161510
7151515158
81516161511
9282828278
10282828279
112828282710

In [15]:

# 直接修改原df的这几列
stocks.loc[:, ['收盘', '开盘', '高', '低', '交易量']] = sub_df.applymap(lambda x : int(x))

In [16]:

stocks.head()

Out[16]:

日期公司收盘开盘交易量涨跌幅公司中文1公司中文2公司中文3公司中文4
02019-10-03BIDU10410210410120.02百度百度百度百度
12019-10-02BIDU1021001039920.01百度百度百度百度
22019-10-01BIDU1021021031011-0.01百度百度百度百度
32019-10-03BABA169166170165100.02阿里巴巴阿里巴巴阿里巴巴阿里巴巴
42019-10-02BABA165162166161110.00阿里巴巴阿里巴巴阿里巴巴阿里巴巴