: | -----: | ------: | -----: | --------: |
| 0 | 1 | 1193 | 5 | 978300760 |
| 1 | 1 | 661 | 3 | 978302109 |
| 2 | 1 | 914 | 3 | 978301968 |
| 3 | 1 | 3408 | 4 | 978300275 |
| 4 | 1 | 2355 | 5 | 978824291 |

In [3]:

# 实现按照用户ID分组,然后对其中一列归一化
def ratings_norm(df):
    """
    @param df:每个用户分组的dataframe
    """
    min_value = df["Rating"].min()
    max_value = df["Rating"].max()
    df["Rating_norm"] = df["Rating"].apply(
        lambda x: (x-min_value)/(max_value-min_value))
    return df

ratings = ratings.groupby("UserID").apply(ratings_norm)

In [4]:

ratings[ratings["UserID"]==1].head()

Out[4]:

UserIDMovieIDRatingTimestampRating_norm
01119359783007601.0
1166139783021090.0
2191439783019680.0
31340849783002750.5
41235559788242911.0

可以看到UserID==1这个用户,Rating==3是他的最低分,是个乐观派,我们归一化到0分;

实例2:怎样取每个分组的TOPN数据?

获取2018年每个月温度最高的2天数据

In [5]:

fpath = "./datas/beijing_tianqi/beijing_tianqi_2018.csv"
df = pd.read_csv(fpath)
# 替换掉温度的后缀℃
df.loc[:, "bWendu"] = df["bWendu"].str.replace("℃", "").astype('int32')
df.loc[:, "yWendu"] = df["yWendu"].str.replace("℃", "").astype('int32')
# 新增一列为月份
df['month'] = df['ymd'].str[:7]
df.head()

Out[5]:

ymdbWenduyWendutianqifengxiangfengliaqiaqiInfoaqiLevelmonth
02018-01-013-6晴~多云东北风1-2级5922018-01
12018-01-022-5阴~多云东北风1-2级4912018-01
22018-01-032-5多云北风1-2级2812018-01
32018-01-040-8东北风1-2级2812018-01
42018-01-053-6多云~晴西北风1-2级5012018-01

In [6]:

def getWenduTopN(df, topn):
    """
    这里的df,是每个月份分组group的df
    """
    return df.sort_values(by="bWendu")[["ymd", "bWendu"]][-topn:]

df.groupby("month").apply(getWenduTopN, topn=1).head()

Out[6]:

ymdbWendu
month
2018-01182018-01-197
2018-02562018-02-2612
2018-03852018-03-2727
2018-041182018-04-2930
2018-051502018-05-3135

我们看到,grouby的apply函数返回的dataframe,其实和原来的dataframe其实可以完全不一样

20、Pandas的stack和pivot实现数据透视

  1. 经过统计得到多维度指标数据
  2. 使用unstack实现数据二维透视
  3. 使用pivot简化透视
  4. stack、unstack、pivot的语法

1. 经过统计得到多维度指标数据

非常常见的统计场景,指定多个维度,计算聚合后的指标

实例:统计得到“电影评分数据集”,每个月份的每个分数被评分多少次:(月份、分数1~5、次数)

In [1]:

import pandas as pd
import numpy as np
%matplotlib inline

In [2]:

df = pd.read_csv(
    "./datas/movielens-1m/ratings.dat",
    header=None,
    names="UserID::MovieID::Rating::Timestamp".split("::"),
    sep="::",
    engine="python"
)

In [3]:

df.head()

Out[3]:

UserIDMovieIDRatingTimestamp
0111935978300760
116613978302109
219143978301968
3134084978300275
4123555978824291

In [4]:

df["pdate"] = pd.to_datetime(df["Timestamp"], unit='s')

In [5]:

df.head()

Out[5]:

UserIDMovieIDRatingTimestamppdate
01119359783007602000-12-31 22:12:40
1166139783021092000-12-31 22:35:09
2191439783019682000-12-31 22:32:48
31340849783002752000-12-31 22:04:35
41235559788242912001-01-06 23:38:11

In [6]:

df.dtypes

Out[6]:

UserID                int64
MovieID               int64
Rating                int64
Timestamp             int64
pdate        datetime64[ns]
dtype: object

In [19]:

# 实现数据统计
df_group = df.groupby([df["pdate"].dt.month, "Rating"])["UserID"].agg(pv=np.size)

In [20]:

df_group.head(20)

Out[20]:

pv
pdateRating
111127
22608
36442
48400
54495
21629
21464
33297
44403
52335
31466
21077
32523
43032
51439
411048
22247
35501
46748
53863

对这样格式的数据,我想查看按月份,不同评分的次数趋势,是没法实现的

需要将数据变换成每个评分是一列才可以实现

2. 使用unstack实现数据二维透视

目的:想要画图对比按照月份的不同评分的数量趋势

In [21]:

df_stack = df_group.unstack()
df_stack

Out[21]:

pv
Rating12345
pdate
111272608644284004495
26291464329744032335
34661077252330321439
410482247550167483863
545577631184812576917840
631966500152112183814365
748919566254213495722169
81087320597505096419842497
931075873147021992713182
1021214785121751609510324
1117701322027606910244867041
12645813007308664115626760

In [22]:

df_stack.plot()

Out[22]:

<matplotlib.axes._subplots.AxesSubplot at 0x1ba09b0ce48>

In [23]:

# unstack和stack是互逆操作
df_stack.stack().head(20)

Out[23]:

pv
pdateRating
111127
22608
36442
48400
54495
21629
21464
33297
44403
52335
31466
21077
32523
43032
51439
411048
22247
35501
46748
53863

3. 使用pivot简化透视

In [24]:

df_group.head(20)

Out[24]:

pv
pdateRating
111127
22608
36442
48400
54495
21629
21464
33297
44403
52335
31466
21077
32523
43032
51439
411048
22247
35501
46748
53863

In [25]:

df_reset = df_group.reset_index()
df_reset.head()

Out[25]:

pdateRatingpv
0111127
1122608
2136442
3148400
4154495

In [26]:

df_pivot = df_reset.pivot("pdate", "Rating", "pv")

In [27]:

df_pivot.head()

Out[27]:

Rating12345
pdate
111272608644284004495
26291464329744032335
34661077252330321439
410482247550167483863
545577631184812576917840

In [28]:

df_pivot.plot()

Out[28]:

<matplotlib.axes._subplots.AxesSubplot at 0x1ba09db6d48>

*pivot方法相当于对df使用set_index创建分层索引,然后调用unstack*

4. stack、unstack、pivot的语法

stack:DataFrame.stack(level=-1, dropna=True),将column变成index,类似把横放的书籍变成竖放

level=-1代表多层索引的最内层,可以通过==0、1、2指定多层索引的对应层

unstack:DataFrame.unstack(level=-1, fill_value=None),将index变成column,类似把竖放的书籍变成横放

pivot:DataFrame.pivot(index=None, columns=None, values=None),指定index、columns、values实现二维透视

21、Pandas怎样快捷方便的处理日期数据

Pandas日期处理的作用:将2018-01-01、1/1/2018等多种日期格式映射成统一的格式对象,在该对象上提供强大的功能支持

几个概念:

  1. pd.to_datetime:pandas的一个函数,能将字符串、列表、series变成日期形式
  2. Timestamp:pandas表示日期的对象形式
  3. DatetimeIndex:pandas表示日期的对象列表形式

其中:

  • DatetimeIndex是Timestamp的列表形式
  • pd.to_datetime对单个日期字符串处理会得到Timestamp
  • pd.to_datetime对日期字符串列表处理会得到DatetimeIndex

问题:怎样统计每周、每月、每季度的最高温度?

1、读取天气数据到dataframe

In [1]:

import pandas as pd
%matplotlib inline

In [2]:

fpath = "./datas/beijing_tianqi/beijing_tianqi_2018.csv"
df = pd.read_csv(fpath)
# 替换掉温度的后缀℃
df.loc[:, "bWendu"] = df["bWendu"].str.replace("℃", "").astype('int32')
df.loc[:, "yWendu"] = df["yWendu"].str.replace("℃", "").astype('int32')
df.head()

Out[2]:

ymdbWenduyWendutianqifengxiangfengliaqiaqiInfoaqiLevel
02018-01-013-6晴~多云东北风1-2级592
12018-01-022-5阴~多云东北风1-2级491
22018-01-032-5多云北风1-2级281
32018-01-040-8东北风1-2级281
42018-01-053-6多云~晴西北风1-2级501

2、将日期列转换成pandas的日期

In [3]:

df.set_index(pd.to_datetime(df["ymd"]), inplace=True)

In [4]:

df.head()

Out[4]:

ymdbWenduyWendutianqifengxiangfengliaqiaqiInfoaqiLevel
ymd
2018-01-012018-01-013-6晴~多云东北风1-2级592
2018-01-022018-01-022-5阴~多云东北风1-2级491
2018-01-032018-01-032-5多云北风1-2级281
2018-01-042018-01-040-8东北风1-2级281
2018-01-052018-01-053-6多云~晴西北风1-2级501

In [5]:

df.index

Out[5]:

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
               '2018-01-09', '2018-01-10',
               ...
               '2018-12-22', '2018-12-23', '2018-12-24', '2018-12-25',
               '2018-12-26', '2018-12-27', '2018-12-28', '2018-12-29',
               '2018-12-30', '2018-12-31'],
              dtype='datetime64[ns]', name='ymd', length=365, freq=None)

In [6]:

# DatetimeIndex是Timestamp的列表形式
df.index[0]

Out[6]:

Timestamp('2018-01-01 00:00:00')

3、 方便的对DatetimeIndex进行查询

In [7]:

# 筛选固定的某一天
df.loc['2018-01-05']

Out[7]:

ymd          2018-01-05
bWendu                3
yWendu               -6
tianqi             多云~晴
fengxiang           西北风
fengli             1-2级
aqi                  50
aqiInfo               优
aqiLevel              1
Name: 2018-01-05 00:00:00, dtype: object

In [8]:

# 日期区间
df.loc['2018-01-05':'2018-01-10']

Out[8]:

ymdbWenduyWendutianqifengxiangfengliaqiaqiInfoaqiLevel
ymd
2018-01-052018-01-053-6多云~晴西北风1-2级501
2018-01-062018-01-062-5多云~阴西南风1-2级321
2018-01-072018-01-072-4阴~多云西南风1-2级592
2018-01-082018-01-082-6西北风4-5级501
2018-01-092018-01-091-8西北风3-4级341
2018-01-102018-01-10-2-10西北风1-2级261

In [10]:

# 按月份前缀筛选
df.loc['2018-03']

Out[10]:

ymdbWenduyWendutianqifengxiangfengliaqiaqiInfoaqiLevel
ymd
2018-03-012018-03-018-3多云西南风1-2级461
2018-03-022018-03-029-1晴~多云北风1-2级952
2018-03-032018-03-03133多云~阴北风1-2级214重度污染5
2018-03-042018-03-047-2阴~多云东南风1-2级144轻度污染3
2018-03-052018-03-058-3南风1-2级942
2018-03-062018-03-066-3多云~阴东南风3-4级672
2018-03-072018-03-076-2阴~多云北风1-2级652
2018-03-082018-03-088-4东北风1-2级622
2018-03-092018-03-0910-2多云西南风1-2级132轻度污染3
2018-03-102018-03-1014-2东南风1-2级171中度污染4
2018-03-112018-03-11110多云南风1-2级812
2018-03-122018-03-12153多云~晴南风1-2级174中度污染4
2018-03-132018-03-13175晴~多云南风1-2级287重度污染5
2018-03-142018-03-14156多云~阴东北风1-2级293重度污染5
2018-03-152018-03-1512-1多云~晴东北风3-4级702
2018-03-162018-03-1610-1多云南风1-2级582
2018-03-172018-03-1740小雨~阴南风1-2级812
2018-03-182018-03-18131多云~晴西南风1-2级134轻度污染3
2018-03-192018-03-19132多云东风1-2级107轻度污染3
2018-03-202018-03-2010-2多云南风1-2级411
2018-03-212018-03-21111多云西南风1-2级762
2018-03-222018-03-22174晴~多云西南风1-2级112轻度污染3
2018-03-232018-03-23185多云北风1-2级146轻度污染3
2018-03-242018-03-24225西南风1-2级119轻度污染3
2018-03-252018-03-25247南风1-2级782
2018-03-262018-03-26257多云西南风1-2级151中度污染4
2018-03-272018-03-272711南风1-2级243重度污染5
2018-03-282018-03-28259多云~晴东风1-2级387严重污染6
2018-03-292018-03-29197南风1-2级119轻度污染3
2018-03-302018-03-30188多云南风1-2级682
2018-03-312018-03-31239多云~晴南风1-2级125轻度污染3

In [11]:

# 按月份前缀筛选
df.loc["2018-07":"2018-09"].index

Out[11]:

DatetimeIndex(['2018-07-01', '2018-07-02', '2018-07-03', '2018-07-04',
               '2018-07-05', '2018-07-06', '2018-07-07', '2018-07-08',
               '2018-07-09', '2018-07-10', '2018-07-11', '2018-07-12',
               '2018-07-13', '2018-07-14', '2018-07-15', '2018-07-16',
               '2018-07-17', '2018-07-18', '2018-07-19', '2018-07-20',
               '2018-07-21', '2018-07-22', '2018-07-23', '2018-07-24',
               '2018-07-25', '2018-07-26', '2018-07-27', '2018-07-28',
               '2018-07-29', '2018-07-30', '2018-07-31', '2018-08-01',
               '2018-08-02', '2018-08-03', '2018-08-04', '2018-08-05',
               '2018-08-06', '2018-08-07', '2018-08-08', '2018-08-09',
               '2018-08-10', '2018-08-11', '2018-08-12', '2018-08-13',
               '2018-08-14', '2018-08-15', '2018-08-16', '2018-08-17',
               '2018-08-18', '2018-08-19', '2018-08-20', '2018-08-21',
               '2018-08-22', '2018-08-23', '2018-08-24', '2018-08-25',
               '2018-08-26', '2018-08-27', '2018-08-28', '2018-08-29',
               '2018-08-30', '2018-08-31', '2018-09-01', '2018-09-02',
               '2018-09-03', '2018-09-04', '2018-09-05', '2018-09-06',
               '2018-09-07', '2018-09-08', '2018-09-09', '2018-09-10',
               '2018-09-11', '2018-09-12', '2018-09-13', '2018-09-14',
               '2018-09-15', '2018-09-16', '2018-09-17', '2018-09-18',
               '2018-09-19', '2018-09-20', '2018-09-21', '2018-09-22',
               '2018-09-23', '2018-09-24', '2018-09-25', '2018-09-26',
               '2018-09-27', '2018-09-28', '2018-09-29', '2018-09-30'],
              dtype='datetime64[ns]', name='ymd', freq=None)

In [12]:

# 按年份前缀筛选
df.loc["2018"].head()

Out[12]:

ymdbWenduyWendutianqifengxiangfengliaqiaqiInfoaqiLevel
ymd
2018-01-012018-01-013-6晴~多云东北风1-2级592
2018-01-022018-01-022-5阴~多云东北风1-2级491
2018-01-032018-01-032-5多云北风1-2级281
2018-01-042018-01-040-8东北风1-2级281
2018-01-052018-01-053-6多云~晴西北风1-2级501

4、方便的获取周、月、季度

Timestamp、DatetimeIndex支持大量的属性可以获取日期分量:
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components

In [13]:

# 周数字列表
df.index.week

Out[13]:

Int64Index([ 1,  1,  1,  1,  1,  1,  1,  2,  2,  2,
            ...
            51, 51, 52, 52, 52, 52, 52, 52, 52,  1],
           dtype='int64', name='ymd', length=365)

In [14]:

# 月数字列表
df.index.month

Out[14]:

Int64Index([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
            ...
            12, 12, 12, 12, 12, 12, 12, 12, 12, 12],
           dtype='int64', name='ymd', length=365)

In [15]:

# 季度数字列表
df.index.quarter

Out[15]:

Int64Index([1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
            ...
            4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
           dtype='int64', name='ymd', length=365)

5、统计每周、每月、每个季度的最高温度

统计每周的数据

In [16]:

df.groupby(df.index.week)["bWendu"].max().head()

Out[16]:

ymd
1    3
2    6
3    7
4   -1
5    4
Name: bWendu, dtype: int32

In [17]:

df.groupby(df.index.week)["bWendu"].max().plot()

Out[17]:

<matplotlib.axes._subplots.AxesSubplot at 0x23300b75b88>

统计每个月的数据

In [18]:

df.groupby(df.index.month)["bWendu"].max()

Out[18]:

ymd
1      7
2     12
3     27
4     30
5     35
6     38
7     37
8     36
9     31
10    25
11    18
12    10
Name: bWendu, dtype: int32

In [19]:

df.groupby(df.index.month)["bWendu"].max().plot()

Out[19]:

<matplotlib.axes._subplots.AxesSubplot at 0x23302dac4c8>

统计每个季度的数据

In [20]:

df.groupby(df.index.quarter)["bWendu"].max()

Out[20]:

ymd
1    27
2    38
3    37
4    25
Name: bWendu, dtype: int32

In [21]:

df.groupby(df.index.quarter)["bWendu"].max().plot()

Out[21]:

<matplotlib.axes._subplots.AxesSubplot at 0x23302e338c8>

22、Pandas怎么处理日期索引的缺失?

问题:按日期统计的数据,缺失了某天,导致数据不全该怎么补充日期?

可以用两种方法实现:
1、DataFrame.reindex,调整dataframe的索引以适应新的索引
2、DataFrame.resample,可以对时间序列重采样,支持补充缺失值

问题:如果缺失了索引该怎么填充?

In [1]:

import pandas as pd
%matplotlib inline

In [2]:

df = pd.DataFrame({
    "pdate": ["2019-12-01", "2019-12-02", "2019-12-04", "2019-12-05"],
    "pv": [100, 200, 400, 500],
    "uv": [10, 20, 40, 50],
})

df

Out[2]:

pdatepvuv
02019-12-0110010
12019-12-0220020
22019-12-0440040
32019-12-0550050

In [3]:

df.set_index("pdate").plot()

Out[3]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a0d908bf48>

*问题,这里缺失了2019-12-03的数据,导致数据不全该怎么补充?*

方法1:使用pandas.reindex方法

1、将df的索引变成日期索引

In [4]:

df_date = df.set_index("pdate")
df_date

Out[4]:

pvuv
pdate
2019-12-0110010
2019-12-0220020
2019-12-0440040
2019-12-0550050

In [5]:

df_date.index

Out[5]:

Index(['2019-12-01', '2019-12-02', '2019-12-04', '2019-12-05'], dtype='object', name='pdate')

In [6]:

# 将df的索引设置为日期索引
df_date = df_date.set_index(pd.to_datetime(df_date.index))
df_date

Out[6]:

pvuv
pdate
2019-12-0110010
2019-12-0220020
2019-12-0440040
2019-12-0550050

In [7]:

df_date.index

Out[7]:

DatetimeIndex(['2019-12-01', '2019-12-02', '2019-12-04', '2019-12-05'], dtype='datetime64[ns]', name='pdate', freq=None)

2、使用pandas.reindex填充缺失的索引

In [8]:

# 生成完整的日期序列
pdates = pd.date_range(start="2019-12-01", end="2019-12-05")
pdates

Out[8]:

DatetimeIndex(['2019-12-01', '2019-12-02', '2019-12-03', '2019-12-04',
               '2019-12-05'],
              dtype='datetime64[ns]', freq='D')

In [9]:

df_date_new = df_date.reindex(pdates, fill_value=0)
df_date_new

Out[9]:

pvuv
2019-12-0110010
2019-12-0220020
2019-12-0300
2019-12-0440040
2019-12-0550050

In [10]:

df_date_new.plot()

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a0db1ab388>

方法2:使用pandas.resample方法

1、先将索引变成日期索引

In [11]:

df

Out[11]:

pdatepvuv
02019-12-0110010
12019-12-0220020
22019-12-0440040
32019-12-0550050

In [12]:

df_new2 = df.set_index(pd.to_datetime(df["pdate"])).drop("pdate", axis=1)
df_new2

Out[12]:

pvuv
pdate
2019-12-0110010
2019-12-0220020
2019-12-0440040
2019-12-0550050

In [13]:

df_new2.index

Out[13]:

DatetimeIndex(['2019-12-01', '2019-12-02', '2019-12-04', '2019-12-05'], dtype='datetime64[ns]', name='pdate', freq=None)

2、使用dataframe的resample的方法按照天重采样

resample的含义:
改变数据的时间频率,比如把天数据变成月份,或者把小时数据变成分钟级别

resample的语法:
(DataFrame or Series).resample(arguments).(aggregate function)

resample的采样规则参数:
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases

In [14]:

# 由于采样会让区间变成一个值,所以需要指定mean等采样值的设定方法
df_new2 = df_new2.resample("D").mean().fillna(0)
df_new2

Out[14]:

pvuv
pdate
2019-12-01100.010.0
2019-12-02200.020.0
2019-12-030.00.0
2019-12-04400.040.0
2019-12-05500.050.0

In [15]:

# resample的使用方式
df_new2.resample("2D").mean()

Out[15]:

pvuv
pdate
2019-12-01150.015.0
2019-12-03200.020.0
2019-12-05500.050.0

23、Pandas怎样实现Excel的vlookup并且在指定列后面输出?

背景:

  1. 有两个excel,他们有相同的一个列;
  2. 按照这个列合并成一个大的excel,即vlookup功能,要求:
    • 只需要第二个excel的少量的列,比如从40个列中挑选2个列
    • 新增的来自第二个excel的列需要放到第一个excel指定的列后面;
  3. 将结果输出到一个新的excel;

步骤1:读取两个数据表

In [1]:

import pandas as pd

In [2]:

# 学生成绩表
df_grade = pd.read_excel("./course_datas/c23_excel_vlookup/学生成绩表.xlsx") 
df_grade.head()

Out[2]:

班级学号语文成绩数学成绩英语成绩
0C01S001998488
1C01S002669577
2C01S003686861
3C01S004636682
4C01S005729594

In [3]:

# 学生信息表
df_sinfo = pd.read_excel("./course_datas/c23_excel_vlookup/学生信息表.xlsx") 
df_sinfo.head()

Out[3]:

学号姓名性别年龄籍贯
0S001怠涵23山东
1S002婉清25河南
2S003溪榕23湖北
3S004漠涓19陕西
4S005祈博24山东

*目标:怎样将第二个“学生信息表”的姓名、性别两列,添加到第一个表“学生成绩表”,并且放在第一个表的“学号”列后面?*

步骤2:实现两个表的关联

即excel的vloopup功能

In [4]:

# 只筛选第二个表的少量的列
df_sinfo = df_sinfo[["学号", "姓名", "性别"]]
df_sinfo.head()

Out[4]:

学号姓名性别
0S001怠涵
1S002婉清
2S003溪榕
3S004漠涓
4S005祈博

In [5]:

df_merge = pd.merge(left=df_grade, right=df_sinfo, left_on="学号", right_on="学号")
df_merge.head()

Out[5]:

班级学号语文成绩数学成绩英语成绩姓名性别
0C01S001998488怠涵
1C01S002669577婉清
2C01S003686861溪榕
3C01S004636682漠涓
4C01S005729594祈博

步骤3:调整列的顺序

In [6]:

df_merge.columns

Out[6]:

Index(['班级', '学号', '语文成绩', '数学成绩', '英语成绩', '姓名', '性别'], dtype='object')

问题:怎样将'姓名', '性别'两列,放到'学号'的后面?

接下来需要用Python的语法实现列表的处理

In [7]:

# 将columns变成python的列表形式
new_columns = df_merge.columns.to_list()
new_columns

Out[7]:

['班级', '学号', '语文成绩', '数学成绩', '英语成绩', '姓名', '性别']

In [8]:

# 按逆序insert,会将"姓名","性别"放到"学号"的后面
for name in ["姓名", "性别"][::-1]:
    new_columns.remove(name)
    new_columns.insert(new_columns.index("学号")+1, name)

In [9]:

new_columns

Out[9]:

['班级', '学号', '姓名', '性别', '语文成绩', '数学成绩', '英语成绩']

In [10]:

df_merge = df_merge.reindex(columns=new_columns)
df_merge.head()

Out[10]:

班级学号姓名性别语文成绩数学成绩英语成绩
0C01S001怠涵998488
1C01S002婉清669577
2C01S003溪榕686861
3C01S004漠涓636682
4C01S005祈博729594

步骤4:输出最终的Excel文件

In [11]:

df_merge.to_excel("./course_datas/c23_excel_vlookup/合并后的数据表.xlsx", index=False)

24、Pandas怎样结合Pyecharts绘制交互性折线图?

背景:

  • Pandas是Python用于数据分析领域的超级牛的库
  • Echarts是百度开源的非常好用强大的可视化图表库,Pyecharts是它的Python库版本

1、读取数据

In [1]:

import pandas as pd

In [2]:

xlsx_path = "./datas/stocks/baidu_stocks.xlsx"
df = pd.read_excel(xlsx_path, index_col="datetime", parse_dates=True)
df.head()

Out[2]:

codeopenclosehighlowvolp_change
datetime
2019-12-03BIDU115.199997114.800003116.019997113.3000033493249-2.25
2019-12-02BIDU118.389999117.440002119.764999116.4000022203313-0.92
2019-11-29BIDU118.300003118.529999118.690002117.5999981917004-0.82
2019-11-27BIDU119.180000119.510002119.839996118.44000223410700.77
2019-11-26BIDU120.010002118.599998120.440002118.0999983813176-1.43

In [3]:

df.index

Out[3]:

DatetimeIndex(['2019-12-03', '2019-12-02', '2019-11-29', '2019-11-27',
               '2019-11-26', '2019-11-25', '2019-11-22', '2019-11-21',
               '2019-11-20', '2019-11-19',
               ...
               '2019-01-15', '2019-01-14', '2019-01-11', '2019-01-10',
               '2019-01-09', '2019-01-08', '2019-01-07', '2019-01-04',
               '2019-01-03', '2019-01-02'],
              dtype='datetime64[ns]', name='datetime', length=227, freq=None)

In [4]:

df.sort_index(inplace=True)
df.head()

Out[4]:

codeopenclosehighlowvolp_change
datetime
2019-01-02BIDU156.179993162.250000164.330002155.4900052996952NaN
2019-01-03BIDU158.750000154.710007159.880005153.7799993879180-4.65
2019-01-04BIDU157.600006160.949997162.429993157.25000038474974.03
2019-01-07BIDU162.600006162.600006164.490005158.50999532660911.03
2019-01-08BIDU162.190002163.399994163.889999158.16000432533610.49

2、使用Pyecharts绘制折线图

In [5]:

# 如果没有安装,使用pip install pyecharts安装
from pyecharts.charts import Line
from pyecharts import options as opts

In [6]:

# 折线图
line = Line()

# x轴
line.add_xaxis(df.index.to_list())

# 每个y轴
line.add_yaxis("开盘价", df["open"].round(2).to_list())
line.add_yaxis("收盘价", df["close"].round(2).to_list())

# 图表配置
line.set_global_opts(
    title_opts=opts.TitleOpts(title="百度股票2019年"),
    tooltip_opts=opts.TooltipOpts(trigger="axis", axis_pointer_type="cross")
)

Out[6]:

<pyecharts.charts.basic_charts.line.Line at 0x201bd51d088>

In [7]:

# 渲染数据
line.render_notebook()

25、Pandas结合Sklearn实现泰坦尼克存活率预测

实例目标:实现泰坦尼克存活预测

处理步骤:
1、输入数据:使用Pandas读取训练数据(历史数据,特点是已经知道了这个人最后有没有活下来)
2、训练模型:使用Sklearn训练模型
3、使用模型:对于一个新的不知道存活的人,预估他存活的概率

步骤1:读取训练数据

In [1]:

import pandas as pd

In [2]:

df_train = pd.read_csv("./datas/titanic/titanic_train.csv")
df_train.head()

Out[2]:

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

*其中,Survived==1代表这个人活下来了、==0代表没活下来;其他的都是这个人的信息和当时的仓位、票务情况*

In [3]:

# 我们只挑选两列,作为预测需要的特征
feature_cols = ['Pclass', 'Parch']
X = df_train.loc[:, feature_cols]
X.head()

Out[3]:

PclassParch
030
110
230
310
430

In [4]:

# 单独提取是否存活的列,作为预测的目标
y = df_train.Survived
y.head()

Out[4]:

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

步骤2:训练模型

In [5]:

from sklearn.linear_model import LogisticRegression
# 创建模型对象
logreg = LogisticRegression()

# 实现模型训练
logreg.fit(X, y)
//anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Out[5]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [6]:

logreg.score(X, y)

Out[6]:

0.6879910213243546

步骤3:对于未知数据使用模型

机器学习的核心目标,是使用模型预测未知的事物

比如预测股票明天是涨还是跌、一套新的二手房成交价大概多少钱、用户打开APP最可能看那些视频等问题

In [7]:

# 找一个历史数据中不存在的数据
X.drop_duplicates().sort_values(by=["Pclass", "Parch"])

Out[7]:

PclassParch
110
5411
2712
43814
920
9821
4322
43723
030
731
832
8633
16734
1335
67836

In [8]:

# 预测这个数据存活的概率
logreg.predict([[2, 4]])

Out[8]:

array([1])

In [9]:

logreg.predict_proba([[2, 4]])

Out[9]:

array([[0.35053893, 0.64946107]])

26、Pandas处理分析网站原始访问日志

目标:真实项目的实战,探索Pandas的数据处理与分析

实例:
数据来源:某大佬的wordpress博客http://www.crazyant.net/ 的访问日志

实现步骤:
1、读取数据、清理、格式化
2、统计爬虫spider的访问比例,输出柱状图
3、统计http状态码的访问占比,输出饼图
4、统计按小时、按天的PV/UV流量趋势,输出折线图

1、读取数据并清理格式化

In [1]:

import pandas as pd
import numpy as np

pd.set_option('display.max_colwidth', -1)

from pyecharts import options as opts
from pyecharts.charts import Bar,Pie,Line

In [2]:

# 读取整个目录,将所有的文件合并到一个dataframe
data_dir = "./datas/crazyant/blog_access_log"

df_list = []

import os
for fname in os.listdir(f"{data_dir}"):
    df_list.append(pd.read_csv(f"{data_dir}/{fname}", sep=" ", header=None, error_bad_lines=False))

df = pd.concat(df_list)
b'Skipping line 2245: expected 10 fields, saw 16\nSkipping line 2889: expected 10 fields, saw 14\nSkipping line 2890: expected 10 fields, saw 14\nSkipping line 2891: expected 10 fields, saw 13\nSkipping line 2892: expected 10 fields, saw 13\nSkipping line 2900: expected 10 fields, saw 11\nSkipping line 2902: expected 10 fields, saw 11\nSkipping line 3790: expected 10 fields, saw 14\nSkipping line 3791: expected 10 fields, saw 14\nSkipping line 3792: expected 10 fields, saw 13\nSkipping line 3793: expected 10 fields, saw 13\nSkipping line 3833: expected 10 fields, saw 11\nSkipping line 3835: expected 10 fields, saw 11\nSkipping line 9936: expected 10 fields, saw 16\n'
b'Skipping line 11748: expected 10 fields, saw 11\nSkipping line 11750: expected 10 fields, saw 11\n'

In [3]:

df.head()

Out[3]:

0123456789
0106.11.153.226--[02/Dec/2019:22:40:18 | +0800]GET /740.html?replytocom=1194 HTTP/1.020013446-YisouSpider
142.156.254.60--[02/Dec/2019:22:40:23 | +0800]POST /wp-json/wordpress-popular-posts/v1/popular-posts HTTP/1.020155http://www.crazyant.net/740.html?replytocom=1194Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36
2106.11.159.254--[02/Dec/2019:22:40:27 | +0800]GET /576.html HTTP/1.020013461-YisouSpider
3106.11.157.254--[02/Dec/2019:22:40:28 | +0800]GET /?lwfcdw=t9n2d3&oqzohc=m5e7j1&oubyvq=iab6a3&oudmbg=6osqd3 HTTP/1.020010485-YisouSpider
442.156.137.109--[02/Dec/2019:22:40:30 | +0800]POST /wp-json/wordpress-popular-posts/v1/popular-posts HTTP/1.020155http://www.crazyant.net/576.htmlMozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36

In [4]:

df = df[[0, 3, 6, 9]].copy()
df.head()

Out[4]:

0369
0106.11.153.226[02/Dec/2019:22:40:18200YisouSpider
142.156.254.60[02/Dec/2019:22:40:23201Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36
2106.11.159.254[02/Dec/2019:22:40:27200YisouSpider
3106.11.157.254[02/Dec/2019:22:40:28200YisouSpider
442.156.137.109[02/Dec/2019:22:40:30201Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36

In [5]:

df.columns = ["ip", "stime", "status", "client"]
df.head()

Out[5]:

ipstimestatusclient
0106.11.153.226[02/Dec/2019:22:40:18200YisouSpider
142.156.254.60[02/Dec/2019:22:40:23201Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36
2106.11.159.254[02/Dec/2019:22:40:27200YisouSpider
3106.11.157.254[02/Dec/2019:22:40:28200YisouSpider
442.156.137.109[02/Dec/2019:22:40:30201Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36

In [6]:

df.dtypes

Out[6]:

ip        object
stime     object
status    int64 
client    object
dtype: object

2、统计spider的比例

In [7]:

df["is_spider"] = df["client"].str.lower().str.contains("spider")
df.head()

Out[7]:

ipstimestatusclientis_spider
0106.11.153.226[02/Dec/2019:22:40:18200YisouSpiderTrue
142.156.254.60[02/Dec/2019:22:40:23201Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36True
2106.11.159.254[02/Dec/2019:22:40:27200YisouSpiderTrue
3106.11.157.254[02/Dec/2019:22:40:28200YisouSpiderTrue
442.156.137.109[02/Dec/2019:22:40:30201Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36True

In [8]:

df_spider = df["is_spider"].value_counts()
df_spider

Out[8]:

False    46641
True     3637 
Name: is_spider, dtype: int64

In [9]:

bar = (
        Bar()
        .add_xaxis([str(x) for x in df_spider.index])
        .add_yaxis("是否Spider", df_spider.values.tolist())
        .set_global_opts(title_opts=opts.TitleOpts(title="爬虫访问量占比"))
)
bar.render_notebook()

27、Pandas怎样找出最影响结果的那些特征?

应用场景:

  • 机器学习的特征选择,去除无用的特征,可以提升模型效果、降低训练时间等等
  • 数据分析领域,找出收入波动的最大因素!!

实例演示:泰坦尼克沉船事件中,最影响生死的因素有哪些?

1、导入相关的包

In [1]:

import pandas as pd
import numpy as np

# 特征最影响结果的K个特征
from sklearn.feature_selection import SelectKBest

# 卡方检验,作为SelectKBest的参数
from sklearn.feature_selection import chi2

2、导入泰坦尼克号的数据

In [2]:

df = pd.read_csv("./datas/titanic/titanic_train.csv")
df.head()

Out[2]:

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

In [3]:

df = df[["PassengerId", "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]].copy()
df.head()

Out[3]:

PassengerIdSurvivedPclassSexAgeSibSpParchFareEmbarked
0103male22.0107.2500S
1211female38.01071.2833C
2313female26.0007.9250S
3411female35.01053.1000S
4503male35.0008.0500S

3、数据清理和转换

3.1 查看是否有空值列

In [4]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 62.8+ KB

3.2 给Age列填充平均值

In [5]:

df["Age"] = df["Age"].fillna(df["Age"].median())

In [6]:

df.head()

Out[6]:

PassengerIdSurvivedPclassSexAgeSibSpParchFareEmbarked
0103male22.0107.2500S
1211female38.01071.2833C
2313female26.0007.9250S
3411female35.01053.1000S
4503male35.0008.0500S

3.2 将性别列变成数字

In [7]:

# 性别
df.Sex.unique()

Out[7]:

array(['male', 'female'], dtype=object)

In [8]:

df.loc[df["Sex"] == "male", "Sex"] = 0
df.loc[df["Sex"] == "female", "Sex"] = 1

In [9]:

df.head()

Out[9]:

PassengerIdSurvivedPclassSexAgeSibSpParchFareEmbarked
0103022.0107.2500S
1211138.01071.2833C
2313126.0007.9250S
3411135.01053.1000S
4503035.0008.0500S

3.3 给Embarked列填充空值,字符串转换成数字

In [10]:

# Embarked
df.Embarked.unique()

Out[10]:

array(['S', 'C', 'Q', nan], dtype=object)

In [11]:

# 填充空值
df["Embarked"] = df["Embarked"].fillna(0)

# 字符串变成数字
df.loc[df["Embarked"] == "S", "Embarked"] = 1
df.loc[df["Embarked"] == "C", "Embarked"] = 2
df.loc[df["Embarked"] == "Q", "Embarked"] = 3

In [12]:

df.head()

Out[12]:

PassengerIdSurvivedPclassSexAgeSibSpParchFareEmbarked
0103022.0107.25001
1211138.01071.28332
2313126.0007.92501
3411135.01053.10001
4503035.0008.05001

4、将特征列和结果列拆分开

In [13]:

y = df.pop("Survived")
X = df

In [14]:

X.head()

Out[14]:

PassengerIdPclassSexAgeSibSpParchFareEmbarked
013022.0107.25001
121138.01071.28332
233126.0007.92501
341135.01053.10001
453035.0008.05001

In [15]:

y.head()

Out[15]:

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

5、使用卡方检验选择topK的特征

In [16]:

# 选择所有的特征,目的是看到特征重要性排序
bestfeatures = SelectKBest(score_func=chi2, k=len(X.columns))
fit = bestfeatures.fit(X, y)

6、按照重要性顺序打印特征列表

In [17]:

df_scores = pd.DataFrame(fit.scores_)
df_scores

Out[17]:

0
03.312934
130.873699
2170.348127
321.649163
42.581865
510.097499
64518.319091
72.771019

In [18]:

df_columns = pd.DataFrame(X.columns)
df_columns

Out[18]:

0
0PassengerId
1Pclass
2Sex
3Age
4SibSp
5Parch
6Fare
7Embarked

In [19]:

# 合并两个df
df_feature_scores = pd.concat([df_columns,df_scores],axis=1)
# 列名
df_feature_scores.columns = ['feature_name','Score']  #naming the dataframe columns

# 查看
df_feature_scores

Out[19]:

feature_nameScore
0PassengerId3.312934
1Pclass30.873699
2Sex170.348127
3Age21.649163
4SibSp2.581865
5Parch10.097499
6Fare4518.319091
7Embarked2.771019

In [20]:

df_feature_scores.sort_values(by="Score", ascending=False)

Out[20]:

feature_nameScore
6Fare4518.319091
2Sex170.348127
1Pclass30.873699
3Age21.649163
5Parch10.097499
0PassengerId3.312934
7Embarked2.771019
4SibSp2.581865