码以致用03 - 用 Pandas 分析爬虫抓取的数据

kidult00 (41)in #cn • 7 years ago (edited)

码以致用01 - Scrapy 爬虫框架简介
 码以致用02 - 用 Scrapy 爬虫抓取简单心理咨询师资料

接下来试着分析一下咨询师的价格。

如何去掉某一列中不需要的字符？

在 price 列中，数据格式是 600 元/次。很明显，中文字符会给统计价格带来不便，需要想办法去掉。

取 price 列：df['name']
去掉元/次 字符：str.rstrip()
把剩下字符转换成数字：pd.to_numeric

Pandas 语句可以这样写：

df['price'].str.rstrip('元/次').apply(pd.to_numeric)

结果：

Screen Shot 2018-01-18 at 3.53.32 PM.png

如何统计价格？

用 Pandas 做基本的数据统计如均值、最大值、最小值等，非常方便，分别用 mean(), max(), min()就可以：

print("平均价格：{:.1f}元 \n最高价格：{}元 \n最低价格：{}元".format(df['price'].mean(),df['price'].max(),df['price'].min()))

平均价格：570.9元
最高价格：3000元
最低价格：100元

另外，Pandas 还提供了 describe() 函数，快速给出概要统计值：

Screen Shot 2018-01-18 at 3.57.36 PM.png

然后单独取出收费最高和最低的咨询师资料：

df.loc[df['price'].idxmax()]
df.loc[df['price'].idxmin()]

Screen Shot 2018-01-18 at 3.58.52 PM.png

如何统计咨询师介绍里的词频？

方法 1 ：用 jieba 分词，用 Counter 统计

import jieba
from collections import Counter

# 单独导出咨询师介绍列
df['info'].to_csv('info.txt')

with open('info.txt', 'r') as f:
    text = f.read()

wordlist = Counter()
words = jieba.cut(text)

for word in words:
    if len(word) > 1: 
        wordlist[word] += 1

def gen_cloud_word():
    words = []
    for word,cnt in wordlist.most_common(30):
        words.append(word)
    return words

列出前 30 个高频词：

Screen Shot 2018-01-18 at 4.10.14 PM.png

方法 2 ：用 wordcloud 直接制作标签云

word cloud 是一个 python 的标签云生成库，可以直接输入文本，得到标签云图片，还可以定制图片形状和颜色，小巧好用。(https://github.com/amueller/word_cloud)

结合 matplotlib，很快就可以画出高频词的标签云：

%matplotlib inline
%config InlineBackend.figure_format='retina'
from os import path
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
import numpy as np

d = path.dirname('info.txt')
# 设置字体
font = r'/Users/kidult/Library/Fonts/MFKeSong_Noncommercial-Regular.TTF'

# Read the whole text
text = open(path.join(d, 'info.txt')).read()

# 排除词
stopwords=('Dr','of','to','and','The','in','zx','至今','中国','同时','当然')

# 用图片截取并取色
heart_coloring = np.array(Image.open(path.join(d, "heart.png")))

# Generate a word cloud image
wordcloud = WordCloud(max_words=80, background_color='white', mask=heart_coloring,
                      max_font_size=60, relative_scaling=0.4, font_path=font,stopwords=stopwords, random_state=42)
wordcloud.generate(text)

image_colors = ImageColorGenerator(heart_coloring)


plt.figure(figsize=(12,8))
plt.axis("off")
plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear");

结果如下：