Python爬虫实战二,Python网络爬虫实战项目代码大全

By admin in 4858.com on 2019年4月2日

1. 多进度爬虫

  对于数据量较大的爬虫,对数码的拍卖供给较高时,能够使用python多进程或拾二线程的建制落成,多进度是指分配多少个CPU处理程序,同近日刻只有二个CPU在干活,四线程是指进程之中有八个近乎”子进度”同时在协同工作。python中有三种八个模块可成功多进度和10二线程的行事,此处此用multiprocessing模块形成10二线程爬虫,测试进度中发觉,由于站点具有反爬虫机制,当url地址和进度数目较多时,爬虫会报错。

Python四线程爬虫与多样多少存储方式完成(Python爬虫实战2),python爬虫

4858.com 1

1]-
微信公众号爬虫。基于搜狗微信搜索的微信公众号爬虫接口,能够扩张成基于搜狗搜索的爬虫,重返结果是列表,每一项均是大众号具体信息字典

贰. 代码内容

#!/usr/bin/python
#_*_ coding:utf _*_

import re
import time 
import requests
from multiprocessing import Pool

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 time.sleep(1)
 return duanzi_list

def normal_scapper(url_lists):
 '''
 定义调用函数,使用普通的爬虫函数爬取数据
 '''
 begin_time = time.time()
 for url in url_lists:
  scrap_qiushi_info(url)
 end_time = time.time()
 print "普通爬虫一共耗费时长:%f" % (end_time - begin_time)

def muti_process_scapper(url_lists,process_num=2):
 '''
 定义多进程爬虫调用函数,使用mutiprocessing模块爬取web数据
 '''
 begin_time = time.time()
 pool = Pool(processes=process_num)
 pool.map(scrap_qiushi_info,url_lists)
 end_time = time.time()
 print "%d个进程爬虫爬取所耗费时长为:%s" % (process_num,(end_time - begin_time))

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 normal_scapper(url_lists)
 muti_process_scapper(url_lists,process_num=2)


if __name__ == "__main__":
 main()

一. 多进程爬虫

  对于数据量较大的爬虫,对数据的处理供给较高时,能够利用python多进度或四线程的编写制定达成,多进度是指分配多个CPU处理程序,同目前刻唯有1个CPU在办事,多线程是指进度之中有多少个类似”子进程”同时在协同工作。python中有八种八个模块可形成多进度和十2线程的干活,此处此用multiprocessing模块形成八线程爬虫,测试进度中发觉,由于站点具有反爬虫机制,当url地址和进度数目较多时,爬虫会报错。

那是菜鸟学Python的第98篇原创作品阅读本文大致要求叁分钟

python零基础学习摄像教程全集-三

 三. 爬取的数码存入到MongoDB数据库

#!/usr/bin/python
#_*_ coding:utf _*_

import re
import time 
import json
import requests
import pymongo
from multiprocessing import Pool

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_mongo(datas):
 '''
 @datas: 需要插入到mongoDB的数据,封装为字典,通过遍历的方式将数据插入到mongoDB中,insert_one()表示一次插入一条数据
 '''
 client = pymongo.MongoClient('localhost',27017)
 duanzi = client['duanzi_db']
 duanzi_info = duanzi['duanzi_info']
 for data in datas:
  duanzi_info.insert_one(data)

def query_data_from_mongo():
 '''
 查询mongoDB中的数据
 '''
 client = pymongo.MongoClient('localhost',27017)['duanzi_db']['duanzi_info']
 for data in client.find():
  print data 
 print "一共查询到%d条数据" % (client.find().count())


def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_mongo(duanzi_list)

if __name__ == "__main__":
 main()
 #query_data_from_mongo()

二. 代码内容

#!/usr/bin/python
#_*_ coding:utf _*_

import re
import time 
import requests
from multiprocessing import Pool

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 time.sleep(1)
 return duanzi_list

def normal_scapper(url_lists):
 '''
 定义调用函数,使用普通的爬虫函数爬取数据
 '''
 begin_time = time.time()
 for url in url_lists:
  scrap_qiushi_info(url)
 end_time = time.time()
 print "普通爬虫一共耗费时长:%f" % (end_time - begin_time)

def muti_process_scapper(url_lists,process_num=2):
 '''
 定义多进程爬虫调用函数,使用mutiprocessing模块爬取web数据
 '''
 begin_time = time.time()
 pool = Pool(processes=process_num)
 pool.map(scrap_qiushi_info,url_lists)
 end_time = time.time()
 print "%d个进程爬虫爬取所耗费时长为:%s" % (process_num,(end_time - begin_time))

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 normal_scapper(url_lists)
 muti_process_scapper(url_lists,process_num=2)


if __name__ == "__main__":
 main()

日前写了壹篇小说关于爬取市面上全数的Python书思路,那也究竟我们多少解析类别讲座里面包车型大巴二个小的实战项目。上次代码未有写完,正好周末有时光把代码全体完事而且存入了数据库中,后日就给我们一步步解析一下是作者是什么爬取数据,清洗数据和绕过反爬虫的部分政策和个别记录。

2]-
豆瓣读书爬虫。能够爬下豆瓣读书标签下的有所书籍,按评分排行依次存款和储蓄,存款和储蓄到Excel中,可方便大家筛选搜罗,比如筛选评价人数>一千的高分书籍;可根据分化的主旨存款和储蓄到Excel差异的Sheet
,采取User
Agent伪装为浏览器进行爬取,并出席随机延时来更好的模拟浏览器行为,幸免爬虫被封

 4. 插入至MySQL数据库

  将爬虫获取的数据插入到关系性数据库MySQL数据库中作为永恒数据存款和储蓄,首先须要在MySQL数据库中创立库和表,如下:

1. 创建库
MariaDB [(none)]> create database qiushi;
Query OK, 1 row affected (0.00 sec)

2. 使用库
MariaDB [(none)]> use qiushi;
Database changed

3. 创建表格
MariaDB [qiushi]> create table qiushi_info(id int(32) unsigned primary key auto_increment,username varchar(64) not null,level int default 0,laugh_count int default 0,comment_count int default 0,content text default '')engine=InnoDB charset='UTF8';
Query OK, 0 rows affected, 1 warning (0.06 sec)

MariaDB [qiushi]> show create table qiushi_info;
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table       | Create Table                                                                                                                                                                                                                                                                                            |
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| qiushi_info | CREATE TABLE `qiushi_info` (
  `id` int(32) unsigned NOT NULL AUTO_INCREMENT,
  `username` varchar(64) NOT NULL,
  `level` int(11) DEFAULT '0',
  `laugh_count` int(11) DEFAULT '0',
  `comment_count` int(11) DEFAULT '0',
  `content` text,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

 写入到MySQL数据库中的代码如下:

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import time 
import pymysql
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_mysql(datas):
 '''
 @params: datas,将爬虫获取的数据写入到MySQL数据库中
 '''
 try:
  conn = pymysql.connect(host='localhost',port=3306,user='root',password='',db='qiushi',charset='utf8')
  cursor = conn.cursor(pymysql.cursors.DictCursor)
  for data in datas:
   data_list = (data['username'],int(data['level']),int(data['laugh_count']),int(data['comment_count']),data['content'])
   sql = "INSERT INTO qiushi_info(username,level,laugh_count,comment_count,content) VALUES('%s',%s,%s,%s,'%s')" %(data_list)
   cursor.execute(sql)
   conn.commit()
 except Exception as e:
  print e
 cursor.close()
 conn.close()


def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_mysql(duanzi_list)

if __name__ == "__main__":
 main()

 3. 爬取的多寡存入到MongoDB数据库

#!/usr/bin/python
#_*_ coding:utf _*_

import re
import time 
import json
import requests
import pymongo
from multiprocessing import Pool

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_mongo(datas):
 '''
 @datas: 需要插入到mongoDB的数据,封装为字典,通过遍历的方式将数据插入到mongoDB中,insert_one()表示一次插入一条数据
 '''
 client = pymongo.MongoClient('localhost',27017)
 duanzi = client['duanzi_db']
 duanzi_info = duanzi['duanzi_info']
 for data in datas:
  duanzi_info.insert_one(data)

def query_data_from_mongo():
 '''
 查询mongoDB中的数据
 '''
 client = pymongo.MongoClient('localhost',27017)['duanzi_db']['duanzi_info']
 for data in client.find():
  print data 
 print "一共查询到%d条数据" % (client.find().count())


def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_mongo(duanzi_list)

if __name__ == "__main__":
 main()
 #query_data_from_mongo()

1).市面上全部的Python书,都在京东,Tmall和豆类上,于是自身采用了豆瓣来爬取二).分析网址的布局,其实照旧相比简单的,首先有2个主的页面,里面有全部python的链接,一共138八本(在这之中有拾0多本其实是再度的),网页尾部分页展现一共九三页

3]-
天涯论坛爬虫。此项目标效果是爬取和讯用户信息以及人际拓扑关系,爬虫框架使用scrapy,数据存储使用mongodb。[3]: 

 伍. 将爬虫数据写入到CSV文件

  CSV文件是以逗号,方式分开的文书读写形式,能够透过纯文本大概Excel情势读取,是壹种普遍的数额存款和储蓄格局,此处将爬取的数额存入到CSV文件内。

将数据存入到CSV文件代码内容如下:

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import csv
import time 
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_csv(datas,filename):
 '''
 @datas: 需要写入csv文件的数据内容,是一个列表
 @params:filename,需要写入到目标文件的csv文件名
 '''
 with file(filename,'w+') as f:
  writer = csv.writer(f)
  writer.writerow(('username','level','laugh_count','comment_count','content'))
  for data in datas:
   writer.writerow((data['username'],data['level'],data['laugh_count'],data['comment_count'],data['content']))

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_csv(duanzi_list,'/root/duanzi_info.csv')

if __name__ == "__main__":
 main()

 4. 插入至MySQL数据库

  将爬虫获取的数额插入到关系性数据库MySQL数据库中作为永久数据存款和储蓄,首先需求在MySQL数据库中创制库和表,如下:

1. 创建库
MariaDB [(none)]> create database qiushi;
Query OK, 1 row affected (0.00 sec)

2. 使用库
MariaDB [(none)]> use qiushi;
Database changed

3. 创建表格
MariaDB [qiushi]> create table qiushi_info(id int(32) unsigned primary key auto_increment,username varchar(64) not null,level int default 0,laugh_count int default 0,comment_count int default 0,content text default '')engine=InnoDB charset='UTF8';
Query OK, 0 rows affected, 1 warning (0.06 sec)

MariaDB [qiushi]> show create table qiushi_info;
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table       | Create Table                                                                                                                                                                                                                                                                                            |
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| qiushi_info | CREATE TABLE `qiushi_info` (
  `id` int(32) unsigned NOT NULL AUTO_INCREMENT,
  `username` varchar(64) NOT NULL,
  `level` int(11) DEFAULT '0',
  `laugh_count` int(11) DEFAULT '0',
  `comment_count` int(11) DEFAULT '0',
  `content` text,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

 写入到MySQL数据库中的代码如下:

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import time 
import pymysql
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_mysql(datas):
 '''
 @params: datas,将爬虫获取的数据写入到MySQL数据库中
 '''
 try:
  conn = pymysql.connect(host='localhost',port=3306,user='root',password='',db='qiushi',charset='utf8')
  cursor = conn.cursor(pymysql.cursors.DictCursor)
  for data in datas:
   data_list = (data['username'],int(data['level']),int(data['laugh_count']),int(data['comment_count']),data['content'])
   sql = "INSERT INTO qiushi_info(username,level,laugh_count,comment_count,content) VALUES('%s',%s,%s,%s,'%s')" %(data_list)
   cursor.execute(sql)
   conn.commit()
 except Exception as e:
  print e
 cursor.close()
 conn.close()


def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_mysql(duanzi_list)

if __name__ == "__main__":
 main()

4858.com 2三).那么些页面是静态页面,url页相比较有规律,所以很不难构造出全数的url的地点4858.com 3四).爬虫每一种分页里面包车型客车兼具的Python书和对应的url,比如第三页里面有”笨办法那本书”,大家只须求领取书名和呼应的url4858.com 44858.com 5

[4]-
Bilibili用户爬虫。,抓取字段:用户id,外号,性别,头像,等级,经验值,观者数,寿辰,地址,注册时间,签名,等级与经验值等。抓取之后生成B站用户数量报告。

 陆. 将爬取数据写入到文本文件中

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import csv
import time 
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_files(datas,filename):
 '''
 定义数据存入写文件的函数
 @params:datas需要写入的数据
 @filename:将数据写入到指定的文件名
 '''
 print "开始写入文件.."
 with file(filename,'w+') as f:
  f.write("用户名" + "\t" + "用户等级" + "\t" + "笑话数" + "\t" + "评论数" + "\t" + "段子内容" + "\n")
  for data in datas:
   f.write(data['username'] + "\t" + \
    data['level'] + "\t" + \
    data['laugh_count'] + "\t" + \
    data['comment_count'] + "\t" + \
    data['content'] + "\n" + "\n"
   )

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_files(duanzi_list,'/root/duanzi.txt')

if __name__ == "__main__":
 main()

 

 5. 将爬虫数据写入到CSV文件

  CSV文件是以逗号,格局分开的公文读写形式,能够因而纯文本也许Excel情势读取,是一种常见的数额存款和储蓄格局,此处将爬取的多少存入到CSV文件内。

Python爬虫实战二,Python网络爬虫实战项目代码大全。将数据存入到CSV文件代码内容如下:

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import csv
import time 
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_csv(datas,filename):
 '''
 @datas: 需要写入csv文件的数据内容,是一个列表
 @params:filename,需要写入到目标文件的csv文件名
 '''
 with file(filename,'w+') as f:
  writer = csv.writer(f)
  writer.writerow(('username','level','laugh_count','comment_count','content'))
  for data in datas:
   writer.writerow((data['username'],data['level'],data['laugh_count'],data['comment_count'],data['content']))

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_csv(duanzi_list,'/root/duanzi_info.csv')

if __name__ == "__main__":
 main()

1).上边大家已经提取了9三个页面包车型大巴富有的Python书和对应的url,一共是93*一五大约1300多本书,首先先去重,然后大家可以把它存到内部存款和储蓄器里面用2个字典保存,也许存到八个csv文件中去(有同学只怕想不到为啥要存到文件之中呢,用字典存取不是有益呢,先不说最终颁发)

5]-
博客园天涯论坛爬虫。重要爬取和讯乐乎用户的个人消息、今日头条新闻、客官和关怀。代码获取今日头条天涯论坛Cookie进行登录,可经过多账号登录来预防乐乎的反对扒手。主要行使
scrapy 爬虫框架。

 陆. 将爬取数据写入到文本文件中

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import csv
import time 
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_files(datas,filename):
 '''
 定义数据存入写文件的函数
 @params:datas需要写入的数据
 @filename:将数据写入到指定的文件名
 '''
 print "开始写入文件.."
 with file(filename,'w+') as f:
  f.write("用户名" + "\t" + "用户等级" + "\t" + "笑话数" + "\t" + "评论数" + "\t" + "段子内容" + "\n")
  for data in datas:
   f.write(data['username'] + "\t" + \
    data['level'] + "\t" + \
    data['laugh_count'] + "\t" + \
    data['comment_count'] + "\t" + \
    data['content'] + "\n" + "\n"
   )

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_files(duanzi_list,'/root/duanzi.txt')

if __name__ == "__main__":
 main()

 

  1. 多进程爬虫 对于数据量较大的爬虫,对数据的处理供给较高时,可…

2).大家随后分析每本书页面包车型大巴特色:

r[6]-
小说下载分布式爬虫。使用scrapy,redis,
mongodb,graphite实现的1个分布式互联网爬虫,底层存款和储蓄mongodb集群,分布式使用redis完成,爬虫状态显示应用graphite达成,首要针对五个小说站点。

4858.com 6上一片小说说过大家须求分析:小编/出版社/译者/出版年/页数/定价/ISBN/评分/评价人数

[7]-
中国知网爬虫。设置检索条件后,执行src/CnkiSpider.py抓取多少,抓取数据存款和储蓄在/data目录下,每一种数据文件的首先行为字段名称。

看一下网址的源码,发现重大的音信在div 和div

[8]-
链家网爬虫。爬取法国巴黎地区链家历年二手房成交记录。涵盖链家爬虫一文的满贯代码,包罗链家模拟登录代码。

4858.com 7三).这1局地的数据清洗是相比较麻烦的,因为不是每1本书都以有点评和评分系统的,而且不是每壹本书都有小编,页面,价格的,所以提取的时候肯定要做好丰盛处理,比如有的页面长的如此:4858.com 8原本数据收集的进程中有许多不同的多少:

[9]- 京东爬虫。基于scrapy的京东网址爬虫,保存格式为csv。

  • 书的日子表示格式,各类种种都有:有的书的日期是:’September
    二〇〇七’,’October 2贰, 2007’,’2017-九’,’2017-8-2五’

  • 有些书的标价是通货单位不合并,有英镑,日币,美金和人民币比如:CNY
    4九.00,135,1九 €,JPY 4320, $ 176.00

[10]- QQ 群爬虫。批量抓取 QQ
群音讯,包含群名称、群号、群人数、群主、群简介等剧情,最后生成 XLS(X) /
CSV 结果文件。

一).有的同校后台问小编,你是用scrapy框架依旧自身入手写的,作者那么些项目是温馨出手写的,其实scrapy是叁个卓殊棒的框架,假若爬取几捌仟0的多少,小编一定会用这几个一级武器.

[11]-乌云爬虫。
乌云公开漏洞、知识库爬虫和搜索。全体驾驭漏洞的列表和种种漏洞的公文内容存在mongodb中,差不多约二G内容;假设整站爬全体文书和图纸作为离线查询,大致供给拾G上空、贰小时(10M邮电通讯带宽);爬取全体知识库,总共约500M空间。漏洞搜索采纳了Flask作为web
server,bootstrap作为前端。

二).小编用的是二十四线程爬取,把具有的url都扔到一个行列之中,然后设置多少个线程去队列之中不断的爬取,然后循环往复,直到队列里的url全部处理达成

2016.9.11补充:

三).数据存款和储蓄的时候,有三种思路:

[12]- 去何方网爬虫。
互联网爬虫之Selenium使用代理登6:爬取去何方网址,使用selenium模拟浏览器登陆,获取翻页操作。代理能够存入3个文本,程序读取并采取。帮助多进程抓取。

  • 一种是平素把爬取完的多寡存到SQL数据Curry面,然后每回新的url来了未来,直接查询数据Curry面有未有,有的话,就跳过,没有就爬取处理

  • 另1种是存入CSV文件,因为是二拾10二线程存取,所以一定要加入保险护,不然多少个线程同时写三个文书的会不日常的,写成CSV文件也能转换来数据库,而且保存成CSV文件还有三个利益,能够转成pandas相当方便的处理分析.

[13]-
机票爬虫(去何方和携程网)。Findtrip是3个依照Scrapy的机票爬虫,近来整合了国内两大机票网址(去哪个地方

一).一般大型的网站都有反爬虫策略,纵然大家此番爬的数码唯有一千本书,然则同样会境遇反爬虫难题

  • 携程)。

二).关于反爬虫策略,绕过反爬虫有很三种措施。有的时候加时延(尤其是10贰线程处理的时候),有的时候用cookie,有的会代理,尤其是常见的爬取肯定是要用代理池的,笔者那边用的是cookie加时延,比较土的方法.

r[14]

3).断点续传,就算自己的数据量不是十分的大,千条规模,但是提议要加断点续传成效,因为您不领悟在爬的时候会出现什么难题,尽管你能够递归爬取,但是即使您爬了800多条,程序挂了,你的事物还没用存下来,下次爬取又要重头伊始爬,会黄疸的(聪明的同学肯定猜到,小编下面第三步留的伏笔,就是那样原因)

  • 基于requests、MySQLdb、torndb的博客园客户端内容爬虫。

1).整个的代码架构笔者还尚无完全优化,近日是五个py文件,前面小编会越发优化和打包的

[15]- 豆瓣电影、书籍、小组、相册、东西等爬虫集。

4858.com 9

[17]-
百度mp3全站爬虫,使用redis援助断点续传。r[18]-
天猫和天猫的爬虫,能够依据查找关键词,物品id来抓去页面包车型客车消息,数据存款和储蓄在

  • spider_main:首假若爬取九一个分页的全部书的链接和书面,并且八线程处理
  • book_html_parser:主假诺爬取每1本书的音信
  • url_manager:重即便治本全数的url链接
  • db_manager:首借使数据库的存取和询问
  • util:是三个存放一些大局的变量
  • verify:是自家内部测试代码的2个小程序

[19]4858.com,-
三个股票数量(沪深)爬虫和选股策略测试框架。依照选定的日子范围抓取全体沪深两市股票的盘子数据。补助选取表达式定义选股策略。帮助十六线程处理。保存数据到JSON文件、CSV文件。

2).重要的爬取结果的存放

4858.com 10all_books_link.csv:首要存放在1200多本书的url和书名4858.com 11python_books.csv:首要存放具体每1本书的新闻4858.com 12三).用到的库爬虫部分:用了requests,beautifulSoup数据清洗:用了汪洋的正则表明式,collection模块,对书的出版日期用了datetime和calendar模块十二线程:用了threading模块和queue


敲定:好,前日的全网分析Python书,爬虫篇,就讲道那里,基本上大家全体这些种类的技术点都讲了一回,爬虫照旧很风趣的,可是要变成三个爬虫高手还有为数不少地点要上学,想把爬虫写的爬取速度快,又稳重,还能够绕过反爬虫系统,并不是一件不难的事务.
有趣味的伴儿,也可以本身动手写一下啊。源码等前面的数目解析篇讲完后,作者会放github上,若有怎么着难点,也欢迎留言研讨一下.

本项目收音和录音各类Python互连网爬虫实战开源代码,并长久更新,欢迎补充。

更加多Python干货欢迎加作者爱python大神QQ群:30405079玖

发表评论

电子邮件地址不会被公开。 必填项已用*标注

网站地图xml地图
Copyright @ 2010-2019 美高梅手机版4858 版权所有