python的应用场景有很大一部分是基于计算机网络的,因为现在的人们使用计算机来工作不是孤立的,大部分工作都是在网络上完成的,python在处理这方面的事务上是天然的行家里手,这主要是因为它拥有丰富的网络应用接口包,能处理各种各样的网络请求,大大地降低了程序开发的复杂度,提升了应用开发的效率,今天就以爬虫为例,一起来看看常用的python网络工具库的使用场景。
目前python的网络应用包或工具库主要分为两类: 一类是用于网络请求及文件处理的,另一类是对网页内容进行解析和信息提取的,常用的库有 - urllib,urllib2,urllib3,requests, BeautifulSoup, lxml等,其中urllib,urllib2,urllib3,requests 主要用于前者,而bs和lxml则多用于后者,这里的urllibx区别是针对不同的python版本来说的,而requests的底层实现其实是基于urllib库的。本质上还是urllib的应用拓展,了解了这些,对我们的实际的开发和使用这些类库是很有帮助的。
废话不多说,一起看例子吧,这是一个简单的网页访问和页面图片内容抓取的应用,基本的实现代码如下 -
基本思路实现 -
import requests
from bs4 import BeautifulSoup
from lxml import etree
htmlRequest = requests.get(url) #返回一个requests对象
htmlContent = BeautifulSoup(html.text, ''html.parser'') #获得html的本文内容
htmlContent.select("img") #获取所有的img tag内容
上面三步是最直接的网络处理事务流,但要获取准确的信息,光这几行代码是远远不够的,还需要在它的基础上对齐进行包装和完善,增加代码的隐蔽性和执行效率,因为你在网络上爬取人家的信息,一般都不会赤裸裸的去索取,那样的话,人家一定不会让你顺利爬取的,唉, 为什么?这还用问,你懂得。。。
下面是我的完整代码,重要处都做了注释说明,相信有点python基础的你,一定会看明白的 ~~
访问指定网页,获得页面html内容并返回图片链接的数量 -
import requests, urllib, json, os, sys, time, random
from bs4 import BeautifulSoup
# 每次切换页面就更换headers及ip, 使得每次的访问就像是来自真实的用户访问,这是对requests网路访问的基本封装,你可以先在网页的开发及网络模式下查看,找出request header的基本参数,然后逐一封装,通常用proxy和header就可以了。
proxy_list = [{'HTTP': '175.42.128.96:9999'}, {'HTTP': '119.108.172.244:9000'}, {'HTTP': '122.226.57.70:8888'}, {'HTTP': '125.108.87.88:9000'}, {'HTTP': '122.234.92.111:9000'}, {'HTTP': '121.232.199.95:9000'}, {'HTTP': '175.43.57.17:9999'}, {'HTTP': '163.204.245.61:9999'}, {'HTTP': '123.55.114.251:9999'}, {'HTTP': '110.243.18.179:9999'}, {'HTTP': '125.108.107.118:9000'}, {'HTTP': '125.108.67.101:9000'}, {'HTTP': '123.101.207.19:9999'}, {'HTTP': '113.128.27.16:9999'}, {'HTTP': '125.108.85.59:9000'}, {'HTTP': '125.108.75.165:9000'}, {'HTTP': '123.160.1.227:9999'}, {'HTTP': '125.108.114.61:9000'}, {'HTTP': '171.35.168.185:9999'}, {'HTTP': '120.83.108.100:9999'}, {'HTTP': '123.163.116.39:9999'}, {'HTTP': '106.4.136.158:9000'}, {'HTTP': '120.83.100.162:9999'}, {'HTTP': '175.42.123.75:9999'}, {'HTTP': '113.124.92.168:9999'}, {'HTTP': '112.14.47.6:52024'}, {'HTTP': '125.108.73.169:9000'}, {'HTTP': '222.85.28.130:52590'}, {'HTTP': '123.149.141.196:9999'}, {'HTTP': '123.149.136.233:9999'}, {'HTTP': '123.57.84.116:8118'}, {'HTTP': '125.108.117.67:9000'}, {'HTTP': '175.43.56.16:9999'}, {'HTTP': '175.42.122.25:9999'}, {'HTTP': '175.42.123.160:9999'}, {'HTTP': '125.108.102.60:9000'}, {'HTTP': '125.110.98.55:9000'}, {'HTTP': '175.43.59.57:9999'}, {'HTTP': '123.55.98.218:9999'}, {'HTTP': '171.11.178.35:9999'}]
theProxy = random.choice(proxy_list)
myHeaders = [
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14",
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Win64; x64; Trident/6.0)",
'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
'Opera/9.25 (Windows NT 5.1; U; en)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9',
"Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 Chrome/16.0.912.77 Safari/535.7",
"Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 ",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 Edg/85.0.564.63"
]
header = random.choice(myHeaders)
time.sleep(0.1)
html = requests.get(url, headers={"User-Agent": header, "Referer": url}, proxies = theProxy) # 获取url,连接连接请求 or write like this - {'User-Agent': header} {'proxy': theProxy} >>>>proxies = theProxy
print("************* Check header2 -", header)
print("************* Print the Preferer - ", url)
# if html.raise_for_status() == 200:
# print ("The html source is - \n", html.text)
if selectorFlag == "bs":
htmlContent = BeautifulSoup(html.text, 'html.parser') # 此处应该是html.text 而不是html, 这是用beautifulsoup库来提取tag及属性值的解析思路
imgTagQty = len(htmlContent.select("img"))
else:
htmlContent = etree.HTML(requests.get(url).text) # 注意此处写法,返回的是Element对象,即htmlCotent是一个Elelment对象,
imgTagQty = len(htmlContent.xpath('//table/tr/td/a/img/@src')) #使用xpath选择器来提取tag和属性值
print("How many pictures are there in this page - ", imgTagQty)
return htmlContent, imgTagQty
抓取图片并保存在指定位置 -
def fetchPicture_and_saved(htmlContent, imgTagQty, pathRootPicture, savePictrueFolder, selectorFlag):
for i in range(imgTagQty):
# if "顾欣怡" in soup.select('img')[i]['alt']:
if selectorFlag == "bs":
getImgs = soup.select('img')[i]['src'] # 获取src的值, 所有图片的tag都是img, 属性src则是指向图片的来源
pictureName = getImgs.split("/")[-1] # 获取图片名称
getSuffix = getImgs.split(".")[-1]
else:
getImgList = htmlContent.xpath('//table/tr/td/a/img/@src')
for getImgs in getImgList:
pictureName = getImgs.split("/")[-1]
getSuffix = getImgs.split(".")[-1]
if getSuffix.lower() == "jpg":
if "http://" in getImgs or "https://" in getImgs:
picUrl = getImgs
else:
picUrl = pathRootPicture + getImgs # 拼接图片的完整地址
print(" >>>>>>>>>>>>>>>>> Check the picture URL -\n", picUrl)
pictrueData = requests.get(picUrl).content
if not os.path.exists(savePictrueFolder):
os.makedirs(savePictrueFolder)
savePictureAs = os.path.join(savePictrueFolder, pictureName)
if os.path.exists(savePictureAs):
print("该图片已经存在!")
else:
open(savePictureAs, 'wb').write(pictrueData)
print("保存成功!!!")
删除缩略图,只保留大尺寸图片 -
def deleteSnapShot(savePictrueFolder):
for root, dirs, files in os.walk(savePictrueFolder):
for file in files:
pathpic = root + "\\" + file
picSize = ((os.path.getsize(root + "\\" + file))//1000)
if picSize < 20:
os.remove(pathpic)
if __name__ == "__main__":
pathRootPicture = r'http://m.qwsgyideyrtys.com'
selectorFlag = ""
savePictrueFolder = r'C:\Users\Luck_Lello\Desktop\PythonDrilling\fetchPictruesFromWebsite\PictureArts_Oct09'
for j in [278, 279, 280]: #167,
for i in range(1, 16):
if i == 1:
url = r'http://m.qwsgyideyrtys.com/dandan/20180713/' + str(j) + '.html'
else:
url = r'http://m.qwsgyideyrtys.com/dandan/20180713/' + str(j) + '_' + str(i) + '.html'
soup, imgTagQty = getResponse_and_imgTag(url, selectorFlag)
fetchPicture_and_saved(soup, imgTagQty,
pathRootPicture, savePictrueFolder, selectorFlag)
代码量不是很大,不过是在基本思路实现上增加了访问头header和代理服务的封装,提高了代码的隐蔽性;另外在对页面图片的抓取上采用了beautifulsoap的解析和etree的解析方法,两者都是在将页面内容修正为html代码格式后进行的,主要区别在于etree能直接使用xpath的方法,直接定位到指定的图片,而beautifulsoap虽然也能提取元素的值和属性,但不容易定位到指定的图片,通常是需要遍历来实现的,这增加了时间开销,不可取,但不会放过每一张图片。两种方法都放进来,便于参考学习。