再议爬虫


python的应用场景有很大一部分是基于计算机网络的,因为现在的人们使用计算机来工作不是孤立的,大部分工作都是在网络上完成的,python在处理这方面的事务上是天然的行家里手,这主要是因为它拥有丰富的网络应用接口包,能处理各种各样的网络请求,大大地降低了程序开发的复杂度,提升了应用开发的效率,今天就以爬虫为例,一起来看看常用的python网络工具库的使用场景。

目前python的网络应用包或工具库主要分为两类: 一类是用于网络请求及文件处理的,另一类是对网页内容进行解析和信息提取的,常用的库有 - urllib,urllib2,urllib3,requests, BeautifulSoup, lxml等,其中urllib,urllib2,urllib3,requests 主要用于前者,而bs和lxml则多用于后者,这里的urllibx区别是针对不同的python版本来说的,而requests的底层实现其实是基于urllib库的。本质上还是urllib的应用拓展,了解了这些,对我们的实际的开发和使用这些类库是很有帮助的。

废话不多说,一起看例子吧,这是一个简单的网页访问和页面图片内容抓取的应用,基本的实现代码如下 -

基本思路实现 -
import requests
from bs4 import BeautifulSoup
from lxml import etree

htmlRequest = requests.get(url) #返回一个requests对象
htmlContent = BeautifulSoup(html.text, ''html.parser'') #获得html的本文内容

htmlContent.select("img") #获取所有的img tag内容

上面三步是最直接的网络处理事务流,但要获取准确的信息,光这几行代码是远远不够的,还需要在它的基础上对齐进行包装和完善,增加代码的隐蔽性和执行效率,因为你在网络上爬取人家的信息,一般都不会赤裸裸的去索取,那样的话,人家一定不会让你顺利爬取的,唉, 为什么?这还用问,你懂得。。。
下面是我的完整代码,重要处都做了注释说明,相信有点python基础的你,一定会看明白的 ~~

访问指定网页,获得页面html内容并返回图片链接的数量 -
import requests, urllib, json, os, sys, time, random
from bs4 import BeautifulSoup

# 每次切换页面就更换headers及ip, 使得每次的访问就像是来自真实的用户访问,这是对requests网路访问的基本封装,你可以先在网页的开发及网络模式下查看,找出request header的基本参数,然后逐一封装,通常用proxy和header就可以了。
 proxy_list = [{'HTTP': '175.42.128.96:9999'}, {'HTTP': '119.108.172.244:9000'}, {'HTTP': '122.226.57.70:8888'},  {'HTTP': '125.108.87.88:9000'}, {'HTTP': '122.234.92.111:9000'}, {'HTTP': '121.232.199.95:9000'}, {'HTTP': '175.43.57.17:9999'}, {'HTTP': '163.204.245.61:9999'}, {'HTTP': '123.55.114.251:9999'}, {'HTTP': '110.243.18.179:9999'}, {'HTTP': '125.108.107.118:9000'}, {'HTTP': '125.108.67.101:9000'}, {'HTTP': '123.101.207.19:9999'}, {'HTTP': '113.128.27.16:9999'}, {'HTTP': '125.108.85.59:9000'}, {'HTTP': '125.108.75.165:9000'}, {'HTTP': '123.160.1.227:9999'}, {'HTTP': '125.108.114.61:9000'}, {'HTTP': '171.35.168.185:9999'}, {'HTTP': '120.83.108.100:9999'}, {'HTTP': '123.163.116.39:9999'}, {'HTTP': '106.4.136.158:9000'}, {'HTTP': '120.83.100.162:9999'}, {'HTTP': '175.42.123.75:9999'}, {'HTTP': '113.124.92.168:9999'}, {'HTTP': '112.14.47.6:52024'}, {'HTTP': '125.108.73.169:9000'}, {'HTTP': '222.85.28.130:52590'}, {'HTTP': '123.149.141.196:9999'}, {'HTTP': '123.149.136.233:9999'}, {'HTTP': '123.57.84.116:8118'}, {'HTTP': '125.108.117.67:9000'}, {'HTTP': '175.43.56.16:9999'}, {'HTTP': '175.42.122.25:9999'}, {'HTTP': '175.42.123.160:9999'}, {'HTTP': '125.108.102.60:9000'}, {'HTTP': '125.110.98.55:9000'}, {'HTTP': '175.43.59.57:9999'}, {'HTTP': '123.55.98.218:9999'}, {'HTTP': '171.11.178.35:9999'}]
 theProxy = random.choice(proxy_list)

myHeaders = [
     "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
     "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36",
     "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0",
     "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14",
     "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Win64; x64; Trident/6.0)",
     'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
     'Opera/9.25 (Windows NT 5.1; U; en)',
     'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
     'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
     'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
     'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9',
     "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 Chrome/16.0.912.77 Safari/535.7",
     "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 ",
     "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 Edg/85.0.564.63"
 ]
 header = random.choice(myHeaders)

 time.sleep(0.1)
 html = requests.get(url, headers={"User-Agent": header, "Referer": url}, proxies = theProxy) # 获取url,连接连接请求 or write like this  - {'User-Agent': header} {'proxy': theProxy}  >>>>proxies = theProxy
 print("************* Check header2 -", header)
 print("************* Print the Preferer - ", url)
 # if html.raise_for_status() == 200:
     # print ("The html source is - \n", html.text)
 if selectorFlag == "bs":
     htmlContent = BeautifulSoup(html.text, 'html.parser') # 此处应该是html.text 而不是html, 这是用beautifulsoup库来提取tag及属性值的解析思路
     imgTagQty = len(htmlContent.select("img"))
 else:
     htmlContent = etree.HTML(requests.get(url).text) # 注意此处写法,返回的是Element对象,即htmlCotent是一个Elelment对象,
     imgTagQty = len(htmlContent.xpath('//table/tr/td/a/img/@src')) #使用xpath选择器来提取tag和属性值


 print("How many pictures are there in this page - ", imgTagQty)
 return htmlContent, imgTagQty
抓取图片并保存在指定位置 -
def fetchPicture_and_saved(htmlContent, imgTagQty, pathRootPicture, savePictrueFolder, selectorFlag):
    for i in range(imgTagQty):
    # if "顾欣怡" in soup.select('img')[i]['alt']:
        if selectorFlag == "bs":
            getImgs = soup.select('img')[i]['src'] # 获取src的值, 所有图片的tag都是img, 属性src则是指向图片的来源
            pictureName = getImgs.split("/")[-1] # 获取图片名称
            getSuffix = getImgs.split(".")[-1]
        else:
            getImgList = htmlContent.xpath('//table/tr/td/a/img/@src')
            for getImgs in getImgList:
                pictureName = getImgs.split("/")[-1]
                getSuffix = getImgs.split(".")[-1]

        if getSuffix.lower() == "jpg":
            if "http://" in getImgs or "https://" in getImgs:
                picUrl = getImgs
            else:
                picUrl = pathRootPicture + getImgs # 拼接图片的完整地址

            print(" >>>>>>>>>>>>>>>>> Check the picture URL -\n", picUrl)
            pictrueData = requests.get(picUrl).content

            if not os.path.exists(savePictrueFolder):
                os.makedirs(savePictrueFolder)

            savePictureAs = os.path.join(savePictrueFolder, pictureName)
            if os.path.exists(savePictureAs):
                print("该图片已经存在!")
            else:
                open(savePictureAs, 'wb').write(pictrueData)
                print("保存成功!!!")
删除缩略图,只保留大尺寸图片 -
def deleteSnapShot(savePictrueFolder):
    for root, dirs, files in os.walk(savePictrueFolder):
        for file in files:
            pathpic = root + "\\" + file
            picSize = ((os.path.getsize(root + "\\" + file))//1000)
            if picSize < 20:
                os.remove(pathpic)
if __name__ == "__main__":  
  pathRootPicture = r'http://m.qwsgyideyrtys.com'
  selectorFlag = ""
  savePictrueFolder = r'C:\Users\Luck_Lello\Desktop\PythonDrilling\fetchPictruesFromWebsite\PictureArts_Oct09'
  for j in [278, 279, 280]: #167,
      for i in range(1, 16):
          if i == 1:
              url = r'http://m.qwsgyideyrtys.com/dandan/20180713/' + str(j) + '.html'
          else:
              url = r'http://m.qwsgyideyrtys.com/dandan/20180713/' + str(j) + '_' + str(i) + '.html'
          soup, imgTagQty = getResponse_and_imgTag(url, selectorFlag)
          fetchPicture_and_saved(soup, imgTagQty,
            pathRootPicture, savePictrueFolder, selectorFlag)

代码量不是很大,不过是在基本思路实现上增加了访问头header和代理服务的封装,提高了代码的隐蔽性;另外在对页面图片的抓取上采用了beautifulsoap的解析和etree的解析方法,两者都是在将页面内容修正为html代码格式后进行的,主要区别在于etree能直接使用xpath的方法,直接定位到指定的图片,而beautifulsoap虽然也能提取元素的值和属性,但不容易定位到指定的图片,通常是需要遍历来实现的,这增加了时间开销,不可取,但不会放过每一张图片。两种方法都放进来,便于参考学习。


Author: Alan_Yuan
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source Alan_Yuan !
  TOC