博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
爬虫--Urllib库详解
阅读量:7111 次
发布时间:2019-06-28

本文共 9923 字,大约阅读时间需要 33 分钟。

1、什么是Urllib?

2、相比Python2的变化

3、用法讲解

(1)urlopen

urlllb.request.urlopen(url,data=None[timeout,],cahle=None,capath=None,cadefault=False,context=None)#第一个参数为url网址,第二个参数为额外的数据,第三个参数为超时的设置,剩下的参数暂时用不到
######### GET 类型的请求 #############import urllib.requestresponse =urllib.request.urlopen("http://ww.baidu.com")print(response.read().decode("utf-8")
··································································
打印的结果为:
######### POST 类型的请求 #############import urllib.requestimport urllib.parsedata=bytes(urllib.parse.urlencode({
'word':'hello'}),encoding='utf8')response=urllib.request.urlopen("http://httpbin.org/post",data=data) # http://httpbin.org/post HTTP测试的网址print(response.read())
b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "word": "hello"\n  }, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Connection": "close", \n    "Content-Length": "10", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.5"\n  }, \n  "json": null, \n  "origin": "221.208.253.76", \n  "url": "http://httpbin.org/post"\n}\n'
打印的结果为:
import urllib.request############### 超时的设置 ###############response=urllib.request.urlopen("http://httpbin.org/get",timeout=1) # 设置一个超时的时间,在规定的时间没有响应,则抛出异常print(response.read())
b'{\n "args": {}, \n "headers": {\n "Accept-Encoding": "identity", \n "Connection": "close", \n "Host": "httpbin.org", \n "User-Agent": "Python-urllib/3.5"\n }, \n "origin": "221.208.253.76", \n "url": "http://httpbin.org/get"\n}\n'
打印的结果为:
import urllib.requestimport urllib.errorimport socket############### 超时的设置,超出响应时间 ###############try:    response = urllib.request.urlopen('htp://httpbin.org/get', timeout=0.1)except urllib.error.URLError as e:    if isinstance(e.reason,socket.timeout):        print("Time out")
Time out
打印的结果为:

 (2)响应

 响应类型

import urllib.requestresponse=urllib.request.urlopen('https://www.python.org')print(type(response))
打印的结果为:

状态码、响应头

import urllib.requestresponse =urllib.request.urlopen('https://www.python.org')print(response.status) # 获取状态码print(response.getheaders) # 获取响应头print(response.getheader('Server'))
200
>nginx
打印的结果为:

(3)request

import urllib.requestrequest=urllib.request.Request("https://python.org")response=urllib.request.urlopen(request)print(response.read().decode("utf-8"))
·····································
打印的结果为:
from urllib import request,parseurl='http://httpbin.org/post'############ POST 请求 ###############headers={    "User-Agent":"Mozilla/4.0(compatible;MSIE 5.5;Windows NT)",    "Host":'httpbin.org'    }dict={    'name':"Germey"    }data =bytes(parse.urlencode(dict),encoding="utf-8")req =request.Request(url=url,data=data,headers=headers,method='POST')response=request.urlopen(req)print(response.read().decode('utf-8'))
{"args": {}, "data": "", "files": {}, "form": {"name": "Germey"}, "headers": {"Accept-Encoding": "identity", "Connection": "close", "Content-Length": "11", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "Mozilla/4.0(compatible;MSIE 5.5;Windows NT)"}, "json": null, "origin": "221.208.253.76", "url": "http://httpbin.org/post"}
打印的结果为:

from urllib import request,parseurl ="http://httpbin.org/post"dict={    'name':'Germey'    }data =bytes(parse.urlencode(dict),encoding='utf8')req = request.Request(url=url,data=data,method="POST")req.add_header('User-Agent','Mozilla/4.0(compatible;MSIE5.5;Windows NT)')response = request.urlopen(req)print(response.read().decode('utf-8'))
{"args": {}, "data": "", "files": {}, "form": {"name": "Germey"}, "headers": {"Accept-Encoding": "identity", "Connection": "close", "Content-Length": "11", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "Mozilla/4.0(compatible;MSIE5.5;Windows NT)"}, "json": null, "origin": "221.208.253.76", "url": "http://httpbin.org/post"}
打印的结果为:

(4)Handler

 代理

import urllib.requestproxy_handler = urllib.request.ProxyHandler({    'http':'http://127.0.0.1:9743',   # 代理http    'https':'https://127.0.0.1:9743'  # 代理https    })opener =urllib.request.build_opener(proxy_handler)response=opener.open("http://www.baidu.com")print(response.read())
因为我没有代理,所以打印出来的结果为:urllib.error.URLError: 
打印的结果为:

Cookie

import http.cookiejar,urllib.requestcookie=http.cookiejar.CookieJar()  # 获取Cookie信息handler=urllib.request.HTTPCookieProcessor(cookie) # 把Cookie信息放入到 handler中opener=urllib.request.build_opener(handler) # 建立openerresponse=opener.open("http://www.baidu.com")for item in cookie:    print(item.name+"=”+item.value)
BAIDUID=DDCB4C216AE8EE90C7D95E7AF8FA577F:FG=1BIDUPSID=DDCB4C216AE8EE90C7D95E7AF8FA577FH_PS_PSSID=1452_21078_26350_27111PSTM=1536830732BDSVRTM=0BD_HOME=0delPer=0
打印的结果为:
########### 把Cookie 保存成文件 ##########import http.cookiejar,urllib.requestfilename = "cookie.txt"cookie=http.cookiejar.MozillaCookieJar(filename)handler=urllib.request.HTTPCookieProcessor(cookie)opener=urllib.request.build_opener(handler)response=opener.open("http://www.baidu.com")cookie.save(ignore_discard=True,ignore_expires=True)
在工程目录下多了一个cookie.txt文件该文件的内容为:# Netscape HTTP Cookie File# http://curl.haxx.se/rfc/cookie_spec.html# This is a generated file!  Do not edit..baidu.com TRUE   /  FALSE  3684314677 BAIDUID    CB67C520D33E28D7204C570EB7DFA28F:FG=1.baidu.com TRUE   /  FALSE  3684314677 BIDUPSID   CB67C520D33E28D7204C570EB7DFA28F.baidu.com TRUE   /  FALSE     H_PS_PSSID 1434_21113_26350_20930.baidu.com TRUE   /  FALSE  3684314677 PSTM   1536831034www.baidu.com  FALSE  /  FALSE     BDSVRTM    0www.baidu.com  FALSE  /  FALSE     BD_HOME    0www.baidu.com  FALSE  /  FALSE  2482910974 delPer 0
打印的结果为:
########### 另一种 Cookie 的保存案例 ##########import http.cookiejar,urllib.requestfilename = "cookies.txt"cookie=http.cookiejar.LWPCookieJar(filename)handler=urllib.request.HTTPCookieProcessor(cookie)opener=urllib.request.build_opener(handler)response=opener.open("http://www.baidu.com")cookie.save(ignore_discard=True,ignore_expires=True)

代码运行结果与上面相同!

(5)异常处理

from urllib import request,errortry:    response=request.urlopen("http://cuiqingcai.com/index.htm")except error.URLError as e:    print(e.reason)

 

Not Found
打印的结果为:
from urllib import request,errortry:    response =request.urlopen('http://cuiqingcai.com/index.htm')except error.HTTPError as e:    print(e.reason,e.code,e.headers,sep='\n')except error.URLError as e:    print(e.reason)else:    print("Request Successfully")

 

Not Found404Server: nginx/1.10.3 (Ubuntu)Date: Thu, 13 Sep 2018 11:08:18 GMTContent-Type: text/html; charset=UTF-8Transfer-Encoding: chunkedConnection: closeVary: CookieExpires: Wed, 11 Jan 1984 05:00:00 GMTCache-Control: no-cache, must-revalidate, max-age=0Link: 
; rel="https://api.w.org/"
打印的结果为:
import socketimport urllib.requestimport urllib.errortry:    response = urllib.request.urlopen("https://www.baidu.com",timeout=0.000000001)except urllib.error.URLError as e:    print(type(e.reason))    if isinstance(e.reason,socket.timeout):        print("TimeOut")

 

TimeOut
执行后的结果为:

 (6)URL解析

urlparse

urllib.parse.urlparse(urlstring.scheme="",allow_fragments=True)
from urllib.parse import urlparseresult =urlparse("http://www.baidu.com/index.html;user?id=5i#comment")print(type(result),result)
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5i', fragment='comment')
打印的结果为:
######## 无协议类型 ###########from urllib.parse import urlparseresult =urlparse("www.baidu.com/index.html;user?id=5i#comment,scheme=/https")print(result)

 

ParseResult(scheme='', netloc='', path='www.baidu.com/index.html', params='user', query='id=5i', fragment='comment,scheme=/https')
打印后的结果为:
######## 默认的协议类型 ###########from urllib.parse import urlparseresult=urlparse("http://www.baidu.com/index.html;user?id=5i#comment,scheme=/https")print(result)

 

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5i', fragment='comment,scheme=/https')
打印后的结果为:
from urllib.parse import urlparseresult =urlparse("http://www.baidu.com/index.html;user?id=5i#comment",allow_fragments=False)print(result)

 

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5i#comment', fragment='')
打印后的结果为:
from urllib.parse import urlparseresult =urlparse("http://www.baidu.com/index.htmlf#comment",allow_fragments=False)print(result)
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.htmlf#comment', params='', query='', fragment='')
打印后的结果为:

urlunparse

from urllib.parse import urlunparsedata =["http","www.baidu.cogn","index.html","user",'a=6','comment']print(urlunparse(data))
http://www.baidu.cogn/index.html;user?a=6#comment
执行后的结果

 urljoin(url拼接,前面若在为补充,后面若在为基准)

from urllib.parse import urljoinprint(urljoin('http://www.baidu.com','FAQ.html'))print(urljoin('http://www.baidu.com','https://cuiqingcai.com/FAQ.html'))print(urljoin('http://www.baidu.com/about.html','https://cuiqingcai.com/FAQ.html'))print(urljoin('http://www.baidu.com/about.html','https://cuiqingcai.com/FAQ.html?question=2'))print(urljoin('http://www.baidu.com?wd=abc','https://cuiqingcai.com/infex.php'))print(urljoin('http://www.baidu.com','?category=2#commen:'))print(urljoin('www.baidu.com','?category=2t#comment'))print(urljoin('www.baidu.comi#comment','?category=2'))
http://www.baidu.com/FAQ.htmrhttps://cuiqingcai.com/FAQ.htmlhttps://cuiqingcai.com/FAQ.htmlhttps://cuiqingcai.com/FAQ.html?question=2https://cuiqingcai.com/infex.phphttp://www.baidu.com?category=2#commen:www.baidu.com?category=2t#commentwww.baidu.comi?category=2
打印的结果为:

urlencode(把字典对象转化为GET请求参数)

from urllib.parse import urlencodeparams={    'name':'germey',    'agel':'22'    }base_url='http://www.baidu.com?'url=base_url+urlencode(params)print(url)
http://www.baidu.com?name=germey&agel=22
打印的结果为:

 

转载于:https://www.cnblogs.com/zhuifeng-mayi/p/9641285.html

你可能感兴趣的文章
春运服务“铁骑”返乡8年女交警:寒风中随车返乡孩子少了
查看>>
「Python」一文读懂装饰器
查看>>
TreeMap就这么简单【源码剖析】
查看>>
(?<=p)与:nth-child()的相似性分析
查看>>
携程内部海量CRN项目解决方案
查看>>
阿里云 MVP技术直播——缪政辉教你如何搭建万能LNMP环境
查看>>
深入理解工厂模式
查看>>
看得见的数据结构Android版之二分搜索树篇
查看>>
实现Treeset
查看>>
Android Jetpack 助推应用开发 | 中文字幕视频介绍
查看>>
Es2016、2017新特性(上)
查看>>
聊天系统很复杂?前端工程师也能完成!
查看>>
一步一步学习JNI
查看>>
【译】 WebSocket 协议第九章——扩展(Extension)
查看>>
如何架构一个数据工程
查看>>
CSS入门指南-4:页面布局
查看>>
Kotlin——高级篇(四):集合(Array、List、Set、Map)基础
查看>>
Java并发编程之锁机制之LockSupport工具
查看>>
浅析Vue源码(四)—— $mount中template的编译--parse
查看>>
In FontFamilyFont, unable to find attribute android:font的报错处理
查看>>