本文共 2206 字,大约阅读时间需要 7 分钟。
今天,我们的普通话考试成绩终于出来了(以山东为例)。下午闲来无事,用Python写了写代码来爬取大家的成绩(已知姓名和身份证),方法有点暴力,具体实现如下:
import urllib.requestimport urllib.parseimport reimport timedef get_html(txtName, txtIDCard): url = 'http://sd.cltt.org/Web/Login/PSCP01001.aspx' data = { 'txtName': txtName, 'txtIDCard': txtIDCard, 'btnLogin': '查 询', '__VIEWSTATE': '', 'txtStuID': '', 'txtCertificateNO': '', 'txtCardNO': '' } data = urllib.parse.urlencode(data).encode('utf-8') response = urllib.request.urlopen(url, data) html = response.read().decode('utf-8') return htmldef get_result(html): name_start = html.find(r'姓名:') name_end = html.find(r'证件号:') name_html = html[name_start+190:name_end-186] id_start = html.find(r'证件号:') id_end = html.find(r'准考证号:') id_html = html[id_start+192:id_end-786] level_start = html.find(r'等级:') level_end = html.find(r'证书编号:') level_html = html[level_start+171:level_end-169] score_start = html.find(r'最终分:') score_end = html.find(r'等级:') score_html = html[score_start+173:score_end-280] bookid_start = html.find(r'证书编号:') bookid_end = html.find(r'省份:') bookid_html = html[bookid_start+174:bookid_end-280] k_start = html.find(r'准考证号:') k_end = html.find(r'出生日期:') k_html = html[k_start+173:k_end-171] if len(name_html) < 10: print("--------------------------------------------------------------------------------------------------------------") print("姓名(id):%s(%s) | 等级:%s(%s分) | 证书编号:%s | 准考证号:%s"%(name_html, id_html, level_html, score_html, bookid_html, k_html))def main(): names = ["张三","李四","王五","赵四","狗子","二虎"] ids = ["3xxxxxxxxxxxxxx2","3xxxxxxxxxxxxxx0","6xxxxxxxxxxxxxx0","3xxxxxxxxxxxxxx2","3xxxxxxxxxxxxxx1X","3xxxxxxxxxxxxxx7"] for i in range(len(names)): txtName = names[i] txtIDCard = ids[i] html = get_html(txtName, txtIDCard) get_result(html)
这段代码通过模拟浏览器请求,向普通话考试成绩查询系统发送请求,成功获取了相关信息。虽然方法略显直接,但在当前情况下能够获取所需数据。
通过仔细分析HTML结构,代码成功提取了以下信息:
代码中使用了urllib
库来处理HTTP请求,并通过re
库解析HTML内容。虽然代码逻辑清晰,但对于大规模数据爬取可能会存在性能问题,建议在实际应用中增加请求间隔和错误处理机制。
转载地址:http://mcb.baihongyu.com/