少女祈祷中...

双拼输入法

前端时间一直在练习双拼输入法,所以刚上手打字比较慢,所以一直没怎么写博客。(对,一定是因为这个原因才不写。)现在感觉比较熟练了。
至于什么是双拼输入法这里有一篇文章。至于如何修改输入法,这个可以去搜索一下,毕竟不同的设备修改流程可能不太一样。
关于双拼的种类,按搜索结果来看“小鹤双拼”和“自然码”这两种用的比较多。我选的是自然码,不过感觉其实都大同小异。这里还有一个很不错的练习项目。不过这东西不是一天两天就能掌握的,需要长期的练习哦。
但是熟练以后你就会感觉到双拼的优雅。除此之外还能防止别人入侵你的设备。比如你电脑开着,人去忙别的事情了,这时候如果有人想用你的社交账号发点什么言论,他可能不会双拼,连字都打不出来。

爬取Yandex的搜索结果

Yandex是俄罗斯人发明的一个搜索引擎,它有个搜图功能,可以根据关键词搜索,也可以根据图片搜索相似的内容,感觉效果是可以的。这次就试着爬一下根据关键词搜索的内容。
比如我们就使用faruzan这个词进行搜索。
在搜索页中,打开浏览器的检查功能。

一般来说,爬虫有两种方式,一种是取得到html的源码,然后根据Xpath或者Selector找到目标元素。另一种方法是直接找到后端的api,从而获取数据。这里使用后者。
于是,注意到有这么个东西:https://yandex.com/images/search?……
省略号是因为这个url太长了。

为表诚意,我们需要先搞个请求头。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
headers = {
'Accept-Encoding':'gzip, deflate, br, zstd',
'Accept-Language':'zh-CN,zh;q=0.9',
'Cookie':'i_eaabs=1; receive-cookie-deprecation=1; i=jGdn72ddAKMYgenX/nEOa2pQkc8sDCJTYOQyBmQDrRPr0zJxu88zLDWiqgcemLgRGNIWMhMWe6/tRG4CnZI+XtZp0Ks=; yandexuid=5707939341711265164; yashr=6175169221711265164; is_gdpr=0; yuidss=5707939341711265164; _ym_uid=1711358990497205648; font_loaded=YSv1; is_gdpr_b=CLj5IhDr8gEoAg==; gdpr=0; L=ZSdWZUt6QwB6f3l3AVlTWFR1QE4JXQQJNAYGHTABGjUkFw==.1711501310.15661.383778.d107972615ae981104963ffe7e5eaaa6; yandex_login=luviichann; yandex_gid=11514; my=YysBrPoA; gpb=yandex_gid.11514#ygo.21013%3A11514#ygu.0; bh=EkAiR29vZ2xlIENocm9tZSI7dj0iMTIzIiwgIk5vdDpBLUJyYW5kIjt2PSI4IiwgIkNocm9taXVtIjt2PSIxMjMiGgUieDg2IiIQIjEyMy4wLjYzMTIuMTA2IioCPzAyAiIiOgkiV2luZG93cyJCCCIxNC4wLjAiSgQiNjQiUlwiR29vZ2xlIENocm9tZSI7dj0iMTIzLjAuNjMxMi4xMDYiLCAiTm90OkEtQnJhbmQiO3Y9IjguMC4wLjAiLCAiQ2hyb21pdW0iO3Y9IjEyMy4wLjYzMTIuMTA2IloCPzA=; Session_id=3:1712302795.5.0.1711501310798:bC6PJA:11.1.2:1|1959455688.0.2.0:3.3:1711501310|11:10180069.998265.ii1jnIGJCBGOzrnmPtfSzeZEOuM; sessar=1.1188.CiAXrjSxgsZwEgqCVpiRqeRNykJXDnigCOgXV9oPgfkL4g.1KMjT0XcD1YRNSGCw6InDue_6riqt1tR27N4f5asYc0; sessionid2=3:1712302795.5.0.1711501310798:bC6PJA:11.1.2:1|1959455688.0.2.0:3.3:1711501310|11:10180069.998265.fakesign0000000000000000000; ymex=1714984978.oyu.5707939341711265164#2026718990.yrts.1711358990; bh=Ej4iR29vZ2xlIENocm9tZSI7dj0iMTIzIiwiTm90OkEtQnJhbmQiO3Y9IjgiLCJDaHJvbWl1bSI7dj0iMTIzIhoFIng4NiIiECIxMjMuMC42MzEyLjEwNiIqAj8wOgkiV2luZG93cyJCCCIxNC4wLjAiSgQiNjQiUlsiR29vZ2xlIENocm9tZSI7dj0iMTIzLjAuNjMxMi4xMDYiLCJOb3Q6QS1CcmFuZCI7dj0iOC4wLjAuMCIsIkNocm9taXVtIjt2PSIxMjMuMC42MzEyLjEwNiIi; _ym_d=1712403995; yabs-vdrf=A0; ys=wprid.1712404938048476-15457847157943776967-balancer-l7leveler-kubr-yp-vla-155-BAL; _yasc=4y2tIr1+OrUzxVk4wDX+Hfcyp8TT9TH+VHFYNJH7sMKjXCsI0fkL01hsJCAculpOU3eja06yUOAIbNpJB7Gs; _ym_isad=2; yabs-udb=Pc5oTNfXRW00; cycada=zs7KErd5i46P37SZt2Z3pOrMirwEt1ZJDo7ndolXwno=; yp=1715081628.csc.1#2027764942.pcs.0#4294967295.skin.s#1727269129.szm.1_25:1536x864:683x730#2026861310.udn.cDpMdXZpaSBDaGFubg%3D%3D#1743037729.ygo.21013%3A11514#1743037729.ygu.0#1712479378.yu.5707939341711265164',
'Device-Memory':'8',
'Downlink':'0.4',
'Dpr':'1.25',
'Ect':'3g',
'Referer':'https://yandex.com/images/search?family=yes&text=faruzan',
'Rtt':'450',
'Sec-Ch-Ua':'"Google Chrome";v="123", "Not:A-Brand";v="8", "Chromium";v="123"',
'Sec-Ch-Ua-Arch':'"x86"',
'Sec-Ch-Ua-Bitness':'"64"',
'Sec-Ch-Ua-Full-Version':'"123.0.6312.106"',
'Sec-Ch-Ua-Full-Version-List':'"Google Chrome";v="123.0.6312.106", "Not:A-Brand";v="8.0.0.0", "Chromium";v="123.0.6312.106"',
'Sec-Ch-Ua-Mobile':'?0',
'Sec-Ch-Ua-Model':'""',
'Sec-Ch-Ua-Platform':'"Windows"',
'Sec-Ch-Ua-Platform-Version':'"14.0.0"',
'Sec-Ch-Ua-Wow64':'?0',
'Sec-Fetch-Dest':'empty',
'Sec-Fetch-Mode':'cors',
'Sec-Fetch-Site':'same-origin',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
'Viewport-Width':'683',
'X-Requested-With':'XMLHttpRequest',
}

根据观察结果,这个api是有限定范围的,所以涉及到定制请求。

1
2
3
4
5
6
7
8
import urllib.request
def create_url(p):
if (p == 0):
url = "https://yandex.com/images/search?tmpl_version=releases%2Ffrontend%2Fimages%2Fv1.1280.0%23274c929bac850df85275252016e13c3c22086c1a&format=json&request=%7B%22blocks%22%3A%5B%7B%22block%22%3A%22extra-content%22%2C%22params%22%3A%7B%7D%2C%22version%22%3A2%7D%2C%7B%22block%22%3A%7B%22block%22%3A%22i-react-ajax-adapter%3Aajax%22%7D%2C%22params%22%3A%7B%22type%22%3A%22ImagesApp%22%2C%22ajaxKey%22%3A%22serpList%2Ffetch%22%7D%2C%22version%22%3A2%7D%5D%2C%22metadata%22%3A%7B%22bundles%22%3A%7B%22lb%22%3A%22k%2BNw%7DkFub%5D%22%7D%2C%22assets%22%3A%7B%22las%22%3A%22justifier-height%3D1%3Bjustifier-setheight%3D1%3Bfitimages-height%3D1%3Bjustifier-fitincuts%3D1%3Breact-with-dom%3D1%3B231.0%3D1%3B239.0%3D1%3B1112b8.0%3D1%3B3af2a4.0%3D1%22%7D%2C%22extraContent%22%3A%7B%22names%22%3A%5B%22i-react-ajax-adapter%22%5D%7D%7D%7D&yu=5707939341711265164&family=yes&lr=11514&p=1&rpt=image&serpListType=horizontal&serpid=qIPprE9IqIMerOLJkwz0fQ&text=faruzan&uinfo=sw-1536-sh-864-ww-683-wh-730-pd-1.25-wp-16x9_1920x1080"
else:
url = "https://yandex.com/images/search?tmpl_version=releases%2Ffrontend%2Fimages%2Fv1.1280.0%23274c929bac850df85275252016e13c3c22086c1a&format=json&request=%7B%22blocks%22%3A%5B%7B%22block%22%3A%22extra-content%22%2C%22params%22%3A%7B%7D%2C%22version%22%3A2%7D%2C%7B%22block%22%3A%7B%22block%22%3A%22i-react-ajax-adapter%3Aajax%22%7D%2C%22params%22%3A%7B%22type%22%3A%22ImagesApp%22%2C%22ajaxKey%22%3A%22serpList%2Ffetch%22%7D%2C%22version%22%3A2%7D%5D%2C%22metadata%22%3A%7B%22bundles%22%3A%7B%22lb%22%3A%22k%2BNw%7DkFub%5D%22%7D%2C%22assets%22%3A%7B%22las%22%3A%22justifier-height%3D1%3Bjustifier-setheight%3D1%3Bfitimages-height%3D1%3Bjustifier-fitincuts%3D1%3Breact-with-dom%3D1%3B231.0%3D1%3B239.0%3D1%3B1112b8.0%3D1%3B3af2a4.0%3D1%3B151.0%3D1%3Bbde834.0%3D1%22%7D%2C%22extraContent%22%3A%7B%22names%22%3A%5B%22i-react-ajax-adapter%22%5D%7D%7D%7D&yu=5707939341711265164&family=yes&lr=11514&p=" + str(p) +"&rpt=image&serpListType=horizontal&serpid=qIPprE9IqIMerOLJkwz0fQ&text=faruzan&uinfo=sw-1536-sh-864-ww-683-wh-730-pd-1.25-wp-16x9_1920x1080"
req = urllib.request.Request(url,headers=headers)
return req

其中str(p)就是传入一个整数,可以理解为页码。

对于返回的req,可以再写一个函数进行处理。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import gzip
import io

def get_content(req):
res = urllib.request.urlopen(req)
# 读取数据
data = res.read()

# 尝试解压缩数据
try:
with gzip.GzipFile(fileobj=io.BytesIO(data)) as f:
content = f.read().decode('utf-8')
except OSError:
# 如果不是gzip压缩的数据,直接解码
content = data.decode('utf-8')

with open('./faruzan/fls_'+str(p)+'.json','w',encoding='utf-8') as fp:
fp.write(content)

这里我们解读接收到的响应中的数据,并把它们写入文件中,而不是直接保存,这样可以观察一下数据的格式。


通过观察json数据,可以找到图片的url,这个过程不算太复杂,但是比较容易搞错,毕竟json数据层层嵌套,不便于观察。 找到之后处理一下就可以下载图片了。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import json
import jsonpath

def get_info(p):
obj = json.load(open('./faruzan/fls_'+str(p)+'.json','r',encoding='utf-8'))
img_list = obj["blocks"][1]["params"]["adapterData"]["serpList"]["items"]["entities"]
path = "./faruzan/p-"+str(p)
os.makedirs(path)
for k in img_list:
url = img_list[k]["origUrl"]
name = img_list[k]["alt"]
download(url,name,path)


import os

def download(url,name,path):
filename = path+'/'+clean_filename(name)+'.png'
print(url,filename)
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
}
)
# 这里的错误处理是把由于各种原因下载失败的图片链接存储下来。
try:
urllib.request.urlretrieve(req.full_url, filename)
except Exception as e:
with open('./faruzan/errfls.txt','a',encoding='utf-8') as fp:
fp.write(url+'Error'+ ' '+str(e)+'\n')

# 这个函数的作用是过滤一下存在违规字符的文件名。
def clean_filename(filename):
illegal_chars = ['<', '>', ':', '"', '/', '\\', '|', '?', '*']
for char in illegal_chars:
filename = filename.replace(char, '_')
if (len(filename)>=249):
filename = filename[0:249]
return filename

启动!

1
2
3
4
5
6
7
8
if __name__ == "__main__":
m = int(input())
n = int(input())
for p in range(m,n+1):
req = create_url(p)
get_content(req)
get_info(p)
print(p)

后记

请求的url中有这样一个参数serpid=qIPprE9IqIMerOLJkwz0fQ,这个我猜可能是一个加密之类的东西,每天会变一次,搜不同的东西也会变。但好像没什么关系,故意更改甚至直接删掉似乎也不会影响搜索结果。