参考了这篇文章简单爬取天眼查数据(非严谨)

大四了,想找个实习,联系了几家公司,以前没编过爬虫,只了解了些概念,然后有一个公司让我编个爬虫试试看,就是用python实现从天眼查提取指定公司的所有股东,我想了想,编编看,然后完成了这个小东西。

分析天眼查网站,通过火狐抓包,发现必须要解析JavaScript才能拿到真实的数据,网上查找了下,发现使用phantomjs是比较简单的一种方式。

所以选择用selenium 和 phantomjs来获取页面源代码,用BeautifulSoup4 + lxml来解析页面内容,用的python2.7

使用selenium+PHANTOMJS获取源代码,

selenium简单教程-----Python爬虫利器五之Selenium的用法
phantomjs简单教程-----Python爬虫利器四之PhantomJS的用法

为phantomjs添加useragent信息

1
2
3
4
5
6
7
def driver_open():
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0"
)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
return driver

获取网页源代码

1
2
3
4
5
6
7
8
def get_content(driver,url):
driver.get(url)
#等待2秒,根据动态网页加载耗时自定义,这里如果设置的太小的话,会爬取不完整,出现错误
time.sleep(2)
# 获取网页内容
content = driver.page_source.encode('utf-8')
soup = BeautifulSoup(content, 'lxml')
return soup

可以通过 print content 来简单验证页面内容爬的对不对

使用BeautifulSoup4 + lxml来解析页面内容

BeautifulSoup4教程-----Beautiful Soup 4.2.0 文档

接下去解析代码,获取对应的信息。

查找指定公司页面url

先看天眼查搜索格式
'http://www.tianyancha.com/search?key=' + key + '&checkFrom=searchBox'

然后用BeautifulSoup4解析网页源代码
通过urllib.qutoe()方法,以防传入的keyname出现错误编码

1
2
3
4
5
6
7
8
def search_url(keyname):#查找公司url
key = urllib.quote(keyname)
search_url = 'http://www.tianyancha.com/search?key=' + key + '&checkFrom=searchBox'
res_soup = get_content(driver, search_url) #得到搜索页代码
ifname = res_soup.find_all(attrs={"ng-bind-html": "node.name | trustHtml"})
name = ifname[0].text
ifcompany = res_soup.find_all(attrs={"ng-click": "$event.preventDefault();goToCompany(node.id);inClick=true;"})
return ifcompany[0].get('href')

查找股东页面url

有了公司页面url,通过phantomjs来获取公司页面源代码,然后解析内容,获得指定公司的基本信息和所有股东信息

然后输入股东格名,获得股东页面url。
在输入股东姓名时,需要注意python2的编码问题,从网页获得的股东姓名格式是unicode,而在Bash里输入的股东姓名是以string的形式,所以在这里需要对输入的股东姓名重新编码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def get_company_info(company_url):#获得公司信息和股东url
soup = get_content(driver, company_url)#得到公司页面源代码
company = soup.select('div.company_info_text > div.ng-binding')[0].get_text()
print "-----公司名称-----"
print company
tzfs = soup.find_all(attrs={"event-name": "company-detail-investment"})
print "-----股东-----"
for i in range(len(tzfs)):
tzf_split = tzfs[i].text.replace("\n","").split()
tzf = ' '.join(tzf_split)
print tzf
print "-------------"
holder_name = raw_input('请输入股东姓名:\n').decode('utf-8')
for i in range(len(tzfs)):
if [holder_name] == tzfs[i].text.split():
break
holder_url = soup.find_all(attrs={"event-name": "company-detail-investment"})
return 'www.tianyancha.com' + holder_url[i].get('href')

获得股东信息

接下来就获取到了股东信息,就结束了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def get_holder_info(human_url):#获得股东信息
human_soup = get_content(driver, "http://"+human_url)#在这里卡了很久。。。。。。
print "-------股东信息-------"
base_info = human_soup.find_all(attrs={"ng-if": "singleHumanBase"})
print base_info[0].text
print "-----认识的人-----"
rpersons = human_soup.find_all(attrs={"ng-if" : "relatedHuman.name"})
for i in range(len(rpersons)):
rperson_split =rpersons[i].text.replace("\n","").split()
rperson = ' '.join(rperson_split)
print rperson
print "-----关联企业-----"
rcompanys = human_soup.find_all(attrs={"ng-bind-html": "node.name | trustHtml"})
some_info = human_soup.select('div.title')
state_infos = human_soup.find_all(attrs={"ng-class": "initStatusColor(node.regStatus);"})
rate_infos = human_soup.select('svg')
base_infos = human_soup.select('div.search_base ')
for i in range(1, len(rcompanys)+1):
state_info_split = state_infos[i-1].text.replace("\n","").split()
state_info = ' '.join(state_info_split)
rcompany_split =rcompanys[i-1].text.replace("\n","").split()
rcompany = ' '.join(rcompany_split)
rate_info = rate_infos[i-1].text
rate_u = "评分".decode('utf-8')
base_info_split = base_infos[i-1].text.replace(rate_u,"").split()
base_info = ' '.join(base_info_split)
some_info_1 = some_info[3*i - 3].text.replace(" ","")
some_info_2 = some_info[3*i - 2].text.replace(" ","")
some_info_3 = some_info[3*i - 1].text.replace(" ","")
print rcompany + '\n'+ some_info_1 + ' '+ some_info_2 + ' '+ some_info_3 + u' 状态: ' + state_info + ' ' + rate_info + u' 位置:' + base_info[0:2]

整合------全部代码:

总的来说,这仅仅是单页面的一个示例,而且还有很多地方语句写的不优美,以后再花时间改进。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from selenium import webdriver
import time, urllib
from bs4 import BeautifulSoup
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

def driver_open():
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0"
)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
return driver
def get_content(driver,url):
driver.get(url)
#等待2秒,根据动态网页加载耗时自定义
time.sleep(2)
# 获取网页内容
content = driver.page_source.encode('utf-8')
soup = BeautifulSoup(content, 'lxml')
return soup

def search_url(keyname):#查找公司url
key = urllib.quote(keyname)
search_url = 'http://www.tianyancha.com/search?key=' + key + '&checkFrom=searchBox'
res_soup = get_content(driver, search_url) #得到搜索页代码
ifname = res_soup.find_all(attrs={"ng-bind-html": "node.name | trustHtml"})
name = ifname[0].text
ifcompany = res_soup.find_all(attrs={"ng-click": "$event.preventDefault();goToCompany(node.id);inClick=true;"})
return ifcompany[0].get('href')

def get_company_info(company_url):#获得公司信息和股东url
soup = get_content(driver, company_url)
company = soup.select('div.company_info_text > div.ng-binding')[0].get_text()
print "-----公司名称-----"
print company
tzfs = soup.find_all(attrs={"event-name": "company-detail-investment"})
print "-----股东-----"
for i in range(len(tzfs)):
tzf_split = tzfs[i].text.replace("\n","").split()
tzf = ' '.join(tzf_split)
print tzf
print "-------------"
holder_name = raw_input('请输入股东姓名:\n').decode('utf-8')
for i in range(len(tzfs)):
if [holder_name] == tzfs[i].text.split():
break
holder_url = soup.find_all(attrs={"event-name": "company-detail-investment"})
return 'www.tianyancha.com' + holder_url[i].get('href')

def get_holder_info(human_url):#获得股东信息
human_soup = get_content(driver, "http://"+human_url)#在这里卡了很久。。。。。。
print "-------股东信息-------"
base_info = human_soup.find_all(attrs={"ng-if": "singleHumanBase"})
print base_info[0].text
print "-----认识的人-----"
rpersons = human_soup.find_all(attrs={"ng-if" : "relatedHuman.name"})
for i in range(len(rpersons)):
rperson_split =rpersons[i].text.replace("\n","").split()
rperson = ' '.join(rperson_split)
print rperson
print "-----关联企业-----"
rcompanys = human_soup.find_all(attrs={"ng-bind-html": "node.name | trustHtml"})
some_info = human_soup.select('div.title')
state_infos = human_soup.find_all(attrs={"ng-class": "initStatusColor(node.regStatus);"})
rate_infos = human_soup.select('svg')
base_infos = human_soup.select('div.search_base ')
for i in range(1, len(rcompanys)+1):
state_info_split = state_infos[i-1].text.replace("\n","").split()
state_info = ' '.join(state_info_split)
rcompany_split =rcompanys[i-1].text.replace("\n","").split()
rcompany = ' '.join(rcompany_split)
rate_info = rate_infos[i-1].text
rate_u = "评分".decode('utf-8')
base_info_split = base_infos[i-1].text.replace(rate_u,"").split()
base_info = ' '.join(base_info_split)
some_info_1 = some_info[3*i - 3].text.replace(" ","")
some_info_2 = some_info[3*i - 2].text.replace(" ","")
some_info_3 = some_info[3*i - 1].text.replace(" ","")
print rcompany + '\n'+ some_info_1 + ' '+ some_info_2 + ' '+ some_info_3 + u' 状态: ' + state_info + ' ' + rate_info + u' 位置:' + base_info[0:2]


if __name__=='__main__':
try:
driver = driver_open()
except Exception, e:
print e
company_name=raw_input("请输入公司名称:\n")
company_url = search_url(company_name)
human_url = get_company_info(company_url)
get_holder_info(human_url)
driver.close()

这里以腾讯为例:
示例