渗透测试岗位职能分析 Secer's Blog - 记录互联网安全历程与个人成长经历

0x00 目标

爬虫练习
渗透测试职位岗能需求分析
使用gevent模块，练习多线程
如果有可能，使用数据库，对岗能进行大数据分析

0x01 开始工作

智*招聘搜索关键字：渗透测试
地址：

http://sou.zhaop**.com/jobs/searchresult.ashx?jl=%E8%BE%93%E5%85%A5%E9%80%89%E6%8B%A9%E5%9F%8E%E5%B8%82&kw=%E6%B8%97%E9%80%8F%E6%B5%8B%E8%AF%95&sm=0&p=1

点击发现p参数控制页码，一共14页。

可以使用Queue，制造这些数据put进去：

def create():
    queue = Queue()
    for i in range(1,15):
        queue.put('''http://sou.zhaop**.com/jobs/searchresult.ashx?jl=\
%E8%BE%93%E5%85%A5%E9%80%89%E6%8B%A9%E5%9F%8E%E5%B8%82&kw=%E6%B8%97%E9%80%8F%E6%B5%8B%E8%AF%95&sm=0&p='''+str(i))
    return queue

接着，使用gevent，多线程的去把queue数据放入spider，让spider取url地址进行爬取和采集。
gevent部分：

def main():
    queue = create()
    gevent_pool = []
    thread_count = 5
    for i in range(thread_count):
        gevent_pool.append(gevent.spawn(spider,queue))
    gevent.joinall(gevent_pool)

spider部分:

def spider(queue):
    while not queue.empty():
        url = queue.get_nowait()
        try:
            r = requests.get(url)
            soup = bs(r.content.replace('<b>','').replace('</b>',''), 'lxml')
            jobs = soup.find_all(name='td', attrs={'class':'zwmc'})
            #job.div.a.string
            companys = soup.find_all(name='td', attrs={'class':'gsmc'})
            #company.a.string
            wages = soup.find_all(name='td', attrs={'class':'zwyx'})
            #wages.string
            locations = soup.find_all(name='td', attrs={'class':'gzdd'})
            #location.string
            for job,company,wage,location in zip(jobs,companys,wages,locations):
                # print job.div.a.string,company.a.string,wage.string,location.string
                j = job.div.a.string
                c = company.a.string
                w = wage.string
                l = location.string
                job_detail_url = job.div.a['href']
                job_detail_req = requests.get(job_detail_url)
                contents = re.findall(r'SWSStringCutStart -->(.*?)<!-- SWSStringCutEnd', job_detail_req.content, re.S)
                content = re.sub(r'<[^>]+>', '', contents[0]).replace(' ','').replace('\r\n','').replace(' ','')
                print j,c,w,l
                print content.decode('utf-8')
                sqlin(j,c,w,l,content)
        except Exception,e:
            print e
            pass

值得注意的是：

content = re.sub(r'<[^>]+>', '', contents[0])

这一行的作用是去掉所有的html标签（可以通用）

0x02获取数据

一共采集到795条信息，并存放入mysql数据库

0x03分析

本来想用python生成词云，但是网上有现成的Tools，而且效果很不错。
地址：https://timdream.org/wordcloud/
最终生成的关于就职地址的词云为：

可以明显的看到，北上广，杭州，深圳，南京等这些地区对岗位需求较高。
最终生成的关于岗能的词云为：

可以看出，岗位需要的能力主要包括：测试，经验，网络，系统，漏洞
对薪酬进行统计分析：

可以看出，薪酬在6k－9k的较多
在数据库中，统计需要python关键字的岗能需求：

一共是198项，占24.9％，其他数据分析的玩法，就靠大家自己去扩展了。欢迎来社区进行探讨。

渗透测试岗位职能分析

0x00 目标

0x01 开始工作

0x02获取数据

0x03分析

Hacking more