自建搜索引擎：一个专门服务于安全圈的工具 Secer's Blog - 记录互联网安全历程与个人成长经历

导语：前几周的时候爬下了嘶吼的全部数据，但是如果光是单纯拿到这些数据好像是没有什么用的。全部技术文章大概也就只有20M的大小，但是这20M的数据中干货还是相当多的。

闲话

前几周的时候爬下了嘶吼的全部数据，但是如果光是单纯拿到这些数据好像是没有什么用的。全部技术文章大概也就只有20M的大小，但是这20M的数据中干货还是相当多的。要想办法把这些数据利用起来，可以帮助大家进一步寻找需要的资料。于是，就有了这篇文章，基于Flask搭建搜索引擎。可以对es中的数据进行检索，并将这些文章展示出来。

这个搜索引擎的功能相对而言也是比较齐全，智能提示，容错功能,来源分类，该有的都有了。附一张搜索图：

环境安装

这一次将数据直接写入elasticsearch中，不再使用mysql保存数据，es现在是全文搜索的首选，维基百科，github，这些有名的组织都用的是elasticsearch，可见它的强大之处！

· 但是用之前需要安装java 8或java 8+以上的环境。

· 之后到github上去下载elasticsearch : git clone https://github.com/medcl/elasticsearch-rtf.git

PS: elasticsearch-rtf相比于elasticsearch就是安装了很多插件，而这些插件基本都是针对中文的处理而安装的，如果安装的是elasticsearch,还需要安装许多插件来支持对中文的支持，比如说分词器

之后cd bin目录下,linux执行linux的可执行文件，windows执行bat文件

· 安装flask

pip install flask

写入数据

想要搭建搜索引擎，在之前代码的基础上需要做一些修改，要将数据写入elasticsearch中，不在是Mysql

在爬虫的目录下新建一个models，需要models文件夹中新建一个名为elaticsearch_type_4hou.py的文件，内同如下代码，他是类似Mysql表结构的东西，专业名词叫做映射结构。其它专业等碰到再说。

PS：这里其实跟Django里面的模块声明很像。

from elasticsearch_dsl.analysis import CustomAnalyzer as _CustomAnalyzer
class CustomAnalyzer(_CustomAnalyzer):
    def get_analysis_definition(self):
        return {}
ik_analyzer = CustomAnalyzer("ik_max_word", filter=["lowercase"])#filter=["lowercase"]作用是大小写的转换
connections.create_connection(hosts=["localhost"])
class Article_4houType(DocType):
    suggest = Completion(analyzer=ik_analyzer)  #搜索建议
    image_local = Keyword()
    title = Text(analyzer="ik_max_word")
    url_id = Keyword()
    create_time = Date()
    url = Keyword()
    author = Keyword()
    tags = Text(analyzer="ik_max_word")
    watch_nums = Integer()
    comment_nums = Integer()
    praise_nums = Integer()
    content = Text(analyzer="ik_max_word")
    class Meta:
        index = "teachnical_4hou"
        doc_type = "A_4hou"

学过python web开发的看眼就知道上面这段代码的作用，跟Mysql的表结构类似，只是多了一些关键字，如:analazy="ik_max_word"。ik_max_word他是elasticsearch-rtf已经安装的插件，它的作用是对中文进行分词。举个例子，"Linux运维工程师",这样一串中文，它会将这串中文分为Linux,运维,工程师等等的，分的非常细，拿一条数据库中数据给大家看下:

suggest的内容是ik_max_word根据title,tags这两个字段的内容分割而成的，在上面的代码中，也可以看到具体设置，下面是分词过后的结果：

这里要说下，分割出来的词，过滤掉了长度小于2的字符串，就是单字，分出来一个单字是没有实际意义的，比如说，我告诉你了个“了”，你肯定不知道什么意思。。

有了分出来的词，在输入框中输入的时候让它基于Ajax来请求后台，就能实现自动补全的功能。还有一点，至于Suggest这个映射为什么不能向下面一样，Text(analyzer="ik_max_word")这样写，据说是个bug，模仿我的代码就没有问题了。完成上面代码之后，执行之后将会在es中生成一个名为teachnical_4hou的索引(索引就是就相当于mysql中的库)。

有了映射结构，就该考虑如何把数据写入es中了，跟之前写入mysql一样，在items.py中的ArticleSpider4hou这个类中添加一个save_to_es的方法。

es_4hou = connections.create_connection(Article_4houType._doc_type.using)
def save_to_es(self):
    article = Article_4houType()
    article.image_local = self[&quot;image_url&quot;]
    article.title = self[&quot;title&quot;]
    article.url_id = self[&quot;url_id&quot;]
    article.create_time = self[&quot;create_date&quot;]
    article.url = self[&quot;url&quot;]
    article.author = self[&quot;author&quot;]
    article.tags = self[&quot;tags&quot;]
    article.watch_nums = self[&quot;watch_num&quot;]
    article.comment_nums = self[&quot;comment_num&quot;]
    article.praise_nums = self[&quot;praise_nums&quot;]
    article.content = self[&quot;content&quot;]
    article.suggest = gen_suggests(es_4hou,Article_4houType._doc_type.index,((article.title,10),(article.tags,7)))
    article.save()
    return

这段代码也是比较容易理解，实例化Article_4houType这个对象之后，把获取到的值填充进去就ok了，这样就能将数据写入到es中。眼尖的肯定看到了Suggest,这一项调用了一个gen_suggests函数，这个方法究竟做了什么能把Tags,title分词之后放入es中这个先放一放，先看看piplines怎么写。直接在piplines.py中添加如下代码

#将数据写入到es中,
    def process_item(self,item,spider):
        #提升代码性能
        item.save_to_es()
        return item

现在来看之前的gen_suggests函数，这个代码也写在items中，是一个全局函数

def gen_suggests(es,index,info_tuple):
    #根据字符串生生搜索建议数据
    used_words = set() #供去重使用
    suggests = []
    for text,weight in info_tuple:
        if text:
            #调用analyze接口分析字符串
            words = es.indices.analyze(index=index,analyzer="ik_max_word",params={'filter':["lowercase"]},body=text)
            anylyzed_words = set([r["token"] for r in words["tokens"] if len(r["token"])>1])
            new_words = anylyzed_words - used_words
        else:
            new_words = set()
        if new_words:
            suggests.append({"input":list(new_words),"weight":weight})
    return suggests

第一个es为一个连接对象，第二个是index(索引名)，第三个是一个元组(因为要处理的字段不止一个,所以使用元组循环处理)，就拿当前这个例子来说，我们需要对两个字段进行分词，一个是title,一个是tags,而且，传进来的元组中需要带着weight (权重,不明白的百度下)。继续看代码,首先设置了一个Set ,它的作用是去重。举个例子，如果Title和Tags都出现了hacker这个词，谁先进来取谁，如果Title的权重为10，而Tags中又出现了hacker这个单词，直接过滤掉。肯定不能修改之前已经设置好的值。Suggest为返回列表，之后进入for循环，for循环的对象是之前传入的info_tuple，调用es的analyze来分析字符串，返回处理生成的词语列表，之后，使用列表生成式那到这些值。并把它添加到列表中(es的固定格式)。完成之后，数据就可以大量的写入，方便后面的测试。

这个过程起始并不是很复杂，都是elasticsearc的操作，都是死格式。

展示数据

接下来要做的事情就比较简单了，从elasticsearch中取数据，展示出来就好了，前端页面是我胡诌的。主要讨论后台功能的实现

实用Pycharm创建好项目之后，在项目根目录下新建一个名为moudels.py，直接将scrapy项目下的moudels.py复制过来就好，不需要任何的修改。

然后再创建一个Config.py的文件，看名字就知道里面写什么了，配置文件啊等等的，如果是连接Mysql数据库，那么在里面写的就是数据库地址啊，账号密码什么的，但是这里既然是elasticsearch，那些es的配置信息就ok了，代码如下

from elasticsearch import Elasticsearch
client = Elasticsearch(hosts=['127.0.0.1'])

开始实现主要逻辑，如果实用Pycharm创建的Flask项目，那直接写在跟你项目名一致，但是后面是py结尾的文件就可以了。写之前来看看前端Ajax是怎么写的，

每当输入框中的字符变化的时候他会将搜索框当前的内容传给后台，有两个参数

url: suggest_url + "?
s=" + searchText + "&s_type=" + $(".searchItem.current").attr('data-type'),

一个是s，他代表的是要搜索的字符串，s_type代表的是文章来源,当然，在这里传到后台的s_type一定是A4hou(这是我在html中设置的名字)

Suggest_url就是要请求的地址，实用url_for反转成视图函数

开始写后台代码，比Ajax简单

from flask import Flask,render_template,request
from moudels import Article_4houType
import json
app = Flask(__name__)
@app.route('/')
def search_index():
    return render_template("index.html")
@app.route('/suggest/')
def suggest():
    key_words = request.args.get('s','')
    type = request.args.get('s_type','')
    if "A4hou" == type:
        fuzzing = elasticsearch_search(type=Article_4houType)
        re_dates = fuzzing.return_fuzzing_search(key_words=key_words)
    return json.dumps(re_dates)

代码非常的简单，如果访问的是/根目录,返回index.html这个页面给他，之后如果在搜索框中填入数据，就会提交到后台处理，也就请求了Suggest这个函数。接收s和s_type这两个参数，之后到指定的es索引中取查找对应的值。

为了降低代码耦合性，我将各大功能全部分装在一个类中。这样，如果之后添加了别的网站的数据，这里的代码动起来就非常容易，只需要判断一下传过来的s_type，之后传入索引名称创建一个elasticsearch_search的实例类。

在项目文件夹下创建common.py的文件，写这个elasticsearch_search类

from moudels import Article_4houType,Article_anquankeType,Article_freebuf
from config import client
from datetime import datetime
import re
class elasticsearch_search(object):
    def __init__(self,type):
        self.s = type.search()
        if Article_4houType == type:
            self.index = "teachnical_4hou"
    def return_fuzzing_search(self,key_words):
        """
        模糊查询,分词匹配
        :param key_words:
        :return:re_dates
        """
        re_dates = []
        if key_words:
            s = self.s.suggest("my_suggest", key_words, completion={
                "field": "suggest",
                "fuzzy": {
                    "fuzziness": 2
                },
                "size": 10
            })
            suggestions = s.execute_suggest()
            for match in suggestions.my_suggest[0].options:
                source = match._source
                re_dates.append(source["title"])
        return re_dates

在init函数中，调用了Article_4houType.search()方法，返回了一个供查询的s对象，而且，判断了传入的Type类型，设置了索引名称。之后就是上面调用的return_fuzzing_search方法，将要查询的关键字传进来，返回一个查询到的列表。方法中的"fuzziness": 2是设置的是模糊匹配的属性。举个例子，如果你要查询Linux这个关键字，如果你输入了Linnx,他一样可以成功的模糊匹配到Linux这个关键字，但是，这个宽度如果大于2,就查不到了。（这里设置成2是相对而言比较准的）,这个方法返回的是一段json，前台接收到之后进行解析。看看效果:

最后是文章搜索详情页。代码如下

@app.route('/search/')
def search():
    #文章来源
    all_options = [["all","全部"],["A4hou","嘶吼"]] 
    key_words = request.args.get('q','')
    types = request.args.get('s_type','')
    page = request.args.get('p','1')
    try:
        page = int(page)
    except:
        page = 1
    if "A4hou" == types:
        search_obj = elasticsearch_search(type=Article_4houType)
        response, last_seconds = search_obj.get_date(key_words=key_words, page=page)
        total_nums = response["hits"]["total"]
        all_hits = search_obj.analyze_date(key_words, response)
    x = get_elasticsearch_data_count()
    alldate_nums = x.return_count()
    if (page%12) > 0:
        page_nums = int(total_nums/12)+1
    else:
        page_nums = int(total_nums/12)
    return render_template("result.html",
                           alldate_nums = alldate_nums,
                           page=page,
                           all_hits=all_hits,
                           key_words=key_words,
                           total_nums=total_nums,
                           page_nums=page_nums,
                           last_seconds=last_seconds,
                           type = types,
                           all_options = all_options
                           )

开头的列表是用来显示文章来源，如果之后有了别的网站的文章数据可以直接在当前列表中添加。其实用字典是更好的，但是我不明白字典为什么会乱序，导致每一次刷新列表的值都是随机的。all_options列表会直接返回给前端页面，中间并没有对它进行处理。接下来key_words,types,page这三个参数，跟之前相比就是多了一个page参数，page代表页码，如果这些文章全部显示在一页中。啧啧。

之后调用了get_date方法来获取文章数据，需要的两个参数是key_words和page，page的作用是限制查询条数。之后在elasticsearch_search类中添加一个新的方法get_date

def get_date(self,key_words,page):
    """
    从elasticsearch中获取数据
    :param key_words:
    :param page:
    :return: response
    :return : last_seconds
    """
    start_time = datetime.now()
    response = client.search(
        index = self.index,
        body = {
            "query":{
                "multi_match":{
                    "query":key_words,
                    "fields":["tags","title","content"]
                }
            },
            "from":(page-1)*12,
            "size":12,
            "highlight":{
                "pre_tags":['<span>'],
                "post_tags":['</span>'],
                "fields":{
                    "title":{},
                    "content":{},
                }
            }
        }
    )
    end_time = datetime.now()
    last_seconds = (end_time - start_time).total_seconds()
    return response,last_seconds

在body中的是查询的一个结构，虽然这个结构很复杂，但是我觉的比sql语句友好的多，

· query代表查询的关键字，

· query 中的fields是要在那几个字段中查询当前出现的值，

· from,size看一眼就知道，当然是开始和结束，因为每夜显示的条数是12，所以除了个12

· highlight高亮显示，pre_tags代表开始标签，post_tags代表结束标签

· highlight中的fields代表返回字段

highlight其实就是做了关键字标红处理，看一下演示

这个方法会把查询返回的response和查询所花的时间返回

total_nums = response["hits"]["total"]
all_hits = search_obj.analyze_date(key_words, response)

total_nums会取出匹配到数据的总条数

之后调用analyze_date方法来分析数据，因为response也是一大串，来看下

这里只取了3条，需要按照这个结构来解析返回的response，在elasticsearch_search类中添加analyze_date方法

def analyze_date(self,key_words,response):
    """
    返回分析后的数据列表集合
    :return:hit_list
    """
    hit_list = []
    for hit in response["hits"]["hits"]:
        hit_dict = {}
        hit_dict["origin"] = self.get_origin(hit)
        if "highlight" in hit:
            if "title" in hit["highlight"]:
                hit_dict["title"] = "".join(hit["highlight"]["title"])
            else:
                hit_dict["title"] = hit["_source"]["title"]
            if "content" in hit["highlight"]:
                hit_dict["content"] = self.filter_tags("".join(hit["highlight"]["content"]))
                hit_dict["content"] = hit_dict["content"][:500]
            else:
                hit_dict["content"] = self.filter_tags(hit["_source"]["content"])
                hit_dict["content"] = hit_dict["content"][:500]
        else:
            hit_dict["title"] = hit["_source"]["title"]
            hit_dict["content"] = self.filter_tags(hit["_source"]["content"][:500])
        hit_dict["create_date"] = hit["_source"]["create_time"]
        hit_dict["url"] = hit["_source"]["url"]
        hit_dict["score"] = hit["_score"]
        replace_text = '<span>' + key_words + "</span>"
        words = "(?i)"+key_words
        hit_dict["content"] = re.sub(words,replace_text,hit_dict["content"])
        hit_list.append(hit_dict)
    return hit_list

从上面的图中可以看到返回的数据其实是一个列表，对它进行for循环遍历就好。之后挨个去取值就ok了，不做过多的解释。

这个方法中调用了get_origin()方法,作用是获取文章来源地址，代码实现如下

def get_origin(self,hit):
    """
    获取文章来源
    :return:index_name 来源名称
    """
    if "_index" in hit:
        if "teachnical_4hou" == hit["_index"]:
            origin = "嘶吼"
    else:
        origin = "未知来源"
    return origin

还有filter_tags这个方法，作用是过滤html标签，这里就不给出，上面的代码有一点小问题就是返回的response中起始已经做了高亮处理，但是为了界面的美化，过滤了一遍html，之后在加上高亮标签。如果不这样做，你看到的文章内容将是乱七八糟一坨html代码。

在回到项目主文件往下看

x = get_elasticsearch_data_count()
alldate_nums = x.return_count()
if (page%12) > 0:
    page_nums = int(total_nums/12)+1
else:
    page_nums = int(total_nums/12)

又写了个名为get_elasticsearch_data_count()类，这个类作用是返回当前数据库中的全部数据条数，之后展示到页面中，

class get_elasticsearch_data_count(object):
    def __init__(self):
        self.index = []
        self.counts = []
        self.index.append("article_anquanke")
        self.index.append("teachnical_4hou")
        self.index.append("teachnical_freebuf")
        for index in self.index:
            count = self.__get_datecount(index)
            self.counts.append(count)
        self.counts.append(self.counts[0]+self.counts[1]+self.counts[2])
    def __get_datecount(self,index):
        response = client.count(index)
        return response["count"]
    def return_count(self):
        return self.counts

之后把这些数据全部返回到html页面中就ok了。flask跟Django填充数据很像，而且都非常简单，不做过多的介绍。

总结

代码只是实现了最简单的功能，很多问题都是没有处理的。还有，千万别问这个有什么用，除了好玩以外，是没有一点用。T_T 。如果还有问题的话发邮箱[email protected],源代码点这里，

数据来源的爬虫点这里

最后一张成品图

PS：千万不要盯着那个圈看，虽然挺好看，也挺好玩。但是会晕，伤眼睛!

源链接

Hacking more

...

#attack #hack