Recently I was investigating what would be a way of implementing custom search in a CMS I am working at the moment, and part of conclusions was quite interesting and I thought it deserves it’s own thought as a minor blog post.
For getting the content scraped from your domains/websites:
Scrapy – It provides fairly elegant and quick to implement solutions that help you crawl all the necessary data that you want to index in your search in Python. I was thinking in using it so it scraps everything from www.ANYTHING.com domain for example. Since the output you get is full of RAW HTML and not that easy to make it nicer with only relevant data being left, I included the next addition to the mini-stack.
BeautifulSoup version 4 – A smaller Python lib which makes it great combination with Scrapy. It helps you get that pesky HTML description of the web pages to get something manageable out, like for example the whole text description of the webpage, titles etc…
After we get all the data and Soupify it with the Python soup, it’s nice to put it in some kind of DB to make it easier to search for etc. And what’s more appropriate then new and flashy….
ElasticSearch – For all the dumped and normalised data we got before, we store everything nicely in ES to make it easier to retrieve.
And now the only thing left on the frontend would be to fetch the data, make appropriate queries to make it relevant to your search specification and voila, you have something kinda manageable for that pesky search that doesn’t always work as it should.