Web Crawling & Analytics Case Study – Database Vs Self Hosted Message Queuing Vs Cloud Message Queuing

image00

The Business Problem:

 To build a repository of used car prices and identify trends based on data available from used car dealers. The solution to the problem necessarily involved building large scale crawlers to crawl & parse thousands of used car dealer websites everyday.

1st Solution: Database Driven Solution to Crawling & Parsing

Our initial Infrastructure consisted of a crawling, parsing and database insertion web services all written in Python. When the crawling web service finishes with crawling a web site it pushes the output data to the database & the parsing web service picks it from there & after parsing the data, pushes the structured data into the database.

 

arch-with-db

 Problems with Database driven approach:

  • Bottlenecks: Writing the data into database and reading it back proved be a huge bottleneck and slowed down the entire process & left to high & low capacity issues in the crawling & parsing functions.

  • High Processing Cost: Due to the slow response time of many websites the parsing service would remain mostly idle which lead to a very high cost of servers & processing.

We tried to speed up the process by directly posting the data to the parsing service from crawling service but this resulted in loss of data when the parsing service was busy. Additionally, the approach presented a massive scaling challenge from read & write bottlenecks from the database.

2nd Solution: Self Hosted / Custom Deployment Using RabbitMQ

arch-with-rabbit-mq

To overcome the above mentioned problems and to achieve the ability to scale we moved to a new architecture using RabbitMQ. In the new architecture crawlers and parsers were Amazon EC2 micro instances. We used Fabric to push commands to the scripts running in the instances. The crawling instance would pull the used car dealer website from the website queue, crawl the relevant pages and push output the data to a crawled pages queue.The parsing instance would pull the data from the crawled pages queue, parse them and push data into parsed data queue and a data base insertion script would transfer that data into Postgres.

 This approach speeded up the crawling and parsing cycle. Scaling was just a matter of adding more instances created from specialized AMIs.

Problems with RabbitMQ Approach:

  • Setting up, deploying & maintaining this infrastructure across hundreds of servers was a nightmare for a small team

  • We suffered data losses every time there was a deployment & maintenance issues. Due to the tradeoff we were forced to make between speed and persistence of data in RabbitMQ, there was a chance we lost some valuable data if the server hosting RabbitMQ crashed.

3rd Solution: Cloud Messaging Deployment Using IronMQ & IronWorker

The concept of having multiple queues and multiple crawlers and parsers pushing and pulling data from them gave us a chance to scale the infrastructure massively. We were looking for solutions which could help us overcome the above problems using a similar architecture but without the headache of deployment & maintenance management.

The architecture, business logic & processing methods of using Iron.io & Ironworkers were similar to RabbitMQ but without the deployment & maintenance efforts. All our code is written in python and since Iron.io supports python we could set up the crawl & parsing workers and queues within 24 hours with minimal deployment & maintenance efforts. Reading and writing data into IronMQ is fast and all the messages in IronMQ are persistent and the chance of losing data is very less.


arch-with-iron-io

COMPARISON MATRIX BETWEEN DIFFERENT APPROACHES

Key Variables Database Driven Batch Processing Self Hosted – RabbitMQ Cloud Based – Iron MQ
Speed of processing a batch Slow Fast Fast
Data Loss from Server Crashes & Production Issues Low Risk Medium Risk Low Risk
Custom Programming for Queue Management High Effort Low Effort Low Effort
Set Up for Queue Management NA Medium Effort Low Effort
Deployment & Maintenance of Queues NA High Effort Low Effort

(Via DataScienceCentral.com)

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s