The Business Problem:
To build a repository of used car prices and identify trends based on data available from used car dealers. The solution to the problem necessarily involved building large scale crawlers to crawl & parse thousands of used car dealer websites everyday.
1st Solution: Database Driven Solution to Crawling & Parsing
Our initial Infrastructure consisted of a crawling, parsing and database insertion web services all written in Python. When the crawling web service finishes with crawling a web site it pushes the output data to the database & the parsing web service picks it from there & after parsing the data, pushes the structured data into the database.
Problems with Database driven approach:
Bottlenecks: Writing the data into database and reading it back proved be a huge bottleneck and slowed down the entire process & left to high & low capacity issues in the crawling & parsing functions.
High Processing Cost: Due to the slow response time of many websites the parsing service would remain mostly idle which lead to a very high cost of servers & processing.
We tried to speed up the process by directly posting the data to the parsing service from crawling service but this resulted in loss of data when the parsing service was busy. Additionally, the approach presented a massive scaling challenge from read & write bottlenecks from the database.
2nd Solution: Self Hosted / Custom Deployment Using RabbitMQ
To overcome the above mentioned problems and to achieve the ability to scale we moved to a new architecture using RabbitMQ. In the new architecture crawlers and parsers were Amazon EC2 micro instances. We used Fabric to push commands to the scripts running in the instances. The crawling instance would pull the used car dealer website from the website queue, crawl the relevant pages and push output the data to a crawled pages queue.The parsing instance would pull the data from the crawled pages queue, parse them and push data into parsed data queue and a data base insertion script would transfer that data into Postgres.
This approach speeded up the crawling and parsing cycle. Scaling was just a matter of adding more instances created from specialized AMIs.
Problems with RabbitMQ Approach:
Setting up, deploying & maintaining this infrastructure across hundreds of servers was a nightmare for a small team
We suffered data losses every time there was a deployment & maintenance issues. Due to the tradeoff we were forced to make between speed and persistence of data in RabbitMQ, there was a chance we lost some valuable data if the server hosting RabbitMQ crashed.
3rd Solution: Cloud Messaging Deployment Using IronMQ & IronWorker
The concept of having multiple queues and multiple crawlers and parsers pushing and pulling data from them gave us a chance to scale the infrastructure massively. We were looking for solutions which could help us overcome the above problems using a similar architecture but without the headache of deployment & maintenance management.
The architecture, business logic & processing methods of using Iron.io & Ironworkers were similar to RabbitMQ but without the deployment & maintenance efforts. All our code is written in python and since Iron.io supports python we could set up the crawl & parsing workers and queues within 24 hours with minimal deployment & maintenance efforts. Reading and writing data into IronMQ is fast and all the messages in IronMQ are persistent and the chance of losing data is very less.
COMPARISON MATRIX BETWEEN DIFFERENT APPROACHES
|Key Variables||Database Driven Batch Processing||Self Hosted – RabbitMQ||Cloud Based – Iron MQ|
|Speed of processing a batch||Slow||Fast||Fast|
|Data Loss from Server Crashes & Production Issues||Low Risk||Medium Risk||Low Risk|
|Custom Programming for Queue Management||High Effort||Low Effort||Low Effort|
|Set Up for Queue Management||NA||Medium Effort||Low Effort|
|Deployment & Maintenance of Queues||NA||High Effort||Low Effort|