The Cloud and Hyperix Search

| 0 Comments

Hyperix LogoA lot has been written about cloud computing in the last year and each day seems to bring news of a new player in the cloud arena. So what does the cloud have to offer search engine companies like Hyperix? Well that depends on how deep our pockets are. After all, we need a lot of bandwidth, processing power and data storage to run any real search engine. And as we don't have deep pockets, nor an angel or venture firm backing us we've had to be find creative solutions and innovate where possible.

Up to this point we've been focusing solely on the technology we're using that will differentiate ourselves from any other vertical search platform entities out there. We've got our own small web crawling cluster setup which we've used for some time to test different web crawlers, collect and parse data and measure a variety web crawler values which determine how many CPU cycles, RAM, bandwidth, and storage is necessary to create the vertical search indexes we want. We've also been focusing on the quality of the data we're crawling, the algorithm which ranks the pages crawled, the parsing engines, and the results pages.

Having determined a baseline for our costs to crawl on our own we're now comparing that with web crawling using Amazon's Elastic Compute Cloud (Amazon EC2). After we've compared the two we'll decide which to use as we move forward with our production web crawls. We would prefer to use our own hardware but the cost can be prohibitive and ultimately you would think that at some point it would make financial sense to run the crawls on your own hardware, but until we actually test the crawl on Amazon EC2 we won't know the true costs. And while we could just crunch numbers in Amazon's calculator, anyone whose ever done crawling knows that there are many variables that determine how long a crawl will take, the RAM it will use and how many CPU's and nodes are required to successfully achieve an efficient crawl.

Aside from web crawling there's the search side of the equation. There are some search engines which use Amazon's web services to not only crawl for data, but also to serve up their searches. We've determined that Amazon's services as offered don't offer us a cost effective solution for our search needs. This primarily has to do with our search indexes. When users will search our vertical search niches they'll be querying our indexes which are held in memory. We've come up with an innovative solution that a) dramatically reduces our memory costs b) is faster than current index searching and c) is cheaper for us to run on our own hardware. The innovation which is theoretical at this point is going to be tested out for the first time later this year, but we are confident it will work.

Cloud computing for us at this time, using Amazon's EC2, may be useful for web crawling but not for our search servers.

Leave a comment - Sign in with SpaceRef, Google, Yahoo or OpenID accounts

Recent Blog Entries

Amazon Unleashes Cluster Compute Instances for High Performance Computing
I have to say I'm fairly excited at the news today that Amazon is making available a new instance type…
Shame on the New York Times for Forcing Apple to Remove Pulse from iTunes
When the New York Times objected officially to Apple about an iPad application called Pulse they shot themselves in the…
What if Microsoft and Apple Merged?
TechCrunch is reporting that Microsoft could be taking over the search on Apple's iPhone with the upcoming release of the…
Thoughts on Apple's iPad - Why it Will Succeed
I haven't used one and I can't buy one yet, as I'm Canada, but I do have some thoughts on…
Bigelow Space Station 1/30th Scale Model
I received two Bigelow Space Station models today. They are 1/30 scale model and include one B.A. Standard Module, two…
What if Twitter was Down for Several Days? Perhaps it's Time for a new Internet Protocol
Anil Dash has an opinion piece today on CNN which basically says don't let a service like Twitter or Facebook…