Recently in Search Engine Technology Category

Cuil Home Page Screen Shot

It's cool to be Cuil today. Cuil Inc. launched their new search alternative to Google today. Cuil pronounced Cool has received lot's of press today and it helps when it's in the right places. And if it we're not for the fact that the principals have a history of producing value add to existing search products like Google search, then this roll out would be hardly noticed.

But the fact that they have a track record, worked at Google and are boasting that they have an index bigger than Google, is newsworthy. Cuil is led by Anna Patterson a former engineer at Google.  Along with her husband Tom Costello, a search expert in his right, Cuil aims to take on Google. No small feat.

But having a bigger index doesn't mean you're better. And only time will tell if they have what it takes to carve out a piece of the big search pie. They claim to be able to search across 120 billion web pages compared to an estimated 40 billion Google has. Google officially does not reveal how many pages it indexes but others sources suggest that they keep an index of around 60 billion pages. As well Google says that not all of the pages it crawls are indexed because many are duplicates. Working in this industry I can concur that there is a lot of duplicate content out there.

For Cuil to take some market share away from Google it will take more than the boasting of a bigger index. Reality is, with enough hardware and money a startup can build an index that is big, even huge as Cuil has. The test of whether Cuil can succeed will be if the public and business users find more relevant search results through Cuil. Being as big or fast as Google is not enough. You have to be able to change people's search preferences. And that's not easy.

What is noteworthy is that Cuil says they've developed a faster, better way to index pages and just as important use less hardware. Less hardware is important as the cost to index, store and serve up results can be prohibitive. The ongoing downward costs of hard drives, CPU's etc. helps. However even though RAM prices have come down, the price of RAM still is one of the most expensive aspect of creating a searchable index.

In my initial tests of Cuil I was both pleased with the results and disappointed. Some common searches resulted in no results. I'll attribute that to first day bugs. But I also found that sources like Wikipedia were heavily weighted, sometimes in favor of the actually site that I was looking for.

It's public day 1 for Cuil and they have people's attention. Let's see if they can keep it and build some momentum. In the meantime I'll give them a try and report back with my thoughts in the near future.

Twitter LogoI have a secret, for the last couple months as a side project we've been crawling Twitter with the idea to create a small niche vertical search of tweets. But the more I come across cools applications like Twitterholic, Tweetstats, Twubble, Tweet Scan, twemes etc. the more I think we can do more with our data. So my question to anyone caring to answer is; If you had a rockin application you'd like to see built for Twitter, what would it be?

You never know, we might just build it.

Yahoo Search Blog
In his latest entry on the Yahoo Search Blog, Vish Makhijani, discusses "Yahoo! Search An Open Approach to Search". This post builds on last weeks announcement of the largest Hadoop production application and I love it. It's innovative, especially for content producers. They, we finally get a say in the output of Yahoo's search results like never before. Regardless if you're a content producer or searcher you can sign up for more information here.

"Because the platform is open it gives all Web site owners -- big or small -- an opportunity to present more useful information on the Yahoo! Search page as compared to what is presented on other search engines. Site owners will be able to provide all types of additional information about their site directly to Yahoo! Search. So instead of a simple title, abstract and URL, for the first time users will see rich results that incorporate the massive amount of data buried in websites -- ratings and reviews, images, deep links, and all kinds of other useful data -- directly on the Yahoo! Search results page."

Hadoop
Some exciting news today from Eric Baldeschwieler, Senior Director, Grid Computing on the Yahoo Developer Network, Yahoo! Launches World's Largest Hadoop Production Application. I'll note that my company Hyperix is using Hadoop for our vertical search platform.

Here's some of the stats:

Some Webmap size data:

* Number of links between pages in the index: roughly 1 trillion links
* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over 10,000
* Raw disk used in the production cluster: over 5 Petabytes




Cloud Computing Thoughts

| 0 Comments

ReadWriteWeb has a good article on cloud computing today.

The first, Reaching for the Sky Through The Compute Clouds, is written the Amazon Web Services outage last Friday fresh in our minds. I'm a big proponent of cloud computing as it's the only way in my opinion to truly scale large data driven applications such as search which is what I'm working on.

"So is it really true - is cloud computing a bad idea? Of course not. It is a wonderful, powerful idea. In this post, we explore the ideas behind cloud computing and argue that it will be an integral part of our future."

"Do Clouds Really Work?

You bet! The best example is Google. The king of the web is reigning with a farm of hundreds of thousands, if not millions of boxes. To race along with the web, Google constantly increases the size of its cloud, incorporating new web sites, and expanding its index.

Of course, Google isn't the only one operating in a cloud. All major web players including Amazon, eBay, Yahoo! and Facebook are running some sort of massive computing cloud."

O'Reilly Money:Tech Conference
Next week I'm headed to New York for the O'Reilly Money:Tech Conference billed as "Where Web 2.0 Meets Wall Street.” The conference speaker list is impressive. I see this conference as a potential opportunity to extend Hyperix Search into the vertical finance segment.

Over at Mashable their reporting today that;

"Reuters will publicly discuss their new initiative around a social network for traders; LinkedIn will introduce a new way for members to create connections and networks with experts by industry; Stormwatch will come out of stealth mode to unveil a project in the financial service industry related to tracking sentiment indicators in real time; Eventvestor, a new ad-driven service for aggregating and tracking financial events, will launch; Motley Fool will present evidence that the collective analysis of the community, and their own analysis, is wisdom not to be dismissed."

And if you want to go Mashable has a discount coupon to save you 20% off the registration.

Is MapReduce a step backwards?

| 0 Comments

Greg Linden has an excellent post about MapReduce and how some database gurus view at as "a giant step backwards". Personally I don't buy it and we're implementing something similar for Hyperix. Here's an excerpt of what Greg had to say.

"The comments on the post are enjoyable and useful. Many rightfully point out that it might not be fair to compare a system like MapReduce to a full database. DeWitt and Stonebraker do partially address this, though, by not just limiting their criticism to GFS, but also going after BigTable.

The most compelling part of the post for me is their argument that some algorithms require random access to data, something that is not well supported by GFS, and it is not always easy or efficient to restructure those algorithms primarily to do sequential scans."

A lot of notable writers are touting 2008 as the year Vertical Search Engines really hit the mainstream. I'm a believer otherwise I wouldn't be working on a vertical search platform.

Here's some of the posts from the last couple of days;

From AltSearchEngines - 75 per cent of online publishers see vertical search as way to reclaim online community from Google

"Nearly three quarters of online publishers see the benefit of developing vertical search engines as a way to claw back online communities from Google, a study published last month has claimed."

From John Battelle's Blog - Blekko
"The web is big. Really, really big. It's literally billions and billions of pages. It's Carl Sagan big. And it's doubling in size every year or two.

So the idea that what you can see in positions 1-3 above the fold on Google are the sum of what the web has to say about every possible query is crazy.

And yet they have 85%+ market share, and little effective competition. At the same time there is such a fabulous business in search. It's the highest monetization service on the web, by far."


As some of you may know one of my other projects is Project Phoenix which is an effort of my newly created company Hyperix Search, Inc. We're focusing on a vertical search platform. One important aspect of any serious search platform is the need for a distributed file system and it's map and reduce operations.

MapReduce is a programming model and an associated implementation for processing and generating large data sets and was originally created by Google. For some time now other companies and efforts having trying to replicate and improve on the process that Google created. This video takes you through Microsoft's version of MapReduce called Dryad.


Recently Read/WriteWeb started a new feature on alternative search engines. The editor is Charles Knight and his latest Top 100 list for August is out.

"Some of them did not even exist a year ago. One of my goals is to show my readers the “latest and the greatest” search engine innovations. The motto for the blog [ASE], after all, is “the most wonderful search engines you’ve never seen,” and my favorite comment of all is, “Wow! I didn’t even know that most of these existed!”

Once I have our first vertical search engine in production it will be fun to watch how it fares on the top 100 list.