Feed on:
Posts
Comments

First Use of PowerMouse

I was finally accepted into the PowerSet PowerLabs beta program to give their natural language search technology a whirl. As far as I can tell, they’re not in “beta” or even “alpha” for any specific product line right now. Instead, they’ve released some demoware showcasing some of their capabilities.

Among the demos they’ve put up are structured queries for content related to business, arts, and quotes. Each of these allows the user to select one of a dozen or so canned query formats into which free-form nouns and/or verbs can be placed. The input for each of these canned queries is undoubtedly wrapped in some magic prior to execution against the specified source. By restricting user input, and directing the query toward engines tuned for the specific need they can narrow down the degrees of freedom.

For example, one of the business searches you can run is in the form “Who acquired INSERT:COMPANY”. Entering a company name will bring up a list of results highlighting the assumed relationships as found in articles on Wikipedia (their only current source for information).

Their goal is to home in on the intent of the user, and provide results that are better than standard keyword searching. To allow users to judge the quality of the results, most of the demos reply with side-by-side comparisons between what PowerSet can do next to what is returned by the inputs as keywords.

The demo that provides the most flexible input from the user is their PowerMouse application. Using it the user is able to build an undirected (ie. not forced to their “business, arts, or quotes” categories) query in the format “subject-verb-subject”. In fact, you can leave one (or two) of the fields blank to see what it finds. One of their canned examples is “zombies - eat - BLANK”. It is gratifying, then, to see Wikipedia articles returned that include “zombies - eat - brain” (along with eating “body part, boy, chick, debbie, flashback, franklin, galactus, granddaughter, hawkeye, head, man, meat, member, neighbor, people , richards, schoolchild, shell, study, sullivan, team, vet, yoshi”).

To try it out myself, I wanted to see what it would pick up about my friend Sunita Williams running the Boston Marathon while aboard the International Space Station. I already knew that there was a note about this on her Wikipedia bio page, so I figured it’d be a slow ball for PowerSet to knock out of the park.

I entered the query: “Sunita - ran - BLANK”

The result set included the following:

  1. Sunita - dump - maya
  2. Sunita - tie - maya
  3. Sunita - survive - ordeal
  4. Sunita - seek - reside
  5. Sunita - develop - feeling
  6. Sunita - get - marry
  7. Sunita - go - look
  8. Sunita - set - world record

I’m not sure what synset graph they’re using for “ran” in this context, but the first seven results were clear misses. The last entry, though, made me think PowerMouse hit upon what I expected as it was able to find Sunita’s bio page (as opposed to the unrelated “Sunita Parekh”, a TV soap opera character). Unfortunately, however, expanding the results showed that the blurb it keyed off was actually about her record-breaking space walk (no mention of the marathon).

To make sure I was remembering her bio page correctly, I checked and here’s the paragraph mentioning the marathon:

On April 16, 2007, she ran the first marathon by an astronaut in orbit.[6] Williams finished the Boston Marathon in four hours and 24 minutes.[7][8] The other crew members reportedly cheered her on and gave her oranges during the race. Williams’ sister, Dina Pandya, and fellow astronaut Karen L. Nyberg ran the marathon on Earth, and Williams received updates on their progress from Mission Control.

I’m surprised it didn’t clue into the part of the sentence that reads “she ran the first marathon”. That seems to be about as clear a match for the query as could be expected in many reasonable situations.

It’s too bad that the author of the blurb led the paragraph with a pronoun rather than Sunita (or Williams). It’s possible that the pronoun recursion required to connect the noun was unable to detect the association. Possibly compounding the problem is that the nearest previous noun was “Joan Higginbotham”.

I wonder, then, if the query would have picked it up had the sentence read “Williams ran the first marathon” (or more specifically, “Sunita Williams ran the first marathon”). Since it’d be better encyclopedia formatting to lead the paragraph with her last name, and updating the page wouldn’t hurt, I have half a mind to edit it and try the query again.

The shining star in the experience, though, was the user interface of their support site. Very nice use of in-situ form editing and feature flow. Not great knowledge repository, but fun to play with. My hope is they’re engaging with a good group of demo testers during their shakeout cruise. I look forward to watching as the training wheels come off to see how it works in the wild.

According to their press, ZoomInfo is taking the path toward semantic search by utilizing their patented technologies to pre-scrub crawled data. This approach, rather than relying on adding linguistic magic at query time, allows them the flexibility to massage the crawled data into searchable indexes. In this way, it then looks like the information is retrieved by a user’s more typical keyword searches.

When searching for themselves using their engine, they say:

ZoomInfo is the best destination for information about people and companies. Our product is a summarization search engine that finds, understands, extracts and summarizes information about people and companies on the Web.

And for a deeper dive, here’re a couple notes from their technology page (annotated):

ZoomInfo employs Artificial Intelligence Algorithms to analyze Website pages and to create a human like understanding of their content. With these algorithms, ZoomInfo analyzes the type of Website and the content of the Website based on how it’s constructed. ZoomInfo is able to deduce that a specific paragraph is a company description or that a specific address contains the location of a company’s headquarters to extract the most accurate and relevant information.

ZoomInfo’s semantic search engine continually crawls the Web and reads business information. Using proprietary Natural Language Extraction technology, ZoomInfo analyzes sentences to understand their meaning and to extract relevant information about companies, and people, such as the industry a company is in and its products or services, or the company a person works for and his/her job title.

That certainly sounds kewl. But what about the reality? Check out this recent ZDNet post (annotated):

A search by company for IBM turns up some basic information, and lists Ramon Demper as the company CEO and CTO. As far as I know Sam Palmisano is the IBM CEO and Demper left IBM in 1993. A search for ZDNet in both basic and powersearch (requires registration) and by company and people turned up outdated and grossly incorrect information. Similarly, a search on CNET turned up a lot of erroneous information.

And if you consider using it as a consumer to find a “security software” company:

Searching for security software companies in California with $50 million or less in revenue and fewer than 100 employees turned up Network Associates, which merged with McAfee in 2002, as the first entry.

I’m assuming they’re still working out the kinks in their system. The problem I see, though, is they appear to be relying too heavily on their smart software without a human in the loop. If they’re hoping to court the business community with subscription services, I’d think they’d need to significantly increase their accuracy rate.

While it’s currently hip to be wrong (and opening the doors to social networking style corrections), that doesn’t seem to be what their doing.

I realize I’m showing up (fashionably) late to the semantic web party, but the timing seems to feel ripe. As I mentioned in an earlier post about what I call a “Semantic Servant“, I’ve been thinking a lot about how to (easily) cross-connect online systems. Despite the zealot debates between the Web 2.0 / 3.0 / Semantic Web crowds, there’s a lot to be gained from cooperative growth.

For example, I found this post about “Pinging the Semantic Web” by Harry Chen. In it he mentions there’s a lot to be learned from the blog pinging services:

As the Semantic Web grows, we also need similar services. Ping.SemanticWeb.Org is an experimental service for notifying search engines (or semantic web bots) about changes made in semantic web documents. The present service accepts pings from semantic web documents that describe SIOC, FOAF and DOAP.

He goes on to give some rationale behind his belief in this type of system. My personal favorite is his second point:

Second, a wide adoption of ping services can help to speed up the convergence of standard ontologies. In the blogosphere, we have seen the convergence of few RSS standards, which I believe is due to the wide adoption of ping services, as well as RSS readers and blog publishing software. If Semantic Web ping services are widely used, I believe it’s only nature for SWD publishers to adopt few standard ontologies that are supported by the ping services, and not to create the owner ontologies.

As much as I hate to admit it, the semi-formalization of RSS did for online content sharing what HTML did for Internet content publishing in general. What I mean by that is sometimes it takes an example of technology deployed in a useful context to propel it into mainstream adoption. There’s no reason why we need RSS to share content (we could simply use straight XML, or even straight HTML), but it certainly makes it easier — especially if everyone adopts it.

Now, all we need to do is come up with “an example technology deployed in a useful context.” Piece of cake.

Semantic Servant

This may not be a totally revolutionary idea, but it’s something I’d love to see implemented. The end state of the proposed application would be to deploy what I call a “Semantic Servant” that provide guidance for searching and indexing. I’m terming it a “servant” rather than a “server” for the basic reason that I see it as a “helper tool” to existing servers rather than serving up content itself.

Without getting into it too deeply, the concept is that the Semantic Servant (via a new “Semantic Servant Index Protocol”) would reply on a specified port to provide a machine readable summary of the content available from another server. For example, if a web site is available at “http://www.contentsite.com”, the servant would reply on the same URL via something like “ssip://www.contentsite.com”. The results would be an XML packet including rules for leveraging the content on the sister site.

Keep in mind that this is a totally half-baked idea. My goal in this concept would be to empower a website developer with a tool that would, with a few minor configuration clicks, tell spiders/bots/indexers/etc. more about the associated site. In order for this to work, the servant application would have to be incredibly light weight and easy to use out-of-the-box. Assuming the servant defaults to a standard OWL, RDF, etc. standard configuration, the administrator could select from some pre-canned configurations and let it go.

The more time the administrator spends customizing the configuration, of course, the more fine-tuned it could be to the content of the specific site. In this way, though, indexers visiting the site would (a) have more information about the content of the site than is currently (easily) available, and (b) changes to the site would be more forgiving.

This is, of course, assuming that producers of web content want their information to be aggregated more freely. If a site producer wants to force all of it’s users to it’s front gate, this isn’t the solution for them. As I think we’re moving to an “All Content Everywhere” model, though, whereby there are multiple ways to experience the same content, I see something like this as an eventual must-have.

… then again, I’m a dreamer.

If we’re all moving toward a more connected set of tools for communication with hopes of a better Web 3.0, how’re we gonna’ get there? Getting everyone to agree on a single standard seems like a pipedream, but what can we do in the meantime? From what I can tell, it seems relatively easy to chat up the concept of .

I bumped into this post from Tom Johnson which seemed to sum it up well:

The idea of microformats and the semantic web sound cool. And I’m looking forward to the day when microformats are widely adopted. But if microformats are so useful, why hasn’t Google come out with a microformats search yet? Why aren’t microformats being baked into the core structure of WordPress and other blogging platforms?

Not many people are using the structured blogging plugins, and those that do use it mainly to autoformat their posts. I even heard in a recent interview with Matt Mullenweg, the WordPress lead, that there are no current plans to develop structured blogging microformats into the WordPress code.

Oddly enough, Jason Kolb made a similar comment in a recent post:

The only technology that would really be necessary to make this work is to embed microformats in site text itself. I’m really not sure why this hasn’t taken off yet, it seems like a no-brainer to me. What I’m talking about, and I’ve actually posted some working examples of this before, is to surround chunks of text from a weblog post or text published to a public site with microformat markup so that it can be extracted as meaningful data.

It seems like a simple enough first step toward the semantic web thing. Like these two cats, I’m relatively surprised microformatting hasn’t been embraced, but I do believe the value chain still seems to be missing a couple links. There probably need to be a couple of successes (like a popular microformat tagging/retrieval tool) before the masses jump on board.

For my part in this digital village, I’m going to actively explore more microformatting opportunities. More if it develops.