Thursday, February 21, 2008

Specifying patterns for your Custom Search Engine



Creating a basic Custom Search Engine (CSE) is very easy. You enter a list of sites, select a few basic preferences, and you are done, right? But in fact there's more to Custom Search -- consider it a very powerful way of building your own search engine on top of Google search. You can exclude sites, add labels for drill-down and even change the ranking of results for your search engine. In this blog post, we look at the basic element of Custom Search - URL patterns

URL patterns specify the part of the web you want to search or exclude from your search. Custom Search is based on approximation algorithms that use these patterns to give you your customized results.

Consider the "I Love Veggies" search engine that we created. Here's how the "I Love Veggies" search engine made use of patterns effectively:

  • Be very specific. Use the longest possible pattern for specifying a site. For example, in the "I Love Veggies" search engine, we wanted to search all of www.goveg.com, so we added "www.goveg.com/*" as a pattern. But we wanted to search only the vegetarian part of the "allrecipes.com" site. So instead of adding all of "allrecipes.com/*" we added the more specific "allrecipes.com/Recipes/Everyday-Cooking/Vegetarian/*".
  • Specify multiple pages in a site with a "*" at the end of the pattern. If you specify just "www.goveg.com", Custom Search will search just the single page http://www.goveg.com. You need to remember this only if you are write your XML file of annotations directly. If you are using the Control Panel, it automatically adds the "/*" at the end for you, unless you indicate otherwise.
  • Sometimes, you might have a few hosts on a domain with the same path that you want to search. In our example, we wanted to search "mideastfood.about.com/od/vegetarianrecipes/*" and "indianfood.about.com/od/vegetarianrecipes/*". In such a case it is better to specify these patterns individually instead of a very general "*.about.com/od/vegetarianrecipes/*" as more specific the patterns, better the approximation.
  • You can only use the * in the hostname at the beginning of the pattern and it can only represent a full token. For example, "*.about.com/*" is a valid pattern and so is "*.food.about.com/*". However, "*ood.about.com/*" is not valid, nor is "food.*.about.com/*".

Keep reading this blog for more tips and tricks as we develop our "I Love Veggies" search engine. If you have specific questions or feature requests you can visit our Help Center or ask a question on the Discussion group.

No comments:

Post a Comment