Update 2011-10-12: IndexTank will be no more.

Google certainly has Web search down to a very fine art, but it finds only what its machines have crawled or been told to get via a sitemap. Submitting a new site map is an indication to Google that something on my site has probably changed, but it isn't a "crawl now!" instruction; it can take anything from minutes to days until the changed site map is reflected in Google's search engine.

I was searching (pun intended) for a method to have the content of my site indexed on-demand, without having to deploy my own search engine, and up came IndexTank, a cloud-service built atop Amazon SimpleDB which calls itself "hosted search you control".


IndexTank basically gives me a REST interface with which to upload JSON documents I can later search through. IndexTank offer a free plan with up to five indexes, 100,000 documents, and unlimited daily queries. (Paid plans allow for more documents.)

When you sign up, you get two API URLs: use the private one for data manipulation and queries, and you use the public one for the auto-complete API (which also supports JSONP). Indextank's API allows for creating or deleting indexes, adding, deleting, and searching documents, and they have a special API for auto-complete.

There's a rather good hello world tutorial which gives a good overview of the capabilities provided by the API. Documents can contain geolocation data which can provide search results customized to a user's location.

Getting started

Registering for the free IndexTank account cannot be more simple: an e-mail address suffices; IndexTank generates a (changeable) password and sends you that. I connected to the dashboard and created my first index. (I was a bit wasteful with creating/deleting indexes and had to contact support for help, but they reacted very quickly and solved my problem in the middle of their night.)

The dashboard shows me the indexes I currently have, their size, etc. and I can use it to test searches against whatever documents I have in the index.

First experiments

Documents I submit to IndexTank are JSON objects with an application-specific docid, which could be a part number, a URL, or almost any bit of data unique to your application, and fields which contain structured data.

Let me show you a small example. I create the following JSON in a file called i1.json

   "docid" : "mykey1",
   "fields" : {
      "text" : "IndexTank lets me submit documents and search them",
      "title" : "Searching in the cloud",
      "tags" : "Search,Cloud"

I then submit this document to my index (called t22) at IndexTank using an HTTP PUT request. (By the way, a PUT request creates new documents and overwrites existing documents.) MyAPIURL is my private URL as discussed above.

curl -X PUT http://MyAPIURL/v1/indexes/t22/docs -d @i1.json

And I can immediately search for it. When retrieving documents, IndexTank can be instructed to return one or more particular fields from your document, and it can return snippets of one of those fields. You'll typically display these snippets of text in your interface so that a user sees the context in which the search term(s) appear. (Somewhat like what Google does on their results page.)

curl 'http://MyAPIURL/v1/indexes/t22/search?q=documents'
   "search_time" : "0.002",
   "matches" : 1,
   "facets" : {},
   "results" : [
         "query_relevance_score" : -498,
         "docid" : "mykey1"

Most of the JSON document returned is quite clear. results is an array with, well, the results of the search. In my program I'd use docid to retrieve the original document from my database, file system or wherever. matches contains the number of results. facets allow grouping of search results, and I have no idea on how to use query_relevance_score.

To get at the other fields I originally submitted in the PUT request (title, tags), I append a list of fields I want to the URL:

curl 'http://MyAPIURL/v1/indexes/t22/search?q=documents&fetch=title,tags'
   "search_time" : "0.004",
   "matches" : 1,
   "facets" : {},
   "results" : [
         "query_relevance_score" : -662,
         "docid" : "mykey1",
         "title" : "Searching in the cloud",
         "tags" : "Search,Cloud"

Finally, to clean up the experiment, let me delete the document:

curl -X DELETE 'http://MyAPIURL/v1/indexes/t22/docs' -d '{"docid":"mykey1"}'

(If you do a lot of command-line talking to a REST service, I recommend you use Resty.)

So, there are two parts to using IndexTank: collecting and submitting documents to the tank, followed by the actual searching.

It goes without saying, which is why I'll say it anyway, that the data I submit is "in a cloud". Since I have no control whatsoever over that data, I should avoid uploading (i.e. PUT) any sensitive data. In fact, IndexTank's privacy policy explicitly says:

we cannot ensure or warrant the security of any information you transmit to Flaptor or guarantee that your information on the Service may not be accessed, disclosed, altered, or destroyed by breach of any of our physical, technical, or managerial safeguards.

I've told you. :-)

As far as using IndexTank for Web search of my site I see no problem, as the data is public anyway, and other search engines have that already.

Implementing site search

I use Jekyll to generate this site, so I thought it would be a good idea to use IndexTank for site search. Basically that means the following steps:

  • Collect the documents I want indexed
  • Submit those documents to IndexTank
  • Implement a search function for the site

The easiest would be to have Jekyll generate the site, and then submit the HTML pages with all the HTML markup they contain. In order to reduce the amount of data I have to transfer (i.e. upload) to IndexTank and index relevant content only, I'm going to prepare documents as follows:

  1. Read the Markdown source files which contain individual posts.
  2. Convert those to HTML
  3. Strip the HTML markup.

That should leave me with the plain text of the post only, without links, image references, etc. I'm a bit old-fashioned in that I don't like wasting resources (network bandwidth in this case) so I'll also ensure I upload a document to IndexTank only if it's been modified since the last upload. As a docid I'll use the path to the HTML page, sans the host name (i.e. something like /2011/06/14/deliciously-static-from-wordpress-to-jekyll/).

I can search an index on IndexTank using my private API URL, or I can use my public API URL for auto-complete searches. I found out a bit later, I can allow anonymous searches if I enable the public search API in the Manage portion of the dashboard, which is important for, say, performing a search from a JavaScript in a Web page; without anonymous search I'd have to divulge my private API URL in the JavaScript.

Submitting documents to IndexTank

IndexTank have client libraries for Python, Ruby, PHP, and Java, but not for Perl. I found a small bit of code that uses Net::HTTP::Spore for accessing IndexTank, but I decided to use Perl's LWP, instead of installing yet another Perl module.

My program searches for *.markdown files in the _posts and pages directories. For each file, it

  1. determines a docid based on the path name and the date in the YAML front-matter of the post
  2. retrieves the post's tags
  3. grabs the Markdown from the posting
  4. converts the Markdown to HTML
  5. strips the HTML tags from the result
  6. and finally creates the JSON document

The final JSON document is UTF-8-encoded (see The Perl UTF-8 and utf8 Encoding Mess) and submitted to IndexTank for indexing. I record a time stamp (UNIX epoch time) for each docid thus submitted, and that enables me to skip uploading unchanged documents at the next run of the indexer.

The initial upload of 1800+ documents took round about 10 minutes. This sounds like a relatively long time (and maybe it is), but I assume it's due to the ineffective tools I created as well as the relatively congested ADSL line I'm using. Nevertheless, quite adequate for me.

Querying IndexTank

After uploading all my pages, I try an anonymous search on my API URL. The query (q=) is "turkish", and I'm asking IndexTank to fetch the "title" and "tags" fields, and to return a snippet from the "text" field.

curl -qs 'http://dh42u.api.indextank.com/v1/indexes/t22/search?q=turkish&fetch=title,tags&snippet=text'
   "search_time" : "0.005",
   "matches" : 3,
   "facets" : {},
   "results" : [
         "query_relevance_score" : -1964,
         "docid" : "/2006/08/02/sucuk/",
         "title" : "Sucuk",
         "snippet_text" : "Germany it\n is sold in <b>Turkish</b> grocery stores. Cut the sucuk into slices (approximately\n 4 mm thick) and fry without additional fat in a pan, ensuring it doesn't get\n too",
         "tags" : "Food"
         "query_relevance_score" : -2005,
         "docid" : "/2006/03/01/turkish-style-pizza/",
         "title" : "Turkish Style Pizza",
         "snippet_text" : "I hadn't\n tasted before: Culinaria <b>Turkish</b> Style . It tastes quite different to\n other pizzas I've eaten; a little exotic shall we say. I won't be eating this\n on a regular",
         "tags" : "Food"
         "query_relevance_score" : -2013,
         "docid" : "/2006/02/14/durum-yok/",
         "title" : "D&uuml;r&uuml;m Yok!",
         "snippet_text" : "of a bank in\n the <b>Turkish</b> capital Ankara. It was a very interesting experience and I was\n able to pick up a couple dozen words of the <b>Turkish</b> language. My favorite word\n was (and still is): yok . It means &quot;nothing&quot; or &quot;not available&quot; and I\n understand it is added after the noun",
         "tags" : "Food"

Note how the returned snippet has HTML tags inserted by IndexTank; they mark the position of the query terms in the snippet.

Jekyll integration

I run my indexer in the Makefile I use to build the site, just before pushing Jekyll's _site/ directory to its final destination. This allows me to ensure an index is available on IndexTank seconds before the page is actually visible on the Web. I beat Google to it! ;-)


Pascal Widdershoven implemented something similar using a Jekyll Ruby plug-in to do the indexing. If I understand his Ruby code correctly, he submits pages to IndexTank at each Jekyll run, which is rather "expensive"; I prefer my method.

User interface

I haven't yet decided exactly on the user interface, although I have a first draft working. It uses a background helper which does the actual search and returns a HTML page which is overlayed into a div on the page.

As some of you know, I'm not a "Web 2.0 developer", so there's a lot (ALOT! :-) of know-how missing here... I suppose I could use one of the anonymous queries to get a JSONP result directly from the Web browser, and somehow fiddle it onto the page nicely, perhaps with some indextank-jquery...

A first try

Nothing terribly special here, I assume, but this is my first try using a bit of Mustache, a templating engine for all sorts of languages. (For an excellent discussion on JSON vs JSONP, see Cross-domain AJAX with JSONP.)

<!DOCTYPE html>
<script type="text/javascript" src="/inc/jquery-min.js"></script>
<script type="text/javascript" src="mustache.js"></script>
<style type="text/css">
body { font-family:Verdana,sans-serif; }
#output { margin-left: 2em; width: 90%; }
li { padding: 10px; }
.s_snippet { font-size: 90%; font-style:italic; }
.s_date { font-size: 80%; }
.s_tags { font-size: 80%; color: #e1e; }

<script type="text/javascript">
    $(document).ready(function() {

        var template = '<ul>{{#results}}<li>\
            <a href="http://jpmens.net{{docid}}">{{title}}</a>\
            <span class="s_snippet">{{snippet_text}}</span>\
            <span class="s_date">{{date}}</span>\
            <span class="s_tags">{{tags}}</span></div>\

            var url = 'http://dh42u.api.indextank.com/v1/indexes/t22/search';
            var params = {
                q: 'yok', // search term
                fetch: 'title,tags,date',
                snippet: 'text'

                url: url,
                data: params,
                dataType: 'jsonp',
                success: function(data) {
                    var markup = Mustache.to_html(template, data);
                    markup = markup.replace(/\&lt;b\&gt;/gi, '<em>');
                    markup = markup.replace(/&lt;\/b&gt;/gi, '</em>');
            return false;
<a href="#" id="clickme">click</a>
<div id='output'></div>

So far, so good: I'll be using (and recommending) IndexTank regularly now, because it really delivers on its promise; this is good stuff.

Further reading

Flattr this
IndexTank, Search, Cloud, JSON, Jekyll, REST, and curl :: 15 Jun 2011 :: e-mail


blog comments powered by Disqus