Update 2011-10-12: IndexTank will be no more.
Google certainly has Web search down to a very fine art, but it finds only what its machines have crawled or been told to get via a sitemap. Submitting a new site map is an indication to Google that something on my site has probably changed, but it isn’t a “crawl now!” instruction; it can take anything from minutes to days until the changed site map is reflected in Google’s search engine.
I was searching (pun intended) for a method to have the content of my site indexed on-demand, without having to deploy my own search engine, and up came IndexTank, a cloud-service built atop Amazon SimpleDB which calls itself “hosted search you control”.
IndexTank basically gives me a REST interface with which to upload JSON documents I can later search through. IndexTank offer a free plan with up to five indexes, 100,000 documents, and unlimited daily queries. (Paid plans allow for more documents.)
When you sign up, you get two API URLs: use the private one for data manipulation and queries, and you use the public one for the auto-complete API (which also supports JSONP). Indextank’s API allows for creating or deleting indexes, adding, deleting, and searching documents, and they have a special API for auto-complete.
There’s a rather good hello world tutorial which gives a good overview of the capabilities provided by the API. Documents can contain geolocation data which can provide search results customized to a user’s location.
Registering for the free IndexTank account cannot be more simple: an e-mail address suffices; IndexTank generates a (changeable) password and sends you that. I connected to the dashboard and created my first index. (I was a bit wasteful with creating/deleting indexes and had to contact support for help, but they reacted very quickly and solved my problem in the middle of their night.)
The dashboard shows me the indexes I currently have, their size, etc. and I can use it to test searches against whatever documents I have in the index.
Documents I submit to IndexTank are JSON objects with an application-specific docid, which could be a part number, a URL, or almost any bit of data unique to your application, and fields which contain structured data.
Let me show you a small example. I create the following JSON in a file called
I then submit this document to my index (called
t22) at IndexTank using
PUT request. (By the way, a
PUT request creates new documents and
overwrites existing documents.)
MyAPIURL is my private URL as discussed above.
And I can immediately search for it. When retrieving documents, IndexTank can be instructed to return one or more particular fields from your document, and it can return snippets of one of those fields. You’ll typically display these snippets of text in your interface so that a user sees the context in which the search term(s) appear. (Somewhat like what Google does on their results page.)
Most of the JSON document returned is quite clear.
results is an array with,
well, the results of the search. In my program I’d use
docid to retrieve the
original document from my database, file system or wherever.
the number of
results. facets allow grouping of search results, and I
have no idea on how to use
To get at the other fields I originally submitted in the
PUT request (title,
tags), I append a list of fields I want to the URL:
Finally, to clean up the experiment, let me delete the document:
(If you do a lot of command-line talking to a REST service, I recommend you use Resty.)
So, there are two parts to using IndexTank: collecting and submitting documents to the tank, followed by the actual searching.
It goes without saying, which is why I’ll say it anyway, that the data I submit
is “in a cloud”. Since I have no control whatsoever over that data, I should
avoid uploading (i.e.
PUT) any sensitive data. In fact, IndexTank’s privacy
policy explicitly says:
we cannot ensure or warrant the security of any information you transmit to Flaptor or guarantee that your information on the Service may not be accessed, disclosed, altered, or destroyed by breach of any of our physical, technical, or managerial safeguards.
I’ve told you. :-)
As far as using IndexTank for Web search of my site I see no problem, as the data is public anyway, and other search engines have that already.
Implementing site search
- Collect the documents I want indexed
- Submit those documents to IndexTank
- Implement a search function for the site
The easiest would be to have Jekyll generate the site, and then submit the HTML pages with all the HTML markup they contain. In order to reduce the amount of data I have to transfer (i.e. upload) to IndexTank and index relevant content only, I’m going to prepare documents as follows:
- Read the Markdown source files which contain individual posts.
- Convert those to HTML
- Strip the HTML markup.
That should leave me with the plain text of the post only, without links, image
references, etc. I’m a bit old-fashioned in that I don’t like wasting resources
(network bandwidth in this case) so I’ll also ensure I upload a document to
IndexTank only if it’s been modified since the last upload. As a docid I’ll
use the path to the HTML page, sans the host name (i.e. something like
Submitting documents to IndexTank
IndexTank have client libraries for Python, Ruby, PHP, and Java, but not
for Perl. I found a small bit of code that uses
accessing IndexTank, but I decided to use Perl’s LWP, instead of installing
yet another Perl module.
My program searches for
*.markdown files in the
directories. For each file, it
- determines a docid based on the path name and the date in the YAML front-matter of the post
- retrieves the post’s tags
- grabs the Markdown from the posting
- converts the Markdown to HTML
- strips the HTML tags from the result
- and finally creates the JSON document
The final JSON document is UTF-8-encoded (see The Perl UTF-8 and utf8 Encoding Mess) and submitted to IndexTank for indexing. I record a time stamp (UNIX epoch time) for each docid thus submitted, and that enables me to skip uploading unchanged documents at the next run of the indexer.
The initial upload of 1800+ documents took round about 10 minutes. This sounds like a relatively long time (and maybe it is), but I assume it’s due to the ineffective tools I created as well as the relatively congested ADSL line I’m using. Nevertheless, quite adequate for me.
After uploading all my pages, I try an anonymous search on my API URL. The
q=) is “turkish”, and I’m asking IndexTank to fetch the “title” and
“tags” fields, and to return a snippet from the “text” field.
Note how the returned snippet has HTML tags inserted by IndexTank; they mark the position of the query terms in the snippet.
I run my indexer in the
Makefile I use to build the site, just before pushing
_site/ directory to its final destination. This allows me to ensure
an index is available on IndexTank seconds before the page is actually
visible on the Web. I beat Google to it! ;-)
Pascal Widdershoven implemented something similar using a Jekyll Ruby plug-in to do the indexing. If I understand his Ruby code correctly, he submits pages to IndexTank at each Jekyll run, which is rather “expensive”; I prefer my method.
I haven’t yet decided exactly on the user interface, although I have a first
draft working. It uses a background helper which does the actual search and
returns a HTML page which is overlayed into a
div on the page.
As some of you know, I’m not a “Web 2.0 developer”, so there’s a lot (ALOT! :-) of know-how missing here… I suppose I could use one of the anonymous queries to get a JSONP result directly from the Web browser, and somehow fiddle it onto the page nicely, perhaps with some indextank-jquery…
A first try
Nothing terribly special here, I assume, but this is my first try using a bit of Mustache, a templating engine for all sorts of languages. (For an excellent discussion on JSON vs JSONP, see Cross-domain AJAX with JSONP.)
So far, so good: I’ll be using (and recommending) IndexTank regularly now, because it really delivers on its promise; this is good stuff.