Update 2011-10-12: IndexTank will be no more.
Google certainly has Web search down to a very fine art, but it finds only what its machines have crawled or been told to get via a sitemap. Submitting a new site map is an indication to Google that something on my site has probably changed, but it isn’t a “crawl now!” instruction; it can take anything from minutes to days until the changed site map is reflected in Google’s search engine.
I was searching (pun intended) for a method to have the content of my site indexed on-demand, without having to deploy my own search engine, and up came IndexTank, a cloud-service built atop Amazon SimpleDB which calls itself “hosted search you control”.
IndexTank basically gives me a REST interface with which to upload JSON documents I can later search through. IndexTank offer a free plan with up to five indexes, 100,000 documents, and unlimited daily queries. (Paid plans allow for more documents.)
When you sign up, you get two API URLs: use the private one for data manipulation and queries, and you use the public one for the auto-complete API (which also supports JSONP). Indextank’s API allows for creating or deleting indexes, adding, deleting, and searching documents, and they have a special API for auto-complete.
There’s a rather good hello world tutorial which gives a good overview of the capabilities provided by the API. Documents can contain geolocation data which can provide search results customized to a user’s location.
Getting started
Registering for the free IndexTank account cannot be more simple: an e-mail address suffices; IndexTank generates a (changeable) password and sends you that. I connected to the dashboard and created my first index. (I was a bit wasteful with creating/deleting indexes and had to contact support for help, but they reacted very quickly and solved my problem in the middle of their night.)
The dashboard shows me the indexes I currently have, their size, etc. and I can use it to test searches against whatever documents I have in the index.
First experiments
Documents I submit to IndexTank are JSON objects with an application-specific docid, which could be a part number, a URL, or almost any bit of data unique to your application, and fields which contain structured data.
Let me show you a small example. I create the following JSON in a file called i1.json
{
"docid" : "mykey1",
"fields" : {
"text" : "IndexTank lets me submit documents and search them",
"title" : "Searching in the cloud",
"tags" : "Search,Cloud"
}
}
I then submit this document to my index (called t22
) at IndexTank using
an HTTP PUT
request. (By the way, a PUT
request creates new documents and
overwrites existing documents.) MyAPIURL
is my private URL as discussed above.
curl -X PUT http://MyAPIURL/v1/indexes/t22/docs -d @i1.json
And I can immediately search for it. When retrieving documents, IndexTank can be instructed to return one or more particular fields from your document, and it can return snippets of one of those fields. You’ll typically display these snippets of text in your interface so that a user sees the context in which the search term(s) appear. (Somewhat like what Google does on their results page.)
curl 'http://MyAPIURL/v1/indexes/t22/search?q=documents'
{
"search_time" : "0.002",
"matches" : 1,
"facets" : {},
"results" : [
{
"query_relevance_score" : -498,
"docid" : "mykey1"
}
]
}
Most of the JSON document returned is quite clear. results
is an array with,
well, the results of the search. In my program I’d use docid
to retrieve the
original document from my database, file system or wherever. matches
contains
the number of results
. facets allow grouping of search results, and I
have no idea on how to use query_relevance_score
.
To get at the other fields I originally submitted in the PUT
request (title,
tags), I append a list of fields I want to the URL:
curl 'http://MyAPIURL/v1/indexes/t22/search?q=documents&fetch=title,tags'
{
"search_time" : "0.004",
"matches" : 1,
"facets" : {},
"results" : [
{
"query_relevance_score" : -662,
"docid" : "mykey1",
"title" : "Searching in the cloud",
"tags" : "Search,Cloud"
}
]
}
Finally, to clean up the experiment, let me delete the document:
curl -X DELETE 'http://MyAPIURL/v1/indexes/t22/docs' -d '{"docid":"mykey1"}'
(If you do a lot of command-line talking to a REST service, I recommend you use Resty.)
So, there are two parts to using IndexTank: collecting and submitting documents to the tank, followed by the actual searching.
It goes without saying, which is why I’ll say it anyway, that the data I submit
is “in a cloud”. Since I have no control whatsoever over that data, I should
avoid uploading (i.e. PUT
) any sensitive data. In fact, IndexTank’s privacy
policy explicitly says:
we cannot ensure or warrant the security of any information you transmit to Flaptor or guarantee that your information on the Service may not be accessed, disclosed, altered, or destroyed by breach of any of our physical, technical, or managerial safeguards.
I’ve told you. :-)
As far as using IndexTank for Web search of my site I see no problem, as the data is public anyway, and other search engines have that already.
Implementing site search
I use Jekyll to generate this site, so I thought it would be a good idea to use IndexTank for site search. Basically that means the following steps:
- Collect the documents I want indexed
- Submit those documents to IndexTank
- Implement a search function for the site
The easiest would be to have Jekyll generate the site, and then submit the HTML pages with all the HTML markup they contain. In order to reduce the amount of data I have to transfer (i.e. upload) to IndexTank and index relevant content only, I’m going to prepare documents as follows:
- Read the Markdown source files which contain individual posts.
- Convert those to HTML
- Strip the HTML markup.
That should leave me with the plain text of the post only, without links, image
references, etc. I’m a bit old-fashioned in that I don’t like wasting resources
(network bandwidth in this case) so I’ll also ensure I upload a document to
IndexTank only if it’s been modified since the last upload. As a docid I’ll
use the path to the HTML page, sans the host name (i.e. something like
/2011/06/14/deliciously-static-from-wordpress-to-jekyll/
).
I can search an index on IndexTank using my private API URL, or I can use my public API URL for auto-complete searches. I found out a bit later, I can allow anonymous searches if I enable the public search API in the Manage portion of the dashboard, which is important for, say, performing a search from a JavaScript in a Web page; without anonymous search I’d have to divulge my private API URL in the JavaScript.
Submitting documents to IndexTank
IndexTank have client libraries for Python, Ruby, PHP, and Java, but not
for Perl. I found a small bit of code that uses Net::HTTP::Spore
for
accessing IndexTank, but I decided to use Perl’s LWP, instead of installing
yet another Perl module.
My program searches for *.markdown
files in the _posts
and pages
directories. For each file, it
- determines a docid based on the path name and the date in the YAML front-matter of the post
- retrieves the post’s tags
- grabs the Markdown from the posting
- converts the Markdown to HTML
- strips the HTML tags from the result
- and finally creates the JSON document
The final JSON document is UTF-8-encoded (see The Perl UTF-8 and utf8 Encoding Mess) and submitted to IndexTank for indexing. I record a time stamp (UNIX epoch time) for each docid thus submitted, and that enables me to skip uploading unchanged documents at the next run of the indexer.
The initial upload of 1800+ documents took round about 10 minutes. This sounds like a relatively long time (and maybe it is), but I assume it’s due to the ineffective tools I created as well as the relatively congested ADSL line I’m using. Nevertheless, quite adequate for me.
Querying IndexTank
After uploading all my pages, I try an anonymous search on my API URL. The
query (q=
) is “turkish”, and I’m asking IndexTank to fetch the “title” and
“tags” fields, and to return a snippet from the “text” field.
curl -qs 'http://dh42u.api.indextank.com/v1/indexes/t22/search?q=turkish&fetch=title,tags&snippet=text'
{
"search_time" : "0.005",
"matches" : 3,
"facets" : {},
"results" : [
{
"query_relevance_score" : -1964,
"docid" : "/2006/08/02/sucuk/",
"title" : "Sucuk",
"snippet_text" : "Germany it\n is sold in <b>Turkish</b> grocery stores. Cut the sucuk into slices (approximately\n 4 mm thick) and fry without additional fat in a pan, ensuring it doesn't get\n too",
"tags" : "Food"
},
{
"query_relevance_score" : -2005,
"docid" : "/2006/03/01/turkish-style-pizza/",
"title" : "Turkish Style Pizza",
"snippet_text" : "I hadn't\n tasted before: Culinaria <b>Turkish</b> Style . It tastes quite different to\n other pizzas I've eaten; a little exotic shall we say. I won't be eating this\n on a regular",
"tags" : "Food"
},
{
"query_relevance_score" : -2013,
"docid" : "/2006/02/14/durum-yok/",
"title" : "Dürüm Yok!",
"snippet_text" : "of a bank in\n the <b>Turkish</b> capital Ankara. It was a very interesting experience and I was\n able to pick up a couple dozen words of the <b>Turkish</b> language. My favorite word\n was (and still is): yok . It means "nothing" or "not available" and I\n understand it is added after the noun",
"tags" : "Food"
}
]
}
Note how the returned snippet has HTML tags inserted by IndexTank; they mark the position of the query terms in the snippet.
Jekyll integration
I run my indexer in the Makefile
I use to build the site, just before pushing
Jekyll’s _site/
directory to its final destination. This allows me to ensure
an index is available on IndexTank seconds before the page is actually
visible on the Web. I beat Google to it! ;-)
Alternative
Pascal Widdershoven implemented something similar using a Jekyll Ruby plug-in to do the indexing. If I understand his Ruby code correctly, he submits pages to IndexTank at each Jekyll run, which is rather “expensive”; I prefer my method.
User interface
I haven’t yet decided exactly on the user interface, although I have a first
draft working. It uses a background helper which does the actual search and
returns a HTML page which is overlayed into a div
on the page.
As some of you know, I’m not a “Web 2.0 developer”, so there’s a lot (ALOT! :-) of know-how missing here… I suppose I could use one of the anonymous queries to get a JSONP result directly from the Web browser, and somehow fiddle it onto the page nicely, perhaps with some indextank-jquery…
A first try
Nothing terribly special here, I assume, but this is my first try using a bit of Mustache, a templating engine for all sorts of languages. (For an excellent discussion on JSON vs JSONP, see Cross-domain AJAX with JSONP.)
<!DOCTYPE html>
<html>
<script type="text/javascript" src="/inc/jquery-min.js"></script>
<script type="text/javascript" src="mustache.js"></script>
<style type="text/css">
body { font-family:Verdana,sans-serif; }
#output { margin-left: 2em; width: 90%; }
li { padding: 10px; }
.s_snippet { font-size: 90%; font-style:italic; }
.s_date { font-size: 80%; }
.s_tags { font-size: 80%; color: #e1e; }
</style>
<script type="text/javascript">
$(document).ready(function() {
var template = '<ul>{{#results}}<li>\
<a href="http://jpmens.net{{docid}}">{{title}}</a>\
<span class="s_snippet">{{snippet_text}}</span>\
<span class="s_date">{{date}}</span>\
<span class="s_tags">{{tags}}</span></div>\
</li>{{/results}}</ul>';
$('#clickme').click(function(){
var url = 'http://dh42u.api.indextank.com/v1/indexes/t22/search';
var params = {
q: 'yok', // search term
fetch: 'title,tags,date',
snippet: 'text'
};
$.ajax({
url: url,
data: params,
dataType: 'jsonp',
success: function(data) {
var markup = Mustache.to_html(template, data);
markup = markup.replace(/\<b\>/gi, '<em>');
markup = markup.replace(/<\/b>/gi, '</em>');
$('#output').html(markup);
}
});
return false;
});
});
</script>
</head>
<body>
<a href="#" id="clickme">click</a>
<div id='output'></div>
</body>
So far, so good: I’ll be using (and recommending) IndexTank regularly now, because it really delivers on its promise; this is good stuff.