Recovering a 6.5GB Elasticsearch Index from Amazon S3

Update: This was run in 2013 on Elasticsearch 0.9 (I think). Elasticsearch has changed considerably in the last few years and is now on version 1.7.1 - download the most recent version of Elasticsearch here.

Summary: A 6.5GB Elasticsearch index takes roughly 10:00 minutes to recover to a fresh Amazon EC2 M1.Small Instance that was preconfigured to start it's own cluster and pull the index from S3 upon launch.  The index was actually searchable ( although incomplete ) around 8:30 minutes.  I did not expect the index to recover this quickly.  I'm constantly impressed by Elasticsearch.

System Configuration

I'm running an Elasticsearch index on a single Amazon EC2 instance.  It's configured to run with five shards and one replica ( default ).  This means it will work as a single node on a single machine or as a cluster with many nodes on many machines.  Elasticsearch is configured to back up to Amazon S3 using the Elasticsearch AWS Cloud Plugin.

I've created an Amazon Machine Image ( AMI ) that allows me to boot a fresh instance that will automatically launch Elasticsearch and join a running cluster.  If no cluster is found, it will pull the index from Amazon S3 onto it's 50GB ephemeral drive and into memory, creating a cluster and becoming the master without me touching the instance.

Index Configuration

I've generated a large Elasticsearch index containing real data used for a client's project.   The index contains two document types, Company and File.  The Company documents contain profile data about a company: name, url, description, addresses, keywords, etc.  Each company type can include a few to dozens of nested objects used so we can perform very specific searches not available if these were simple objects ( addresses, industries served, industry certifications, etc ).  The Company documents are parents to the File documents.  File documents contain a few fields like title, description, keywords, etc., but mostly consist of an attached document in a specific format: doc, pdf, html, etc.

Index Stats

Stats are pulled from the Elasticsearch Head Plugin ( a web based interface to look at your index ).

  • There are 1,475,892 File documents and 454,252 Company documents.
  • Total documents in the index, including nested objects in the Company type are 4,954,814.
  • The total index size is 6.5GB.

Launching a New Cluster

To launch a new cluster, I stopped Elasticsearch on the system that I built the index with so that my new EC2 instance would not try to join that cluster. The new Instance started it's own cluster and pulled the index from S3.  Below are notes about the times it took to recover the index when launching our AMI on an M1.Small EC2 Instance with a 50GB ephemeral drive ( M1.Small is the smallest instance size that Elasticsearch will run on. Micro instances don't have enough memory available and Elasticsearch fails during launch ).

  • 0:00 minutes - Launched new AMI from Amazon EC2 Dashboard
  • ~2:30 - Elasticsearch was available via it's API ( curl -XGET 'http://AMAZON_HOST:9200/' )
  • 3:45 - Index size reported by Head Plugin: 1.1gb
  • 4:35 - Index size reported by Head Plugin: 1.2gb
  • 5:35 - Index size reported by Head Plugin: 1.7gb
  • 7:00 - EC2 Instance available via SSH
  • 7:36 - Disk usage reported by DF: 6.2gb
  • ~8:30 - The index is searchable even though it was not fully loaded yet.  Some shards were not distributed properly based on head until this time.  This may vary based on different trials making the index searchable sooner, even though incomplete.  ( API Request: curl -XGET 'http://AMAZON_HOST:9200/companies/company/_search?q=_all:adhesive&pretty=true&size=3' )
  • 8:50 - Index size reported by Head Plugin: 5.8gb
  • 9:10 - Disk usage reported by DF: 8.1gb ( index almost fully loaded to disk )
  • ~10:00 - Index size reported by Head Plugin: 6.5gb ( index fully recovered )


  • The results returned by the index at 8:30 minutes when it was first available were different than the results returned when it was fully loaded at ~10:00 minutes.  They were still correct ( objectively ), but not all Company documents were available yet. ( API Request: curl -XGET 'http://AMAZON_HOST:9200/companies/company/_search?q=_all:adhesive&pretty=true&size=3' )
  • The first few searches on the index are very slow, but it warms up quickly.  I'm not exactly sure what is happening here.  I have not dug deep enough yet.  As far as I know, this index is too large to load entirely into memory on an M1.Small EC2 Instance, but search performance seems to be comparable to larger instances that have enough memory to store the entire index.  I have not quantified this.
  • The index is available and searchable in about 30 seconds when stopping and starting Elasticsearch on a machine where the index is already loaded to disk ( An M1.Small EC2 Instance ).

Simple MySQL-Based Sphinx Caching

This is based on some sample caching code I found on the Sphinx forum and works with Sphinx V2.02 and the Sphinx PHP API. Updates I made:

I changed the mysql database to use a varchar for the request, and added an index to it.  The blob had scaling issues almost immediately.  I also added a log to watch the cache at work.  The log file needs to be writable by Apache ( or your web server ) and be located in the sphinx/var/log/ directory.  It currently lacks any cleanup of old or unused cache entries.

My Sphinx setup is using a sharded index performing 20-40 queries per second across ~50M documents at peak times.  This simple MySQL-based caching is being used as a temporary fix to some scaling issues we've encountered on a site called Industry Cortex, a research tool for the manufacturing industry.  It's working pretty well for the time being, and could be a very effective and simple caching solution on the right systems with the right needs.

How to install this:

Replace the entire RunQueries() method in the Sphinx PHP api.  Update the username and password to connect to MySQL using MySQLi on line 27 below.  Create the mysql database ( mine is called `sphinx_cache` ) and create the `sphinx_cache` table ( same name as the database ) with the sql code below.  Then create your log file and give your webserver write permissions on it.  You should now have a working Sphinx cache.

Sphinx API updates:

function RunQueries ()

	if ( empty($this->_reqs) )
		$this->_error = "no queries defined, issue AddQuery() first";
		return false;

	// mbstring workaround
	$this->_MBPush ();

	if (!( $fp = $this->_Connect() ))
		$this->_MBPop ();
		return false;

	// send query, get response
	$nreqs = count( $this->_reqs );
	$req = join ( "", $this->_reqs );
	$len = 8 + strlen( $req );
	$req = pack ( "nnNNN", SEARCHD_COMMAND_SEARCH, VER_COMMAND_SEARCH, $len, 0, $nreqs ) . $req; // add header


	$mysqli = new mysqli( 'localhost', 'username', 'password', 'sphinx_cache' );
	$cache_log = '/data/sphinx/var/log/cache.log';

	// check for query in mysql cache
	$cache_query_start = microtime( true );
	$results = $mysqli->query( "SELECT response FROM sphinx_cache WHERE request='" . $mysqli->real_escape_string( $req ) . "'" );
	$cache_query_end = microtime( true );
	$cache_query_time = $cache_query_end - $cache_query_start;

	if( $results->num_rows > 0 ) { // there is a cached result

		// use cached result
		$row = $results->fetch_row( );
		$response = $row[0];

		// write cache hit to log
		file_put_contents( $cache_log, "  cached query served, $cache_query_time, " . date('Y-m-d h:m:s') . ', ' . $_SERVER['PHP_SELF'] . ', ' . $_SERVER['REMOTE_ADDR'] . "\n", FILE_APPEND );

	} else { // no cached result

		// query sphinx
		$sphinx_query_start = microtime( true );
		if ( !( $this->_Send ( $fp, $req, $len+8 ) ) ||
			 !( $response = $this->_GetResponse ( $fp, VER_COMMAND_SEARCH ) ) )
			$this->_MBPop ( );
			return false;
		$sphinx_query_end = microtime( true );
		$sphinx_query_time = $sphinx_query_end - $sphinx_query_start;

		// store new result into mysql cache
		$mysqli->query( "INSERT INTO sphinx_cache (request, response) VALUES ('" . $mysqli->real_escape_string( $req ) . "','" . $mysqli->real_escape_string( $response ) . "')");

		// write cache miss to log
		file_put_contents( $cache_log, "* sphinx query served, $sphinx_query_time, " . date('Y-m-d h:m:s') . ', ' . $_SERVER['PHP_SELF'] . ', ' . $_SERVER['REMOTE_ADDR'] . "\n", FILE_APPEND );


	// === END CACHE EDITS ===

	// query sent ok; we can reset reqs now
	$this->_reqs = array ();

	// parse and return response
	return $this->_ParseSearchResponse ( $response, $nreqs );


SQL code to generate the caching database

My database is called sphinx_cache ... the same as the table.

CREATE TABLE `sphinx_cache` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `request` varchar(5000) NOT NULL DEFAULT '',
  `response` text NOT NULL,
  PRIMARY KEY (`id`),
  KEY `request` (`request`(333))