Google Calendar Tide Chart for Surfers ... and other people that care about tides

I wrote a PHP script to generate a Tide Chart Calendar CSV for import into Google Calendar ( and other calendar apps ). If you want to generate a tide chart for your location, simply download the desired chart data from NOAA and paste the data into the Tide Chart Calendar Generator, execute the PHP script, and then import the CSV into your calendar app.

A list of free email provider domains / URLs

Here's a list of free email provider domains / URLs as a GIST.  I compiled this list from a bunch of other lists I found around the web.  I use it in some projects to help filter non-business emails.  It's probably got a lot of uses.

Google Alerts API Client for Creating Alerts in Bulk

Google Alerts does not have an official API, but sometimes you just need to create a lot of alerts.  I did, so I pulled some code together from Stackoverflow and built my own PHP class for creating Google Alerts in bulk.

One caveat is that I didn't build any way to remove alerts in bulk ( although you could definitely use this base class ), so be careful when creating alerts in your primary account.  It's easy to flood it with alerts, and then not be able to delete them.  I actually created some new accounts to create bulk alerts in.

Also, this class currently logs in for every alert it makes, so it's not super fast.  Some cleanup could definitely be done.

PHP API Client for the Data Science Toolkit

The Data Science Toolkit ( DST ) is a fantastic open RESTful API resource for all sorts of data processing use cases. Check out the developer documentation to see what all is available. Some of the more surprising resources include multiple geocoding endpoints, of which, two are drop in replacements for Google's and Yahoo's geocoding APIs. Big projects will probably be rate limited, however.

Your Own DST Server

The really cool thing about the DST is that it maintains an open and freely available Amazon Machine Image of the entire system. I ended up using this internally on a project where we needed to do a lot of Geocoding really fast. Just boot up the AMI and start making requests. It operates exactly like the open API they maintain at datasciencetoolkit.org, just at your own URL.

A DST PHP Api Client

In the process, I built a simple PHP API Client for the Data Science Toolkit. Only Geocoding endpoints are currently defined in my Client, but it should be a good start for other projects. By default, the client will operate using the Open API at datasciencetoolkit.org, but you can set a custom URL if you are using your own AMI of the DST.  Everything you need to know will be in the README.

Recovering a 6.5GB Elasticsearch Index from Amazon S3

Update: This was run in 2013 on Elasticsearch 0.9 (I think). Elasticsearch has changed considerably in the last few years and is now on version 1.7.1 - download the most recent version of Elasticsearch here.

Summary: A 6.5GB Elasticsearch index takes roughly 10:00 minutes to recover to a fresh Amazon EC2 M1.Small Instance that was preconfigured to start it's own cluster and pull the index from S3 upon launch.  The index was actually searchable ( although incomplete ) around 8:30 minutes.  I did not expect the index to recover this quickly.  I'm constantly impressed by Elasticsearch.

System Configuration

I'm running an Elasticsearch index on a single Amazon EC2 instance.  It's configured to run with five shards and one replica ( default ).  This means it will work as a single node on a single machine or as a cluster with many nodes on many machines.  Elasticsearch is configured to back up to Amazon S3 using the Elasticsearch AWS Cloud Plugin.

I've created an Amazon Machine Image ( AMI ) that allows me to boot a fresh instance that will automatically launch Elasticsearch and join a running cluster.  If no cluster is found, it will pull the index from Amazon S3 onto it's 50GB ephemeral drive and into memory, creating a cluster and becoming the master without me touching the instance.

Index Configuration

I've generated a large Elasticsearch index containing real data used for a client's project.   The index contains two document types, Company and File.  The Company documents contain profile data about a company: name, url, description, addresses, keywords, etc.  Each company type can include a few to dozens of nested objects used so we can perform very specific searches not available if these were simple objects ( addresses, industries served, industry certifications, etc ).  The Company documents are parents to the File documents.  File documents contain a few fields like title, description, keywords, etc., but mostly consist of an attached document in a specific format: doc, pdf, html, etc.

Index Stats

Stats are pulled from the Elasticsearch Head Plugin ( a web based interface to look at your index ).

  • There are 1,475,892 File documents and 454,252 Company documents.
  • Total documents in the index, including nested objects in the Company type are 4,954,814.
  • The total index size is 6.5GB.

Launching a New Cluster

To launch a new cluster, I stopped Elasticsearch on the system that I built the index with so that my new EC2 instance would not try to join that cluster. The new Instance started it's own cluster and pulled the index from S3.  Below are notes about the times it took to recover the index when launching our AMI on an M1.Small EC2 Instance with a 50GB ephemeral drive ( M1.Small is the smallest instance size that Elasticsearch will run on. Micro instances don't have enough memory available and Elasticsearch fails during launch ).

  • 0:00 minutes - Launched new AMI from Amazon EC2 Dashboard
  • ~2:30 - Elasticsearch was available via it's API ( curl -XGET 'http://AMAZON_HOST:9200/' )
  • 3:45 - Index size reported by Head Plugin: 1.1gb
  • 4:35 - Index size reported by Head Plugin: 1.2gb
  • 5:35 - Index size reported by Head Plugin: 1.7gb
  • 7:00 - EC2 Instance available via SSH
  • 7:36 - Disk usage reported by DF: 6.2gb
  • ~8:30 - The index is searchable even though it was not fully loaded yet.  Some shards were not distributed properly based on head until this time.  This may vary based on different trials making the index searchable sooner, even though incomplete.  ( API Request: curl -XGET 'http://AMAZON_HOST:9200/companies/company/_search?q=_all:adhesive&pretty=true&size=3' )
  • 8:50 - Index size reported by Head Plugin: 5.8gb
  • 9:10 - Disk usage reported by DF: 8.1gb ( index almost fully loaded to disk )
  • ~10:00 - Index size reported by Head Plugin: 6.5gb ( index fully recovered )

Notes

  • The results returned by the index at 8:30 minutes when it was first available were different than the results returned when it was fully loaded at ~10:00 minutes.  They were still correct ( objectively ), but not all Company documents were available yet. ( API Request: curl -XGET 'http://AMAZON_HOST:9200/companies/company/_search?q=_all:adhesive&pretty=true&size=3' )
  • The first few searches on the index are very slow, but it warms up quickly.  I'm not exactly sure what is happening here.  I have not dug deep enough yet.  As far as I know, this index is too large to load entirely into memory on an M1.Small EC2 Instance, but search performance seems to be comparable to larger instances that have enough memory to store the entire index.  I have not quantified this.
  • The index is available and searchable in about 30 seconds when stopping and starting Elasticsearch on a machine where the index is already loaded to disk ( An M1.Small EC2 Instance ).

Simple MySQL-Based Sphinx Caching

This is based on some sample caching code I found on the Sphinx forum and works with Sphinx V2.02 and the Sphinx PHP API. Updates I made:

I changed the mysql database to use a varchar for the request, and added an index to it.  The blob had scaling issues almost immediately.  I also added a log to watch the cache at work.  The log file needs to be writable by Apache ( or your web server ) and be located in the sphinx/var/log/ directory.  It currently lacks any cleanup of old or unused cache entries.

My Sphinx setup is using a sharded index performing 20-40 queries per second across ~50M documents at peak times.  This simple MySQL-based caching is being used as a temporary fix to some scaling issues we've encountered on a site called Industry Cortex, a research tool for the manufacturing industry.  It's working pretty well for the time being, and could be a very effective and simple caching solution on the right systems with the right needs.

How to install this:

Replace the entire RunQueries() method in the Sphinx PHP api.  Update the username and password to connect to MySQL using MySQLi on line 27 below.  Create the mysql database ( mine is called `sphinx_cache` ) and create the `sphinx_cache` table ( same name as the database ) with the sql code below.  Then create your log file and give your webserver write permissions on it.  You should now have a working Sphinx cache.

Sphinx API updates:

function RunQueries ()
{

	if ( empty($this->_reqs) )
	{
		$this->_error = "no queries defined, issue AddQuery() first";
		return false;
	}

	// mbstring workaround
	$this->_MBPush ();

	if (!( $fp = $this->_Connect() ))
	{
		$this->_MBPop ();
		return false;
	}

	// send query, get response
	$nreqs = count( $this->_reqs );
	$req = join ( "", $this->_reqs );
	$len = 8 + strlen( $req );
	$req = pack ( "nnNNN", SEARCHD_COMMAND_SEARCH, VER_COMMAND_SEARCH, $len, 0, $nreqs ) . $req; // add header

	// === BEGIN CACHE EDITS ===

	$mysqli = new mysqli( 'localhost', 'username', 'password', 'sphinx_cache' );
	$cache_log = '/data/sphinx/var/log/cache.log';

	// check for query in mysql cache
	$cache_query_start = microtime( true );
	$results = $mysqli->query( "SELECT response FROM sphinx_cache WHERE request='" . $mysqli->real_escape_string( $req ) . "'" );
	$cache_query_end = microtime( true );
	$cache_query_time = $cache_query_end - $cache_query_start;

	if( $results->num_rows > 0 ) { // there is a cached result

		// use cached result
		$row = $results->fetch_row( );
		$response = $row[0];

		// write cache hit to log
		file_put_contents( $cache_log, "  cached query served, $cache_query_time, " . date('Y-m-d h:m:s') . ', ' . $_SERVER['PHP_SELF'] . ', ' . $_SERVER['REMOTE_ADDR'] . "\n", FILE_APPEND );

	} else { // no cached result

		// query sphinx
		$sphinx_query_start = microtime( true );
		if ( !( $this->_Send ( $fp, $req, $len+8 ) ) ||
			 !( $response = $this->_GetResponse ( $fp, VER_COMMAND_SEARCH ) ) )
		{
			$this->_MBPop ( );
			return false;
		}
		$sphinx_query_end = microtime( true );
		$sphinx_query_time = $sphinx_query_end - $sphinx_query_start;

		// store new result into mysql cache
		$mysqli->query( "INSERT INTO sphinx_cache (request, response) VALUES ('" . $mysqli->real_escape_string( $req ) . "','" . $mysqli->real_escape_string( $response ) . "')");

		// write cache miss to log
		file_put_contents( $cache_log, "* sphinx query served, $sphinx_query_time, " . date('Y-m-d h:m:s') . ', ' . $_SERVER['PHP_SELF'] . ', ' . $_SERVER['REMOTE_ADDR'] . "\n", FILE_APPEND );

	}

	// === END CACHE EDITS ===

	// query sent ok; we can reset reqs now
	$this->_reqs = array ();

	// parse and return response
	return $this->_ParseSearchResponse ( $response, $nreqs );

}

SQL code to generate the caching database

My database is called sphinx_cache ... the same as the table.

CREATE TABLE `sphinx_cache` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `request` varchar(5000) NOT NULL DEFAULT '',
  `response` text NOT NULL,
  `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `request` (`request`(333))
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;