Version: 0.3
REST service for Mozilla Metrics. This service currently uses Hazelcast as a distributed in-memory map with short TTLs. Then provides an implementation for Hazelcast MapStore to persist the map data to various data sinks.
This code is built with the following assumptions. You may get mixed results if you deviate from these versions.
- Hadoop 0.20.2+
- HBase 0.90+
- Hazelcast 1.9.3
- Elastic Search 0.16.2
To make a jar you can do:
mvn package
The jar file is then located under target
.
In order to run bagheera on another machine you will probably want to use the dist assembly like so: need to deploy the following to your deployment target which I'll call BAGHEERA_HOME
.
mvn assembly:assembly
The zip file now under the target
directory should be deployed to BAGHEERA_HOME
on the remote server.
To run Bagheera you can use bin/bagheera
or copy the init.d script by the same name from bin/init.d
to /etc/init.d
. The init script assumes an installation of bagheera at /usr/lib/bagheera
, but this can be modified by changing the BAGHEERA_HOME
variable near the top of that script. Here is an example of using the regular bagheera script:
bin/bagheera 8080 conf/hazelcast.xml.example
If you start up multiple instances Hazelcast will auto-discover other instances assuming your network and hazelcast.xml are setup to do so.
Bagheera takes POST data on /submit/mymapname/unique-id. Depending on how mymapname is configured in the Hazelcast configuration file, it may write to different sources. That is explained further below.
Here's a quick rundown of HTTP return codes that Bagheera could send back (this isn't comprehensive but rather the most common ones):
- 204 No Content - returned if everything was submitted successfully
- 406 Not Acceptable - returned if the POST failed validation in some manner
Suppose you've created a table called 'mytable' in HBase like so:
create 'mytable', {NAME => 'data', COMPRESSION => 'LZO', VERSIONS => '1', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
All you need to do is add a section like this to the hazelcast.xml configuration file:
<map name="mytable">
<time-to-live-seconds>20</time-to-live-seconds>
<backup-count>1</backup-count>
<eviction-policy>NONE</eviction-policy>
<max-size>0</max-size>
<eviction-percentage>25</eviction-percentage>
<merge-policy>hz.ADD_NEW_ENTRY</merge-policy>
<!-- HBaseMapStore -->
<map-store enabled="true">
<class-name>com.mozilla.bagheera.hazelcast.persistence.HBaseMapStore</class-name>
<write-delay-seconds>5</write-delay-seconds>
<property name="hazelcast.hbase.pool.size">20</property>
<property name="hazelcast.hbase.table">mytable</property>
<property name="hazelcast.hbase.column.family">data</property>
<property name="hazelcast.hbase.column.qualifier">json</property>
</map-store>
</map>
Notice you can tweak the HBase connection pool size, table and column names as needed for different maps.
If you want to configure a Hazelcast Map to persist data to HDFS you can use the HdfsMapStore. It will write a SequenceFile with Text key/value pairs. Currently it will always use block compression. In the future we may add support for more compression codecs or alternative file formats. This MapStore will rollover and write new files every day or when hazelcast.hdfs.max.filesize is reached. It will write files to the directory hazelcast.hdfs.basedir/hazelcast.hdfs.dateformat/UUID. Please note that hazelcast.hdfs.max.filesize is only checked against a bytes written counter and not the actual filesize in HDFS. Actual filesize's will probably be much smaller than this number due to block compression. Here is an example section using this MapStore from hazelcast.xml configuration:
<map name="mymapname">
<time-to-live-seconds>20</time-to-live-seconds>
<backup-count>1</backup-count>
<eviction-policy>NONE</eviction-policy>
<max-size>0</max-size>
<eviction-percentage>25</eviction-percentage>
<merge-policy>hz.ADD_NEW_ENTRY</merge-policy>
<!-- HdfsMapStore -->
<map-store enabled="true">
<class-name>com.mozilla.bagheera.hazelcast.persistence.HdfsMapStore</class-name>
<write-delay-seconds>5</write-delay-seconds>
<property name="hazelcast.hdfs.basedir">/bagheera</property>
<property name="hazelcast.hdfs.dateformat">yyyy-MM-dd</property>
<property name="hazelcast.hdfs.max.filesize">1073741824</property>
</map-store>
</map>
The ElasticSearchIndexQueueStore is our first MapStore that takes advantage of Hazelcast's distributed queues. Hazelcast added persistence for distributed queues in version 1.9.3. The idea behind this store is that if you have data being inserted into HBase already you could post a row ID via REST to a queue. Once the ID is received and the MapStore persistence is triggered we then want to take a column value from a HBase column and send that value to ElasticSearch for indexing. Here is an example section using this MapStore from hazelcast.xml configuration:
<map name="mymapname">
<time-to-live-seconds>20</time-to-live-seconds>
<backup-count>1</backup-count>
<eviction-policy>NONE</eviction-policy>
<max-size>0</max-size>
<eviction-percentage>25</eviction-percentage>
<merge-policy>hz.ADD_NEW_ENTRY</merge-policy>
<!-- ElasticSearchIndexQueueStore -->
<map-store enabled="true">
<class-name>com.mozilla.bagheera.hazelcast.persistence.ElasticSearchIndexQueueStore</class-name>
<write-delay-seconds>5</write-delay-seconds>
<property name="hazelcast.elasticsearch.index">socorro</property>
<property name="hazelcast.elasticsearch.type.name">crash_reports</property>
<property name="hazelcast.hbase.pool.size">20</property>
<property name="hazelcast.hbase.table">crash_reports</property>
<property name="hazelcast.hbase.column.family">processed_data</property>
<property name="hazelcast.hbase.column.qualifier">json</property>
</map-store>
</map>
To read more on Hazelcast configuration in general check out their documentation.
All aspects of this software written in Java are distributed under Apache Software License 2.0. See LICENSE file for full license text.
All aspects of this software written in Python are distributed under the Mozilla Public License MPL/LGPL/GPL tri-license.
- Xavier Stevens (@xstevens)
- Daniel Einspanjer (@deinspanjer)
- Anurag Phadke (@anuragphadke)