Wednesday, March 18, 2015

Using Elasticsearch and Moqui for Analytics

Introduction to Elasticsearch in Moqui


Elasticsearch has been used in Moqui for years for search purposes, finding things like tasks, products, and wiki pages. It is a great tool for finding data quickly from a wide variety of sources indexed using Apache Lucene for flexible full-text searching.

The Data Document feature in Moqui makes it easy to define a JSON document (or nested Map/List structure) derived from relational database records. The Moqui Data Feed feature provides a real-time link between Data Documents and tools that use them, including the service built into Moqui that receives documents from a feed to index them for searching and other purposes. This can also be used to send data to other systems, trigger emails, and so on.

Elasticsearch is much more than just a full-text search tool. The faceted searches are helpful when searching structured data, but indexing by different fields within a document can be used for so much more. To get an idea of the scope and usefulness of Elasticsearch look at the home page for it:

https://www.elastic.co/products/elasticsearch

Analytics Capabilities and Easy Use in Moqui


The analytics capabilities of ES combine the distributed big-data and faceted search capabilities with a wide variety of tools for bucketing and aggregating data. For a list of the aggregation functions available, and a summary of the aggregation feature in general, see:

http://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html

It has basic numeric functions for "metrics" like min, max, sum, and avg plus more advanced ones like percentiles, histograms, geographic areas, extracting significant terms, and even scripts for custom metrics. These can be calculated over sets of documents split into "buckets" distinguished by terms, ranges of values, and more. Filters can be applied to the documents included at the top level and in any bucket.

For a more general idea of aggregations they are covered in one part of the book "Elasticsearch: The Definitive Guide" which is available online here:

http://www.elastic.co/guide/en/elasticsearch/guide/current/aggregations.html

For analytics in Moqui this provides an ideal tool for persistence of analytic data derived from operational data. The analytic data can be easily searched with results bucketed and metrics calculated as needed. Feeding data to this analytic data store is easy in Moqui by defining Data Documents that cover entity fields you want to report on, plus a real-time Data Feed to send the documents to Elasticsearch for indexing as records in the operational database are updated.

Real World Examples from Mantle Business Artifacts


For some real world examples of Data Documents useful for analytics see the MantleDocumentInventoryData.xml file in Mantle Business Artifacts:

https://github.com/moqui/mantle/blob/master/mantle-udm/data/MantleDocumentInventoryData.xml

This file has documents for inventory data, both current and projected, based on inventory asset records, sales order items, production estimates, and production run consume/produce estimates and actuals. Over time this will be extended to include shipments, purchase orders, and other inventory-relevant data structures. Most of these documents include data/time fields that can be used to limit projections to a certain period in the future and along with statusId fields exclude data not useful for future projections. Documents from the past can be used for reporting on the history of inventory and statistics about incoming and outgoing items.

A big part of the efficiency in reporting comes from pulling together data from various parts of the operational database into a single document that is persisted and indexed in the analytic data store (in this case Elasticsearch). That is what these Data Document definitions do. They pull together the needed fields from various tables into a single JSON document that can be used for efficient filtered and bucketed searches with metrics calculated on the fly.

Note (if you want to run this stuff) that to actually create and index documents from these Data Document definitions you must uncomment the DataFeed defined at the bottom of this file, and do so before the data load (i.e. 'gradle load').

The service in Moqui to do analytics queries is org.moqui.impl.EntityServices.search#CountBySource which is designed to take a "source" parameter with the details of the search to do in Elasticsearch. Here is an example of a source Map in Groovy syntax:
    andList = [[terms:[productId:productIdList]], [not:[terms:  [partStatusId:   ['OrderCompleted','OrderRejected','OrderCancelled']]]]]
    if (facilityId) andList.add([term:[facilityId:facilityId]])
    if (estThruDateStr) andList.add([or: [[missing:[field:'requiredByDate']], [range:[requiredByDate:[lte:estThruDateStr]]]]])
    searchSourceMap = [
        size: maxResults, query: [filtered: [filter: [and: andList] ]],
        aggregations: [
            products: [
                terms: [field: 'productId', size: maxResults],
                aggregations: [
                    orderQuantitySum: [sum: [field:'orderQuantity']],
                    quantityReservedSum: [sum: [field:'quantityReserved']],
                    quantityNotAvailableSum: [sum: [field:'quantityNotAvailable']],
                    quantityNotIssuedSum: [sum: [field:'quantityNotIssued']]
                ]
            ]
        ]
    ]
This is from the get#InventoryProjectedInfo service in the InventoryReportServices.xml file in Mantle:

https://github.com/moqui/mantle/blob/master/mantle-usl/service/mantle/product/InventoryReportServices.xml

This source Map in Groovy syntax follows the same structure as JSON documents sent to Elasticsearch through its REST API, and is generally easier than the Java API to use plus more closely aligns with the ES documentation examples.

This example is meant to get sums of various values for order items that are associated with order parts that still open and that may or may not be reserved. This is useful to see upcoming orders that will effect inventory at some point in the future.

Pitfalls and Quirks of Elasticsearch for Analytics


One potential issue with Elasticsearch for analytic use is the default result size of 10. When using aggregations with buckets the general "size" parameter passed in to the Java API or REST request parameter does NOT apply to the buckets. If you want more than 10 buckets in an aggregation you must specify a larger size in the bucket's "terms" Map. In the example above the terms Map is [field: 'productId', size: maxResults] using the maxResults field to allow for a higher value (defaults to 1000for the get#InventoryProjectedInfo service).

Another big pitfall is the data type of document fields. If you try to do a sum, avg, etc on a string type field you will get an error, you need a number type field instead. Here is a reference for the core data types available in Elasticsearch:

http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-core-types.html

The type of each field is specified using a "mapping" for a document type. In the most recent Moqui Framework code it automatically checks to see if an index exists and if not it creates the index and puts all the document type mappings. These type mappings are generated based on the Data Document definition and the entity field type for each field included in it. If you want to see what these mappings look like you can get them from Elasticsearch after the index has been used at least once, to index a document or to search, using a curl command like this:

curl -XGET 'http://localhost:9200/default__mantle_inventory/_mapping?pretty=true'

Note that there is a separate set of indexes for each tenant in Moqui, which is prepended to the index name and separated with a double underscore. The entire extended index name is lower-cased as Elasticsearch only supports lower case index names. This is why there is a "default__" at the beginning of the index name for the DEFAULT tenant in Moqui.

In addition to the numeric field types another important thing Moqui does in the default document mappings is to set all ID fields (entity field types id and id-long) to have index: not_analyzed. This tells Elasticsearch to use and index the literal ID value instead of analyzing it for search terms which also changes the value to all lower case (which is how ES handles case-insensitive searches). Without this fields like productId, partStatusId, facilityId, etc would be lower-cased and possibly split into multiple search terms if they contain certain characters. This seriously messes up both term/terms filters and the ID values that come back for the bucket keys.

There are also some Data Document and Feed quirks in Moqui to be aware of. The main one is that for the real-time push feed (triggered by changes to entity records through the Entity Facade) there must be a reverse-relationship defined for each forward relationship in the field path in a data document field. For example consider this field path in the MantleInventoryOrderItem Data Document:

mantle.order.OrderPart:Vendor#mantle.party.PartyRole:roleTypeId

This uses the relationship "Vendor#mantle.party.PartyRole" from OrderPart to PartyRole. For the real-time push we need the reverse relationship, i.e. one from PartyRole back to OrderPart which can be seen in the "Vendor#mantle.order.OrderPart" relationship on the PartyRole entity.

Conclusion


For more complex analytics, especially that involve data from many places in an operational database with a variety of constraints and with a variety of metrics applied, it is common to have a separate data store that is structured for analytics and the particular ways the data will be used.

Star schemas in relational databases are a nice approach to this, but still limited by the flat relational structure. With hierarchical JSON documents you can include "dimensions" in "fact" documents for more efficient querying/searching while still remaining flexible. You can still have separate documents that are "joined" in a search, but because documents in the database are only loosely typed and structured and because they are hierarchical you are not limited to this.

Relational SQL databases are also generally more difficult to scale for the large amount of data needed when indexing a large number fields across very large numbers of records.

These are some of the reasons that NoSQL and more generally non-relational databases are becoming more popular for analytics. When not so concerned with data consistency and referential integrity, and more generally the ACID properties, sharding and otherwise distributing data across various physical machines is MUCH easier. This is great for reporting in real time on large data sets as the data can be processed on each server and then the aggregated results combined from the various servers to get the final results.

Elasticsearch is a nice tool for this. It features much of the scalability and flexibility of document databases like MongoDB but is WAY easier to deploy in Java applications and usable in so many ways. ES can be run embedded in a Java application, as it is by default in Moqui Framework, or it can be distributed across hundreds of servers coordinating in a cluster with automatic discovery, sharding, and so on. Elasticsearch also does a pretty good job of caching frequent searches, and has all sorts of tools and configuration options for tuning searches, aggregations, and caching.

Elasticsearch also goes beyond what most document databases provide in terms of the variety of aggregations it can run on the document data. This provides, in one package, what is only available with analytics tools built on various popular document databases.

There is also an ad-hoc and prepared reporting tool for Elasticsearch called Kibana. It is already pretty impressive and is still being improved significantly. Unfortunately, it isn't all that easy to embed (for deployment) in a Java application but it is pretty easy to run on its own. For details see:

https://www.elastic.co/products/kibana

The Data Document and Data Feed features in Moqui Framework were designed for deriving data from the relational database to generate JSON documents. Initially the intent was to use them for searching and triggering things like sending email, but the general concept applies very well as an easy way to configure what will be pushed from the operational database to an analytic data store.

In short this combination of tools, already available in Moqui Framework (currently unreleased but available in the GitHub repositories), provide an excellent foundation for flexible, high scale reporting and analytics. It is easy to transform and feed data from the operational database to the analytic data store, and a wide variety of operations can be performed with high efficiency on this data.

How cool is this? With the combination of Moqui Framework and Elasticsearch you can put together complex reports that operate on huge volumes of data with very little effort.