Friday, June 24, 2016

Simple Indexing Guide for WSO2 Data Analytics Server

Interactive search functionality in WSO2 Data Analytics Server is powered by Apache Lucene[1],  a powerful, high performing, full text search engine!.

Lucene index data were kept in a database in the first version of the Data Analytics Server[2] (3.0.0). But from DAS 3.0.1 onwards, Lucene indexes are maintained in the local filesystem.
Data Analytics Server(DAS) has a seperate server profile which enables it to perform as a dedicated indexing node. 

When a DAS node is started with indexing (disableIndexing=false), there are a quite a few things going on. But first let’s get the meta-information out of the way.

Configuration files locations:

Local shard config: <DAS-HOME>/repository/conf/analytics/local-shard-allocation-config.conf
Analytics config: <DAS-HOME>/repository/conf/analytics/analytics-config.xml

What is a Shard?

All the indexing data are partitioned across in partitions called shards ( default is 6). The partitioned data can be observed by browsing in <DAS-HOME>/repository/data/index_data/
Each and every record belongs to only one shard.

Digging in ...


In standalone mode, DAS behaves as a powerful indexer, but it truly shines as a powerhouse when it’s clustered.

A cluster of DAS indexer nodes behaves as a one giant and powerful distributed search engine. Search queries are distributed across all the nodes resulting in lightning fast result retrieval. Not only that, since indexes are distributed among the nodes, indexing data in the cluster is unbelievably fast. 

In the clustering mode, the aforementioned shards are distributed across the indexing nodes ( nodes in which indexing is enabled). For example for a cluster including 3 indexing nodes with 6 shards, each indexing node would be assigned two shards each (unless replication is enabled, more on that later).

The shard allocation is DYNAMIC!. If a new node starts up as an indexing node, the shard allocations would change to allocate some of the shards to the newly spawned indexing node. This is controlled using a global shard allocation mechanism.

Replication factor:
Replication factor can be configured in the analytics config file as indexReplicationFactor, which would decide how many replicas of each record ( or shard ) would be kept. Default is 1.

Manual Shard configuration:

There are 3 modes when configuring the local shard allocations of a node. They are,
  2. INIT

NORMAL means, the data for that particular shard would already be residing in that node.
For example, upon starting up a single indexing node, the shard configuration would look like follows,

This means, all the shards (0 through 5) are indexed successfully in that indexer node.

INIT allows you to tell the indexing node to index that particular shard.
If you restart the server after adding a shard as INIT,  that shard would be re-indexed in that node.
Ex: if the shard allocations are


and we add the line 4, INIT and restart the server,


This would index the data for the 4th Shard for that indexing node and you can see that it will be returned to the NORMAL state after that once indexing the 4th shard is done.

RESTORE provides you the opportunity to copy the indexed data from a different node/backup to another node and use that indexing data. This avoids re-indexing and reuses the already available indexing data. After successful restoring, that node is able to support queries on that corresponding shard as well.

Ex: for the same shard allocation above, if we copy the indexed data for the 5th Shard and add the line,
and restart the server, the node would then allocate the 5th shard to that node ( which would be used to search and index)

After restoring, that node would also index the incoming data for that shard as well.


How do I remove a node as an indexing node?

If you want to remove a node as an indexing node, then you have to restart that particular node with indexing disabled ( disableIndexing=true) and it will trigger the global shard allocations to be re-allocated, removing that node from the indexing cluster.

Then you have the option of restarting the other indexing nodes which will automatically distribute the shards among the available nodes again. 

Or you can manually assign the shards. Then you need to make sure to index the data for the shards that node took in other indexing node as you require.
You can view the shard list for that particular node in the local indexing shard config (refer above).

Then you can either use INIT or RESTORE methods to index those shards in other nodes (refer to the Manual Shard configuration section).

You must restart all the indexing servers for it to get the indexing updates after the indexing node has left.

Mistakenly started up an indexing node?

If you have mistakenly started up an indexing server, it will change the global configurations and has to be undone manually.
If the replication factor is equal to or greater than 1, you would still be able to query and get all the data even this node is down.

First of all, to remove a node as an indexing node, you need to delete the indexing data (optional).
Indexing data resides at: <DAS-HOME>/repository/data/indexing_data/

Second if you want to use the node in another server profile (more info on profiles: [3]), restart the server with the required profile.

Then follow the steps in the above question on "How do I remove a node as an indexing node?"

How do I know all my data are indexed?

The simplest way to do this would be to use the Data Explorer in DAS, refer to [4] for more information.

There, for a selected table, run the query “*:*” and it should return all the results with the total count.

For more information or queries please do drop by the mailing lists[4].

No comments:

Post a Comment