METRON-1801 Allow Customization of Elasticsearch Document ID #1218

nickwallen · 2018-10-02T21:50:47Z

Currently, the Metron GUID is always used as the Elasticsearch document ID. As documented in METRON-1677, using a randomized UUID like Java's UUID.randomUUID() can negatively impact Elasticsearch performance. This change allows a user to customize the identifier that is used by Elasticsearch when indexing documents.

A property es.document.id was added that sets the message field that is used to define the document ID when a message is indexed by Elasticsearch.

To allow Elasticsearch to define its own document id, this property should be set to a blank or empty string. The client will not set the document ID and Elasticsearch will define its own.
In most cases allowing Elasticsearch to define the document ID is the most performant option. This is the default behavior.
Metron versions 0.6.0 and earlier defined the document ID using the Metron GUID, which is a randomized UUID using Java's UUID.randomUUID(). Using a randomized UUID can negatively impact Elasticsearch indexing performance. To maintain backwards compatibility with legacy versions of Metron use the following setting.
```
es.document.id = guid
```
To use a custom document ID, create an enrichment that defines a new message field; for example one called my_document_id. Then use this field to set the document ID as follows. This will set the document ID to the value of the message field my_document_id.
```
es.document.id = my_document_id
```
If a message does not contain the es.document.id field, a warning is issued and no document ID is set by the client.

Changes

The ElasticsearchWriter was updated to allow the document ID to be configurable.
A 'search by GUID' in the REST layer was implicitly using the document ID, whereas it should be using the Metron GUID.
Search results should use the Metron GUID as the ID returned to the UI. All IDs visible to the user should always be the Metron GUID, not the document ID.
The MPack was updated to allow the user to define the es.document.id on the Metron > Config > Index Settings tab.
The default behavior was changed to allow Elasticsearch to set the document ID. This is the most performant option in most cases. I updated the Upgrading.md doc to describe how to revert to the legacy behavior.

Testing

Spin-up a development environment. You may need to stop the PCAP and/or Profiler topology to free-up slots to allow indexing to occur.
```
cd metron-deployment/development/centos6
vagrant up
```
Open the Alerts UI and ensure that alerts are visible. Notice that the ID listed in the table has not changed. This will always display the Metron GUID, no matter what ID used for the document.
Click on a GUID in the table to search for a single alert.
Create a meta-alert and ensure that alerts tied to the meta-alert are still discoverable by GUID.
Open Ambari and go to Metron > Configs > Index Settings. Notice that the default setting of es.document.id is blank, thus allowing Elasticsearch to define its own document ID.
Open Kibana and verify that indeed Elasticsearch is generated its own document IDs. You will notice an _id field which has been generated by Elasticsearch. This will be different than the UUID generated by Metron and stored as part of the document as guid.
Stop the indexing topologies using Ambari.
Login to the VM.
```
vagrant ssh
sudo su -
```

Delete the existing indices in Elasticsearch.

curl -XDELETE http://node1:9200/bro*
curl -XDELETE http://node1:9200/snort*

Open Ambari, and go to Metron > Configs > Index Settings. Edit the "Elasticsearch Document ID Source Field" and set it to 'guid'. This will restore the legacy behavior where the document ID is to the Metron GUID.
Restart the Indexing Topology in Ambari.
Open the Alerts UI and ensure that alerts are visible. Notice that the ID listed in the table has not changed. This will always display the Metron GUID, no matter what ID used for the document.
Click on a GUID in the table to search for a single alert.
Create a meta-alert and ensure that alerts tied to the meta-alert are still discoverable by GUID.
Open Kibana and verify that the document ID matches the Metron GUID.
I would also advise running the UI e2e test suite with this change.

Pull Request Checklist

Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Have you included steps to reproduce the behavior or problem that is being changed or addressed?
Have you included steps or a guide to how the change may be verified and tested manually?
Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
Have you written or updated unit tests and or integration tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

…lerts UI

…ded to fix-up the integration tests due to this.

… is the most performant option

alinazemian · 2018-10-08T05:55:15Z

Thanks, Nick. So if es.document.id is not provided, as the default, doc id won't be send to ES indexing, right? I guess it would be also nice to provide some guidance on how document ID should be defined (in the case of custom ID). Otherwise, users may create some serious issues with the indexing and search throughput.

nickwallen · 2018-10-08T13:54:54Z

@MRaliagha: So if es.document.id is not provided, as the default, doc id won't be send to ES indexing, right?

Yes, exactly.

@MRaliagha: I guess it would be also nice to provide some guidance on how document ID should be defined (in the case of custom ID). Otherwise, users may create some serious issues with the indexing and search throughput.

I am just providing the capability for advanced users to define their own doc ID, primarily based on your feedback in METRON-1677. (It also provides a nice way to support backwards compatibility, which is the main reason that I took this approach.)

If you have any advice to offer, feel free to offer it and we can include it in the docs. Other than that, I am not sure what I can do besides add a big, bold warning to the docs that says create your own doc ID at your own risk.

nickwallen · 2018-10-08T22:19:21Z

@MRaliagha I updated the README to (hopefully) better explain your options in using es.document.id. I sensed by your question that what I had originally was not very clear.

nickwallen · 2018-10-08T22:25:20Z

...est/java/org/apache/metron/elasticsearch/integration/ElasticsearchSearchIntegrationTest.java

+      Map<String, Object> source = results.get(i).getSource();
+      Assert.assertNotNull(source);
+      Assert.assertNotNull(source.get(Constants.Fields.SRC_ADDR.getName()));
+      Assert.assertNotNull(source.get(Constants.GUID));


Elasticsearch must now always return the GUID to populate the UI. We cannot rely on the document ID being the same as the Metron GUID.

nickwallen · 2018-10-08T22:25:48Z

.../metron-solr/src/test/java/org/apache/metron/solr/integration/SolrSearchIntegrationTest.java

+    for (int i = 0; i < 10; ++i) {
+      Map<String, Object> source = results.get(i).getSource();
+      Assert.assertNotNull(source);
+      Assert.assertNotNull(source.get(Constants.Fields.SRC_ADDR.getName()));


Solr does not return the Metron GUID; unlike Elasticsearch.

...icsearch/src/main/java/org/apache/metron/elasticsearch/writer/ElasticsearchWriterConfig.java

alinazemian · 2018-10-09T03:39:14Z

@nickwallen in the case of event logs and the fact that retrieval segmentation would be mostly based on timestamp, it is recommended to use timestamp as a prefix of the id. For example, something like timestamp+hash(original_string).

merrimanr · 2018-10-11T17:39:43Z

@MRaliagha, that's a good suggestion. I believe we can functionally achieve that be creating a custom id field in the format you suggest (with a Stellar field transform) and set that field to be the ES id with the Ambari property exposed in this PR. Do you feel it's worth documenting as an optimization?

I spun this up in full dev and ran through all the testing instructions. Everything worked as advertised. I think there are just a couple open questions but this is pretty close in my opinion.

merrimanr · 2018-10-11T17:44:46Z

metron-platform/metron-elasticsearch/pom.xml

            <type>test-jar</type>
            <scope>test</scope>
        </dependency>
-        <dependency>


Why were these dependencies removed? Just curious.

They cause the tests to fail when executed in an IDE like IntelliJ. I don't understand exactly why, but @justinleet pointed me in this direction.

Also, everything runs just fine without them, so they are unnecessary. The fewer dependencies, the better.

I have also noticed this and I end up commenting these out every time I run a test in my IDE. Thanks for investigating and removing them.

merrimanr · 2018-10-11T18:06:55Z

Looks good to me. +1

…n master

alinazemian · 2018-10-11T23:47:31Z

@MRaliagha, that's a good suggestion. I believe we can functionally achieve that be creating a custom id field in the format you suggest (with a Stellar field transform) and set that field to be the ES id with the Ambari property exposed in this PR. Do you feel it's worth documenting as an optimization?

Yes, I think it is worth documenting as people can easily create serious issues with Lucene based indexers by messing with ID. It can give users an understanding of where it is safe to play with the ID and what the recommendations are. I see if I can find any articles to share it as a part of the manual.

…(nickwallen) closes #1218" This reverts commit 90c5e1d.

nickwallen · 2018-10-24T12:42:40Z

This change was reverted here. A new pull request will be opened with the functionality. See also this mailing list thread.

…len) closes apache#1218

…(nickwallen) closes apache#1218" This reverts commit 90c5e1d.

…len) closes apache#1218

…(nickwallen) closes apache#1218" This reverts commit 90c5e1d.

…len) closes apache#1218

…(nickwallen) closes apache#1218" This reverts commit 90c5e1d.

nickwallen added 11 commits October 1, 2018 18:39

Can change source field used for document ID. Unable to /findOne in A…

23a96cc

…lerts UI

Cannot assume that ES doc ID == Metron GUID

2d0f478

Removed unnecessary dependencies

d66c839

Search results need to use Metron GUID as ID, not the doc ID

36c5921

Small rename

4ebb800

Removed unncessary part of error msg

13d698d

Elasticsearch always returns a GUID, while Solr does not for now. Nee…

5711c89

…ded to fix-up the integration tests due to this.

Added Mpack support

1766cb3

Added simple gitignore

6fa5b5b

Added documentation for es.document.id

dbb83ad

Changed default behavior to use Elasticsearch generated doc ID, which…

af36ebf

… is the most performant option

nickwallen added 2 commits October 8, 2018 18:16

Improve description of es.document.id

6b79429

Better define default

5a793ce

nickwallen commented Oct 8, 2018

View reviewed changes

...icsearch/src/main/java/org/apache/metron/elasticsearch/writer/ElasticsearchWriterConfig.java Outdated Show resolved Hide resolved

merrimanr reviewed Oct 11, 2018

View reviewed changes

Removed the ElasticsearchWriterConfig class to maintain consistency

f3fe193

No need to fix in this PR. Others have already fixed this elsewhere i…

d98a81d

…n master

asfgit closed this in 90c5e1d Oct 11, 2018

nickwallen mentioned this pull request Oct 11, 2018

METRON-1823 Refactor Elasticsearch Configuration Settings #1235

Closed

9 tasks

mmiklavc mentioned this pull request Oct 19, 2018

METRON-1834: Migrate Elasticsearch from TransportClient to new Java REST API #1242

Closed

10 tasks

asfgit pushed a commit that referenced this pull request Oct 23, 2018

Revert "METRON-1801 Allow Customization of Elasticsearch Document ID …

0e037ed

…(nickwallen) closes #1218" This reverts commit 90c5e1d.

justinleet pushed a commit to justinleet/metron that referenced this pull request Oct 25, 2018

METRON-1801 Allow Customization of Elasticsearch Document ID (nickwal…

73b1405

…len) closes apache#1218

justinleet pushed a commit to justinleet/metron that referenced this pull request Oct 25, 2018

Revert "METRON-1801 Allow Customization of Elasticsearch Document ID …

2b09e93

…(nickwallen) closes apache#1218" This reverts commit 90c5e1d.

justinleet pushed a commit to justinleet/metron that referenced this pull request Oct 25, 2018

METRON-1801 Allow Customization of Elasticsearch Document ID (nickwal…

ba35f07

…len) closes apache#1218

justinleet pushed a commit to justinleet/metron that referenced this pull request Oct 25, 2018

Revert "METRON-1801 Allow Customization of Elasticsearch Document ID …

2137834

…(nickwallen) closes apache#1218" This reverts commit 90c5e1d.

justinleet pushed a commit to justinleet/metron that referenced this pull request Oct 25, 2018

METRON-1801 Allow Customization of Elasticsearch Document ID (nickwal…

33bbbb1

…len) closes apache#1218

justinleet pushed a commit to justinleet/metron that referenced this pull request Oct 25, 2018

Revert "METRON-1801 Allow Customization of Elasticsearch Document ID …

df86c54

…(nickwallen) closes apache#1218" This reverts commit 90c5e1d.

nickwallen mentioned this pull request Nov 14, 2018

METRON-1871 Cannot Run Elasticsearch Integration Tests in IDE #1262

Closed

9 tasks

METRON-1801 Allow Customization of Elasticsearch Document ID #1218

METRON-1801 Allow Customization of Elasticsearch Document ID #1218

Uh oh!

Conversation

nickwallen commented Oct 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Testing

Pull Request Checklist

Uh oh!

alinazemian commented Oct 8, 2018

Uh oh!

nickwallen commented Oct 8, 2018

Uh oh!

nickwallen commented Oct 8, 2018

Uh oh!

nickwallen Oct 8, 2018

Choose a reason for hiding this comment

Uh oh!

nickwallen Oct 8, 2018

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alinazemian commented Oct 9, 2018

Uh oh!

merrimanr commented Oct 11, 2018

Uh oh!

merrimanr Oct 11, 2018

Choose a reason for hiding this comment

Uh oh!

nickwallen Oct 11, 2018

Choose a reason for hiding this comment

Uh oh!

merrimanr Oct 11, 2018

Choose a reason for hiding this comment

Uh oh!

merrimanr commented Oct 11, 2018

Uh oh!

alinazemian commented Oct 11, 2018

Uh oh!

nickwallen commented Oct 24, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nickwallen commented Oct 2, 2018 •

edited

Loading