-
Notifications
You must be signed in to change notification settings - Fork 506
METRON-1801 Allow Customization of Elasticsearch Document ID #1218
Conversation
…ded to fix-up the integration tests due to this.
… is the most performant option
|
Thanks, Nick. So if es.document.id is not provided, as the default, doc id won't be send to ES indexing, right? I guess it would be also nice to provide some guidance on how document ID should be defined (in the case of custom ID). Otherwise, users may create some serious issues with the indexing and search throughput. |
Yes, exactly.
I am just providing the capability for advanced users to define their own doc ID, primarily based on your feedback in METRON-1677. (It also provides a nice way to support backwards compatibility, which is the main reason that I took this approach.) If you have any advice to offer, feel free to offer it and we can include it in the docs. Other than that, I am not sure what I can do besides add a big, bold warning to the docs that says create your own doc ID at your own risk. |
|
@MRaliagha I updated the README to (hopefully) better explain your options in using |
| Map<String, Object> source = results.get(i).getSource(); | ||
| Assert.assertNotNull(source); | ||
| Assert.assertNotNull(source.get(Constants.Fields.SRC_ADDR.getName())); | ||
| Assert.assertNotNull(source.get(Constants.GUID)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elasticsearch must now always return the GUID to populate the UI. We cannot rely on the document ID being the same as the Metron GUID.
| for (int i = 0; i < 10; ++i) { | ||
| Map<String, Object> source = results.get(i).getSource(); | ||
| Assert.assertNotNull(source); | ||
| Assert.assertNotNull(source.get(Constants.Fields.SRC_ADDR.getName())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solr does not return the Metron GUID; unlike Elasticsearch.
...icsearch/src/main/java/org/apache/metron/elasticsearch/writer/ElasticsearchWriterConfig.java
Outdated
Show resolved
Hide resolved
|
@nickwallen in the case of event logs and the fact that retrieval segmentation would be mostly based on timestamp, it is recommended to use timestamp as a prefix of the id. For example, something like timestamp+hash(original_string). |
|
@MRaliagha, that's a good suggestion. I believe we can functionally achieve that be creating a custom id field in the format you suggest (with a Stellar field transform) and set that field to be the ES id with the Ambari property exposed in this PR. Do you feel it's worth documenting as an optimization? I spun this up in full dev and ran through all the testing instructions. Everything worked as advertised. I think there are just a couple open questions but this is pretty close in my opinion. |
| <type>test-jar</type> | ||
| <scope>test</scope> | ||
| </dependency> | ||
| <dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why were these dependencies removed? Just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They cause the tests to fail when executed in an IDE like IntelliJ. I don't understand exactly why, but @justinleet pointed me in this direction.
Also, everything runs just fine without them, so they are unnecessary. The fewer dependencies, the better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have also noticed this and I end up commenting these out every time I run a test in my IDE. Thanks for investigating and removing them.
|
Looks good to me. +1 |
Yes, I think it is worth documenting as people can easily create serious issues with Lucene based indexers by messing with ID. It can give users an understanding of where it is safe to play with the ID and what the recommendations are. I see if I can find any articles to share it as a part of the manual. |
|
This change was reverted here. A new pull request will be opened with the functionality. See also this mailing list thread. |
…(nickwallen) closes apache#1218" This reverts commit 90c5e1d.
…(nickwallen) closes apache#1218" This reverts commit 90c5e1d.
…(nickwallen) closes apache#1218" This reverts commit 90c5e1d.
Currently, the Metron GUID is always used as the Elasticsearch document ID. As documented in METRON-1677, using a randomized UUID like Java's
UUID.randomUUID()can negatively impact Elasticsearch performance. This change allows a user to customize the identifier that is used by Elasticsearch when indexing documents.A property
es.document.idwas added that sets the message field that is used to define the document ID when a message is indexed by Elasticsearch.To allow Elasticsearch to define its own document id, this property should be set to a blank or empty string. The client will not set the document ID and Elasticsearch will define its own.
In most cases allowing Elasticsearch to define the document ID is the most performant option. This is the default behavior.
Metron versions 0.6.0 and earlier defined the document ID using the Metron GUID, which is a randomized UUID using Java's
UUID.randomUUID(). Using a randomized UUID can negatively impact Elasticsearch indexing performance. To maintain backwards compatibility with legacy versions of Metron use the following setting.To use a custom document ID, create an enrichment that defines a new message field; for example one called
my_document_id. Then use this field to set the document ID as follows. This will set the document ID to the value of the message fieldmy_document_id.If a message does not contain the
es.document.idfield, a warning is issued and no document ID is set by the client.Changes
The
ElasticsearchWriterwas updated to allow the document ID to be configurable.A 'search by GUID' in the REST layer was implicitly using the document ID, whereas it should be using the Metron GUID.
Search results should use the Metron GUID as the ID returned to the UI. All IDs visible to the user should always be the Metron GUID, not the document ID.
The MPack was updated to allow the user to define the
es.document.idon the Metron > Config > Index Settings tab.The default behavior was changed to allow Elasticsearch to set the document ID. This is the most performant option in most cases. I updated the
Upgrading.mddoc to describe how to revert to the legacy behavior.Testing
Spin-up a development environment. You may need to stop the PCAP and/or Profiler topology to free-up slots to allow indexing to occur.
Open the Alerts UI and ensure that alerts are visible. Notice that the ID listed in the table has not changed. This will always display the Metron GUID, no matter what ID used for the document.
Click on a GUID in the table to search for a single alert.
Create a meta-alert and ensure that alerts tied to the meta-alert are still discoverable by GUID.
Open Ambari and go to Metron > Configs > Index Settings. Notice that the default setting of
es.document.idis blank, thus allowing Elasticsearch to define its own document ID.Open Kibana and verify that indeed Elasticsearch is generated its own document IDs. You will notice an
_idfield which has been generated by Elasticsearch. This will be different than the UUID generated by Metron and stored as part of the document asguid.Stop the indexing topologies using Ambari.
Login to the VM.
Delete the existing indices in Elasticsearch.
Open Ambari, and go to Metron > Configs > Index Settings. Edit the "Elasticsearch Document ID Source Field" and set it to 'guid'. This will restore the legacy behavior where the document ID is to the Metron GUID.
Restart the Indexing Topology in Ambari.
Open the Alerts UI and ensure that alerts are visible. Notice that the ID listed in the table has not changed. This will always display the Metron GUID, no matter what ID used for the document.
Click on a GUID in the table to search for a single alert.
Create a meta-alert and ensure that alerts tied to the meta-alert are still discoverable by GUID.
Open Kibana and verify that the document ID matches the Metron GUID.
I would also advise running the UI e2e test suite with this change.
Pull Request Checklist