Skip to content
This repository was archived by the owner on Aug 20, 2025. It is now read-only.

Conversation

@nickwallen
Copy link
Contributor

@nickwallen nickwallen commented Oct 2, 2018

Currently, the Metron GUID is always used as the Elasticsearch document ID. As documented in METRON-1677, using a randomized UUID like Java's UUID.randomUUID() can negatively impact Elasticsearch performance. This change allows a user to customize the identifier that is used by Elasticsearch when indexing documents.

A property es.document.id was added that sets the message field that is used to define the document ID when a message is indexed by Elasticsearch.

  • To allow Elasticsearch to define its own document id, this property should be set to a blank or empty string. The client will not set the document ID and Elasticsearch will define its own.

  • In most cases allowing Elasticsearch to define the document ID is the most performant option. This is the default behavior.

  • Metron versions 0.6.0 and earlier defined the document ID using the Metron GUID, which is a randomized UUID using Java's UUID.randomUUID(). Using a randomized UUID can negatively impact Elasticsearch indexing performance. To maintain backwards compatibility with legacy versions of Metron use the following setting.

    es.document.id = guid
    
  • To use a custom document ID, create an enrichment that defines a new message field; for example one called my_document_id. Then use this field to set the document ID as follows. This will set the document ID to the value of the message field my_document_id.

    es.document.id = my_document_id
    
  • If a message does not contain the es.document.id field, a warning is issued and no document ID is set by the client.

Changes

  • The ElasticsearchWriter was updated to allow the document ID to be configurable.

  • A 'search by GUID' in the REST layer was implicitly using the document ID, whereas it should be using the Metron GUID.

  • Search results should use the Metron GUID as the ID returned to the UI. All IDs visible to the user should always be the Metron GUID, not the document ID.

  • The MPack was updated to allow the user to define the es.document.id on the Metron > Config > Index Settings tab.

  • The default behavior was changed to allow Elasticsearch to set the document ID. This is the most performant option in most cases. I updated the Upgrading.md doc to describe how to revert to the legacy behavior.

Testing

  1. Spin-up a development environment. You may need to stop the PCAP and/or Profiler topology to free-up slots to allow indexing to occur.

    cd metron-deployment/development/centos6
    vagrant up
    
  2. Open the Alerts UI and ensure that alerts are visible. Notice that the ID listed in the table has not changed. This will always display the Metron GUID, no matter what ID used for the document.

    screen shot 2018-10-02 at 5 43 54 pm

  3. Click on a GUID in the table to search for a single alert.

    screen shot 2018-10-02 at 12 42 54 pm

  4. Create a meta-alert and ensure that alerts tied to the meta-alert are still discoverable by GUID.

    screen shot 2018-10-02 at 5 45 18 pm

  5. Open Ambari and go to Metron > Configs > Index Settings. Notice that the default setting of es.document.id is blank, thus allowing Elasticsearch to define its own document ID.

    screen shot 2018-10-03 at 9 13 18 am

  6. Open Kibana and verify that indeed Elasticsearch is generated its own document IDs. You will notice an _id field which has been generated by Elasticsearch. This will be different than the UUID generated by Metron and stored as part of the document as guid.

    screen shot 2018-10-02 at 4 47 52 pm

  7. Stop the indexing topologies using Ambari.

  8. Login to the VM.

    vagrant ssh
    sudo su -
    
  9. Delete the existing indices in Elasticsearch.

    curl -XDELETE http://node1:9200/bro*
    curl -XDELETE http://node1:9200/snort*
    
  10. Open Ambari, and go to Metron > Configs > Index Settings. Edit the "Elasticsearch Document ID Source Field" and set it to 'guid'. This will restore the legacy behavior where the document ID is to the Metron GUID.

  11. Restart the Indexing Topology in Ambari.

  12. Open the Alerts UI and ensure that alerts are visible. Notice that the ID listed in the table has not changed. This will always display the Metron GUID, no matter what ID used for the document.

  13. Click on a GUID in the table to search for a single alert.

    screen shot 2018-10-02 at 12 42 54 pm

  14. Create a meta-alert and ensure that alerts tied to the meta-alert are still discoverable by GUID.

    screen shot 2018-10-02 at 5 45 18 pm

  15. Open Kibana and verify that the document ID matches the Metron GUID.

    screen shot 2018-10-03 at 5 58 49 pm

  16. I would also advise running the UI e2e test suite with this change.

Pull Request Checklist

  • Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
  • Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
  • Has your PR been rebased against the latest commit within the target branch (typically master)?
  • Have you included steps to reproduce the behavior or problem that is being changed or addressed?
  • Have you included steps or a guide to how the change may be verified and tested manually?
  • Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
  • Have you written or updated unit tests and or integration tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

@alinazemian
Copy link

Thanks, Nick. So if es.document.id is not provided, as the default, doc id won't be send to ES indexing, right? I guess it would be also nice to provide some guidance on how document ID should be defined (in the case of custom ID). Otherwise, users may create some serious issues with the indexing and search throughput.

@nickwallen
Copy link
Contributor Author

@MRaliagha: So if es.document.id is not provided, as the default, doc id won't be send to ES indexing, right?

Yes, exactly.

@MRaliagha: I guess it would be also nice to provide some guidance on how document ID should be defined (in the case of custom ID). Otherwise, users may create some serious issues with the indexing and search throughput.

I am just providing the capability for advanced users to define their own doc ID, primarily based on your feedback in METRON-1677. (It also provides a nice way to support backwards compatibility, which is the main reason that I took this approach.)

If you have any advice to offer, feel free to offer it and we can include it in the docs. Other than that, I am not sure what I can do besides add a big, bold warning to the docs that says create your own doc ID at your own risk.

@nickwallen
Copy link
Contributor Author

@MRaliagha I updated the README to (hopefully) better explain your options in using es.document.id. I sensed by your question that what I had originally was not very clear.

Map<String, Object> source = results.get(i).getSource();
Assert.assertNotNull(source);
Assert.assertNotNull(source.get(Constants.Fields.SRC_ADDR.getName()));
Assert.assertNotNull(source.get(Constants.GUID));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elasticsearch must now always return the GUID to populate the UI. We cannot rely on the document ID being the same as the Metron GUID.

for (int i = 0; i < 10; ++i) {
Map<String, Object> source = results.get(i).getSource();
Assert.assertNotNull(source);
Assert.assertNotNull(source.get(Constants.Fields.SRC_ADDR.getName()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solr does not return the Metron GUID; unlike Elasticsearch.

@alinazemian
Copy link

@nickwallen in the case of event logs and the fact that retrieval segmentation would be mostly based on timestamp, it is recommended to use timestamp as a prefix of the id. For example, something like timestamp+hash(original_string).

@merrimanr
Copy link
Contributor

@MRaliagha, that's a good suggestion. I believe we can functionally achieve that be creating a custom id field in the format you suggest (with a Stellar field transform) and set that field to be the ES id with the Ambari property exposed in this PR. Do you feel it's worth documenting as an optimization?

I spun this up in full dev and ran through all the testing instructions. Everything worked as advertised. I think there are just a couple open questions but this is pretty close in my opinion.

<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were these dependencies removed? Just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They cause the tests to fail when executed in an IDE like IntelliJ. I don't understand exactly why, but @justinleet pointed me in this direction.

Also, everything runs just fine without them, so they are unnecessary. The fewer dependencies, the better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also noticed this and I end up commenting these out every time I run a test in my IDE. Thanks for investigating and removing them.

@merrimanr
Copy link
Contributor

Looks good to me. +1

@alinazemian
Copy link

@MRaliagha, that's a good suggestion. I believe we can functionally achieve that be creating a custom id field in the format you suggest (with a Stellar field transform) and set that field to be the ES id with the Ambari property exposed in this PR. Do you feel it's worth documenting as an optimization?

Yes, I think it is worth documenting as people can easily create serious issues with Lucene based indexers by messing with ID. It can give users an understanding of where it is safe to play with the ID and what the recommendations are. I see if I can find any articles to share it as a part of the manual.

asfgit pushed a commit that referenced this pull request Oct 23, 2018
@nickwallen
Copy link
Contributor Author

This change was reverted here. A new pull request will be opened with the functionality. See also this mailing list thread.

justinleet pushed a commit to justinleet/metron that referenced this pull request Oct 25, 2018
justinleet pushed a commit to justinleet/metron that referenced this pull request Oct 25, 2018
justinleet pushed a commit to justinleet/metron that referenced this pull request Oct 25, 2018
justinleet pushed a commit to justinleet/metron that referenced this pull request Oct 25, 2018
justinleet pushed a commit to justinleet/metron that referenced this pull request Oct 25, 2018
justinleet pushed a commit to justinleet/metron that referenced this pull request Oct 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants