-
Notifications
You must be signed in to change notification settings - Fork 505
METRON-2073: Create in-memory use case for enrichment with map type and flatfile summarizer #1399
Conversation
|
Would it be worth changing the caching to use caffeine, as we have in other locations, given the performance gains we've seen for larger caches with caffeine? |
|
On naming: since this is very much based on, and similar to OBJECT_GET, how about ENRICHMENT_OBJECT_GET? |
|
The other thing worth thinking about is some guard-rails around cache size. If we load a large file, it seems like there is a good chance of blowing up the enrichment topology somewhere hard to debug with a OOM. That's probably something that needs pushing down to OBJECT_GET too to be honest. |
|
The latest commits should address your feedback @simonellistonball. Have you given any thought to being able to configure different cache settings for different paths/objects? |
mmiklavc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merrimanr This is a nice addition to our Stellar features! I really like the mem/size check addition.
I have a number of comments around test setup that I think are worth working through along with some requests for configuration documentation. One additional ask would be to add some integration tests. See the following for examples:
- https://github.com/apache/metron/blob/master/metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/outlier/MedianAbsoluteDeviationTest.java
- https://github.com/apache/metron/blob/master/metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/approximation/HyperLogLogPlusFunctionsIntegrationTest.java
|
|
||
| @Stellar(namespace="ENRICHMENT" | ||
| ,name="OBJECT_GET" | ||
| ,description="Retrieve and deserialize a serialized object from HDFS and stores it in the ObjectCache, " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add references to the global config options like we do for OBJECT_GET?
Line 55 in 14efe83
| "\"" + ObjectGet.OBJECT_CACHE_SIZE_KEY + "\" (default " + ObjectGet.OBJECT_CACHE_SIZE_DEFAULT + ")," + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in latest commit.
| } | ||
| public void setup() throws Exception { | ||
| objectGet = new ObjectGet(); | ||
| objectCache = mock(ObjectCache.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to just use the real ObjectCache here? I get that it might arguably be an integration test at that point, but it's also a pretty trivial dependency. I've been thinking about this a while now, and I think we have quite a few instances where we're doing dependency injection and mocking where just using the real objects would probably be 1) simpler 2) clearer and 3) a more accurate, less error-prone test. This test is simple enough, though there are many other where the test setup for the mocks is anything but obvious. For example, this is from a recent PR of mine - https://github.com/apache/metron/pull/1409/files#diff-ab0ebf385d42a0fe97232a3e1131936bR338. More often than not, I find that spies (parserRunner -> https://github.com/apache/metron/pull/1409/files#diff-ab0ebf385d42a0fe97232a3e1131936bR222) are a symptom/code smell that warrants some further thought on what a class is doing and whether we should be pushing some of its responsibility to other collaborating classes.
See some expanded thoughts on mocking from Uncle Bob Martin's here - https://blog.cleancoder.com/uncle-bob/2014/05/10/WhenToMock.html
Per another of my comments, it looks like the main reason to mock here is the underlying Loader. I'd consider injecting a simple faux Loader if you don't want to add that complexity. I think there's value in testing the true integration between the Stellar function ENRICHMENT_OBJECT_GET wrapper and the underlying ObjectCache implementation versus just mocking the functionality of ObjectCache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You make some good points on the tradeoffs of using mocks vs using real classes. In this case though, I don't agree that using the real ObjectCache class would be simpler.
Using a mock object for the cache is pretty simple here. The mock is set up to return values from method calls and verify methods were called correctly. This is done with only a couple lines of code.
If we were to use the real ObjectCache, now we have to:
- Create an arbitrary cache configuration
- Create a test Loader and inject it instead of the default Loader
- Setup and instantiate the cache
This is more complex in my opinion. Plus any change to the ObjectCache setup requirements would also require an update to this test. I've experienced pain in the past where a change to a low level class required significant changes to several different tests because the tests were relying on the real implementation rather than being isolated to the class it's testing.
Again, I take your point that we should be cautious about over-using mocks. But I don't think it applies here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue is that you never actually test the integration between those classes. It's only ever a mock interaction from the tests you've provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am planning on adding the integration tests you requested. Would that cover it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense, thanks @merrimanr.
Can you make sure we're setting sensible defaults for the cache as well? These two items surprise me that they would matter for this test:
- Create an arbitrary cache configuration
- Setup and instantiate the cache
It also wasn't clear to me that I'd have to do those things, even after spending a lot of time going over the code. I would have thought that I can just do new ObjectCache(simpleLoader), where simpleLoader is a basic stunt double reading from a string/bytearray or something, and have config defaults that just work. The unit test for ObjectCache should be testing all the fine-grained details, which are less important here. Likewise, the integration tests can just verify that the config is being picked up and passed through - no need to test all possible combos from the integration test standpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we're setting defaults in the ObjectCacheConfig constructor, all of this is much simpler. There is no need to populate cache settings as the defaults will work in most cases. Now the Create an arbitrary cache configuration step can be done by just creating a new ObjectCacheConfig object. You would still have to initialize the cache by calling ObjectCache.initialize() and provide a custom test Loader though.
| } | ||
|
|
||
| @Test | ||
| public void testMultithreaded() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably have some sort of timeout that will kill this if there is a multithreading issue. Maybe use an ExecutorService here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me look into this further. This test was preexisting and I'm not sure what the original intention was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is necessarily a blocker.
|
|
||
| protected LoadingCache<String, Object> cache; | ||
| private static ReadWriteLock lock = new ReentrantReadWriteLock(); | ||
| protected LoadingCache<String, Object> getCache() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we exposing the underlying cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only for testing purposes. I like your other suggestion of providing specific methods for things we need to test so this will go away when I make those changes.
| @Test | ||
| public void test() throws Exception { | ||
| String filename = "target/ogt/test.ser"; | ||
| Assert.assertTrue(cache.getCache() == null || !cache.getCache().asMap().containsKey(filename)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize some of this was probably around before, but these are great opportunities to make incremental improvements to the code base with a minimal impact/risk. I think a better option than exposing class internals - see Indecent Exposure - would be to offer methods that provide the functionality you're adding in the test. e.g.
ObjectCache cache = new ObjectCache();
Assert.assertTrue(cache.isEmpty());
Assert.assertTrue(cache.size() == 0);
Assert.assertFalse(cache.hasKey(filename));
It's both clearer and doesn't require you to expose the internals of your class under test. I get the compulsion to do this - you might be thinking "well, how do I know it's null?" I couldn't find the original text from Kent Beck's book on TDD online that I'm thinking of, but the red, green, refactor cycle really covers this pretty well - part of the refactor phase is to swap out a static implementation, e.g. "return false" and put the real thing there. The test should still pass when you refactor. Not that you're doing TDD here and I'm not suggesting that you do, rather my point is there's precedent for not having to expose class internals for tests. Here's some additional background on it - https://blog.cleancoder.com/uncle-bob/2014/12/17/TheCyclesOfTDD.html.
Also regarding class internals exposure - https://martinfowler.com/bliki/TellDontAsk.html
Tell-Don't-Ask is a principle that helps people remember that object-orientation is about bundling data with the functions that operate on that data. It reminds us that rather than asking an object for data and acting on that data, we should instead tell an object what to do.
This approach also helps avoid "train wrecks" and violating the Law of Demeter as seen in !cache.getCache().asMap().containsKey(filename)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, this was preexisting. Happy to make your suggested changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with the latest commit.
| Object value; | ||
| try { | ||
| Map cachedMap = (Map) objectCache.get(path); | ||
| LOG.debug("Looking up value from object at path '{}' using indicator {}", path, indicator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for using the {} pattern, thumbs up.
| public void initialize(Context context) { | ||
| Map<String, Object> config = (Map<String, Object>) context.getCapability(Context.Capabilities.GLOBAL_CONFIG, false) | ||
| .orElse(new HashMap<>()); | ||
| ObjectCacheConfig objectCacheConfig = ObjectCacheConfig.fromGlobalConfig(config); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have both this fromGlobalConfig(config) method and then all the same config setup duplicated again below? e.g. OBJECT_CACHE_EXPIRATION_KEY. I think what you put in fromGlobalConfig(config) is pretty clear, well-encapsulated, and versatile. Am I missing something here?
Can you also add README details around the new global config options and an example of how they're set (showing where/how they're nested in the global config)? e.g.
#global config
{
...
"cache.option.1" : "val1",
"cache.option.2" : false,
"cache.option.3" : 879,
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config setup below overrides the properties that come from fromGlobalConfig(config). I decided to keep the keys that same, only difference is the EnrichmentObjectGet settings are nested inside the enrichment.object.get.settings property. As I was testing I found it easier to remember the settings if they were consistent across both. Also, we want to allow a mix of default global settings and specific EnrichmentObjectGet settings. That's why it looks kind of strange with all the if statements. Can you think of a cleaner way to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config setup below overrides the properties that come from fromGlobalConfig(config).
I must be missing something - you're passing config in order to get fromGlobalConfig to create an objectCacheConfig. But then you're again using config.get(someKey) to override what you've already extracted from config in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is how the process around cache settings works:
- Create a new
ObjectCacheConfigobject. This config will be a mix of defaults and top-level object get global config settings, depending on what is set in the global config. - Retrieve the nested enrichment object get settings using the
enrichment.object.get.settingskey. - For each setting, using the setting in the nested object from the previous step if it is defined. Otherwise fall back to the setting from step 1.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah geeze, I see what's happening now. I had to go back over this a few times to grok what's going on. Ok, so I wasn't aware of or clear on what was previously in global config from the original OBJECT_GET work versus what was added in this new feature. Link provided for context. I get wanting to keep OBJECT_GET's original config around in global config for existing functionality. I'm a little foggy on why the in-memory function wouldn't just have its own config completely independent of the original - ie just cut the cable on the config overrides, even though you're sharing similar infrastructure for the object cache. The 2 different Stellar functions can instantiate the underlying cache object with their own config however they want. I'm all about code reuse, and I like what you've done with the cache refactoring. I also like providing users options with sensible defaults, as we've done many other places in the application. But in this case I'm unclear what the added value is for this extra bit of extra complexity with the config inheritance - can we just let the config for the 2 different functions work independently and get rid of the override?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure that makes sense, it will make it simpler. I will separate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest commit makes ENRICHMENT_OBJECT_GET and OBJECT_GET configs separate. Should be simpler and easier to read now.
| try(BufferedOutputStream bos = new BufferedOutputStream(fs.create(new Path(filename), true))) { | ||
| IOUtils.write(SerDeUtils.toBytes(data), bos); | ||
| } | ||
| cache.initialize(ObjectCacheConfig.fromGlobalConfig(new HashMap<>())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this actually doing? Why not just new ObjectCacheConfig()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initializing the cache requires some ObjectCacheConfig properties to be set, otherwise you get NPEs. I just did this for convenience. I'm happy to change it to new ObjectCacheConfig() and explicitly set the cache properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like you might mean to have the no-arg default constructor set to private ObjectCacheConfig() {} because using it causes unrecoverable errors. Alternatively, you could also do away with fromGlobalConfig and simply create
public ObjectCacheConfig(Map<String, Object> globalConfig) {
...
}
but I don't think we should have both. I personally prefer the latter approach and leveraging constructors for their intended purpose. Let's let the API dictate the use rather than relying on javadoc or, worse yet, NPEs only found at runtime.
Side note - what specifically causes NPEs? Can we address that as well, while we're at it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. The no arg constructor doesn't really serve any purpose. I will change fromGlobalConfig to a constructor (should have done this in the first place).
The NPE comes from Caffeine code. As long as we're not passing in null values to the cache builder it's not an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other recent comments on this - can we setup some sensible defaults for the config? I think the pattern we've attempted to follow is to maximize configurability of the system by exposing as many options as possible, but also making easy to get up and running with some defaults that they can configure PRN. I'd look at the cache used in ParallelEnricher to see what has been used for defaults in other caches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe moving from fromGlobalConfig to a constructor will solve this since that method sets sensible defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with the latest commit.
|
|
||
| @Test | ||
| public void test() throws Exception { | ||
| String filename = "target/ogt/test.ser"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach leaves open a potential for dirty data. Here are a few options that eliminate this issue altogether are:
- Our TestUtils functions. You'll need to tweak them for writing byte arrays, but there is temp dir functionality there as well - https://github.com/apache/metron/blob/master/metron-platform/metron-integration-test/src/main/java/org/apache/metron/integration/utils/TestUtils.java
- JUnit's temp directory/file feature
metron/metron-platform/metron-common/src/test/java/org/apache/metron/common/utils/HDFSUtilsTest.java
Line 35 in 9cee51e
public TemporaryFolder tempDir = new TemporaryFolder(); - Add a cleanup hook in
@Beforethat wipes a pre-determined temporary directory, if it exists, e.g.target/ogt/. One benefit of this approach is that if a test fails, the test data is still in the temp dir and can be readily inspected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was also preexisting. You suggestions make sense and I will add them in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with the latest commit.
| .removalListener((path, value, removalCause) -> { | ||
| LOG.debug("Object retrieved from path '{}' was removed with cause {}", path, removalCause); | ||
| }) | ||
| .build(new Loader(new Configuration(), config)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's one place to provide an opportunity for dependency injection, I think the Loader implementation is it.
|
I believe I've addressed the feedback so far with the latest commit. Let me know how it looks now and what I'm missing. |
|
+1, nice work @merrimanr |
Contributor Comments
This PR adds a Stellar function for performing in-memory enrichments. Similar to
ENRICHMENT_GET, which performs lookups in HBase, this function performs lookups from an in-memory map that is loaded from HDFS. Thenosql_table,enrichment_type, andcolumn_familyparameters inENRICHMENT_GETare similar to theENRICHMENT_IN_MEMORY_GETpathparameter in that they uniquely identify the data source. Theindicatorparameter is the same (the key used to lookup an enrichment). The new function is very similar to theOBJECT_GETfunction (they use the same cache abstraction) except that it performs the lookup, offers configuration specifically for in-memory enrichments, and potentially more. The in-memory objects are lazy-loaded, meaning they are loaded the first time anENRICHMENT_IN_MEMORY_GETis executed.The Jira states the first pass should be a 1-time load but it wasn't difficult to expose cache expiration settings. This allows cached enrichments to be refreshed at an interval and will support streaming enrichments.
Cache settings are applied in the following order:
OBJECT_GETfunction)ENRICHMENT_IN_MEMORY_GETThese settings are passed into the Guava
CacheBuilderwhen the function is initialized.Changes Included
ObjectCacheabstraction was created from the code in the StellarOBJECT_GETfunction. This abstraction is shared by both theOBJECT_GETandENRICHMENT_IN_MEMORY_GETfunctions. The current settings used inOBJECT_GETshould be backwards compatible.ENRICHMENT_IN_MEMORY_GETcan be configured in the global config with thein.memory.enrichment.settingskey.Testing Instructions
This has been tested in full dev with the Stellar CLI.
enrichments.csvfile with enrichment data:enrichments_updated.csvfile with updated enrichment data:summarizer.jsonfile that will serialize this data as a map in HDFS:flatfile_summarizer.shscript to create serialized files:Note the
object.cache.expiration.minutessetting is maintained for backwards compatibility. You can also set the expiration to 1 minute with these settings:enrichments_updated.ser:ENRICHMENT_IN_MEMORY_GET. Put the previousenrichments.serback in HDFS and change the cache settings to 15 seconds:Next Steps
This is intended to be a first pass and there are still some outstanding items. I am planning on adding javadocs and a section in our READMEs describing how to do in-memory enrichments.
Are there other features people would like to see added here? I think it could be useful to set cache settings for each object in HDFS (currently they are global to all objects and can only be changed on initialization). This would make things more complex and require some design work.
Does anyone have suggestions for more appropriate function and setting names?
What else would you change about this?
Pull Request Checklist
Thank you for submitting a contribution to Apache Metron.
Please refer to our Development Guidelines for the complete guide to follow for contributions.
Please refer also to our Build Verification Guidelines for complete smoke testing guides.
In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following:
For all changes:
For code changes:
Have you included steps to reproduce the behavior or problem that is being changed or addressed?
Have you included steps or a guide to how the change may be verified and tested manually?
Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
Have you written or updated unit tests and or integration tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?
For documentation related changes:
Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via
site-book/target/site/index.html:Note:
Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommended that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.