-
Notifications
You must be signed in to change notification settings - Fork 505
Conversation
|
This is interesting. I would think we would want some tooling like https://github.com/YelpArchive/pyleus around building a jar that contains all the python dependencies for a script for people to do non-trivial things in python. I am not sure how you would do the same for javascript or groovy. |
| @@ -0,0 +1,5 @@ | |||
| <?xml version="1.0" encoding="UTF-8" standalone="no"?> | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should go in the .gitignore
|
What is the workflow or steps around deploying the scripts? |
cestella
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting contribution; thanks so much for that. We have definitely talked about this for some time. We really need to think about where scripts should live and how parsers can uptake new scripts on-the-fly.
|
|
||
| @Override | ||
| public void init() { | ||
| // TODO Auto-generated method stub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably don't need that
| } | ||
| //Should this be sent to the interface as a default method? | ||
| public InputStream openInputStream(String streamName) throws IOException { | ||
| FileSystem fs = FileSystem.get(new Configuration()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is following the pattern of what's currently in Grok. That is being moved to take the grok statements from zookeeper because otherwise you cannot update the parser without bouncing the topology. Should we follow suit here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be consistent with this for now. If we change them to live in zookeeper we have to change all the parsers. We can't have some live in HDFS and others live in zookeeper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we currently have active work to move the only parser that refers to reference files to zookeeper. We put ourselves in an inconsistent state as soon as that work comes in if we have this pull from HDFS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think until that work has been merged in the only consistent state is pulling from HDFS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the desire for consistency; however, I would potentially be concerned about the user experience of putting scripts (especially ones with external library dependencies) into zookeeper. We would need to make this as clean as possible.
|
|
||
| @Override | ||
| public boolean validate(JSONObject message) { | ||
| // TODO Auto-generated method stub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need that TODO
|
|
||
| @Override | ||
| public String getTimestampField() { | ||
| // TODO Auto-generated method stub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
|
|
||
| @Override | ||
| public String getDateFormat() { | ||
| // TODO Auto-generated method stub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
|
|
||
| @Override | ||
| public String getParseFunction() { | ||
| // TODO Auto-generated method stub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
|
|
||
| @Override | ||
| public String getLanguage() { | ||
| // TODO Auto-generated method stub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
|
|
||
| @Override | ||
| public List<String> getTimeFields() { | ||
| // TODO Auto-generated method stub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
|
|
||
| @Override | ||
| public String getDateFormat() { | ||
| // TODO Auto-generated method stub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
|
|
||
| @Override | ||
| public String getTimestampField() { | ||
| // TODO Auto-generated method stub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
|
I think the tooling around creating the script parser is very important ( packaging, dependencies ), although that could be tracked in another jira as a follow on. But it may be impactful to this. |
|
There has been a lot of thought and modification around updating external content in #308. It seems to me that some of this should be reused or presumed here. We have a similar issue, the need to update an external reference/execution file from zookeeper. |
|
It may be, of course, that this shouldn't be pulled from zookeeper and should be kept a file. I'm not sure how I feel either way. If we keep it a file, I am unclear how we manage to reparse the script (and collateral) upon change without bouncing the topology. |
|
I don't think the script change is the same as configuration/rule change necessarily.
|
|
@ottobackwards Yeah, that's sensible. As far as the features, I think it's ok for this to support the JVM-supported scripting languages with all their limitations (IMO). |
|
yeah, as a Minimum Viable, but those limitations need to be spelled out, per language. Actually all the points above may have per language implications |
|
@rkarthik29 Could you add a section here describing the parser's config and limitations/abilities? |
|
i am coming in late to this party, but i think the core issue , how would the bolt know that the script has changed so it can reload the parser. I was thinking if the previous bolt could add aversion tag to the raw message.. The parser bolt compares this version or checksum to the last one it had, if different reruns init and stores this value into the bolt. Else just drops the version an processes the raw message. A simple checksum may be. Since the scripts can be really evolved, we could give user more leverage if we allowed them to keep in HDFS or a filesystem. even a rest api. Please let me know if i am way off. |
|
I'm increasingly becoming of the opinion that the code as it stands is probably the right path for the the moment. When we are through the changes from making the grok parser pull from Zookeeper, we can revise if necessary. The main issue, as I see it, is as @rkarthik29 stated: how do you know that the scripts changed. And there are now two parsers that need non-trivially sized external files to function properly, so it may make some sense for there to be more concerted thought to how to abstract that properly. |
|
@rkarthik29 One of the advantages of keeping the configuration (scripts in this case) in Zookeeper is that you get notified when something changes. You don't need to track versions. Unfortunately, until #308 gets merged in, the ParserBolt does not notify the parser class that the config has changed and only provides the configuration on startup (hence the need to restart the topology). I think it makes sense to keep it in HDFS for this PR and accept the current restriction of having to restart the topology for it to take effect. Until we merge #308 or agree on the another approach, I wouldn't worry about script changes being updated on the fly. There has already been a lot of discussion and work put into this so it doesn't make sense to start over. We could even update #308 to include this parser once we get closer to merging that. |
|
I will try to verify this on a production deployment with production data on Friday. I'll let you know how it goes |
|
By the way sounds like the Travis build fails the RAT check. [ERROR] Failed to execute goal org.apache.rat:apache-rat-plugin:0.12:check (default) on project metron-integration-test: Too many files with unapproved license: 3 See RAT report in: /home/travis/build/apache/incubator-metron/metron-platform/metron-integration-test/target/rat.txt -> [Help 1] |
|
The pull request is due to some errors with executing test cases. I am not sure if they are related to the parser, if someone guides me i can look at it. |
# Conflicts: # metron-platform/metron-parsers/pom.xml
|
what happens to the PR now? |
|
How do scripts with dependencies work? - I think there are still questions in the comments here that should be answered |
|
I think that should be a separate lira. This feature does a very simple thing, it allows a user to provide a function in the format JsonObject parse(rawMessage) in a language of his choice. The concept is that he can now use standard library functions to parse the rawMessage into JSon, which metron can then use. dependencies for external libraries i think will need to be tested. This could be the strategy...
I think it is a little more complicated and i think a separate would be needed to track it. |
|
this is exactly what we would need to do for python... |
|
So that is similar to the pyleus stuff. Although pyleus actually produces jar topologies as a unit. The question is, with only standard library support for any of these languages how useful is this? Does that need to be accounted for now or can it be another jira as you mention. I'm not advocating either way, but if we do say it will be another jira, and there will be further work here, we should explicitly state that and create the jiras to put that plan in place. |
|
And the documentation for the feature need to account for the limitations etc |
|
yes, i agree, right now i don't think dependencies in a project would require the code for this feature to change. I think it is just documenting on how to include such dependency. These languages are pretty good in their support for string parsing and extracting data, right out of the box. They support xml, json , again right out of the box. With that being said, just the standard scripts should be pretty useful. |
|
on a separate note, One feature that could really be a jira that causes code change, could be the one where we can have callbacks from these scripts, especially JS. Which means my script can call an external rest Api over AJAX and once it gets a response, will call back a function in the bolt with the JSONObject, asynchronous. This can give user tremendous amount of extendibility. May be a way to use a centralized rule engine , for their alerts. I know JS supports it, so should groovy. Not sure about python. Will the bolt support such a thing? |
kylerichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pulling this together. I can see some real value coming out of this parsing option. Some follow-on work will be needed to support more advanced use cases (e.g. inclusion of additional libraries).
Just a few minor comments. Overall a great start. Thanks for the contribution!
| the specific language governing permissions and limitations under the License. | ||
| --> | ||
|
|
||
| <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be a local backup of the POM file. Can it be removed in favor of the git history?
|
|
||
| def MetronMessage(){ | ||
| def message=[:]; | ||
| message["source"]="userlog"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the other languages you have passed in a name variable and used it for the source, can we do that here to be consistent?
|
|
||
| def MetronMessage(name): | ||
| message={} | ||
| message["source"]="userlog" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update this to set the source to the passed in name variable?
| } | ||
| //Should this be sent to the interface as a default method? | ||
| public InputStream openInputStream(String streamName) throws IOException { | ||
| FileSystem fs = FileSystem.get(new Configuration()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the desire for consistency; however, I would potentially be concerned about the user experience of putting scripts (especially ones with external library dependencies) into zookeeper. We would need to make this as clean as possible.
| + originalMessage + " and the parsed message was: " + message + " . Check the function at: " | ||
| + this.scriptPath); | ||
|
|
||
| message.put("original_string", originalMessage); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To move toward consistency, can this "original_string" field be updated to use the new Constants.Fields.ORIGINAL.getName()?
|
@ottobackwards , Python's standard libraries are useful by themselves. Since first tasks with Metron seem often related to ingesting a new source, Python lowers the barrier to entry over Java or Grok parsers. |
|
i agree with @randerzander. We can file a follow-on jira to get library support for Python and other languages in another follow-on Jira to this |
|
@james-sirota I will fix the conflicts once we are ready to merge. |
|
I am not sure what the status of this pr is. I think that we may want to re-think this a little in light of METRON-777 |
|
What is the status of this effort? |
|
Re-reading this, I think it is dead, unless there is an update triggered by this. I think I have the resolution to the update issue. Post If this PR is dead, I may take this up |
|
@rkarthik29 Please close this PR unless you believe it is still needed. This PR will be closed per the Metron Development Guidelines on Inactive PRs, unless you provide some reasoning as to why it is stil needed. Thanks! |
|
-1 There has been no response from contributor. This PR will be closed according to the Metron Development Guidelines on Inactive PRs. |
addd support for js,python and groovy parser bolt