Skip to content
This repository was archived by the owner on Aug 20, 2025. It is now read-only.

Conversation

@rkarthik29
Copy link

addd support for js,python and groovy parser bolt

@ottobackwards
Copy link
Contributor

This is interesting. I would think we would want some tooling like https://github.com/YelpArchive/pyleus around building a jar that contains all the python dependencies for a script for people to do non-trivial things in python.

I am not sure how you would do the same for javascript or groovy.

@@ -0,0 +1,5 @@
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should go in the .gitignore

@ottobackwards
Copy link
Contributor

What is the workflow or steps around deploying the scripts?

Copy link
Member

@cestella cestella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting contribution; thanks so much for that. We have definitely talked about this for some time. We really need to think about where scripts should live and how parsers can uptake new scripts on-the-fly.


@Override
public void init() {
// TODO Auto-generated method stub
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably don't need that

}
//Should this be sent to the interface as a default method?
public InputStream openInputStream(String streamName) throws IOException {
FileSystem fs = FileSystem.get(new Configuration());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is following the pattern of what's currently in Grok. That is being moved to take the grok statements from zookeeper because otherwise you cannot update the parser without bouncing the topology. Should we follow suit here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be consistent with this for now. If we change them to live in zookeeper we have to change all the parsers. We can't have some live in HDFS and others live in zookeeper

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we currently have active work to move the only parser that refers to reference files to zookeeper. We put ourselves in an inconsistent state as soon as that work comes in if we have this pull from HDFS.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think until that work has been merged in the only consistent state is pulling from HDFS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the desire for consistency; however, I would potentially be concerned about the user experience of putting scripts (especially ones with external library dependencies) into zookeeper. We would need to make this as clean as possible.


@Override
public boolean validate(JSONObject message) {
// TODO Auto-generated method stub
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need that TODO


@Override
public String getTimestampField() {
// TODO Auto-generated method stub
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove


@Override
public String getDateFormat() {
// TODO Auto-generated method stub
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove


@Override
public String getParseFunction() {
// TODO Auto-generated method stub
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove


@Override
public String getLanguage() {
// TODO Auto-generated method stub
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove


@Override
public List<String> getTimeFields() {
// TODO Auto-generated method stub
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove


@Override
public String getDateFormat() {
// TODO Auto-generated method stub
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove


@Override
public String getTimestampField() {
// TODO Auto-generated method stub
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

@ottobackwards
Copy link
Contributor

ottobackwards commented Nov 2, 2016

I think the tooling around creating the script parser is very important ( packaging, dependencies ), although that could be tracked in another jira as a follow on. But it may be impactful to this.

@cestella
Copy link
Member

cestella commented Nov 2, 2016

There has been a lot of thought and modification around updating external content in #308. It seems to me that some of this should be reused or presumed here. We have a similar issue, the need to update an external reference/execution file from zookeeper.

@cestella
Copy link
Member

cestella commented Nov 2, 2016

It may be, of course, that this shouldn't be pulled from zookeeper and should be kept a file. I'm not sure how I feel either way. If we keep it a file, I am unclear how we manage to reparse the script (and collateral) upon change without bouncing the topology.

@ottobackwards
Copy link
Contributor

I don't think the script change is the same as configuration/rule change necessarily.
Maybe what is missing are the requirements?

  • does the script have to be editable and trigger reload the same as a configuration
  • does the script have to include dependencies such that it doesn't introduce problems maintaining environments across the cluster
  • does script X have to support all 'features' ( jython limitations for cpython modules )
    Etc etc

@cestella
Copy link
Member

cestella commented Nov 2, 2016

@ottobackwards Yeah, that's sensible. As far as the features, I think it's ok for this to support the JVM-supported scripting languages with all their limitations (IMO).

@ottobackwards
Copy link
Contributor

yeah, as a Minimum Viable, but those limitations need to be spelled out, per language. Actually all the points above may have per language implications

@cestella
Copy link
Member

cestella commented Nov 2, 2016

@rkarthik29 Could you add a section here describing the parser's config and limitations/abilities?

@rkarthik29
Copy link
Author

i am coming in late to this party, but i think the core issue , how would the bolt know that the script has changed so it can reload the parser. I was thinking if the previous bolt could add aversion tag to the raw message.. The parser bolt compares this version or checksum to the last one it had, if different reruns init and stores this value into the bolt. Else just drops the version an processes the raw message. A simple checksum may be. Since the scripts can be really evolved, we could give user more leverage if we allowed them to keep in HDFS or a filesystem. even a rest api. Please let me know if i am way off.

@cestella
Copy link
Member

cestella commented Nov 2, 2016

I'm increasingly becoming of the opinion that the code as it stands is probably the right path for the the moment. When we are through the changes from making the grok parser pull from Zookeeper, we can revise if necessary.

The main issue, as I see it, is as @rkarthik29 stated: how do you know that the scripts changed. And there are now two parsers that need non-trivially sized external files to function properly, so it may make some sense for there to be more concerted thought to how to abstract that properly.

@merrimanr
Copy link
Contributor

@rkarthik29 One of the advantages of keeping the configuration (scripts in this case) in Zookeeper is that you get notified when something changes. You don't need to track versions. Unfortunately, until #308 gets merged in, the ParserBolt does not notify the parser class that the config has changed and only provides the configuration on startup (hence the need to restart the topology).

I think it makes sense to keep it in HDFS for this PR and accept the current restriction of having to restart the topology for it to take effect. Until we merge #308 or agree on the another approach, I wouldn't worry about script changes being updated on the fly. There has already been a lot of discussion and work put into this so it doesn't make sense to start over. We could even update #308 to include this parser once we get closer to merging that.

@james-sirota
Copy link

I will try to verify this on a production deployment with production data on Friday. I'll let you know how it goes

@james-sirota
Copy link

By the way sounds like the Travis build fails the RAT check. [ERROR] Failed to execute goal org.apache.rat:apache-rat-plugin:0.12:check (default) on project metron-integration-test: Too many files with unapproved license: 3 See RAT report in: /home/travis/build/apache/incubator-metron/metron-platform/metron-integration-test/target/rat.txt -> [Help 1]

@rkarthik29
Copy link
Author

The pull request is due to some errors with executing test cases. I am not sure if they are related to the parser, if someone guides me i can look at it.

Kathik Narayanan added 3 commits November 7, 2016 13:12
@rkarthik29
Copy link
Author

what happens to the PR now?

@ottobackwards
Copy link
Contributor

How do scripts with dependencies work? - I think there are still questions in the comments here that should be answered

@rkarthik29
Copy link
Author

rkarthik29 commented Nov 9, 2016

I think that should be a separate lira. This feature does a very simple thing, it allows a user to provide a function in the format JsonObject parse(rawMessage) in a language of his choice. The concept is that he can now use standard library functions to parse the rawMessage into JSon, which metron can then use.

dependencies for external libraries i think will need to be tested. This could be the strategy...

  1. For JS script, you could just include the min.js link or just put your script in a http server and refer it in the script.
  2. For groovy script, could be as simple as dropping a jar in the proper location, i guess the bolt class path.
  3. for python, i think we should be able to put the modules in a modules folder and i think python script can refer to it from there, we will have to look at that packaging.

I think it is a little more complicated and i think a separate would be needed to track it.

@rkarthik29
Copy link
Author

this is exactly what we would need to do for python...
http://docs.aws.amazon.com/lambda/latest/dg/lambda-python-how-to-create-deployment-package.html

@ottobackwards
Copy link
Contributor

So that is similar to the pyleus stuff. Although pyleus actually produces jar topologies as a unit. The question is, with only standard library support for any of these languages how useful is this? Does that need to be accounted for now or can it be another jira as you mention. I'm not advocating either way, but if we do say it will be another jira, and there will be further work here, we should explicitly state that and create the jiras to put that plan in place.

@ottobackwards
Copy link
Contributor

And the documentation for the feature need to account for the limitations etc

@rkarthik29
Copy link
Author

yes, i agree, right now i don't think dependencies in a project would require the code for this feature to change. I think it is just documenting on how to include such dependency.

These languages are pretty good in their support for string parsing and extracting data, right out of the box. They support xml, json , again right out of the box. With that being said, just the standard scripts should be pretty useful.

@rkarthik29
Copy link
Author

on a separate note, One feature that could really be a jira that causes code change, could be the one where we can have callbacks from these scripts, especially JS. Which means my script can call an external rest Api over AJAX and once it gets a response, will call back a function in the bolt with the JSONObject, asynchronous. This can give user tremendous amount of extendibility. May be a way to use a centralized rule engine , for their alerts. I know JS supports it, so should groovy. Not sure about python. Will the bolt support such a thing?

Copy link
Contributor

@kylerichardson kylerichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pulling this together. I can see some real value coming out of this parsing option. Some follow-on work will be needed to support more advanced use cases (e.g. inclusion of additional libraries).

Just a few minor comments. Overall a great start. Thanks for the contribution!

the specific language governing permissions and limitations under the License.
-->

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be a local backup of the POM file. Can it be removed in favor of the git history?


def MetronMessage(){
def message=[:];
message["source"]="userlog";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the other languages you have passed in a name variable and used it for the source, can we do that here to be consistent?


def MetronMessage(name):
message={}
message["source"]="userlog"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update this to set the source to the passed in name variable?

}
//Should this be sent to the interface as a default method?
public InputStream openInputStream(String streamName) throws IOException {
FileSystem fs = FileSystem.get(new Configuration());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the desire for consistency; however, I would potentially be concerned about the user experience of putting scripts (especially ones with external library dependencies) into zookeeper. We would need to make this as clean as possible.

+ originalMessage + " and the parsed message was: " + message + " . Check the function at: "
+ this.scriptPath);

message.put("original_string", originalMessage);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To move toward consistency, can this "original_string" field be updated to use the new Constants.Fields.ORIGINAL.getName()?

@randerzander
Copy link

@ottobackwards , Python's standard libraries are useful by themselves.

Since first tasks with Metron seem often related to ingesting a new source, Python lowers the barrier to entry over Java or Grok parsers.

@james-sirota
Copy link

i agree with @randerzander. We can file a follow-on jira to get library support for Python and other languages in another follow-on Jira to this

@rkarthik29
Copy link
Author

@james-sirota I will fix the conflicts once we are ready to merge.

@ottobackwards
Copy link
Contributor

I am not sure what the status of this pr is. I think that we may want to re-think this a little in light of METRON-777

@ottobackwards
Copy link
Contributor

What is the status of this effort?

@ottobackwards
Copy link
Contributor

Re-reading this, I think it is dead, unless there is an update triggered by this.

I think I have the resolution to the update issue.

Post 777 and parser extensions, extensions can / will be deployed through the ui.
A script parser will deploy scripts, as we do grok rules to a directory, we can have the deployment also make an entry in zookeeper with the checksum of the script or some such. That can be the item that is in zookeeper and monitored for change. If we deploy a new version of the script by installing it in the UI, it will save it to hdfs, and the change the entry in zookeeper, triggering reload.

If this PR is dead, I may take this up

@nickwallen
Copy link
Contributor

@rkarthik29 Please close this PR unless you believe it is still needed. This PR will be closed per the Metron Development Guidelines on Inactive PRs, unless you provide some reasoning as to why it is stil needed. Thanks!

@nickwallen
Copy link
Contributor

-1

There has been no response from contributor. This PR will be closed according to the Metron Development Guidelines on Inactive PRs.

@Humbedooh Humbedooh closed this May 14, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants