METRON-1641: Enable Pcap jobs to be submitted asynchronously #1081

mmiklavc · 2018-06-27T06:17:04Z

Contributor Comments

https://issues.apache.org/jira/browse/METRON-1641

This enables Pcap Jobs to be run asynchronously. The PcapJob class itself is now a Statusable implementation that allows clients to poll for current JobStatus. This implementation exposes the new functionality on the job class but keeps the existing PcapCli functionality intact and unchanged. The tests for this will be in a comment below, taken from #256.

This validation should check that the current pcap functionality does not break. Follow on PR's will leverage the new asynchronous capabilities.

Pull Request Checklist

Thank you for submitting a contribution to Apache Metron.
Please refer to our Development Guidelines for the complete guide to follow for contributions.
Please refer also to our Build Verification Guidelines for complete smoke testing guides.

In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following:

For all changes:

Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?

For code changes:

Have you included steps to reproduce the behavior or problem that is being changed or addressed?
Have you included steps or a guide to how the change may be verified and tested manually?
Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
```
mvn -q clean integration-test install && dev-utilities/build-utils/verify_licenses.sh 
```
Have you written or updated unit tests and or integration tests to verify your changes?
Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via site-book/target/site/index.html:
```
cd site-book
mvn site
```

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommended that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.

mmiklavc · 2018-06-27T06:24:32Z

Testing

Get PCAP data into Metron:

Install and setup pycapa - look for "Install pycapa" here https://cwiki.apache.org/confluence/display/METRON/Metron+0.4.1+with+HDP+2.5+bare-metal+install+on+Centos+7+with+MariaDB+for+Metron+REST
(if using singlenode vagrant) Kill the enrichment and sensor topologies via for i in bro enrichment yaf snort;do storm kill $i;done
Start the pcap topology via $METRON_HOME/bin/start_pcap_topology.sh
Start the pycapa packet capture producer on eth1 via /usr/bin/pycapa --producer --topic pcap -i eth1 -k node1:6667
Watch the topology in the Storm UI and kill the packet capture utility from before, when the number of packets ingested is over 3k.
Ensure that at at least 3 files exist on HDFS by running hadoop fs -ls /apps/metron/pcap
Choose a file (denoted by $FILE) and dump a few of the contents using the pcap_inspector utility via $METRON_HOME//bin/pcap_inspector.sh -i $FILE -n 5
Choose one of the lines and note the protocol.
Note that when you run the commands below, the resulting file will be placed in the execution directory where you kicked off the job from.

Fixed filter

Run a fixed filter query by executing the following command with the values noted above (match your start_time format to the date format provided - default is to use millis since epoch)
$METRON_HOME/bin/pcap_query.sh fixed -st <start_time> -df "yyyyMMdd" -p <protocol_num> -rpf 500
Verify the MR job finishes successfully. Upon completion, you should see multiple files named with relatively current datestamps in your current directory, e.g. pcap-data-20160617160549737+0000.pcap
Copy the files to your local machine and verify you can them it in Wireshark. I chose a middle file and the last file. The middle file should have 500 records (per the records_per_file option), and the last one will likely have a number of records <= 500.

Query filter

Run a Stellar query filter query by executing a command similar to the following, with the values noted above (match your start_time format to the date format provided - default is to use millis since epoch)
$METRON_HOME/bin/pcap_query.sh query -st "20160617" -df "yyyyMMdd" -query "protocol == '6'" -rpf 500
Verify the MR job finishes successfully. Upon completion, you should see multiple files named with relatively current datestamps in your current directory, e.g. pcap-data-20160617160549737+0000.pcap
Copy the files to your local machine and verify you can them it in Wireshark. I chose a middle file and the last file. The middle file should have 500 records (per the records_per_file option), and the last one will likely have a number of records <= 500.

cestella

I like this! I especially like the statusable abstraction here. Good job; I'm +1 after the full-dev testing checkbox is checked and the small comment I had.

cestella · 2018-06-27T13:52:46Z

metron-platform/pom.xml

 		<module>metron-enrichment</module>
 		<module>metron-solr</module>
 		<module>metron-parsers</module>
+    <module>metron-job</module>


can we adjust the indentation here?

Heh, the pom.xml has tabs instead of spaces. Rather than reformat everything in the file I just changed that line use tabs.

merrimanr · 2018-06-27T16:08:27Z

Looks good! One thing I'm trying to wrap my head around is how we get status if we only have a job id or unique identifier for a job? JobStatus doesn't have an id so I'm assuming resultPath is the unique identifier here.

As far as I can tell an instance of org.apache.hadoop.mapreduce.Job is kept in memory and is responsible for reporting status. I can think of a couple scenarios where this might be problematic.
One is if I ran a query from the CLI but then wanted to get status from REST. How would that work? That's probably not a likely use case so maybe not an issue there. What happens if I submit a query through REST and REST is restarted while jobs are running? Do we lose job status information?

cestella · 2018-06-27T16:17:05Z

I think for a first cut, it's ok to have the restrictions that:

the REST API controls only the jobs it creates. Otherwise, we would need more refactoring in the CLI to drop the output in the same HDFS directory rather than it be user specifiable and output locally. Ultimately, while they use the same mechanism, the UX is different between the two approaches (e.g. the CLI entirely cleans up after itself and outputs to the local directory whereas the REST approach stores the results in HDFS until manual cleanup).
If a job is running while the REST API dies, we should consider that job to be runaway and needs to be killed by the admin or left to complete without the result being published. One thing that we might consider doing is enabling the job naming to have a prefix of METRON_REST_PCAP so, upon REST start, it can kill existing jobs. I think for THIS PR, we should just have REST pcap jobs have that prefix and leave it to a follow-on PR to do the actual killing.

cestella · 2018-06-27T16:25:14Z

One thing I didn't see. Can we make sure we pass along the yarn queue to the job?

cestella · 2018-06-27T16:34:36Z

One more comment about restartability, I think we could potentially support this with this architecture in the future. You can recover the Job object from MR via the JobClient

Configuration conf = new Configuration();
JobClient jobClient = new JobClient(new JobConf(conf)); // deprecation WARN
JobID jobID = JobID.forName("job_201107011451_0001");   // deprecation WARN
RunningJob runningJob = jobClient.getJob(jobID);

We could look for jobs which are completed but not in the HDFS structure and recover them on REST start. I would suggest doing that as a follow-on though.

merrimanr · 2018-06-27T16:38:44Z

Perfect. That addresses my concern. Doing this in a follow on is fine since it's not necessary when using the CLI.

merrimanr · 2018-06-27T20:39:18Z

I spun this up in full dev, ran a fixed query, and got this error:

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file://./, expected: hdfs://node1:8020
	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:666)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:214)
	at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1181)
	at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1177)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:1195)
	at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:1169)
	at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1925)
	at org.apache.metron.common.utils.HDFSUtils.write(HDFSUtils.java:71)
	at org.apache.metron.pcap.writer.ResultsWriter.write(ResultsWriter.java:38)
	at org.apache.metron.pcap.mr.PcapJob.writeResults(PcapJob.java:270)
	at org.apache.metron.pcap.query.PcapCli.run(PcapCli.java:155)
	at org.apache.metron.pcap.query.PcapCli.main(PcapCli.java:52)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:233)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:148)

Looks like the MR jobs succeeded but partitioning the files to the local FS did not work.

mmiklavc · 2018-06-27T21:08:03Z

Good catch on the FS @merrimanr - also finding that via manual testing. I believe I have a workaround that degrades nicely to the configuration default and also allows you to pass in the scheme in the path.

FileSystem fs = FileSystem.newInstance(outPath.toUri(), config);

mmiklavc · 2018-06-27T21:19:30Z

Also @merrimanr agreed about the points you made regarding restart-ability, etc. in the long run. In the short term, @cestella has done a rather good job of summarizing my thoughts on a v1 pass at this feature set.

I will:

add Job ID in the return status (was going to add that when I do the job manager follow-on to this PR, but I'll just do it here)
add the ability to pass a job name, ie job.setJobName(name) - I'll handle the actual job naming in the pcapservice and pass that as a parameter. I think that's a natural place for that logic.
Add ability to pass the queue name

How's that sound to you both?

mmiklavc · 2018-06-27T22:24:09Z

Heads up, hadoop config class is where you set queue name iirc. We already pass that in as an arg. This would simply need to be provided via the calling job manager class. config.setProperty("mapreduce.job.queuename", "somequeue");

merrimanr · 2018-06-28T13:24:04Z

That sounds good to me.

merrimanr · 2018-06-29T13:48:12Z

I tested this and it's working for me in full dev. I think it's good enough to go into the feature branch. +1

merrimanr · 2018-07-10T12:54:56Z

Can we merge this? Any other items you would like addressed @cestella?

cestella · 2018-07-10T13:15:58Z

+1, lgtm

cestella · 2018-07-10T20:27:08Z

@mmiklavc can you merge and close this PR?

…c via mmiklavc) closes #1081

mmiklavc · 2018-07-11T01:28:06Z

Committed to the feature branch.

mmiklavc added 5 commits June 26, 2018 23:48

Add metron-job project. Update pcap to be Statusable.

41ecf36

Save progress on Pageable implementation

a83b472

Rev metron-job to 0.5.1 after merge with master

976e1bc

Move result writing code to pcapjob. Get pcapclitest working again.

f9fc106

Add Pageable results.

8b4ef9c

cestella reviewed Jun 27, 2018

View reviewed changes

change pom change with spaces to tabs

cd302a9

mmiklavc added 2 commits June 27, 2018 17:20

Address review comments. Fix local FS write path problem.

7c1d4a0

Let's try this again.

a00e300

asfgit pushed a commit that referenced this pull request Jul 11, 2018

METRON-1641: Enable Pcap jobs to be submitted asynchronously (mmiklav…

9cee51e

…c via mmiklavc) closes #1081

mmiklavc closed this Jul 11, 2018

mmiklavc mentioned this pull request Aug 14, 2018

METRON-1732: Fix job status liveness bug and parallelize finalizer file writing #1157

Closed

9 tasks

METRON-1641: Enable Pcap jobs to be submitted asynchronously #1081

METRON-1641: Enable Pcap jobs to be submitted asynchronously #1081

Uh oh!

Conversation

mmiklavc commented Jun 27, 2018

Contributor Comments

Pull Request Checklist

For all changes:

For code changes:

For documentation related changes:

Note:

Uh oh!

mmiklavc commented Jun 27, 2018

Fixed filter

Query filter

Uh oh!

cestella left a comment

Choose a reason for hiding this comment

Uh oh!

cestella Jun 27, 2018

Choose a reason for hiding this comment

Uh oh!

mmiklavc Jun 27, 2018

Choose a reason for hiding this comment

Uh oh!

merrimanr commented Jun 27, 2018

Uh oh!

cestella commented Jun 27, 2018

Uh oh!

cestella commented Jun 27, 2018

Uh oh!

cestella commented Jun 27, 2018

Uh oh!

merrimanr commented Jun 27, 2018

Uh oh!

merrimanr commented Jun 27, 2018

Uh oh!

mmiklavc commented Jun 27, 2018

Uh oh!

mmiklavc commented Jun 27, 2018

Uh oh!

mmiklavc commented Jun 27, 2018

Uh oh!

merrimanr commented Jun 28, 2018

Uh oh!

merrimanr commented Jun 29, 2018

Uh oh!

merrimanr commented Jul 10, 2018

Uh oh!

cestella commented Jul 10, 2018

Uh oh!

cestella commented Jul 10, 2018

Uh oh!

mmiklavc commented Jul 11, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants