-
Notifications
You must be signed in to change notification settings - Fork 505
METRON-743: Sort the files when reading results from Pcap #467
Conversation
|
What are the performance penalties of doing this? Part of the discussion in HADOOP-12009 that lead to the clarification of this behavior in the spec was about why they don't do this themselves for the performance penalty. Could you get OOM errors or other issues running large queries that would have worked before? |
|
There are two things here I think
I think we can address the test first, and may have to discuss the 2. |
|
The performance penalties are minimal. The number of files will equal the number of reducers, which does not scale with the data, and user specifiable. Also we are just sorting the file handles here, not the contents, so OOM errors are very unlikely. The contents are sorted by virtue of MapReduce, the files are named in an ordered way by virtue of our custom partitioner, this just ensures that the files are processed in order. I'm not treating this as just a test problem. This is a problem of our assumptions not being correct. This could be a problem for the real pcap system, not just the test, if people are using non-HDFS implementation. For HDFS, it's probably not an issue (I'm not even sure of that in all cases, honestly and there is no guarantee for the behavior to change since it's not mandated), but I'd rather own our assumptions rather than depend on Filesystem operations which do not conform to our assumptions necessarily. |
|
Thank you for clarifying, that makes sense. Sorry to confuse the issue |
|
Ok, I ran this up and tested it and I got the results I expect, but I'd like some independent confirmation by @kylerichardson |
|
+1 passes unit and integration tests, ran through @cestella's test script successfully Thanks for your patience and for fixing! |
|
Also +1 on this. Ran the script above. Just a note, the Vagrant quick-dev install for pcap doesn't work via |
|
+1, and i'm now seeing this in my personal travis |
|
@mmiklavc I ran into the same issue with installing pycapa on quick-dev. My solution was to tweak the playbook to run the pycapa role as part of the sensor-stubs tag. |
The FileSystem.listFiles() call does not return the files in sorted order, which we assume for all FileSystem implementations. We should sort this to be certain.
Testing
Get PCAP data into Metron: Install and setup pycapa - the instructions below reference/mirror those in #93
$ cd /opt/pycapa/pycapa && pip install -r requirements.txt && python setup.py installfor i in bro enrichment yaf snort;do storm kill $i;done$METRON_HOME/bin/start_pcap_topology.sh/usr/bin/pycapa --producer --topic pcap -i eth1 -k node1:6667hadoop fs -ls /apps/metron/pcapChoose a file (denoted by $FILE) and dump a few of the contents using the pcap_inspector utility via
$METRON_HOME/bin/pcap_inspector.sh -i $FILE -n 5Choose one of the lines and note the protocol.
Note that when you run the commands below, the resulting file will be placed in the execution directory where you kicked off the job from.
$METRON_HOME/bin/pcap_query.sh query -st "20160617" -df "yyyyMMdd" -query "protocol == 6" -rpf 500For all changes:
For code changes:
For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommened that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.