METRON-743: Sort the files when reading results from Pcap #467

cestella · 2017-02-26T04:56:11Z

The FileSystem.listFiles() call does not return the files in sorted order, which we assume for all FileSystem implementations. We should sort this to be certain.

Testing

Get PCAP data into Metron: Install and setup pycapa - the instructions below reference/mirror those in #93

Install the pycapa library & utility $ cd /opt/pycapa/pycapa && pip install -r requirements.txt && python setup.py install
(if using singlenode vagrant) Kill the enrichment and sensor topologies via for i in bro enrichment yaf snort;do storm kill $i;done
Start the pcap topology via $METRON_HOME/bin/start_pcap_topology.sh
Start the pycapa packet capture producer on eth1 via /usr/bin/pycapa --producer --topic pcap -i eth1 -k node1:6667
Watch the topology in the Storm UI and kill the packet capture utility from before, when the number of packets ingested is over 3k.
Ensure that at at least 3 files exist on HDFS by running hadoop fs -ls /apps/metron/pcap
Choose a file (denoted by $FILE) and dump a few of the contents using the pcap_inspector utility via $METRON_HOME/bin/pcap_inspector.sh -i $FILE -n 5
Choose one of the lines and note the protocol.
Note that when you run the commands below, the resulting file will be placed in the execution directory where you kicked off the job from.
Run a Stellar query filter query by executing a command similar to the following, with the values noted above (match your start_time format to the date format provided - default is to use millis since epoch)
$METRON_HOME/bin/pcap_query.sh query -st "20160617" -df "yyyyMMdd" -query "protocol == 6" -rpf 500
Verify the MR job finishes successfully. Upon completion, you should see multiple files named with relatively current datestamps in your current directory, e.g. pcap-data-20160617160549737+0000.pcap
Copy the files to your local machine and verify you can them it in Wireshark. I chose a middle file and the last file. The middle file should have 500 records (per the records_per_file option), and the last one will likely have a number of records <= 500.

For all changes:

Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?

For code changes:

Have you included steps to reproduce the behavior or problem that is being changed or addressed?
Have you included steps or a guide to how the change may be verified and tested manually?
Have you ensured that the full suite of tests and checks have been executed in the root incubating-metron folder via:

mvn -q clean integration-test install && build_utils/verify_licenses.sh

Have you written or updated unit tests and or integration tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via site-book/target/site/index.html.

cd site-book
bin/generate-md.sh
mvn site:site

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommened that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.

ottobackwards · 2017-02-26T05:13:22Z

What are the performance penalties of doing this?

Part of the discussion in HADOOP-12009 that lead to the clarification of this behavior in the spec was about why they don't do this themselves for the performance penalty.

Could you get OOM errors or other issues running large queries that would have worked before?

ottobackwards · 2017-02-26T05:15:19Z

There are two things here I think

The test assumptions and the failure on some platforms ( if this is indeed the problem )
What this means for the real use of the pcap query system

I think we can address the test first, and may have to discuss the 2.

cestella · 2017-02-26T06:26:00Z

The performance penalties are minimal. The number of files will equal the number of reducers, which does not scale with the data, and user specifiable. Also we are just sorting the file handles here, not the contents, so OOM errors are very unlikely. The contents are sorted by virtue of MapReduce, the files are named in an ordered way by virtue of our custom partitioner, this just ensures that the files are processed in order.

I'm not treating this as just a test problem. This is a problem of our assumptions not being correct. This could be a problem for the real pcap system, not just the test, if people are using non-HDFS implementation. For HDFS, it's probably not an issue (I'm not even sure of that in all cases, honestly and there is no guarantee for the behavior to change since it's not mandated), but I'd rather own our assumptions rather than depend on Filesystem operations which do not conform to our assumptions necessarily.

ottobackwards · 2017-02-26T13:06:33Z

Thank you for clarifying, that makes sense. Sorry to confuse the issue

cestella · 2017-02-27T15:21:52Z

Ok, I ran this up and tested it and I got the results I expect, but I'd like some independent confirmation by @kylerichardson

kylerichardson · 2017-02-28T03:06:10Z

+1 passes unit and integration tests, ran through @cestella's test script successfully

Thanks for your patience and for fixing!

mmiklavc · 2017-02-28T19:43:16Z

Also +1 on this. Ran the script above.

Just a note, the Vagrant quick-dev install for pcap doesn't work via vagrant --ansible-tags="pycapa" provision. Not sure what others have done here, but I ended up installing manually using the following procedure.

# set env vars
export PYCAPA_HOME=/opt/pycapa
export PYTHON27_HOME=/opt/rh/python27/root

# Install these packages via yum (RHEL, CentOS)
yum -y install epel-release centos-release-scl 
yum -y install "@Development tools" python27 python27-scldevel python27-python-virtualenv libpcap-devel libselinux-python

# Setup directories
mkdir $PYCAPA_HOME && chmod 755 $PYCAPA_HOME

# Create virtualenv
export LD_LIBRARY_PATH="/opt/rh/python27/root/usr/lib64"
${PYTHON27_HOME}/usr/bin/virtualenv pycapa-venv

# Copy pycapa
# copy incubator-metron/metron-sensors/pycapa from the Metron source tree into $PYCAPA_HOME on the node you would like to install pycapa on.

# Build it
cd ${PYCAPA_HOME}/pycapa
# activate the virtualenv
source ${PYCAPA_HOME}/pycapa-venv/bin/activate
pip install -r requirements.txt
python setup.py install

# Run it
cd ${PYCAPA_HOME}/pycapa-venv/bin
pycapa --producer --topic pcap -i eth1 -k node1:6667

ottobackwards · 2017-03-01T15:31:15Z

+1, and i'm now seeing this in my personal travis

kylerichardson · 2017-03-01T18:15:10Z

@mmiklavc I ran into the same issue with installing pycapa on quick-dev. My solution was to tweak the playbook to run the pycapa role as part of the sensor-stubs tag.

METRON-743: Sort the files when reading results from Pcap

22ad59d

asfgit closed this in e416a7d Mar 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

METRON-743: Sort the files when reading results from Pcap #467

METRON-743: Sort the files when reading results from Pcap #467

Uh oh!

cestella commented Feb 26, 2017 •

edited

Loading

Uh oh!

ottobackwards commented Feb 26, 2017

Uh oh!

ottobackwards commented Feb 26, 2017

Uh oh!

cestella commented Feb 26, 2017

Uh oh!

ottobackwards commented Feb 26, 2017

Uh oh!

cestella commented Feb 27, 2017

Uh oh!

kylerichardson commented Feb 28, 2017

Uh oh!

mmiklavc commented Feb 28, 2017

Uh oh!

ottobackwards commented Mar 1, 2017

Uh oh!

kylerichardson commented Mar 1, 2017 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

METRON-743: Sort the files when reading results from Pcap #467

METRON-743: Sort the files when reading results from Pcap #467

Uh oh!

Conversation

cestella commented Feb 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

For all changes:

For code changes:

For documentation related changes:

Note:

Uh oh!

ottobackwards commented Feb 26, 2017

Uh oh!

ottobackwards commented Feb 26, 2017

Uh oh!

cestella commented Feb 26, 2017

Uh oh!

ottobackwards commented Feb 26, 2017

Uh oh!

cestella commented Feb 27, 2017

Uh oh!

kylerichardson commented Feb 28, 2017

Uh oh!

mmiklavc commented Feb 28, 2017

Uh oh!

ottobackwards commented Mar 1, 2017

Uh oh!

kylerichardson commented Mar 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cestella commented Feb 26, 2017 •

edited

Loading

kylerichardson commented Mar 1, 2017 •

edited

Loading