Skip to content
This repository was archived by the owner on Aug 20, 2025. It is now read-only.

Conversation

@cestella
Copy link
Member

@cestella cestella commented Feb 26, 2017

The FileSystem.listFiles() call does not return the files in sorted order, which we assume for all FileSystem implementations. We should sort this to be certain.

Testing

Get PCAP data into Metron: Install and setup pycapa - the instructions below reference/mirror those in #93

  • Install the pycapa library & utility $ cd /opt/pycapa/pycapa && pip install -r requirements.txt && python setup.py install
  • (if using singlenode vagrant) Kill the enrichment and sensor topologies via for i in bro enrichment yaf snort;do storm kill $i;done
  • Start the pcap topology via $METRON_HOME/bin/start_pcap_topology.sh
  • Start the pycapa packet capture producer on eth1 via /usr/bin/pycapa --producer --topic pcap -i eth1 -k node1:6667
  • Watch the topology in the Storm UI and kill the packet capture utility from before, when the number of packets ingested is over 3k.
  • Ensure that at at least 3 files exist on HDFS by running hadoop fs -ls /apps/metron/pcap
    Choose a file (denoted by $FILE) and dump a few of the contents using the pcap_inspector utility via $METRON_HOME/bin/pcap_inspector.sh -i $FILE -n 5
    Choose one of the lines and note the protocol.
    Note that when you run the commands below, the resulting file will be placed in the execution directory where you kicked off the job from.
  • Run a Stellar query filter query by executing a command similar to the following, with the values noted above (match your start_time format to the date format provided - default is to use millis since epoch)
    $METRON_HOME/bin/pcap_query.sh query -st "20160617" -df "yyyyMMdd" -query "protocol == 6" -rpf 500
  • Verify the MR job finishes successfully. Upon completion, you should see multiple files named with relatively current datestamps in your current directory, e.g. pcap-data-20160617160549737+0000.pcap
  • Copy the files to your local machine and verify you can them it in Wireshark. I chose a middle file and the last file. The middle file should have 500 records (per the records_per_file option), and the last one will likely have a number of records <= 500.

For all changes:

  • Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
  • Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
  • Has your PR been rebased against the latest commit within the target branch (typically master)?

For code changes:

  • Have you included steps to reproduce the behavior or problem that is being changed or addressed?
  • Have you included steps or a guide to how the change may be verified and tested manually?
  • Have you ensured that the full suite of tests and checks have been executed in the root incubating-metron folder via:
mvn -q clean integration-test install && build_utils/verify_licenses.sh 
  • Have you written or updated unit tests and or integration tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via site-book/target/site/index.html.
cd site-book
bin/generate-md.sh
mvn site:site

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommened that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.

@ottobackwards
Copy link
Contributor

What are the performance penalties of doing this?

Part of the discussion in HADOOP-12009 that lead to the clarification of this behavior in the spec was about why they don't do this themselves for the performance penalty.

Could you get OOM errors or other issues running large queries that would have worked before?

@ottobackwards
Copy link
Contributor

There are two things here I think

  1. The test assumptions and the failure on some platforms ( if this is indeed the problem )
  2. What this means for the real use of the pcap query system

I think we can address the test first, and may have to discuss the 2.

@cestella
Copy link
Member Author

The performance penalties are minimal. The number of files will equal the number of reducers, which does not scale with the data, and user specifiable. Also we are just sorting the file handles here, not the contents, so OOM errors are very unlikely. The contents are sorted by virtue of MapReduce, the files are named in an ordered way by virtue of our custom partitioner, this just ensures that the files are processed in order.

I'm not treating this as just a test problem. This is a problem of our assumptions not being correct. This could be a problem for the real pcap system, not just the test, if people are using non-HDFS implementation. For HDFS, it's probably not an issue (I'm not even sure of that in all cases, honestly and there is no guarantee for the behavior to change since it's not mandated), but I'd rather own our assumptions rather than depend on Filesystem operations which do not conform to our assumptions necessarily.

@ottobackwards
Copy link
Contributor

Thank you for clarifying, that makes sense. Sorry to confuse the issue

@cestella
Copy link
Member Author

Ok, I ran this up and tested it and I got the results I expect, but I'd like some independent confirmation by @kylerichardson

@kylerichardson
Copy link
Contributor

+1 passes unit and integration tests, ran through @cestella's test script successfully

Thanks for your patience and for fixing!

@mmiklavc
Copy link
Contributor

Also +1 on this. Ran the script above.

Just a note, the Vagrant quick-dev install for pcap doesn't work via vagrant --ansible-tags="pycapa" provision. Not sure what others have done here, but I ended up installing manually using the following procedure.

# set env vars
export PYCAPA_HOME=/opt/pycapa
export PYTHON27_HOME=/opt/rh/python27/root

# Install these packages via yum (RHEL, CentOS)
yum -y install epel-release centos-release-scl 
yum -y install "@Development tools" python27 python27-scldevel python27-python-virtualenv libpcap-devel libselinux-python

# Setup directories
mkdir $PYCAPA_HOME && chmod 755 $PYCAPA_HOME

# Create virtualenv
export LD_LIBRARY_PATH="/opt/rh/python27/root/usr/lib64"
${PYTHON27_HOME}/usr/bin/virtualenv pycapa-venv

# Copy pycapa
# copy incubator-metron/metron-sensors/pycapa from the Metron source tree into $PYCAPA_HOME on the node you would like to install pycapa on.

# Build it
cd ${PYCAPA_HOME}/pycapa
# activate the virtualenv
source ${PYCAPA_HOME}/pycapa-venv/bin/activate
pip install -r requirements.txt
python setup.py install

# Run it
cd ${PYCAPA_HOME}/pycapa-venv/bin
pycapa --producer --topic pcap -i eth1 -k node1:6667

@ottobackwards
Copy link
Contributor

+1, and i'm now seeing this in my personal travis

@asfgit asfgit closed this in e416a7d Mar 1, 2017
@kylerichardson
Copy link
Contributor

kylerichardson commented Mar 1, 2017

@mmiklavc I ran into the same issue with installing pycapa on quick-dev. My solution was to tweak the playbook to run the pycapa role as part of the sensor-stubs tag.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants