Skip to content
This repository was archived by the owner on Aug 20, 2025. It is now read-only.

Conversation

@mmiklavc
Copy link
Contributor

Note: @cestella and I picked up Justin's work while he's out, but attribution for this PR should go to @justinleet.

Original testing plan here

Test Area Sub area Script or description
Smoke test Ensure data flows through the topologies to the index from a base full-dev install
HBase Enrichment Streaming Enrichment #127 (comment)
Batch Enrichment #131 (comment)
Stellar Enrichment Add ENRICHMENT_GET calls in both the enrichment topology as well as a field transformation in the parser
MaaS Ensure that in addition to this script, we also remove the model and check that data errors out if no model exists. #210 (comment)
Pcap Ingest #256 (comment)
Profiler We will have to create one from https://github.com/apache/incubator-metron/tree/master/metron-analytics/metron-profiler but at the least we should ensure that we can pull data that the profiler puts in hbase from the enrichment topology.

Big thanks to @ottobackwards for assisting with testing and verification.

@cestella
Copy link
Member

A couple of points of context around what changed here:

  • Kafka shifted some of their integration testing infrastructure, so the in memory component for Kafka that we use for integration testing had to change
  • Ansible blueprint was changed to use the HDP 2.5 stack
  • There was a bug around the PCap backend where our use of MultiScheme was assuming that the ByteBuffers passed are filled with our data as opposed to wrapping an existing larger buffer.
  • HBase dependencies brought in various Hadoop 2.5.1 dependencies. This was a long-standing issue, but when the classpath shuffled around, it bit us all.
  • Apache Curator 2.10.0 is now a dependency of the storm-kafka component. The service discovery piece of this version seems to have a bug that prevents it from working with Model as a Service. We are maintaining the dependency on 2.7.1 for the time being, but should look further into it in a follow-on JIRA.

One thing to note, is that when this PR is merged, we will need to regenerate the image used in vagrant for quick-dev.

@mmiklavc
Copy link
Contributor Author

The Dyn DNS attacks seem to be affecting the Travis website. Will check on this later.

@dlyle65535
Copy link
Contributor

I ran this up on EC2 and found a couple of issues:

  1. Flume has the wrong java_home so the flume-agent process will not start. This is an ongoing issue each time we change the stack version, I recommend we modify deployment to:
  • Use Ambari to install Flume
  • Remove Flume entirely and just pipe alert.csv to pipe to kafka-console-producer
  1. The /tmp directory on the enrichment host was filled with files in the form md5.jar. They filled the entire 50GB volume. As a result, the topologies couldn't start. I can did into this a bit more next week.

@mmiklavc
Copy link
Contributor Author

What should the path be for EC2? In the Flume Ansible scripts I see the following - java_home: /usr/jdk64/jdk1.8.0_60
I've created a separate Jira to track Flume migration to Ambari https://issues.apache.org/jira/browse/METRON-511

@cestella
Copy link
Member

Can we close and reopen this so travis runs again, @mmiklavc ?

@cestella
Copy link
Member

It appears that the first issue (snort) is being addressed as part of METRON-514. Any insight on what's going on with the /tmp/ directory issue?

@mmiklavc
Copy link
Contributor Author

There also appears to be a version issue with the Storm-kafka client and server versions due to a mismatch with HDP. HDP 2.5 pulls in some commits from a later version of Storm. http://stackoverflow.com/questions/39932441/storm-ui-throwing-offset-lags-for-kafka-not-supported-for-older-versions-pleas

This doesn't appear to affect the topologies working, but it does make it appear like there's a problem through the UI. Should we hold off on this PR until this is resolved? One suggestion is to leverage profiles to enable building against different repos. There would still need to be an Apache version that is built and deployed with EC2, full-dev, and quick-dev. And all of those environments are based on HDP bits, meaning the lag error will show up there regardless of what we do with profiles. We might also try modifying the classpath used when submitting the topology to use the local storm-kafka bits. Community thoughts welcome.

topology-warning

@mmiklavc
Copy link
Contributor Author

@cestella the /tmp issue did not reappear after a redeploy

@cestella
Copy link
Member

So long as the warning does not affect functionality (any storm committer around want to comment on this assertion?), I would vote that it is not a blocker for this PR. I would suggest a follow-on PR to introduce profiles to support the HDP repos if we really don't like this error.

If the /tmp/ issue is not reappearing, is the warning issue the last concern before getting this in?

@dlyle65535
Copy link
Contributor

The /tmp issue didn't re-occur.

I'd like to see if we can get this out without the error display in the UI. I was confused by it and I suspect others would be as well.

@dlyle65535
Copy link
Contributor

@cestella - I do. It's something we'll need to do anyway, may as well. I can't think of anything easier that'd do it.

@mmiklavc
Copy link
Contributor Author

I've added a profile and am currently testing this out.

@mmiklavc
Copy link
Contributor Author

mmiklavc commented Oct 26, 2016

I'm now seeing a unit test failure when swapping out Apache Storm 1.0.1 for the HDP repo version. Tests pass in IntelliJ, not on the CLI. Investigating.

Results :
Failed tests:
  SpoutConfigTest.testEmptyConfigApplication:107 expected:<10000> but was:<0>
  SpoutConfigTest.testIncompleteConfigApplication:90 expected:<10000> but was:<0>

EDIT:
Looks like these tests are verifying some default values which have changed in KafkaConfig.java. I'm probably going to remove this assertion bc it doesn't seem like we want to test Kafka.

@dlyle65535
Copy link
Contributor

It looks like those are just testing defaults that we don't actually set. Do I have that right?

@mmiklavc
Copy link
Contributor Author

Modifying build versions for Storm removes the Storm Kafka lag error from the UI.

storm-kafka-lag-ok

@mmiklavc
Copy link
Contributor Author

Ok, looks like Travis is failing due to a license check. PMC members, do we need to run this for all profiles, or just the default?

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=192m; support was removed in 8.0
Traceback (most recent call last):
  File "build_utils/verify_license.py", line 41, in <module>
    raise ValueError("Unable to find " + component + " in acceptable list of components: " + sys.argv[1])
ValueError: Unable to find com.101tec:zkclient:jar:0.8:compile in acceptable list of components: ./dependencies_with_url.csv

travis_time:end:145bbf40:start=1477492863914186930,finish=1477495134707509286,duration=2270793322356
�[0K
�[31;1mThe command "mvn -q integration-test install && build_utils/verify_licenses.sh
" exited with 1.�[0m

@mmiklavc
Copy link
Contributor Author

I just added a commit that should address the issues with the licenses. I've modified the verify_license.py to print a list of offending licenses rather than print them 1-by-1. Also, the script will now check licenses for the default profile as well as the HDP-2.5.0.0 profile.

@mmiklavc
Copy link
Contributor Author

Before we accept this, I want to point out that I've changed the dependencies_with_url.csv file and that it's probably worth a look.

@dlyle65535
Copy link
Contributor

dlyle65535 commented Oct 28, 2016

This ran well on EC2- deployment was good, expected data flow was good, Kafka offset tracking worked as expected.

I'm +1, but there's some things to do prior to pulling this in or Quick Dev and the Docker containers will break.

  • Stage New Quick Dev Image on Atlas
  • Create new Docker Containers with 2.5

I'm all set, +1, great job all!

@ottobackwards
Copy link
Contributor

Are there instructions for doing either of those things?

@dlyle65535
Copy link
Contributor

Yes.

The Packer stuff is part of Metron and those instructions are in a README.

The Docker stuff is something I maintain as a courtesy to the community based on docker-ambari. My fork with the latest jdk8 stuff is here. That's what I intend to update to use HDP 2.5.

@dlyle65535
Copy link
Contributor

@mmiklavc - I had to make a small tweak to the Quick Dev Vagrantfile for the new image. It's backwardly compatible, fwiw. Just added ambari-slave to the default tags.

Do you want that as a PR against your branch or a separate Jira/PR pair?

@mmiklavc
Copy link
Contributor Author

@dlyle65535 I don't have a strong opinion on it. I'm giving attribution to @justinleet on this PR, since he laid the foundation of pretty much everything here. If you file separately we can give you attribution for the vagrant change.

@mmiklavc
Copy link
Contributor Author

A note for the community - the /tmp file problem did reoccur for us. As it turns out, the timeout default for starting up topologies in Monit was set too low. Normally, Storm cleans up after itself whether a topology succeeds or fails. But due to Monit's timeout setting, it was killing the process prior to completion. As a result, the tmp jar files were being left in /tmp, and Monit continued to retry every minute or two, subsequently filling up the disk space pretty quickly with the ~70MB uber jars.

@dlyle65535
Copy link
Contributor

@mmiklavc PR sent. Thanks!

Make sure ambari-agent has started prior to starting services.
@ottobackwards
Copy link
Contributor

I think this monit timeout issue is part of the problem on low resource machines, and with 'zombie' storm threads being left behind

@dlyle65535
Copy link
Contributor

@ottobackwards - concur. If it exceeds the start/stop timeout (defaults to 30 seconds), Monit will terminate the start/stop process and try again. So, 60 worked on my larger machines and on my quick dev testing, but may not be correct for everybody. Maybe monitor and adjust if necessary?

@cestella
Copy link
Member

I checked the licensing changes @mmiklavc and they look sensible to me.

@cestella
Copy link
Member

@dlyle65535 gave it a provisional +1 (pending docker images), but I want to pile on with a +1 (non-binding since I have some commits in here). Great job @mmiklavc seeing this to completion. Very non-trivial, so kudos.

@asfgit asfgit closed this in e317050 Oct 31, 2016
@DimaKovalyov
Copy link

Hello,

I am not sure if this is a good place for jumping it, but I have installed Metron with HDP 2.5 using this great article:
https://community.hortonworks.com/articles/60805/deploying-a-fresh-metron-cluster-using-ambari-serv.html

Fixed few issues here and there I was able to make it running. However all my Storm topologies are having:
Topology spouts lag error:
kafkaSpout KAFKA Unable to get offset lags for kafka. Reason: java.lang.NullPointerException at org.apache.storm.kafka.monitor.KafkaOffsetLagUtil.getOffsetLags(KafkaOffsetLagUtil.java:269) at org.apache.storm.kafka.monitor.KafkaOffsetLagUtil.main(KafkaOffsetLagUtil.java:127)
storm_kafka_error

Considering that sir Michael Miklavcic said: "Modifying build versions for Storm removes the Storm Kafka lag error from the UI." I have mavened my built like that:
mvn clean install -DskipTests -PHDP2.5.0.0

Is there anything else I need to do in order to get Storm working with HDP 2.5?
Please advise.
Thank you.

p.s. I've used latest metron code-base from apache incubator.

@james-sirota
Copy link

Once you have data in your kafka queue this should go away.

@JonZeolla
Copy link
Member

Would it make sense to put an instantiation or genesis message on the topic
to avoid this in the future? Are there other ways to suppress this message
on initial startup?

Jon

On Tue, Nov 8, 2016, 11:47 James Sirota notifications@github.com wrote:

Once you have data in your kafka queue this should go away.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#318 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABUkJm0sU_xd_zvBz-v9EQTpc7QgJGS5ks5q8KeegaJpZM4KdThQ
.

Jon

Sent from my mobile device

@DimaKovalyov
Copy link

DimaKovalyov commented Nov 8, 2016

Thank you James,

Once you have data in your kafka queue this should go away.

That is true! Once I create a topic and stream data through it the error is gone.

My data is now going to enrichment and both bolts and spouts (all of them) are having this weird error:
java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at org.apache.zookeeper.ClientCnxn.start(ClientCnxn.java:417) at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:450) at ... java.lang.Thread.run(Thread.java:745)

And supervisor crashes also after 5-10 minutes with:

2016-11-08 14:25:56.125 o.a.s.event [ERROR] Error when processing event
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:714)
        at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
        at java.lang.UNIXProcess.initStreams(UNIXProcess.java:289)
        at java.lang.UNIXProcess.lambda$new$2(UNIXProcess.java:259)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:258)
        at java.lang.ProcessImpl.start(ProcessImpl.java:134)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
        at java.lang.Runtime.exec(Runtime.java:620)
        at org.apache.storm.shade.org.apache.commons.exec.launcher.Java13CommandLauncher.exec(Java13CommandLauncher.java:58)
        at org.apache.storm.shade.org.apache.commons.exec.DefaultExecutor.launch(DefaultExecutor.java:254)
        at org.apache.storm.shade.org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:319)
        at org.apache.storm.shade.org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:160)
        at org.apache.storm.shade.org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:147)
        at org.apache.storm.util$exec_command_BANG_.invoke(util.clj:402)
        at org.apache.storm.util$send_signal_to_process.invoke(util.clj:429)
        at org.apache.storm.util$kill_process_with_sig_term.invoke(util.clj:454)
        at org.apache.storm.daemon.supervisor$shutdown_worker.invoke(supervisor.clj:290)
        at org.apache.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:435)
        at clojure.core$partial$fn__4527.invoke(core.clj:2492)
        at org.apache.storm.event$event_manager$fn__7248.invoke(event.clj:40)
        at clojure.lang.AFn.run(AFn.java:22)
        at java.lang.Thread.run(Thread.java:745)

Even though I have more than 30 GB RAM available. Do I need to tune Storm for better memory usage?
Please advise.

  • Dima

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants