METRON-495: Upgrade Storm to 1.0.x #318

mmiklavc · 2016-10-21T14:45:48Z

Note: @cestella and I picked up Justin's work while he's out, but attribution for this PR should go to @justinleet.

Original testing plan here

Test Area	Sub area	Script or description
Smoke test		Ensure data flows through the topologies to the index from a base full-dev install
HBase Enrichment	Streaming Enrichment	#127 (comment)
	Batch Enrichment	#131 (comment)
	Stellar Enrichment	Add ENRICHMENT_GET calls in both the enrichment topology as well as a field transformation in the parser
MaaS	Ensure that in addition to this script, we also remove the model and check that data errors out if no model exists.	#210 (comment)
Pcap Ingest		#256 (comment)
Profiler		We will have to create one from https://github.com/apache/incubator-metron/tree/master/metron-analytics/metron-profiler but at the least we should ensure that we can pull data that the profiler puts in hbase from the enrichment topology.

Big thanks to @ottobackwards for assisting with testing and verification.

exclude hadoop-common deps from management pom

cestella · 2016-10-21T15:03:53Z

A couple of points of context around what changed here:

Kafka shifted some of their integration testing infrastructure, so the in memory component for Kafka that we use for integration testing had to change
Ansible blueprint was changed to use the HDP 2.5 stack
There was a bug around the PCap backend where our use of MultiScheme was assuming that the ByteBuffers passed are filled with our data as opposed to wrapping an existing larger buffer.
HBase dependencies brought in various Hadoop 2.5.1 dependencies. This was a long-standing issue, but when the classpath shuffled around, it bit us all.
Apache Curator 2.10.0 is now a dependency of the storm-kafka component. The service discovery piece of this version seems to have a bug that prevents it from working with Model as a Service. We are maintaining the dependency on 2.7.1 for the time being, but should look further into it in a follow-on JIRA.

One thing to note, is that when this PR is merged, we will need to regenerate the image used in vagrant for quick-dev.

mmiklavc · 2016-10-21T18:26:51Z

The Dyn DNS attacks seem to be affecting the Travis website. Will check on this later.

dlyle65535 · 2016-10-22T11:50:47Z

I ran this up on EC2 and found a couple of issues:

Flume has the wrong java_home so the flume-agent process will not start. This is an ongoing issue each time we change the stack version, I recommend we modify deployment to:

Use Ambari to install Flume
Remove Flume entirely and just pipe alert.csv to pipe to kafka-console-producer

The /tmp directory on the enrichment host was filled with files in the form md5.jar. They filled the entire 50GB volume. As a result, the topologies couldn't start. I can did into this a bit more next week.

mmiklavc · 2016-10-24T14:22:16Z

What should the path be for EC2? In the Flume Ansible scripts I see the following - java_home: /usr/jdk64/jdk1.8.0_60
I've created a separate Jira to track Flume migration to Ambari https://issues.apache.org/jira/browse/METRON-511

cestella · 2016-10-26T13:16:52Z

Can we close and reopen this so travis runs again, @mmiklavc ?

cestella · 2016-10-26T13:18:33Z

It appears that the first issue (snort) is being addressed as part of METRON-514. Any insight on what's going on with the /tmp/ directory issue?

mmiklavc · 2016-10-26T13:24:58Z

There also appears to be a version issue with the Storm-kafka client and server versions due to a mismatch with HDP. HDP 2.5 pulls in some commits from a later version of Storm. http://stackoverflow.com/questions/39932441/storm-ui-throwing-offset-lags-for-kafka-not-supported-for-older-versions-pleas

This doesn't appear to affect the topologies working, but it does make it appear like there's a problem through the UI. Should we hold off on this PR until this is resolved? One suggestion is to leverage profiles to enable building against different repos. There would still need to be an Apache version that is built and deployed with EC2, full-dev, and quick-dev. And all of those environments are based on HDP bits, meaning the lag error will show up there regardless of what we do with profiles. We might also try modifying the classpath used when submitting the topology to use the local storm-kafka bits. Community thoughts welcome.

mmiklavc · 2016-10-26T13:30:29Z

@cestella the /tmp issue did not reappear after a redeploy

cestella · 2016-10-26T13:32:15Z

So long as the warning does not affect functionality (any storm committer around want to comment on this assertion?), I would vote that it is not a blocker for this PR. I would suggest a follow-on PR to introduce profiles to support the HDP repos if we really don't like this error.

If the /tmp/ issue is not reappearing, is the warning issue the last concern before getting this in?

dlyle65535 · 2016-10-26T14:03:27Z

The /tmp issue didn't re-occur.

I'd like to see if we can get this out without the error display in the UI. I was confused by it and I suspect others would be as well.

dlyle65535 · 2016-10-26T14:37:13Z

@cestella - I do. It's something we'll need to do anyway, may as well. I can't think of anything easier that'd do it.

mmiklavc · 2016-10-26T20:09:40Z

I've added a profile and am currently testing this out.

mmiklavc · 2016-10-26T20:58:51Z

I'm now seeing a unit test failure when swapping out Apache Storm 1.0.1 for the HDP repo version. Tests pass in IntelliJ, not on the CLI. Investigating.

Results :
Failed tests:
  SpoutConfigTest.testEmptyConfigApplication:107 expected:<10000> but was:<0>
  SpoutConfigTest.testIncompleteConfigApplication:90 expected:<10000> but was:<0>

EDIT:
Looks like these tests are verifying some default values which have changed in KafkaConfig.java. I'm probably going to remove this assertion bc it doesn't seem like we want to test Kafka.

dlyle65535 · 2016-10-26T21:06:15Z

It looks like those are just testing defaults that we don't actually set. Do I have that right?

mmiklavc · 2016-10-27T04:11:20Z

Modifying build versions for Storm removes the Storm Kafka lag error from the UI.

mmiklavc · 2016-10-27T18:58:31Z

Ok, looks like Travis is failing due to a license check. PMC members, do we need to run this for all profiles, or just the default?

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=192m; support was removed in 8.0
Traceback (most recent call last):
  File "build_utils/verify_license.py", line 41, in <module>
    raise ValueError("Unable to find " + component + " in acceptable list of components: " + sys.argv[1])
ValueError: Unable to find com.101tec:zkclient:jar:0.8:compile in acceptable list of components: ./dependencies_with_url.csv

travis_time:end:145bbf40:start=1477492863914186930,finish=1477495134707509286,duration=2270793322356
�[0K
�[31;1mThe command "mvn -q integration-test install && build_utils/verify_licenses.sh
" exited with 1.�[0m

mmiklavc · 2016-10-27T20:19:02Z

I just added a commit that should address the issues with the licenses. I've modified the verify_license.py to print a list of offending licenses rather than print them 1-by-1. Also, the script will now check licenses for the default profile as well as the HDP-2.5.0.0 profile.

mmiklavc · 2016-10-28T16:14:10Z

Before we accept this, I want to point out that I've changed the dependencies_with_url.csv file and that it's probably worth a look.

dlyle65535 · 2016-10-28T18:05:58Z

This ran well on EC2- deployment was good, expected data flow was good, Kafka offset tracking worked as expected.

I'm +1, but there's some things to do prior to pulling this in or Quick Dev and the Docker containers will break.

Stage New Quick Dev Image on Atlas
Create new Docker Containers with 2.5

I'm all set, +1, great job all!

ottobackwards · 2016-10-28T18:10:29Z

Are there instructions for doing either of those things?

dlyle65535 · 2016-10-28T18:18:58Z

Yes.

The Packer stuff is part of Metron and those instructions are in a README.

The Docker stuff is something I maintain as a courtesy to the community based on docker-ambari. My fork with the latest jdk8 stuff is here. That's what I intend to update to use HDP 2.5.

dlyle65535 · 2016-10-28T19:00:14Z

@mmiklavc - I had to make a small tweak to the Quick Dev Vagrantfile for the new image. It's backwardly compatible, fwiw. Just added ambari-slave to the default tags.

Do you want that as a PR against your branch or a separate Jira/PR pair?

mmiklavc · 2016-10-28T19:20:08Z

@dlyle65535 I don't have a strong opinion on it. I'm giving attribution to @justinleet on this PR, since he laid the foundation of pretty much everything here. If you file separately we can give you attribution for the vagrant change.

mmiklavc · 2016-10-28T19:26:09Z

A note for the community - the /tmp file problem did reoccur for us. As it turns out, the timeout default for starting up topologies in Monit was set too low. Normally, Storm cleans up after itself whether a topology succeeds or fails. But due to Monit's timeout setting, it was killing the process prior to completion. As a result, the tmp jar files were being left in /tmp, and Monit continued to retry every minute or two, subsequently filling up the disk space pretty quickly with the ~70MB uber jars.

dlyle65535 · 2016-10-28T19:33:18Z

@mmiklavc PR sent. Thanks!

Make sure ambari-agent has started prior to starting services.

ottobackwards · 2016-10-28T19:40:20Z

I think this monit timeout issue is part of the problem on low resource machines, and with 'zombie' storm threads being left behind

dlyle65535 · 2016-10-28T20:56:00Z

@ottobackwards - concur. If it exceeds the start/stop timeout (defaults to 30 seconds), Monit will terminate the start/stop process and try again. So, 60 worked on my larger machines and on my quick dev testing, but may not be correct for everybody. Maybe monitor and adjust if necessary?

cestella · 2016-10-31T13:38:08Z

I checked the licensing changes @mmiklavc and they look sensible to me.

cestella · 2016-10-31T13:40:05Z

@dlyle65535 gave it a provisional +1 (pending docker images), but I want to pile on with a +1 (non-binding since I have some commits in here). Great job @mmiklavc seeing this to completion. Very non-trivial, so kudos.

DimaKovalyov · 2016-11-08T15:17:50Z

Hello,

I am not sure if this is a good place for jumping it, but I have installed Metron with HDP 2.5 using this great article:
https://community.hortonworks.com/articles/60805/deploying-a-fresh-metron-cluster-using-ambari-serv.html

Fixed few issues here and there I was able to make it running. However all my Storm topologies are having:
Topology spouts lag error:
kafkaSpout KAFKA Unable to get offset lags for kafka. Reason: java.lang.NullPointerException at org.apache.storm.kafka.monitor.KafkaOffsetLagUtil.getOffsetLags(KafkaOffsetLagUtil.java:269) at org.apache.storm.kafka.monitor.KafkaOffsetLagUtil.main(KafkaOffsetLagUtil.java:127)

Considering that sir Michael Miklavcic said: "Modifying build versions for Storm removes the Storm Kafka lag error from the UI." I have mavened my built like that:
mvn clean install -DskipTests -PHDP2.5.0.0

Is there anything else I need to do in order to get Storm working with HDP 2.5?
Please advise.
Thank you.

p.s. I've used latest metron code-base from apache incubator.

james-sirota · 2016-11-08T16:47:26Z

Once you have data in your kafka queue this should go away.

JonZeolla · 2016-11-08T16:58:37Z

Would it make sense to put an instantiation or genesis message on the topic
to avoid this in the future? Are there other ways to suppress this message
on initial startup?

Jon

On Tue, Nov 8, 2016, 11:47 James Sirota notifications@github.com wrote:

Once you have data in your kafka queue this should go away.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#318 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABUkJm0sU_xd_zvBz-v9EQTpc7QgJGS5ks5q8KeegaJpZM4KdThQ
.

Jon

Sent from my mobile device

DimaKovalyov · 2016-11-08T19:33:09Z

Thank you James,

Once you have data in your kafka queue this should go away.

That is true! Once I create a topic and stream data through it the error is gone.

My data is now going to enrichment and both bolts and spouts (all of them) are having this weird error:
java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at org.apache.zookeeper.ClientCnxn.start(ClientCnxn.java:417) at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:450) at ... java.lang.Thread.run(Thread.java:745)

And supervisor crashes also after 5-10 minutes with:

2016-11-08 14:25:56.125 o.a.s.event [ERROR] Error when processing event
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:714)
        at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
        at java.lang.UNIXProcess.initStreams(UNIXProcess.java:289)
        at java.lang.UNIXProcess.lambda$new$2(UNIXProcess.java:259)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:258)
        at java.lang.ProcessImpl.start(ProcessImpl.java:134)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
        at java.lang.Runtime.exec(Runtime.java:620)
        at org.apache.storm.shade.org.apache.commons.exec.launcher.Java13CommandLauncher.exec(Java13CommandLauncher.java:58)
        at org.apache.storm.shade.org.apache.commons.exec.DefaultExecutor.launch(DefaultExecutor.java:254)
        at org.apache.storm.shade.org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:319)
        at org.apache.storm.shade.org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:160)
        at org.apache.storm.shade.org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:147)
        at org.apache.storm.util$exec_command_BANG_.invoke(util.clj:402)
        at org.apache.storm.util$send_signal_to_process.invoke(util.clj:429)
        at org.apache.storm.util$kill_process_with_sig_term.invoke(util.clj:454)
        at org.apache.storm.daemon.supervisor$shutdown_worker.invoke(supervisor.clj:290)
        at org.apache.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:435)
        at clojure.core$partial$fn__4527.invoke(core.clj:2492)
        at org.apache.storm.event$event_manager$fn__7248.invoke(event.clj:40)
        at clojure.lang.AFn.run(AFn.java:22)
        at java.lang.Thread.run(Thread.java:745)

Even though I have more than 30 GB RAM available. Do I need to tune Storm for better memory usage?
Please advise.

Dima

justinleet and others added 19 commits October 14, 2016 15:25

Initial move to Storm 1.0.1

dbf7ff1

fixing more imports

9796bf3

changes to kakfa. WIP

044b9d7

Missing changes from last commit

ec157af

WIP changes with scala wrapper

9032165

More integration tests working

f418f50

cleanup

f445bfb

Fixed pcap and maas issues.

3bcfa39

Merge branch 'master' into storm_1.0

ab3f153

updating

30f9e80

fixed metron-management

5a5e9ac

removing logging

996958d

flux fix.

3df30cd

exclude hadoop-common deps from management pom

8eede3e

Merge pull request #2 from mmiklavc/storm_1.0

3f780f7

exclude hadoop-common deps from management pom

removing 2.5.1

f1d97b3

removed the remaining 2.5.1 dependencies

b62a28d

missed one

f89ab58

removing logging

e551538

justinleet mentioned this pull request Oct 24, 2016

METRON-495: Upgrade Storm to 1.0.x #310

Closed

METRON-495: Upgrade Storm to 1.0.x

e5622d6

METRON-495: Fix license checks

c8c5c5b

mmiklavc added 2 commits October 28, 2016 10:03

fix profile name

ed33bac

fix profile name...in readme

ffbf068

Make sure ambari-agent has started prior to starting services.

34ff7b7

Merge pull request #2 from dlyle65535/STORM_1.0

7f11604

Make sure ambari-agent has started prior to starting services.

asfgit closed this in e317050 Oct 31, 2016

METRON-495: Upgrade Storm to 1.0.x #318

METRON-495: Upgrade Storm to 1.0.x #318

Uh oh!

Conversation

mmiklavc commented Oct 21, 2016

Uh oh!

cestella commented Oct 21, 2016

Uh oh!

mmiklavc commented Oct 21, 2016

Uh oh!

dlyle65535 commented Oct 22, 2016

Uh oh!

mmiklavc commented Oct 24, 2016

Uh oh!

cestella commented Oct 26, 2016

Uh oh!

cestella commented Oct 26, 2016

Uh oh!

mmiklavc commented Oct 26, 2016

Uh oh!

mmiklavc commented Oct 26, 2016

Uh oh!

cestella commented Oct 26, 2016

Uh oh!

dlyle65535 commented Oct 26, 2016

Uh oh!

dlyle65535 commented Oct 26, 2016

Uh oh!

mmiklavc commented Oct 26, 2016

Uh oh!

mmiklavc commented Oct 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dlyle65535 commented Oct 26, 2016

Uh oh!

mmiklavc commented Oct 27, 2016

Uh oh!

mmiklavc commented Oct 27, 2016

Uh oh!

mmiklavc commented Oct 27, 2016

Uh oh!

mmiklavc commented Oct 28, 2016

Uh oh!

dlyle65535 commented Oct 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ottobackwards commented Oct 28, 2016

Uh oh!

dlyle65535 commented Oct 28, 2016

Uh oh!

dlyle65535 commented Oct 28, 2016

Uh oh!

mmiklavc commented Oct 28, 2016

Uh oh!

mmiklavc commented Oct 28, 2016

Uh oh!

dlyle65535 commented Oct 28, 2016

Uh oh!

ottobackwards commented Oct 28, 2016

Uh oh!

dlyle65535 commented Oct 28, 2016

Uh oh!

cestella commented Oct 31, 2016

Uh oh!

cestella commented Oct 31, 2016

Uh oh!

DimaKovalyov commented Nov 8, 2016

Uh oh!

james-sirota commented Nov 8, 2016

Uh oh!

JonZeolla commented Nov 8, 2016

Uh oh!

DimaKovalyov commented Nov 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

mmiklavc commented Oct 26, 2016 •

edited

Loading

dlyle65535 commented Oct 28, 2016 •

edited

Loading

DimaKovalyov commented Nov 8, 2016 •

edited

Loading