Skip to content

Test & Fix for deadlock in Cascading-Local #47

Closed
cchepelov wants to merge 60 commits intocwensel:wip-3.0from
cchepelov:local_fork_hashjoin_deadlock
Closed

Test & Fix for deadlock in Cascading-Local #47
cchepelov wants to merge 60 commits intocwensel:wip-3.0from
cchepelov:local_fork_hashjoin_deadlock

Conversation

@cchepelov
Copy link
Copy Markdown
Contributor

When running scalding's test suite, a test freezes and dies (SkewJoinPipeTest)

This PR provides a simplified test case that exhibits the same blocking behaviour + provides a fix

The fix works by splitting all edges downstream from a Fork into a separate thread, on the (fair) assumption that downstream consumers are already thread-hardened anyway.
This affects only cascading-local, other backends keep using the same Fork mechanisms as they used to before.

cwensel and others added 30 commits June 25, 2015 09:48
…rks to fully leverage declared field type information to reduce serialized data and perform bitwise equality comparisons. See c.t.h.TupleSerializationProps to disable bitwise comparisons.
…rms through the c.f.FlowRuntimeProps "cascading.flow.runtime.splits.combine" property. If enabled, will induce c.t.h.Hfs to enable combined files support on the MapReduce platforms.
…n, and a method #applyFields() to update field names.
…nd serialization dependencies and updated cascading-hadoop2-mr1 and cascading-hadoop2-tez to depend on cascading-hadoop2-io.
…operations based on Janino and it isolate the Janino dependency. A dependency to cascading-expression must be added to projects that depend on the isolated classes.
…ors to store additional metadata on c.f.FlowStep instances.
…splicing back into a c.p.HashJoin could create an invalid plan.
@cchepelov
Copy link
Copy Markdown
Contributor Author

Back on

The build is still running, but I already see test failures in Tez:

  • JoinFieldedPipesPlatformTest.testForkCoGroupThenHashJoinCoGroupAgain
  • JoinFieldedPipesPlatformTest.testForkThenJoin

JoinFieldedPipesPlatformTest.testForkThenJoin returns zero items on tez, whereas it returns 5 items on -local, -hadoop, -hadoop-mr1. @cwensel , I'm afraid (unless I failed something) I've found a tez vs. rest of the world discrepancy.

JoinFieldedPipesPlatformTest.testForkCoGroupThenHashJoinCoGroupAgain fails clusterside, because of a missed copyFromLocal() of one input file (strange that it now passes). ==> pushing a quick commit for that.

(update: now down to JoinFieldedPipesPlatformTest.testForkThenJoin which fails only under tez. Identical behaviour between java7u91 and java8u72)

@cchepelov cchepelov force-pushed the local_fork_hashjoin_deadlock branch from 0c7ec68 to ba5fde3 Compare January 11, 2016 17:46
@cchepelov cchepelov force-pushed the local_fork_hashjoin_deadlock branch from 55463d6 to 43bb327 Compare January 12, 2016 12:54
@cchepelov
Copy link
Copy Markdown
Contributor Author

Patch updated and smashed back into 1 commit.
One test @Ignore'd as it breaks on Tez (really a separate issue I think).

@fs111
Copy link
Copy Markdown
Collaborator

fs111 commented Jan 13, 2016

JoinFieldedPipesPlatformTest.testForkCoGroupThenHashJoinCoGroupAgain failed again on Tez (linux and java 7). Are you sure, you send me the latest code?

@cchepelov
Copy link
Copy Markdown
Contributor Author

What? Double checking:
now at 43bb327. Using gradle 2.10 from gradle.com. Using Java7
LC_ALL=en_US.UTF-8 gradle clean build install now running

Will advise once it's complete (cc @simondumas)

Update:

BUILD SUCCESSFUL

Total time: 3 hrs 36 mins 7.098 secs
Stopped 0 compiler daemon(s).

toolchain manifest:


Gradle 2.10

Build time:   2015-12-21 21:15:04 UTC
Build number: none
Revision:     276bdcded730f53aa8c11b479986aafa58e124a6

Groovy:       2.4.4
Ant:          Apache Ant(TM) version 1.9.3 compiled on December 23 2013
JVM:          1.7.0_91 (Oracle Corporation 24.91-b01)
OS:           Linux 4.3.0-1-amd64 amd64```

@fs111 were you able to recover a stack trace or something from the failed test on your side?

cchepelov pushed a commit to cchepelov/scalding that referenced this pull request Jan 13, 2016
@cchepelov
Copy link
Copy Markdown
Contributor Author

Same on the Oracle 7u80 jdk:

ORACLE java7u80 cchepelov@TP12:~/workspace/3rd-party/cascading$ git describe
 3.0-wip-74-159-g43bb327
ORACLE java7u80 cchepelov@TP12:~/workspace/3rd-party/cascading$ gradle -v

------------------------------------------------------------
Gradle 2.10
------------------------------------------------------------

Build time:   2015-12-21 21:15:04 UTC
Build number: none
Revision:     276bdcded730f53aa8c11b479986aafa58e124a6

Groovy:       2.4.4
Ant:          Apache Ant(TM) version 1.9.3 compiled on December 23 2013
JVM:          1.7.0_80 (Oracle Corporation 24.80-b11)
OS:           Linux 4.3.0-1-amd64 amd64```

LC_ALL=en_US.UTF-8 gradle clean :cascading-hadoop2-tez:platformTest --tests=cascading.JoinFieldedPipesPlatformTest.testForkCoGroupThenHashJoinCoGroupAgain --stacktrace --info

(snip)

    2016-01-14 10:01:37,611 INFO  flow.Flow (BaseFlow.java:logInfo(1433)) - [cogrouping]  completed in: 00:13.426, using cpu time: 00:10.410
    2016-01-14 10:01:37,640 WARN  server.AuthenticationFilter (AuthenticationFilter.java:doFilter(497)) - AuthenticationToken ignored: org.apache.hadoop.security.authentication.util.SignerException: Invalid signature
    2016-01-14 10:01:37,708 INFO  cascading.PlatformTestCase (PlatformTestCase.java:tearDown(189)) - copying to local /home/cchepelov/workspace/3rd-party/cascading/cascading-hadoop2-tez/build/test/output/hadoop2-tez/joinfieldedpipesplatform/testForkCoGroupThenHashJoinCoGroupAgain/join
Gradle Test Executor 40 finished executing tests.
Finished generating test XML results (0.076 secs) into: /home/cchepelov/workspace/3rd-party/cascading/cascading-hadoop2-tez/build/test-results
Generating HTML test report...
Finished generating test html results (0.07 secs) into: /home/cchepelov/workspace/3rd-party/cascading/cascading-hadoop2-tez/build/reports/tests
:cascading-hadoop2-tez:platformTest (Thread[main,5,main]) completed. Took 56.981 secs.

BUILD SUCCESSFUL

Total time: 1 mins 22.045 secs

I guess at this stage, I'm stuck…

@cwensel
Copy link
Copy Markdown
Owner

cwensel commented Jan 14, 2016

I’ve fixed the known underlying issues in wip-3.1.

as for this patch, is there anyway we can not add new test functions and use existing ones like Identity or ExpressionFunction. would be nice to keep this class from growing too large.

with those changes we an re-test and make the commit/push/publish.

@cchepelov cchepelov force-pushed the local_fork_hashjoin_deadlock branch from 43bb327 to 9624012 Compare January 15, 2016 12:49
@cchepelov
Copy link
Copy Markdown
Contributor Author

  • rebased on wip-3.1
  • removed custom functions in JoinFieldedPipesPlatformTest, replaced with ExpressionFunctions

cchepelov pushed a commit to cchepelov/scalding that referenced this pull request Jan 15, 2016
@cwensel
Copy link
Copy Markdown
Owner

cwensel commented Jan 15, 2016

I think some cleanup needs to happen here. the patch doesn’t apply, might have to do with the fact its 1.6M and attempts to update every file in my wip-3.1 branch.

update: let me just apply the singular commit..

@cchepelov
Copy link
Copy Markdown
Contributor Author

Le 15/01/2016 19:21, Chris K Wensel a écrit :

I think some cleanup needs to happen here. the patch doesn’t apply,
might have to do with the fact its 1.6M and attempts to update every
file in my wip-3.1 branch.

What? I must have made something really bad with git/github. Let me try
to fix this

@cchepelov cchepelov force-pushed the local_fork_hashjoin_deadlock branch from 9624012 to ea14fb2 Compare January 15, 2016 18:33
…g. Scalding's SkewedJoin)

    (fix expected output)

    cascading-local: use ParallelFork to avoid deadlocks in Fork-HashJoin scenarios

    java7 compatibility

    (to be fair, workStealingThreadPool worked by chance, as there are more cores on my machine than the maximum amount of inbound pipes in the joins present downstream from a fork in the test suite)

    (forgotten test input file needs to go to the platform)

    Disabling the cascading.JoinFieldedPipesPlatformTest.testForkThenJoin test, which fails under Tez (for an unrelated cause)

    remove spurious janino import

    removed custom functions, replaced by ExpressionFunctions

Author: Cyrille Chépélov (TP12) <cch@transparencyrights.com>
Date:   Thu Jan 7 10:44:04 2016 +0100
@cchepelov cchepelov force-pushed the local_fork_hashjoin_deadlock branch from ea14fb2 to e2ba161 Compare January 15, 2016 18:34
@cchepelov
Copy link
Copy Markdown
Contributor Author

Turns out I started this on the wrong branch (wip-3.0). While I could rebase my branch off wip-3.1, I can't rebase the PR itself, I have to open a new one.

@cchepelov
Copy link
Copy Markdown
Contributor Author

The action continues here: #48

@cchepelov cchepelov closed this Jan 15, 2016
cchepelov pushed a commit to cchepelov/scalding that referenced this pull request Jan 20, 2016
cchepelov pushed a commit to cchepelov/scalding that referenced this pull request Jan 21, 2016
cchepelov pushed a commit to cchepelov/scalding that referenced this pull request Jan 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants