Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcast and DOP Issue in Flink #32

Open
carabolic opened this issue May 30, 2016 · 14 comments
Open

Broadcast and DOP Issue in Flink #32

carabolic opened this issue May 30, 2016 · 14 comments
Assignees

Comments

@carabolic
Copy link
Member

carabolic commented May 30, 2016

It seems that there is an issue with how Flink handles broadcast DataSets.

Problem

Let's assume we have a Flink cluster with N = 20 nodes and T = 2 tasks per node, hence DOP = 20 * 2 = 40. If we now have a job that reads inputSize = 5mb of data into a single dataset and consecutively broadcast this dataset to the mappers (with max DOP), the data gets broadcasted to every mapper in isolation which means broadcastSize = DOP * inputSize = 40 * 5mb = 200mb need to transferred over the network.

In our case it becomes obvious when running the LinRegDS.dml script on flink_hybrid. The second flink job involves MapmmFLInstruction which broadcasts the smaller matrix to all the mappers. For DOP of 250 this results in about 10GB of broadcasted data.

Solution

Since all tasks per node run in the same JVM it would be better to simply broadcast to the taskmanagers only, which then pass a simple reference to the single task they are responsible for. So for the example this reduces the size the broadcast to broadcastSize = N * inputSize = 20 * 5mb = 100mb.

For the LinReg.dml use-case this fix will reduce the size of the broadcast by 16. Hence it will only need to broadcast 10GB / 16 = 0.625GB of data.

Workaround

For now this could be fixed if the dop is set really low for jobs that include a broadcast.

Follow Up

I will investigate a little bit more to see if this is a known issue for flink and if there are already ways to work around the problem, even maybe opening a PR with Flink.

@carabolic carabolic self-assigned this May 30, 2016
@aalexandrov
Copy link

@carabolic This can also explain some of the results observed by @bodegfoh with respect to the logreg benchmark of Spark vs. Flink

@akunft
Copy link

akunft commented May 31, 2016

Would be nice if we can have a simple example showing this and open a jira issue in Flink.
@carabolic Could you do that?

@aalexandrov
Copy link

@carabolic Maybe you can bootstrap a new Peel bundle, modify the wordcount job to something better that highlights the issue, execute this on one of the clusters, and provide the chart as part of the jira.

The IBM machines have 48 cores on each node, which should make the effect quite visible.

@carabolic
Copy link
Member Author

carabolic commented May 31, 2016

I'm preparing a bundle right now. The idea is to generate a vector: DataSet[Long]s with |vector| = (s * 1024 * 1024) / 8 elements where s is the desired size in mb that is to be broadcasted. This DataSet vector is then broadcasted to each mapper. The number of mappers can be controlled by a dop parameter.

@aalexandrov
Copy link

Sounds good.

@aalexandrov
Copy link

aalexandrov commented May 31, 2016

To make sure that N mappers are started, you can instantiate using fromParallelCollection with a NumberSequenceIterator with N elements and subsequently set the DOP to N.

environment
  .fromParallelCollection(new NumberSequenceIterator(1, N))
  .setParallelism(N)

@carabolic
Copy link
Member Author

I've uploaded the initial version of the bundle to GitHub. And our assumption seems to be true. Here are my initial results for a 10mb broadcast DataSet obtained manually from the Flink web frontent:

#TaskManager #Tasks (total) Size of broadcast (actual) Size of broadcast (expected)
16 20 200MB 160MB
25 400 4000MB 250MB

And the results from running ./peel.sh query:runtimes --connection h2 broadcast.dev:

name name min max median
broadcast.dev broadcast.scale.up.10 5309 8212 22993
broadcast.dev broadcast.scale.up.20 7591 13909 26392
broadcast.dev broadcast.scale.up.30 9976 14223 33992
broadcast.dev broadcast.scale.up.400 104152 123440 343497

So the issue seems to be real. It also seems to have a devastating impact on the runtime of the jobs.

@akunft
Copy link

akunft commented May 31, 2016

Thanks for the work.

It would be nice to have a run with stable # of task managers (25) and increasing number of slots on the task mangers (1, 2, 4, 8, 16). This would reflect the overhead of broadcasting the vector for each slot instead of once to the task manager quite nicely I think.

@aalexandrov
Copy link

Can we rerun with Peel rc4

@aalexandrov
Copy link

I fixed some data bugs and added an event extractor for Dataflows, we should be able to generate some network utilization plots per Taskmanager if you add dst at as a system dependency for your experiment.

@aalexandrov
Copy link

Some preliminary results from a run on cloud-11

--------------------------------------------------------------------------------------------------------------------------------------------
name                               median time (ms)                   run                                run id                             
--------------------------------------------------------------------------------------------------------------------------------------------
broadcast.01                       10799                              1                                  1506351146                         
broadcast.02                       22574                              1                                  1506352107                         
broadcast.04                       31579                              2                                  1506354030                         
broadcast.08                       59960                              2                                  1506357874                         
broadcast.16                       119608                             3                                  1506385744                         
--------------------------------------------------------------------------------------------------------------------------------------------

@akunft
Copy link

akunft commented Jun 6, 2016

I think these results show the problem and we can open a jira issue.
What do you think?

@aalexandrov
Copy link

Alright, WIP benchmark can be pointed as well.

@FelixNeutatz
Copy link

I have a first prototype running to solve the broadcast issue: https://github.com/FelixNeutatz/incubator-flink/commits/experimentWithBroadcast

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants