-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broadcast and DOP Issue in Flink #32
Comments
@carabolic This can also explain some of the results observed by @bodegfoh with respect to the logreg benchmark of Spark vs. Flink |
Would be nice if we can have a simple example showing this and open a jira issue in Flink. |
@carabolic Maybe you can bootstrap a new Peel bundle, modify the wordcount job to something better that highlights the issue, execute this on one of the clusters, and provide the chart as part of the jira. The IBM machines have 48 cores on each node, which should make the effect quite visible. |
I'm preparing a bundle right now. The idea is to generate a |
Sounds good. |
To make sure that environment
.fromParallelCollection(new NumberSequenceIterator(1, N))
.setParallelism(N) |
I've uploaded the initial version of the bundle to GitHub. And our assumption seems to be true. Here are my initial results for a 10mb broadcast DataSet obtained manually from the Flink web frontent:
And the results from running
So the issue seems to be real. It also seems to have a devastating impact on the runtime of the jobs. |
Thanks for the work. It would be nice to have a run with stable # of task managers (25) and increasing number of slots on the task mangers (1, 2, 4, 8, 16). This would reflect the overhead of broadcasting the vector for each slot instead of once to the task manager quite nicely I think. |
Can we rerun with Peel rc4 |
I fixed some data bugs and added an event extractor for Dataflows, we should be able to generate some network utilization plots per Taskmanager if you add dst at as a system dependency for your experiment. |
Some preliminary results from a run on cloud-11
|
I think these results show the problem and we can open a jira issue. |
Alright, WIP benchmark can be pointed as well. |
I have a first prototype running to solve the broadcast issue: https://github.com/FelixNeutatz/incubator-flink/commits/experimentWithBroadcast |
It seems that there is an issue with how Flink handles broadcast DataSets.
Problem
Let's assume we have a Flink cluster with
N = 20
nodes andT = 2
tasks per node, henceDOP = 20 * 2 = 40
. If we now have a job that readsinputSize = 5mb
of data into a single dataset and consecutively broadcast this dataset to the mappers (with maxDOP
), the data gets broadcasted to every mapper in isolation which meansbroadcastSize = DOP * inputSize = 40 * 5mb = 200mb
need to transferred over the network.In our case it becomes obvious when running the
LinRegDS.dml
script onflink_hybrid
. The second flink job involvesMapmmFLInstruction
which broadcasts the smaller matrix to all the mappers. For DOP of 250 this results in about 10GB of broadcasted data.Solution
Since all tasks per node run in the same JVM it would be better to simply broadcast to the taskmanagers only, which then pass a simple reference to the single task they are responsible for. So for the example this reduces the size the broadcast to
broadcastSize = N * inputSize = 20 * 5mb = 100mb
.For the
LinReg.dml
use-case this fix will reduce the size of the broadcast by 16. Hence it will only need to broadcast10GB / 16 = 0.625GB
of data.Workaround
For now this could be fixed if the dop is set really low for jobs that include a broadcast.
Follow Up
I will investigate a little bit more to see if this is a known issue for flink and if there are already ways to work around the problem, even maybe opening a PR with Flink.
The text was updated successfully, but these errors were encountered: