You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The env.fromCollection is used when the matrix Block is read from HDFS as a single block and then distributed as multible blocks with env.fromCollection.
The alternative in Spark is not yet implemented in the FlinkExecutionContext. In that alternative, the matrix is read directly as a JavaPairRDD<MatrixIndices, MatrixBlock> from HDFS. Spark only does that if the memory budget is too small to hold 2 blocks (the read block and the partitioned version of it).
To get a robust version, we could always use the hadoopFile method to avoid using akka alltogether.
The issue might still remain at other points, especially collect()s.
Another option might be to read the single block as a DataSet with one element and split it up into multiple blocks in a flatMap to increase parallelism.
we finally run the linear regression example without fromCollection() by only reading from HDFS. It works without the akka.framesize exception. But now we get the akka.framesize exception because the following collect() call.
So we have two choices: Either we dynamically set the framesize or we only read and write to HDFS.
The program will halt if collect() or env.fromCollection() calls send messaged that exceed the akka.framesize (default 10 MB).
The problem is, that we can not really know in advance how high the communication payload will be for every DML script in advance.
The current solution is to increase this parameter (up to 512 MB). This is based on trial and error and not really generic.
There is an open PR with Jira issue where a fix is proposed that uses the Blob-manager for large client communication.
For now we will try to rewrite the env.fromCollection that we suspect to cause the error.
The text was updated successfully, but these errors were encountered: