-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Priority Queue Serialization Error #3
Comments
Hi org.apache.spark.SparkException: Job aborted due to stage failure: Task 946.0 in stage 16.0 (TID 1971) had a not serializable result: scala.collection.mutable.PriorityQueue$$anon$3
|
Apologies for the delay in responding. Could you please provide a minimal snippet of code that can reproduce the problem? |
It occurred when I tried to use your ModelTest.scala code. |
I'm getting the same issue using the scala 2.10 version. @LaurelStan are you using scala 2.10? I think it may be related to https://issues.scala-lang.org/browse/SI-7568 |
@joerenner I use this library on 2.11 (spark 2.3.0) still got this problem. |
Hi @namitk I use your library in java environment, the following is sample code w2v.printSchema();
RDD<Tuple2<Object, Vector>> map = w2v.javaRDD()
.map(row -> new Tuple2<>(row.getAs("index"),
(org.apache.spark.ml.linalg.Vector) row.getAs(
"vectors")))
.rdd();
System.out.println("toRDD:" + map.count());
LSHNearestNeighborSearchModel<CosineSignRandomProjectionModel> model = new CosineSignRandomProjectionNNS(
"signrp").setNumHashes(300)
.setSignatureLength(15)
.setJoinParallelism(5000)
.setBucketLimit(1000)
.setShouldSampleBuckets(true)
.setNumOutputPartitions(100)
.createModel(vectorSize);
// get 100 nearest neighbors for each item in items from within itself
// RDD[(Long, Long, Double)]
RDD<Tuple3<Object, Object, Object>> selfAllNearestNeighbors = model.getSelfAllNearestNeighbors(map, 6);
selfAllNearestNeighbors.toJavaRDD().foreach(tu->{
System.out.println(tu._1() + " " + tu._2() + " " + tu._3()); +--------------------------------+-----+----------------------------------------------------------------------------------------------------+ root toRDD:3 [2018-06-13 09:53:33,873][ERROR] Executor : Exception in task 2325.0 in stage 16.0 (TID 3350) |
The scala ticket that @joerenner pointed indices this issue has been fixed in vScala 2.11.0-M7. That said, I came across BoundedPriorityQueue in the spark code base which I hadn't used due to it being private to org.apache.spark. However, I noticed that they use the Java priority queue, probably to avoid the same issue. I'll look into whether I can do something similar. That said, at least on my side, I only get this issue while running in the local mode of spark. The library seems to run fine for me in spark cluster mode. Don't fully understand why, I'll try to dig further. |
Hi @namitk, is there any update on this issue?
|
Hi @nicola007b I tried the same thing as you and swapped out scala PriorityQueue with the java.util.PriorityQueue, but I then ran into the same serialization error. Could this new issue be caused by the mutable.Map or mutable.ArrayBuffer in LSHNearestNeighborSearchModel.scala? Considering that the scala PriorityQueue is in fact a mutable.PriorityQueue and part of the same mutable library |
@oscaroboto thanks for following up on this. I believe the issue is in |
Also got this issue. After replacing all the Iterator in NearestNeighborIterator class (except the outer Iterator of the return type) with Seq/List the exception is gone. |
@Itfly would you mind posting exactly what you changed please? |
@Itfly I used IndexedSeq instead of Seq/List. Seq are a special cases of iterable and got the same error. List worked too but thought IndexedSeq may give better performance. |
no need to modify TopNQueue |
@oscaroboto @Itfly would be great if you could open a PR with the fix |
@nicola007b doing some tests and will PR soon |
@MarkTickner I changed
and added an index to support the original buckets iterator's Guessing it's because Iterator do not support serialization, so all the members in NearestNeighborIterator should not use Iterator object including the return object of the override |
@oscaroboto thanks for your advice. I'm a scala/spark beginner. |
@namitk @oscaroboto @Itfly do you solve this issue? |
@zhihuawon Here's my modification, hope this can help you too.
Then use
|
@Itfly thanks a lot |
Sorry for the delay everyone. I finally got a chance to make the PR. |
Any Updates? its not work for me. |
I am also facing the same issue. |
I am also facing the same issue. Has the issue fixed ? |
If this is a dead project, could you mark it as such instead of wasting people's time? |
This appears to be a dead project 😔 |
Hey folks, I'm really sorry I haven't been on top of things and didn't notice @oscaroboto PR #6. I switched jobs in Oct 2018 and it seems this isn't being maintained by anyone at LinkedIn anymore. After leaving, I am no longer working in an environment where I use or have access to a spark cluster so haven't been keeping up with the project. I have commented on the PR however and will be happy to review if one of you can help fix this issue. Apologies :( |
Hi @namitk! Thank you for you reply! I just made a PR |
@namitk For me, I got this error using scanns on databricks, with scala version 2.11. Then I made all the changes based on the pull request by @oscaroboto, and scanns works now. |
fix issue described here: LinkedInAttic#3 reference: https://github.com/LinkedInAttic/scanns/pull/7/files
In ModelTest.scala it is documented that:
"// This is unable to be run currently due to some PriorityQueue serialization issues"
This is the error I am running into. Are there any workarounds for this? I am currently unable to print any results without a solution to this problem.
The text was updated successfully, but these errors were encountered: