Skip to content
This repository has been archived by the owner on Jan 29, 2022. It is now read-only.

BSONLoader has trouble with highly nested Mongo documents #113

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

gminerbo
Copy link

@gminerbo gminerbo commented Aug 7, 2014

I encountered problems using the BSONLoader to read highly nested Mongo documents.

For example,

myPigRelation = LOAD '/data/mydata.bson' USING com.mongodb.hadoop.pig.BSONLoader(
'id',
'id:chararray,
 var1:chararray,
 var2:tuple(
   mySub1:bag{tuple(id:int, pos:int, app:int, fav:chararray)},
   mySub2:bag{tuple(id:int, sp:int, pos:int, app:int, fav:chararray)},
   mySub3:bag{tuple(id:int,sp:int, pos:int, app:int, fav:chararray)},
   mySub4:bag{tuple(id:chararray, pos:int, app:int, fav:chararray)},
   mySub5:bag{tuple(id:int, sport:int, pos:int, app:int, fav:chararray)},
   mySub6:bag{tuple(id:chararray, value:chararray, app:int)},
   mySub7:bag{tuple(id:int, type:chararray, text:chararray, app:int,
                    metadata:bag{tuple(name:chararray, value:chararray, pos:int)}
   )},
   mySub8:bag{tuple(id:int, app:int, pos:int)} 
 ),
 var3:tuple(avatars:bag{tuple(url:chararray, type:int, approved:chararray)})'
);

The changes contained herein resolved the problem for me. The presence of the BasicDBObject & BasicDBList rather than BasicBSONObject and BasicBSONList seem like a mistake to me, yet my confidence is low as I would have expected this to be discovered by now.

Adding the LoadMetadata interface is just a convenience, but it was required due to the other processing I needed to do in my Pig script.

I can try to create a test case (bson file) for this, but it will take some time, as I cannot give away my data.

  • Geoff

@evanchooly
Copy link
Contributor

Thanks for the patch. You're probably correct that working up the class hierarchy is the correct fix. My biggest concern for this patch is the lack of an associated test. Could you try to attach a unit test for this? Testing is less than ideal for the pig code right now but I'm trying to fix that, too.

@gminerbo
Copy link
Author

I added a simple unit test.

@llvtt llvtt added the pig label Mar 13, 2015
@llvtt
Copy link

llvtt commented Apr 28, 2015

Hi @gminerbo
I realize that this PR is pretty old at this point, but I'm interested in getting to the bottom of what's causing this problem. Could you share the stack trace you get when trying to read in these heavily-nested documents in Pig? Please also share or link logs from Pig from the session. Thanks!

@gminerbo
Copy link
Author

gminerbo commented May 1, 2015

Hello,

Unfortunately, I've changed jobs and lost access to this data. It would
take a considerable effort at this point to produce a repo case.
My apologies for the inconvenience.
-Geoff

On Tue, Apr 28, 2015 at 4:06 PM, Luke Lovett [email protected]
wrote:

Hi @gminerbo https://github.com/gminerbo
I realize that this PR is pretty old at this point, but I'm interested in
getting to the bottom of what's causing this problem. Could you share the
stack trace you get when trying to read in these heavily-nested documents
in Pig? Please also share or link logs from Pig from the session. Thanks!


Reply to this email directly or view it on GitHub
#113 (comment).

@dggrj
Copy link

dggrj commented Sep 25, 2015

I can be of assistance here. If you don't have the LoadMetadata interface on the BSONLoader (which is on the MongoLoader!) then you get a pretty generic Projected field not found when trying to operate on the data:

Pig Stack Trace

ERROR 1025:
<line 2, column 25> Invalid field projection. Projected field [createdate] does not exist.

org.apache.pig.impl.plan.PlanValidationException: ERROR 1025:
<line 2, column 25> Invalid field projection. Projected field [createdate] does not exist.
at org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191)
at org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174)
at org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53)
at org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215)
at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:80)
at org.apache.pig.newplan.logical.relational.LOFilter.accept(LOFilter.java:79)
at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at org.apache.pig.PigServer$Graph.compile(PigServer.java:1716)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1649)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1621)
at org.apache.pig.PigServer.registerQuery(PigServer.java:575)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:541)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

Perhaps there's a way to sideload/reload the schema, but it's already been loaded once, it shouldn't have to be!

@dggrj
Copy link

dggrj commented Sep 25, 2015

Additionally, there's the ClassCastException I mentioned in the source code elsewhere if using BasicDB[List/Object] instead of BasicBSON. I don't have a stack trace for that right now, but there seems to be nothing special or different about our BSON dumps.

@llvtt
Copy link

llvtt commented Oct 6, 2015

@dggrj Thanks for the additional info here. Do you have a Pig script that can reproduce the issue? It would also be useful to have the stack trace of the ClassCastException to pinpoint the location of the problem. Thanks!

@dggrj
Copy link

dggrj commented Oct 6, 2015

I don't think I can mutate things back to where I got the CCE for a stack trace right now, I'm sorry to say, but I can tell you that I definitely saw them at line 136/7 inside of TUPLE conversion, and suspect that if it hadn't hit there first that it'd have hit inside BAG, as well.

I will provide a variant of the schema of our BSON data that we're loading and the script and point out where different parts errored as best as I can.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants