Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add GLEM model, TAGDataset and example of GLEM #9662

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

ECMGit
Copy link
Contributor

@ECMGit ECMGit commented Sep 15, 2024

reopened #9591

Feature summary:

  • Add GLEM as GNN & LLM Co-training model to PyG
  • adapt GLEM's LM to AutoModelForSequenceClassification from transformers
  • Lora support
  • LM/LLM support
  • ogbn-products/ogbn-arxiv testing finished
  • TAGDataset can be used as a wrapper class for any node classification dataset in PyG with LM tokenizer and associate raw text
  • external prediction as pseudo labels supported

Copy link

codecov bot commented Sep 15, 2024

Codecov Report

Attention: Patch coverage is 11.93182% with 155 lines in your changes missing coverage. Please review.

Project coverage is 86.92%. Comparing base (ba3b906) to head (a22742c).
Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
torch_geometric/nn/models/glem.py 11.42% 155 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9662      +/-   ##
==========================================
- Coverage   87.54%   86.92%   -0.62%     
==========================================
  Files         482      483       +1     
  Lines       31414    31585     +171     
==========================================
- Hits        27501    27455      -46     
- Misses       3913     4130     +217     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@puririshi98 puririshi98 self-requested a review September 16, 2024 15:27
Copy link
Contributor

@puririshi98 puririshi98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just get CI green

@puririshi98 puririshi98 marked this pull request as ready for review September 24, 2024 19:28
@puririshi98
Copy link
Contributor

@rusty1s @akihironitta ready for your reviews

Copy link
Member

@akihironitta akihironitta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have type annotations all over the PR? Also, I'd suggest splitting this PR into smaller ones.

Comment on lines +28 to +30
# Add the parent directory to sys.path
parent_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))
sys.path.append(parent_dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

Comment on lines +7 to +10

## Run GLEM for getting SOTA result on ogbn-products dataset

`python glem.py`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Run GLEM for getting SOTA result on ogbn-products dataset
`python glem.py`

Comment on lines +73 to +76
ext_pred_path = download_google_url(
id='15sO2m7BeW7C1Upmdw3Cx1JS__6nxTAzY',
folder='/work/users/junhaos/glem_data/ogbn_products/ext_preds',
filename='giant_sagn_scr.pt', log=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a relative path for other people to use.

pretrain_augmented = True

seed_everything(42)
from ogb.nodeproppred import PygNodePropPredDataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's move the import statement at the start of the file.

examples/llm/glem.py Show resolved Hide resolved
Comment on lines +368 to +377
if em_phase == 'gnn':
gnn_test_acc = max(gnn_test_acc, final_test_acc)
model.gnn = model.gnn.to('cpu', non_blocking=True)
em_phase = 'lm'
else:
lm_test_acc = max(lm_test_acc, final_test_acc)
model.lm = model.lm.to('cpu', non_blocking=True)
em_phase = 'gnn'
torch.cuda.empty_cache()
print(f'Best GNN acc: {gnn_test_acc}, LM acc: {lm_test_acc}')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same comment as #9467 (comment), but we shouldn't pick the best metric evaluated on the test set at the end of every EM step.

Copy link
Member

@akihironitta akihironitta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had a look outside the example script yet, but this addition is exciting! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants