Skip to content

Fine-Grained Domain Adaptation for Large Language Models Using Knowledge Injection

Notifications You must be signed in to change notification settings

HomayoonAlimohammadi/knowledge-injection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fine-Grained Domain Adaptation for Large Language Models Using Knowledge Injection

Quick Brief:

This is an end-to-end implementation (in Python) of a proof-of-concept for fine-grained domain adaptation of a large language model with an example of knowledge injection. We will:

  • Load a base pretrained model (e.g., GPT-2) from Hugging Face.
  • Use a toy domain-specific dataset (small mocked examples) for demonstration.
  • Incorporate a small knowledge dictionary to inject domain-specific terms/definitions directly into the input.
  • Fine-tune the model using Hugging Face’s Trainer on a text classification task.
  • Evaluate and provide ways to measure domain adaptation improvements.
  • Note: This is a demonstrative prototype. For a real-world application, replace the toy dataset with a larger, curated domain corpus. Also, adapt or extend the knowledge-injection logic to suit your domain’s complexity.

Objective:

Adapt a pre-trained large language model (e.g., GPT-2, BERT, or any Transformer-based model) to a specific specialized domain (e.g., clinical medicine, quantum physics).

Innovation:

Incorporate external domain knowledge into the model by injecting relevant terminologies or short definitions as additional context during training.

Key Techniques:

  • Utilize Hugging Face Transformers to load a base model.
  • Use a domain-specific corpus to further pre-train or fine-tune the model.
  • Implement a knowledge-injection mechanism that augments or re-ranks tokens based on domain vocabularies or concept definitions.
  • Evaluate improvements on specialized tasks like domain-specific text classification or masked language modeling accuracy.

Outcome:

Demonstrates advanced domain adaptation, bridging general-purpose LLM capabilities with specialized domain knowledge.

Installation:

source env/bin/activate
pip install -r requirements.txt
python train.py

Explanation of the Key Steps:

  1. Mock Dataset: We create a small dataset containing medical-like sentences (label=1) and unrelated general sentences (label=0).
  2. Knowledge Dictionary: Contains a few key medical terms (myocardial infarction, ECG, dyspnea, etc.) and short definitions.
  3. Knowledge Injection: A function inject_knowledge(...) that appends domain definitions whenever it encounters a known keyword in the text. We insert them using a delimiter "[KNOWLEDGE]" so the model receives direct domain context.
  4. Tokenizer & Preprocessing: We apply GPT-2’s tokenizer with max_length=64 to keep the example small. We set the pad_token to eos_token for GPT-2.
  5. Model: We load GPT2ForSequenceClassification with num_labels=2 for a simple binary classification.
  6. Trainer: We use Hugging Face’s Trainer API with a small learning rate and a few epochs (num_train_epochs=3).
  7. Evaluation: We measure accuracy using a built-in Hugging Face metric.
  8. Inference: Shows how to apply knowledge injection at inference time and get predictions from the fine-tuned model.

Possible Extensions:

  • Larger Domain Corpus: Integrate a substantial domain corpus (e.g., a collection of medical articles) for more realistic adaptation.
  • Refined Knowledge Integration: Instead of simple text concatenation, consider more sophisticated approaches like gating networks or attention-based injection of external knowledge.
  • Advanced Evaluation: Go beyond classification—evaluate perplexity for domain-specific language modeling, or measure F1 scores for domain-specific tasks (e.g., NER).
  • Use A More Recent Foundation Model: Instead of GPT-2, one could use GPT-Neo, GPT-J, LLaMA, or another modern large-scale model (depending on hardware resources).

About

Fine-Grained Domain Adaptation for Large Language Models Using Knowledge Injection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages