math_word_problems_llms_thesis

Thesis Score: 80/100

This study investigates the performance of quantized versions of Llama 2 (7B) and Llama 3 (8B) models on simple addition and subtraction word problems. This comparison is significant as we aim to understand what about the structure of a word problem makes it challenging. Four features are explored: information relevance, which include direct and misleading information; problem length, which is word count; number size, ranging from single to multi-digit numbers; and number of steps, which is the number of steps needed to solve the problem. Our evaluation addressed two outputs: the final answer and the extraction of relevant numbers for mathematical operation. . These outputs are generated through four types of prompts: basic prompts for direct answers, basic prompts with examples, extraction prompts for identifying relevant numbers and extraction prompts with examples. We run a series of experiments and save parsed results for evaluation. Results are compared to better understand the effect of increasing complex word problems. The dataset (Hosseini et al., 2014), provided a foundation for investigated features. Overall, this research contributes to the understanding of how language models extract mathematical information from text, providing insights towards their difficulties which can be used to further research in improving language models.

Our research has demonstrated the challenges that smaller LLMs like Llama 2 and Llama 3 face when solving simple word problems, particularly in extracting addition and subtraction steps. While Llama 3 showed improvements over Llama 2 in direct prompts, both models struggled with intermediate reasoning and contextual understanding. Future work should focus on testing more standard datasets and refining and standardising prompts to improve performance and parsing. Further investigation into model architectures and tokenization techniques will be essential for enhancing small LLM capabilities. For exampled, it is possible that the byte pair encoding (BPE) tokenizer, used by both models, struggles to effectively capture sentence level relationships in word problems. Sentence tokenization, rather than BPE, might better capture the context and improve performance on word problems.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Dissertation - Testing Prompts.docx		Dissertation - Testing Prompts.docx
Dissertation_Data_Setup.ipynb		Dissertation_Data_Setup.ipynb
Dissertation_Experiments.ipynb		Dissertation_Experiments.ipynb
Dissertation_Initial_Testing.ipynb		Dissertation_Initial_Testing.ipynb
Dissertation_Significance_Testing_Calculations.ipynb		Dissertation_Significance_Testing_Calculations.ipynb
Final Dataset.csv		Final Dataset.csv
Final Dissertation 20241002.pdf		Final Dissertation 20241002.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

math_word_problems_llms_thesis

About

Releases

Packages

Languages

fusi3/math_word_problems_llms_thesis

Folders and files

Latest commit

History

Repository files navigation

math_word_problems_llms_thesis

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages