Thesis Score: 80/100
This study investigates the performance of quantized versions of Llama 2 (7B) and Llama 3 (8B) models on simple addition and subtraction word problems. This comparison is significant as we aim to understand what about the structure of a word problem makes it challenging. Four features are explored: information relevance, which include direct and misleading information; problem length, which is word count; number size, ranging from single to multi-digit numbers; and number of steps, which is the number of steps needed to solve the problem. Our evaluation addressed two outputs: the final answer and the extraction of relevant numbers for mathematical operation. . These outputs are generated through four types of prompts: basic prompts for direct answers, basic prompts with examples, extraction prompts for identifying relevant numbers and extraction prompts with examples. We run a series of experiments and save parsed results for evaluation. Results are compared to better understand the effect of increasing complex word problems. The dataset (Hosseini et al., 2014), provided a foundation for investigated features. Overall, this research contributes to the understanding of how language models extract mathematical information from text, providing insights towards their difficulties which can be used to further research in improving language models.
Our research has demonstrated the challenges that smaller LLMs like Llama 2 and Llama 3 face when solving simple word problems, particularly in extracting addition and subtraction steps. While Llama 3 showed improvements over Llama 2 in direct prompts, both models struggled with intermediate reasoning and contextual understanding. Future work should focus on testing more standard datasets and refining and standardising prompts to improve performance and parsing. Further investigation into model architectures and tokenization techniques will be essential for enhancing small LLM capabilities. For exampled, it is possible that the byte pair encoding (BPE) tokenizer, used by both models, struggles to effectively capture sentence level relationships in word problems. Sentence tokenization, rather than BPE, might better capture the context and improve performance on word problems.