[转载][翻译]小型语言模型之战:Stable LM、Tiny LLama、Mini CPM与QWEN 1.5的较量 #57
Valdanitooooo
started this conversation in
General
Replies: 1 comment
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
原文:https://medium.com/@zaiinn440/best-slm-stable-lm-tiny-llama-mini-cpm-and-qwen-1-5-91134cfddbc3
以下为翻译:
SLM之战 (图片由原作者生成)
介绍
近期,小型语言模型(SLMs)已成为热门话题。每天都有不同模型发布,目标是在性能上匹敌大型语言模型(LLMs)。然而,在计算和内存成本方面,SLMs已经占据优势。最初它们被视为LLMs的小型版本,但现在情况已发生变化,SLMs日臻完善,其结果在某种程度上可与LLMs相媲美。问题随之而来:哪个SLM最好? 为回答这个问题,我对比了四种小型语言模型(Stable LM、Tiny LLama、Mini CPM以及QWEN 1.5)的表现,并通过一系列针对不同自然语言处理任务的基准测试进行评估。这些任务包括情商评估、代码生成、文本摘要和叙事创作。根据评估结果发现,其中一种模型在所有任务中均表现出色,而另一种则表现不佳;其余两种模型彼此相当,生成的响应相似。
这个博客是我和我的同事Syed Hasan合作的。
SLM 的优点
在比较这些SLMs之前,我们需要了解SLMs相较于LLMs的优势,主要体现在以下几个方面:
测试条件
为了确保公平性和一致性,在进行小型语言模型对比分析之前,满足了几项前提条件:
遵循这些条件可能使SLMs产生无偏见的响应。尽管没有绝对完美,因此只能说“可能”。
比较
现在我们将对比以下四种SLMs在不同提示下的表现并给出评价及理由:
评估标准包括情商、代码生成、文本摘要和叙事创作。
情商评估
我们使用了三个提示来进行情商评估:
Prompt 1: Examine the emotion and sentiment expressed in the following movie review excerpt: “The acting was superb, but the plot was predictable and lackluster.” Determine if the overall impression conveyed by the statement leans more towards being positive, negative, or neutral.
Prompt 2: Describe two scenarios where understanding customer emotions could significantly contribute to improving business outcomes. Suggest a potential solution involving emotion detection technology for each situation.
Prompt 3: Based on the weather conditions described below, predict the likely mood of the speaker: “A heavy blanket of clouds smothered the sky, casting an eerie gray pallor over the once vibrant cityscape. Raindrops pattered against windows with rhythmic monotony, creating a somber symphony that echoed the residents’ melancholic spirits.”
部分截图如下:
叙事作文/故事写作
我们还针对一个特定的叙事创作提示进行了评估,并根据故事情节和各回应中包含的细节对回复进行了排名。
部分截图如下:
Prompt: In a sleepy town where nothing ever happens, ordinary citizens start developing extraordinary powers overnight — an elderly woman gains telekinesis, a schoolboy acquires super strength, and a timid girl suddenly becomes invisible. As everyone grapples with their newfound abilities, tensions rise, fueling fear and prejudice among neighbors. Write a poignant story exploring themes of acceptance, change, and community in this magical setting.
代码生成
对于代码生成,我们在两个编程相关的提示上对模型进行了评估。
Prompt 1: Develop a lightweight microservice written in Go or Rust that resizes incoming JPG images to specified dimensions using OpenCV or any alternative computer vision library. Optimize the solution for minimal latency and memory footprint.
Prompt 2: Given a database schema consisting of two tables: “Orders” (OrderID int PRIMARY KEY, CustomerName varchar(50)) and “OrderDetails” (DetailID int PRIMARY KEY, OrderID int, ProductName varchar(50), Quantity int, UnitPrice decimal(18,2)), write an SQL query to retrieve the total revenue for each customer who has placed orders. Format the output as follows: CustomerName, TotalRevenue, where TotalRevenue represents the sum of all products’ prices multiplied by quantities ordered by that customer. Display customers with zero sales too. Sort the final result set alphabetically by customer name.
部分截图如下:
文本摘要
在文本摘要任务中,选取了一篇约4500词的网络文章进行评估。这篇文章是关于可植入脑芯片的伦理评估
部分截图如下:
结论
经过对Stable LM-2、Tiny LLama、Mini CPM和QWEN 1.5的对比评估与性能测试后发现,Stable LM-2 在各项任务中表现最佳,其情商评估、编码练习、文本摘要和故事写作能力充分展示了其竞争力。
在本次评测的另一端,Tiny LLama的表现明显落后于竞争对手,在几乎所有任务中都未能超越其他模型。尽管偶尔展现出亮点,但它仍被认为是效率最低的模型。
至于Mini CPM和QWEN 1.5,研究表明两者在大多数测试中的表现相当接近。它们虽无法超越Stable LM-2,但在某些领域展现出了各自的特色。因此,可以根据实际应用场景需求或资源可用性,将二者结合使用。这意味着用户可以根据具体情况选择Mini CPM或QWEN 1.5,因为这两种模型在特定任务上都能发挥一定的优势。总的来说,虽然在综合表现上不及Stable LM-2,但它们各自在个别领域内具有值得挖掘的价值。
有关每个提示和详细回复的完整回复,请访问 Analysis Report.
Beta Was this translation helpful? Give feedback.
All reactions