docs: add readme and evaluation metrics for chart recommendation dataset

chenluli · chenluli · commit 1f72d05056f9 · 2024-11-22T08:55:31.000+08:00
diff --git a/README.md b/README.md
@@ -133,6 +133,9 @@ $ pnpm dev
 $ pnpm build
 ```
 
+## 🤖 Chart Recommendation Dataset
+The chart recommendation dataset is designed to evaluate or fine-tune large language models on their ability to recommend chart types based on given data. The dataset currently encompasses 16 types of charts, with 1-3 different data scenarios per chart type, and more than 15 chart data instances for each scenario. The dataset is continuously updated, and we welcome contributions of chart data collected from your own use cases. For more detailed information about the dataset, please visit [evaluations/recommend](https://github.com/antvis/GPT-Vis/tree/main/evaluations/recommend/README.md).
+
 ## License
 
 [MIT](./LICENSE)
diff --git a/README.zh-CN.md b/README.zh-CN.md
@@ -120,6 +120,9 @@ set_gpt_vis(content)
 
 更多了解 👉 [streamlit-gpt-vis](https://github.com/antvis/GPT-Vis/bindings/streamlit-gpt-vis)
 
+## 🤖 图表模型推荐数据集
+图表推荐数据集用于评测/微调大模型在“给定数据，推荐图表类型”任务上的能力。数据集目前涵盖了 16 种图表类型，每种图表类型下 1-3 个不同数据场景，每个场景下 15+ 个图表数据。数据会持续更新，也欢迎向我们贡献你的使用场景中收集的图表数据。数据集详细信息见 [evaluations/recommend](https://github.com/antvis/GPT-Vis/tree/main/evaluations/recommend/README.md)
+
 ## 💻 本地开发
 
 ```bash
diff --git a/evaluations/datastes/recommend/README.en.md b/evaluations/datastes/recommend/README.en.md
@@ -25,17 +25,17 @@ In each data entry, source represents the user input. source.data contains the o
                 "y": ["Population"] // Field for y-axis
             }
         }
-    ]
+    ]s
 }
 ```
 
 ### Model Fine-Tuning Dataset
 The gpt_vis_train.jsonl file is a fine-tuning training dataset generated from the above original chart data. The generation strategy is as follows: randomly select half of the cases for each chart type (the remaining data is used for evaluation). Since the number of original data entries varies for each chart type, to avoid imbalanced chart quantities affecting recommendation results, some chart data entries are repeated a certain number of times to ensure there are 60 entries for each chart type in the training set.
 
 ### Evaluation Result File
-The evalResult.json file contains the results of our model evaluation after fine-tuning. In this file, every source entry is the original input, target is the expected output, and generation is the model's output. Comparing these entries allows the evaluation of recommendation accuracy.
+The `metrics.json` file contains the results of our model evaluation after fine-tuning. In this file, every source entry is the original input, target is the expected output, and generation is the model's output. Comparing these entries allows the evaluation of recommendation accuracy.
 
 ## Model's Performance on Chart Recommendation Task
-Using the above datasets, we achieved a chart type accuracy of 85% and an encode accuracy of 70% with fine-tuning based on the `qwen2.5-14b-instruct`.
+Using the above datasets, we achieved a chart type accuracy of 89% and an encode accuracy of 82% with fine-tuning based on the `qwen2.5-14b-instruct`.
 
 It is important to note that the model recommendations can satisfy the requirement of "providing data and returning chart and configuration" in most scenarios. However, the model's output is not entirely controlled, which may result in invalid output or charts that cannot be successfully rendered. We recommend combining these with the recommendation modules in [@antv/ava](https://ava.antv.antgroup.com/api/advice/advisor). In scenarios where the model performance is suboptimal or where traditional rules fulfill the recommendation requirements, rule-based recommendation pipelines can be used as a fallback.
diff --git a/evaluations/datastes/recommend/README.md b/evaluations/datastes/recommend/README.md
@@ -34,9 +34,9 @@
 `gpt_vis_train.jsonl` 文件是我们使用上述原始图表数据集生成的微调训练数据集。生成策略如下：每种图表随机抽取一半的 case （剩余数据用作评测），由于每种图表原始数据条数不同，为避免图表数量不均衡影响推荐结果，通过将部分图表数据重复一定次数，保证每种图表在训练集中有 60 条数据。
 
 ### 评测结果文件
-`evalResult.json` 文件是我们执行模型微调后，用模型评测的结果，其中每条数据的 `source` 为原始输入，`target` 为期望输出，`generation` 为模型输出，通过比对可以评估推荐的准确率。
+`metrics.json` 文件是我们执行模型微调后，用模型评测的结果，其中每条数据的 `source` 为原始输入，`target` 为期望输出，`generation` 为模型输出，`correctness` 为图表类型是否推荐正确，`encodeScore` 为图表配置推荐结果的打分。评测指标的计算参考 `eval/eval-recommend.js` 文件。
 
 ## 模型推荐图表效果说明
-我们使用上述数据集，基于 `qwen2.5-14b-instruct` 微调后的图表类型准确率可达 85%，`encode` 准确率达 70%。
+我们使用上述数据集，基于 `qwen2.5-14b-instruct` 微调后的图表类型准确率可达 89%，`encode` 准确率达 82%。
 
 需要注意的是，模型推荐在大部分场景下能够满足“给出数据，返回图表及配置”的需求，但模型输出不完全可控，存在输出结果不合法、输出的图表无法绘制成功等情况。推荐结合 [@antv/ava](https://ava.antv.antgroup.com/api/advice/advisor) 中的推荐模块使用，在模型效果不佳或者传统规则已满足推荐需求的场景下，可以使用规则推荐的工程链路进行兜底。
diff --git a/evaluations/datastes/recommend/eval.json b/evaluations/datastes/recommend/eval.json
@@ -646,7 +646,7 @@
     },
     "target": [
       {
-        "type": "mutiple",
+        "type": "multiple",
         "encode": {
           "x": [
             "有效期至"
@@ -3046,38 +3046,38 @@
           "Year": 1916,
           "Deaths": 300
         }
-      ],
-      "target": [
-        {
-          "type": "heatmap",
-          "encode": {
-            "x": [
-              "Entity"
-            ],
-            "y": [
-              "Year"
-            ],
-            "size": [
-              "Deaths"
-            ]
-          }
-        },
-        {
-          "type": "scatter",
-          "encode": {
-            "x": [
-              "Entity"
-            ],
-            "y": [
-              "Year"
-            ],
-            "size": [
-              "Deaths"
-            ]
-          }
-        }
       ]
     },
+    "target": [
+      {
+        "type": "heatmap",
+        "encode": {
+          "x": [
+            "Entity"
+          ],
+          "y": [
+            "Year"
+          ],
+          "size": [
+            "Deaths"
+          ]
+        }
+      },
+      {
+        "type": "scatter",
+        "encode": {
+          "x": [
+            "Entity"
+          ],
+          "y": [
+            "Year"
+          ],
+          "size": [
+            "Deaths"
+          ]
+        }
+      }
+    ],
     "generation": [
       {
         "type": "line",
@@ -9912,6 +9912,15 @@
         }
       ]
     },
+    "target": [
+        {
+            "type": "scatter",
+            "encode": {
+                "x": ["排队时间"],
+                "y": ["满意度"]
+            }
+        }
+    ],
     "generation": [
       {
         "type": "scatter",
@@ -10220,4 +10229,4 @@
       }
     ]
   }
-]
+]
diff --git a/evaluations/datastes/recommend/heatmap/01_two_dim_one_measure.json b/evaluations/datastes/recommend/heatmap/01_two_dim_one_measure.json
@@ -126,26 +126,26 @@
         "Year": 1916,
         "Deaths": 300
       }
-    ],
-    "target": [
-        {
-            "type": "heatmap",
-            "encode": {
-                "x": ["Entity"],
-                "y": ["Year"],
-                "size": ["Deaths"]
-            }
-        },
-        {
-            "type": "scatter",
-            "encode": {
-                "x": ["Entity"],
-                "y": ["Year"],
-                "size": ["Deaths"]
-            }
+    ]
+  },
+  "target": [
+    {
+        "type": "heatmap",
+        "encode": {
+            "x": ["Entity"],
+            "y": ["Year"],
+            "size": ["Deaths"]
+        }
+    },
+    {
+        "type": "scatter",
+        "encode": {
+            "x": ["Entity"],
+            "y": ["Year"],
+            "size": ["Deaths"]
         }
+    }
     ]
-  }
 },
   {
       "source": {
diff --git a/evaluations/datastes/recommend/metrics.json b/evaluations/datastes/recommend/metrics.json
diff --git a/evaluations/datastes/recommend/multiple-axes/01_base.json b/evaluations/datastes/recommend/multiple-axes/01_base.json
@@ -375,7 +375,7 @@
     },
     "target": [
         {
-            "type": "mutiple",
+            "type": "multiple",
             "encode": {
                 "x": [
                     "有效期至"
diff --git a/evaluations/datastes/recommend/scatter/01_two_measure_correlate.json b/evaluations/datastes/recommend/scatter/01_two_measure_correlate.json
@@ -18,7 +18,7 @@
             {"排队时间":15, "满意度": 1}
         ]
     },
-    "targe": [
+    "target": [
         {
             "type": "scatter",
             "encode": {
diff --git a/evaluations/package.json b/evaluations/package.json
@@ -4,6 +4,7 @@
   "scripts": {
     "eval:data": "node ./scripts/eval/eval-data.js",
     "eval:metrics": "node ./scripts/eval/eval-metrics.js",
+    "eval:chart-recommend": "node ./scripts/eval/eval-recommend.js",
     "prompt": "node ./scripts/prompt/generate-prompts.js"
   },
   "devDependencies": {
diff --git a/evaluations/scripts/eval/eval-recommend.js b/evaluations/scripts/eval/eval-recommend.js
@@ -0,0 +1,70 @@
+import { evaluateChartEncodes } from '../helpers/evaluate-chart-encode.js';
+import { readDataset, writeDataset } from '../helpers/read-dataset.js';
+import _ from 'lodash'
+
+export const evaluateChartRecommend = async () => {
+  const evalDatasetPath = `datastes/recommend/eval.json`;
+  const testDataset = await readDataset(evalDatasetPath);
+  console.log('datasets count: ', testDataset.length);
+  console.log('Beginning eval datasets...');
+  const misMap = new Map();
+  const scoredData = testDataset.map(data => {
+    const target = data.target?.[0] ?? data.target;
+    const gen = data.generation?.[0] ?? data.generation;
+    if(!gen || !target) {
+      // 数据缺失
+      misMap.set('missing', (misMap.get('missing') ?? 0) + 1);
+      return;
+    }
+    const chartTypeScore = gen.type === target.type ? 1 : 0;
+    const encodeScore = evaluateChartEncodes(gen.encode, target.encode);
+    if (!chartTypeScore) {
+      const key = `${target.type}_to_${gen.type}`;
+      if (misMap.has(key)) {
+        misMap.set(key, misMap.get(key) + 1);
+      } else {
+        misMap.set(key, 1);
+      }
+    }
+    return {
+      ...data,
+      correctness: chartTypeScore,
+      encodeScore,
+    }
+  }).filter(data => data);
+  // save evaluate result
+  await writeDataset(`datastes/recommend/metrics.json`, scoredData);
+  // output metrics
+  let score = 0;
+  let chartTypeScore = 0;
+  let encodeScore = 0;
+  let chartTypeScoreMap = {}
+  scoredData.forEach(data => {
+    chartTypeScore += data.correctness;
+    encodeScore += data.encodeScore;
+    const target = data.target?.[0] ?? data.target;
+    const chartType = target?.type
+    chartTypeScoreMap[chartType] = {
+      chartTypeScore: (chartTypeScoreMap[chartType]?.chartTypeScore ?? 0) + data.correctness,
+      encodeScore: (chartTypeScoreMap[chartType]?.encodeScore ?? 0) + data.encodeScore,
+      count: (chartTypeScoreMap[chartType]?.count ?? 0) + 1,
+    };
+  });
+  score /= testDataset.length;
+  chartTypeScore /= testDataset.length;
+  encodeScore /= testDataset.length;
+  chartTypeScoreMap = _.mapValues(chartTypeScoreMap, ((score) => ({
+    chartTypeScore: score.chartTypeScore/score.count,
+    encodeScore: score.encodeScore/score.count,
+    count: score.count
+  })))
+  console.log('scoredData.length', scoredData.length, 'datasets count: ', testDataset.length)
+  console.log('chart type recommend accuracy:', chartTypeScore)
+  console.log('chart encode score:', encodeScore)
+  console.log('misclassified:', misMap);
+  console.log('chartTypeScoreMap', chartTypeScoreMap)
+}
+
+evaluateChartRecommend().catch((error) => {
+  console.error('Error evaluating chart recommendation:', error);
+});
diff --git a/evaluations/scripts/helpers/evaluate-chart-encode.js b/evaluations/scripts/helpers/evaluate-chart-encode.js
@@ -0,0 +1,49 @@
+import _ from 'lodash';
+
+const jaccardSimilarity = (a, b) => {
+  const base = _.union(a, b).length;
+  if (base === 0) return 0;
+  return _.intersection(a, b).length / base;
+};
+
+const removeEmpty = (v) => {
+  const res = {};
+  _.each(v, (value, key) => {
+    if (typeof value === 'string') {
+      if (value) {
+        res[key] = value;
+      }
+    } else if (Array.isArray(value)) {
+      if (value.length) {
+        res[key] = value;
+      }
+    }
+  });
+  return res;
+};
+
+/** 根据推荐的 chart encode 和期望的 chart encode 的相似度打分*/
+export const evaluateChartEncodes = (gen, ref) => {
+  try {
+    const v = [];
+    const a = removeEmpty(gen);
+    const b = removeEmpty(ref);
+    const allKeys = _.union(Object.keys(a), Object.keys(b));
+    _.each(allKeys, (key) => {
+      const ao = a[key];
+      const bo = b[key];
+      if (!ao && !bo) {
+        return;
+      }
+      if (Array.isArray(ao) && Array.isArray(bo)) {
+        v.push(jaccardSimilarity(ao, bo));
+      } else {
+        v.push(ao === bo ? 1 : 0);
+      }
+    });
+    return v.length ? _.sum(v) / v.length : gen == ref;
+  } catch (e) {
+    console.log('error', e)
+    return 0;
+  }
+};
diff --git a/evaluations/scripts/helpers/read-dataset.js b/evaluations/scripts/helpers/read-dataset.js
@@ -6,7 +6,6 @@ export const readDataset = async (filePath) => {
   const absolutefilePath = resolve(__dirProject, filePath);
 
   const content = await readFile(absolutefilePath, 'utf-8');
-
   const data = JSON.parse(content);
 
   return data;

Original file line number	Diff line number	Diff line change
`@@ -25,17 +25,17 @@ In each data entry, source represents the user input. source.data contains the o`
`25`	`25`	`"y": ["Population"] // Field for y-axis`
`26`	`26`	`}`
`27`	`27`	`}`
`28`		`- ]`
	`28`	`+ ]s`
`29`	`29`	`}`
`30`	`30`	```
`31`	`31`
`32`	`32`	`### Model Fine-Tuning Dataset`
`33`	`33`	The gpt_vis_train.jsonl file is a fine-tuning training dataset generated from the above original chart data. The generation strategy is as follows: randomly select half of the cases for each chart type (the remaining data is used for evaluation). Since the number of original data entries varies for each chart type, to avoid imbalanced chart quantities affecting recommendation results, some chart data entries are repeated a certain number of times to ensure there are 60 entries for each chart type in the training set.
`34`	`34`
`35`	`35`	`### Evaluation Result File`
`36`		`-The evalResult.json file contains the results of our model evaluation after fine-tuning. In this file, every source entry is the original input, target is the expected output, and generation is the model's output. Comparing these entries allows the evaluation of recommendation accuracy.`
	`36`	+The `metrics.json` file contains the results of our model evaluation after fine-tuning. In this file, every source entry is the original input, target is the expected output, and generation is the model's output. Comparing these entries allows the evaluation of recommendation accuracy.
`37`	`37`
`38`	`38`	`## Model's Performance on Chart Recommendation Task`
`39`		-Using the above datasets, we achieved a chart type accuracy of 85% and an encode accuracy of 70% with fine-tuning based on the `qwen2.5-14b-instruct`.
	`39`	+Using the above datasets, we achieved a chart type accuracy of 89% and an encode accuracy of 82% with fine-tuning based on the `qwen2.5-14b-instruct`.
`40`	`40`
`41`	`41`	It is important to note that the model recommendations can satisfy the requirement of "providing data and returning chart and configuration" in most scenarios. However, the model's output is not entirely controlled, which may result in invalid output or charts that cannot be successfully rendered. We recommend combining these with the recommendation modules in [@antv/ava](https://ava.antv.antgroup.com/api/advice/advisor). In scenarios where the model performance is suboptimal or where traditional rules fulfill the recommendation requirements, rule-based recommendation pipelines can be used as a fallback.
Original file line number	Diff line number	Diff line change
`@@ -375,7 +375,7 @@`
`375`	`375`	`},`
`376`	`376`	`"target": [`
`377`	`377`	`{`
`378`		`- "type": "mutiple",`
	`378`	`+ "type": "multiple",`
`379`	`379`	`"encode": {`
`380`	`380`	`"x": [`
`381`	`381`	`"有效期至"`
Original file line number	Diff line number	Diff line change
`@@ -18,7 +18,7 @@`
`18`	`18`	`{"排队时间":15, "满意度": 1}`
`19`	`19`	`]`
`20`	`20`	`},`
`21`		`- "targe": [`
	`21`	`+ "target": [`
`22`	`22`	`{`
`23`	`23`	`"type": "scatter",`
`24`	`24`	`"encode": {`