Skip to content
Discussion options

You must be logged in to vote

standard默认是英文分词。jieba是中文分词

使用run_analyzer()这个接口可以测试分词器的效果:

    analyzer_params = {"tokenizer": "standard","filter": ["lowercase"] }
    res = client.run_analyzer(
        texts=["采用官方教程提供的分词器"],
        analyzer_params=analyzer_params,
    )
    print("Standard tokenizer:", res)

    analyzer_params = {"tokenizer": "jieba"}
    res = client.run_analyzer(
        texts=["采用官方教程提供的分词器"],
        analyzer_params=analyzer_params,
    )
    print("Jieba tokenizer:", res)

输出结果:

Standard tokenizer: [['采用官方教程提供的分词器']]
Jieba tokenizer: [['采用', '官方', '教程', '提供', '的', '分词', '分词器']]

所以你用standard分词,它会把整句中文当成一个词,那肯定匹配不上。
只有jieba才能正确地把中文句子分词。

Replies: 4 comments 21 replies

Comment options

You must be logged in to vote
3 replies
@StevenLLM
Comment options

@StevenLLM
Comment options

@StevenLLM
Comment options

Comment options

You must be logged in to vote
1 reply
@StevenLLM
Comment options

Comment options

You must be logged in to vote
17 replies
@yhmo
Comment options

yhmo Nov 27, 2025
Collaborator

@StevenLLM
Comment options

@xiaofan-luan
Comment options

@yhmo
Comment options

yhmo Dec 1, 2025
Collaborator

Answer selected by StevenLLM
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants