Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add charset conversion feature. #222

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gnuhpc
Copy link

@gnuhpc gnuhpc commented Aug 10, 2017

Our senario is to collect to some logs encoded by GBK (a Chinese encoding) rather than UTF-8.

We use flume taildirectorysource plugin to do the collection. After avro serialization, the event was sent to kafka. And then we use logstash-kafka-input to fetch and send it to elasticsearch. The data flow is:

Logs(GBK) -> flume (avro serialization)->kafka-> logstash-kafka-input (avro codec to do the avro deserialization) -> elasticsearch

the input config is:

input {
        kafka {
                bootstrap_servers => "DPFTMP06:9092"
                max_partition_fetch_bytes => "3145728"
                topics =>["emu-topic"]
                group_id => "logstashc"
                auto_offset_reset => "earliest"
                consumer_threads => 8
                key_deserializer_class => "org.apache.kafka.common.serialization.ByteArrayDeserializer"
                value_deserializer_class => "org.apache.kafka.common.serialization.ByteArrayDeserializer"
                charset => "GBK"
                charset_field => ["message"]
                codec => avro {
                        schema_uri => "/logger/logstash/configM/test.avsc"
                }
        }
}

If the logs were UTF-8, it won't cause any trouble. Unfortunately, the logs are GBK. So the kibana will show garbled message when it contains Chinese.

With this pr, we can convert the message to the correct encoding after avro codec in this input plugin.

Please check, thank you!

@cgyim
Copy link

cgyim commented Aug 24, 2017

@gnuhpc Recently I met that same trouble as you mentioned, but with GB2312 encoding. My data flow is also flume -> kafka -> logstash , flume source: taildir, could you share your configuration of flume ?

@gnuhpc
Copy link
Author

gnuhpc commented Sep 27, 2017

@cgyim1992 It's nothing to do with this issue. Please add me in wechat: gnuhpc. Thank you!

@imweijh
Copy link

imweijh commented Dec 8, 2017

Specify charset GBK in flume config,
Not in logstash-input-kakfa.

@gnuhpc
Copy link
Author

gnuhpc commented Jan 4, 2018

@imweijh can you tell me the flume config to configure the GBK charset when using avro? Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants