Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error_class=URI::InvalidURIError error=“bad URI(is not URI) ..” #4646

Open
raulgupto opened this issue Sep 25, 2024 · 15 comments
Open

Error_class=URI::InvalidURIError error=“bad URI(is not URI) ..” #4646

raulgupto opened this issue Sep 25, 2024 · 15 comments
Labels
waiting-for-user Similar to "moreinfo", but especially need feedback from user

Comments

@raulgupto
Copy link

Describe the bug

I’m getting this error continuously. When using @http plugin.
I’ve not been able to find the root cause for this but I’ve noticed this in coincidentally when my external endpoint is down for restarts.
I’ve buffering enabled which writes into my local disk and I do not drop any log chunks ie I’ve retry_forever as try. But when the service is back up this one chunk goes into periodic retries till infinity as the dynamic tag in the http endpoint is not resolved in retries.

so the whole error is like this:
Error_class=URI::InvalidURIError error=“bad URI(is not URI) \”https://myexternalendpoint.com/v0/${tag}\””

fluentd version: 1.16.5

To Reproduce

Use http plugin to an endpoint with ${tag}, using retry forever as true.

Expected behavior

Buffer chunk should be sent. It should not complain for invalid uri

Your Environment

- Fluentd version: 1.16.5
- Package version:
- Operating system: Red Enterprise Linux Server 7.9(Maipo)
- Kernel version: 2024 x86_64 GNU/Linux

Your Configuration

<match **>
@type http 
endpoint http://externalxx.com/v0/${tag}
content_type application/json
<format>
@type json
</format>
json_array true
<buffer_tag>
@type file
path /local/data/fluentd
flush_interval 12s
flush_thread_count 1
overflow_action block
chunk_limit_size 4MB
retry_type periodic
retry_wait 60s
total_limit_size 6GB
retry_forever true
</buffer>
</match>

Your Error Log

Error_class=URI::InvalidURIError error=“bad URI(is not URI) \”https://myexternalendpoint.com/v0/${tag}\”

Additional context

No response

@daipom
Copy link
Contributor

daipom commented Sep 26, 2024

@raulgupto Thanks for your report.
However, I can't reproduce this.

The placeholder is replaced when retrying if the chunk has tag key info.
If you set tag to the chunk keys, ${tag} should be replaced when retrying.

<match test>
  @type http 
  endpoint http://localhost:9880/${tag}
  <format>
    @type json
  </format>
  <buffer tag>
    @type file
    path ...
    flush_mode immediate
  </buffer>
</match>

@daipom daipom added the waiting-for-user Similar to "moreinfo", but especially need feedback from user label Sep 26, 2024
@raulgupto
Copy link
Author

Can you try killing the fluentd process ? I can’t figure out the exact scenario to reproduce this issue. What I’ve noticed is that, In normal scenario we have two buffer files for a chunk of message. But in this case, I’ve also noticed that only one present most of the time.

@daipom
Copy link
Contributor

daipom commented Sep 27, 2024

Can you try killing the fluentd process ?

I have tried.
When Fluentd restarts, Fluentd loads the existing chunks and sends them correctly.

But in this case, I’ve also noticed that only one present most of the time.

This should be the cause.
I can reproduce this issue as follows.

  1. Make some buffer files.
  2. Stop Fluentd with some buffer files remaining.
  3. Delete some .meta buffer files manually.
  4. Restart Fluentd.
  5. This error happens.

The file buffer (buf_file) needs a .meta file to process the placeholders.
If it is removed, Fluentd can't process the placeholders.

@daipom
Copy link
Contributor

daipom commented Sep 27, 2024

If .meta file is removed accidentally, it means the information about tag is lost.
So, it is very difficult for Fluentd to recover such data.

@raulgupto
Copy link
Author

I understand without a location you don’t know where to send it. But since retry_forever is true and fluentd keeps on retrying this chunk. What I’ve noticed is that instead of waiting just this chunk to be flushed. Fluentd proces is heavily waiting for this to be flushed, does not go down but consume whole buffer space and remain stuck forever. A solution to manually clear that buffer is there but that requires manual intervention to delete the buffer in production environment which is not sustainable.
Either we should drop the chunk that is corrupted ie without end address or we should fix this with the current address. The later seems not correct because ${tag} or fields like this were supposed to be dynamically resolved. Also, what if someone had changed config with new http address that chunk which was meant for old would go to the new.
I’d go with dropping the ill-configured buffers.

@raulgupto
Copy link
Author

Another approach is to find a way how this problem would not appear in the first place. I’ve seen this appear frequently. Around 3-5 unique /160 hosts are facing this on monthly basis. Any existing config change that would fix this issue?

@daipom
Copy link
Contributor

daipom commented Sep 27, 2024

To address the root cause, please investigate why some buffer files are disappearing.
Is it a bug in Fluentd or an external factor?

If this may be a bug in Fluentd, we need to find out how to reproduce this phenomenon to fix the bug.
(I can reproduce the error by manually removing some buffer files. On the other hand, some buffer files must have been lost for some reason in your environment. We need to find out the cause.)

But since retry_forever is true and fluentd keeps on retrying this chunk. What I’ve noticed is that instead of waiting just this chunk to be flushed. Fluentd proces is heavily waiting for this to be flushed, does not go down but consume whole buffer space and remain stuck forever.

Some errors are considered non-retriable, and Fluentd gives up retrying.

About the error in this issue, Fluentd executes retrying. It is considered retriable in the current implementation.
So, if using retry_forever, Fluentd retries to flush the chunk forever.

The issue may be improved if this can be fixed so that the error can be determined as non-retriable.

A solution to manually clear that buffer is there but that requires manual intervention to delete the buffer in production environment which is not sustainable.

You can stop using retry_forever, and add secondary.
This allows to automatically save unexpected data to a file or other location without manual tweaking.

Either we should drop the chunk that is corrupted ie without end address or we should fix this with the current address.

Certainly, we should improve the handling of buffers about this point.
If there is no corresponding .meta buffer file, it could be better that Fluentd drops or backups the chunk.

@raulgupto
Copy link
Author

raulgupto commented Sep 27, 2024

I’ll definitely add secondary_file. 1 question:
If I use retry_timeout / retry_max_times, how will my retries work in this case.

  1. If 1 buffer has exhausted the retry parameter it will stop sending all buffer chunks.
    or
  2. If 1 buffer has exhausted the retry parameter. This 1 chunk of won’t be retried but others will be retried for same no of times.

I don’t want to stop after n tries or n duration. I want to keep retrying assuming my endpoint will be back after recovering from failure / releases.
Edit : I tried secondary_file. It doesn’t resolve ${tag}. I have <match **> as my match condition.
I wanted to separate out in dump which log files chunks have failed so that I could manually send it to the endpoint.

@daipom
Copy link
Contributor

daipom commented Oct 2, 2024

@raulgupto Sorry for my late response.

If I use retry_timeout / retry_max_times, how will my retries work in this case.

1. If 1 buffer has exhausted the retry parameter it will stop sending all buffer chunks.
   or

2. If 1 buffer has exhausted the retry parameter. This 1 chunk of won’t be retried but others will be retried for same no of times.

2 is correct.
Fluentd handles retries for each chunk.

@daipom
Copy link
Contributor

daipom commented Oct 2, 2024

Edit : I tried secondary_file. It doesn’t resolve ${tag}. I have <match **> as my match condition.
I wanted to separate out in dump which log files chunks have failed so that I could manually send it to the endpoint.

Chunks that cannot resolve placeholders due to missing metafiles fail to be transferred.
The secondary_file handles such chunks, so it can't resolve ${tag}.
If the metafile is lost, the tag information cannot be recovered.

@raulgupto
Copy link
Author

Thank you for the seconday_file workaround. It will help to manually recover and send logs in case of failures. It would however be great if we can have retries/solution that can help recover buffers in case the .meta file is lost

@daipom
Copy link
Contributor

daipom commented Oct 4, 2024

If .meta file is removed accidentally, it means the information about tag is lost.
So, it is very difficult for Fluentd to recover such data.

So, it would be better to avoid the disappearance of buffer files.

Do you have any idea as to why the buffer file disappears?

Is Fluentd running duplicatedly?

@raulgupto
Copy link
Author

I’ve added graceful kill commands to kill running process and around 10 second of sleep for restarts.
However, we have a process monitor that checks if fluentd is running or not. If not running it restarts it. So even if during host maintenance or clean restarts I don’t think there will be process duplication. But there are chances of process kill and restarts which ideally should not leave half of metadata. Is there any flag that can prevent metadata corruption during restarts that I can use ?

@daipom
Copy link
Contributor

daipom commented Oct 28, 2024

Sorry for my late response.

So even if during host maintenance or clean restarts I don’t think there will be process duplication.

I see...

Is there any flag that can prevent metadata corruption during restarts that I can use ?

No.
It's a very unusual case that some buffer files disappear.
It is highly likely that the factor is external to Fluentd, and without identifying it, it is difficult to consider specific measures.

We need a way to reproduce the phenomenon.

@pecastro
Copy link

Hi all.

We've been experiencing the same problem in our fluentd 1.16.5

Though we haven't been able to reproduce it we can offer some clues regarding when/how we started seeing it.

Whilst running it under EKS we applied a VPA component to it which meant an automatic adjustment of CPU and memory limits versus the hardcoded limits we had before.
In some situations ( lower load period ) the VPA would dynamically lower the memory and CPU limits to much lower values than we had ever ran fluentd with.
Subsequently, when under load, sometimes our fluentd would restart after OOMing and we'd start seeing those errors appear in the logs.

Our current running hypothesis is that perhaps when fluentd hit one of these conditions (OOM) it would fail to write the meta file for one of the received logs and thus leave the log in an unprocessable state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-for-user Similar to "moreinfo", but especially need feedback from user
Projects
None yet
Development

No branches or pull requests

3 participants