Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use single ElasticSearch index to store dependencies #2143

Closed
frittentheke opened this issue Mar 26, 2020 · 12 comments · Fixed by #2144
Closed

Use single ElasticSearch index to store dependencies #2143

frittentheke opened this issue Mar 26, 2020 · 12 comments · Fixed by #2144

Comments

@frittentheke
Copy link
Contributor

frittentheke commented Mar 26, 2020

Requirement - what kind of business use case are you trying to solve?

Using ElasticSearch as storage, and using it most efficiently.

Problem - what in Jaeger blocks you from solving the requirement?

Currently the dependencies (System Architecture in die UI) are created "per day" and stored in an dedicated ElasticSearch index per day (see: https://github.com/jaegertracing/spark-dependencies/blob/master/jaeger-spark-dependencies-elasticsearch/src/main/java/io/jaegertracing/spark/dependencies/elastic/ElasticsearchDependenciesJob.java#L203).

The number of indices (actually the number of shards, but they are closely related) one uses to store data in ElasticSearch shall be kept low as they are not "free" (see. https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster).

So especially when looking at the Jaeger span and service indices - which Jaeger learned to use the rollover API for, in order to keep the number of shards low - using a new index for each day of dependencies to be stored and then only put a single document into that index seems a little excessive.

Proposal - what do you suggest to solve the problem or improve the existing situation?

A coordinated switch in Jaeger as well as in the referred external (Spark) job creating the dependencies to simply store them within a single index with a field to mark which day they belong to.

As for housekeeping: It's one doc per day ... so even if one does never delete any documents that index would not explode in size. But if required / intended this could be done in the Spark job as well.
As in "keep for x days" and then delete docs with an older than the mentioned timestamp.

Any open questions to address

@pavolloffay
Copy link
Member

pavolloffay commented Mar 26, 2020

Related to jaegertracing/spark-dependencies#68

@pavolloffay
Copy link
Member

@frittentheke would you like to submit a PR?

@frittentheke
Copy link
Contributor Author

frittentheke commented Mar 27, 2020

@pavolloffay Sure. What should that PR contain then?

The change of the es storage module (https://github.com/jaegertracing/jaeger/blob/master/plugin/storage/es/dependencystore/storage.go) to read and write to that single index I suppose. The query already uses the timestamp field (https://github.com/jaegertracing/jaeger/blob/master/plugin/storage/es/dependencystore/storage.go#L111) so that would not even need changing.
That would then be fully transparent to the UI, right?
Certainly I could also throw together a little PR for the Spark job again (jaegertracing/spark-dependencies#86) to keep compatibility.

Maybe a topic for a separate issue, but if I may ask: What are your plans forward regarding producing those dependencies?
Currently the Spark job uses JavaEsSpark.esJsonRDD which has no optimizations (DataFrames and their pushdown - https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html#spark-pushdown would be). So apart from the plain query I added in the PR jaegertracing/spark-dependencies#86 all docs are fetched and instantiated into full Span objects, even though not all fields of the spans are required for the dependency extraction. This causes many gigabytes of data transferred and a massive memory footprint as well as turnover on the JVM running the job.

Also the write the dependency storage is not done via the API but directly to elasticsearch - thus the issue with "fixing" both ends of the equation.

While all of Jaeger is Golang, running Java code and then also using the Spark framework seems a bit overly complex - at least if ElasticSearch is concerned. See my comments regarding using the ES terms API (jaegertracing/spark-dependencies#68 (comment)) to keep all of the heavy lifting within the ElasticSearch cluster and only minuscule mounts of data having to be transferred.

But even keeping the current approach - using plain Golang and an Elasticsearch client to iterate over the data would at least keep Jaeger components similar.

@pavolloffay
Copy link
Member

The UI does not have to be changed. We just need to change the writer (the writer is not used though) and reader. The dependency storage impl should use the same index names as span storage impl.:
IIRC it is jaeger-span-read and jaeger-span-write.

The index cleaner and rollover scripts will have to changed also to support rollover.

Maybe a topic for a separate issue, but if I may ask: What are your plans forward regarding producing those dependencies?

Any improvements to ES query from the spark dependencies job are welcome. Please create a separate issue.

But even keeping the current approach - using plain Golang and an Elasticsearch client to iterate over the data would at least keep Jaeger components similar.

There are no plans to rewrite the current jobs to Golang, The data aggregations job are memory heavy and in prod systems with a lot of data they might require running a spark/flink cluster. The plans were to provide more aggregations jobs, hence frameworks like spark are useful.

@frittentheke
Copy link
Contributor Author

The UI does not have to be changed. We just need to change the writer (the writer is not used though) and reader. The dependency storage impl should use the same index names as span storage impl.:
IIRC it is jaeger-span-read and jaeger-span-write.

The index cleaner and rollover scripts will have to changed also to support rollover.

I was actually not suggesting / implying to use rollover for storing dependencies actually but just a single index. There are so few documents to hold dependencies (currently it's one per day) it makes no sense to roll over.

But thinking about it: Using rollover in conjunction with ILM (ElasticSearch Index Lifecycle Policies) might make sense just for the much easier housekeeping. Then no external job would be required to delete old indices / data, but simply have ElasticSearch roll and expire indices to your liking, full transparent to the application. We run this setup for the spans / services with great success.

Maybe a topic for a separate issue, but if I may ask: What are your plans forward regarding producing those dependencies?

Any improvements to ES query from the spark dependencies job are welcome. Please create a separate issue.

See jaegertracing/spark-dependencies#88

@frittentheke
Copy link
Contributor Author

frittentheke commented Mar 30, 2020

@pavolloffay I just pushed a PR: #2144
If you happen to like that one - I added the write alias to the Spark job in my PR jaegertracing/spark-dependencies#86 as well .. see: jaegertracing/spark-dependencies@ec4c28a

@pavolloffay
Copy link
Member

Slightly off-topic question is ES ILM free to use? It's marked as x-pack feature which is payed extension: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html

@pavolloffay
Copy link
Member

I was actually not suggesting / implying to use rollover for storing dependencies actually but just a single index. There are so few documents to hold dependencies (currently it's one per day) it makes no sense to roll over.

I am not sure how feasible it would be given the index can last for year(s) and there is no way to remove old documents from it.

@frittentheke
Copy link
Contributor Author

frittentheke commented Mar 30, 2020

Slightly off-topic question is ES ILM free to use? It's marked as x-pack feature which is payed extension: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html

Yes @pavolloffay , but in the free tier (no cost).... see https://www.elastic.co/subscriptions
But with its smart rules on when to do a rollover and when to shrink or delete indices it really is great to not having to run external jobs (like the curator). Even Jaeger currently "has to" provide the housekeeping for the ElasticSearch storage. Even though I blieve the curator (https://github.com/elastic/curator) with a bit of config could be a good replacement and free you from maintaining esCleaner.py and esRollover.py (https://github.com/jaegertracing/jaeger/tree/master/plugin/storage/es) altogether.

@pavolloffay
Copy link
Member

Scripts esCleaner.py and esRollover.py are using curator under the hood. But instead of using the curator's configuration files we use the programmatic API. We could not use just the conf files because we needed to perform more actions which were not possible with the config file.

@AhHa45
Copy link

AhHa45 commented Sep 7, 2021

any news?

@frittentheke
Copy link
Contributor Author

@AhHa45 yes. I refactored my change to add ES alias / rollover support to Jaeger - check out: #2144

albertteoh added a commit that referenced this issue Feb 3, 2022
…esolves #2143) (#2144)

* Add support for ES index aliases / rollover to the dependency store

 * Give DependencyStore a params struct like the SpanStore to carry its configuration parameters
 * Adapt and extend the tests accordingly

Signed-off-by: Christian Rohmann <[email protected]>

* Extend es-rollover and es-index-cleaner to support rolling dependencies indices

Signed-off-by: Christian Rohmann <[email protected]>

Co-authored-by: Christian Rohmann <[email protected]>
Co-authored-by: Albert <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants