Zipkin 3.0.6
Zipkin 3.0.6 updates to Armeria 1.27.1, fixes ES_HTTP_LOGGING
and a glitch in Eureka registration.
- Armeria 1.27.1 helped us remove code around Eureka, which is now upstream, as well bring the server runtime to the latest Netty
ES_HTTP_LOGGING
broke when we updated to SLF4J 2. @reta resolved by bringing us back to the more compatible 1.7 plus config adjustment.- Those using spring-cloud-sleuth were unable to discover zipkin even when it set env like
EUREKA_HOSTNAME
.
Eureka and spring-cloud
Skip this part unless you want to take a walk with us down troubleshooting lane!
Eureka is a service registry originally started at Netflix. Zipkin can register itself in Eureka, allowing traced services to discover its listen address and health state. So, this is an alternative to normal DNS. We added support for this in Zipkin 2.27 and have been polishing that since.
Before, we were testing Eureka integration with armeria. Armeria doesn't use the netflix/eureka codebase at all, as it implements its api directly. This is great for Armeria users as the Netflix/Eureka codebase uses a lot of antique dependencies, some not updated in 8 years. However, it isn't a good test for zipkin for the same reason.
Most users who use Eureka, use Spring Boot 2, and most of those who use zipkin, use spring-cloud-sleuth (which uses brave internally). To get a better sense of confidence registration works in practice, we decided to update our sleuth example to use Eureka. The idea was to set a pseudo hostname in the zipkin endpoint: that would be replaced dynamically by a real endpoint in the "zipkin" application in Eureka. Then, we're all good.
But, we weren't all good. This didn't work at all, as our example used a reactive WebFlux configuration. For some reason, when a sleuth-instrumented application is using reactive, you cannot use Eureka to discover zipkin. So, we backported our sleuth example to a version that can use Eureka. Ironically, we had to go back to WebMvc which was the original canonical zipkin example! However, despite webmvc5-sleuth using the right parts, the pseudo zipkin hostname wasn't replaced.
In close inspection, the first thing we noticed was something documented, but not entirely intuitive. Documentation says to use the "service ID" as the pseudo-hostname in the zipkin URL, which would be replaced with the real hostname and port. In the case of Eureka, it seems intuitive to use the service to find instances of it. Specifically the Eureka application (EUREKA_APP_NAME
of all zipkin instances). However, the "service ID" is not that, and it isn't even the instanceId
in Eureka. Oddly, the "service ID" maps to the vipAddress
field in Eureka, which is actually an instance's hostname! So, the strange thing is that the pseudo-hostname is actually the real hostname!
Fine, so we put the vipAddress
zipkin registered into Eureka into the hostname field as a quasi hostname, but still it didn't work. Stepping through a debugger, we found that if there is a port in the hostname (e.g. zipkin-server:9411) the configuration code assumes it is not something to look up, rather something already resolved. This led to a realization that the vipAddress
having a port encoded, was actually a config default bug, but a simple one to work around. In 3.0.6, when someone sets EUREKA_HOSTNAME
, we also set vipAddress
explicitly to avoid the accidental port adding default.
Voilla! Finally, we're all good: sleuth replaces vipAddress
with that same address and also a port, and it could have only gotten that from eureka info. While it feels like a lot of work to accomplish little, people will still get the other benefits of Eureka (specifically spring-cloud-netflix use of it) including health checking and discovery of other endpoints besides the one you knew about and stuffed into the zipkin URL. While not as ideal as specifying the app name, this approach isn't completely unique to spring. Other technology sometimes ask for "well known addresses" in order to find the rest of a cluster.
Through comments and issue links in the webmvc5-sleuth example, we containerized this hard earned experience, to save future maintainers work trying to figure it all out again. In other words, they don't have to read these release notes and can just use the working binary.
The moral of the story, is: integration test things twice or three times if you can, as some behaviors are not necessarily intuitive. If you have more integrations, all the strange things will present themselves. While painful to get through all of the troubleshooting, it is definitely better to have the project bear this weight than relying on end users to figure it out!
Follow-up
Immediately after this release, spring-cloud-sleuth released 3.1.11 which fixed WebFlux discovery with Eureka. Hence, we our the webflux5-sleuth example, while still keeping the webmvc one. All our Eureka-compatible examples are integration tested against a real eureka server on change now, to prevent unknowing regressions in the future.
Full Changelog: https://github.com/openzipkin/zipkin/compare/3.0.5..3.0.6