This project allows reading trace telemetry (segment documents) pulled from the AWS X-Ray REST API and converting/forwarding it to an OpenTelemetry OTLP-compatible endpoint.
It enables an observability solution to analyze the trace telemetry directly captured via e.g. OpenTelemetry together with X-Ray instrumented AWS services. Especially for fully managed (serverless) services such as Amazon API Gateway, which ONLY support tracing using X-Ray, the integration of X-Ray gives much better insights and end-to-end visibility.
Trace exported into Dynatrace
As AWS X-Ray uses its proprietary trace context, a transaction which passes multiple tracing systems such as X-Ray and OpenTelemetry (using W3C-TraceContext), will generate separated traces. To follow such a transaction you need to correlate the traces by capturing the trace context from the incoming different tracing system. This concept is also called span linking.
AWS services with X-Ray enabled include X-Ray trace IDs in their log events. You can either look up the logs by the origin AWS X-Ray trace context, which is included as span attributes (aws.xray.trace.id and aws.xray.segment.id) or you can transform the X-Ray trace context in log events into the W3C trace context as used by the the conversion in XRay22OTLP.
Whereas the span-id is derived from the segment-id without any further modification, the trace-id is converted using the logic SUBSTR(REPLACE_STRING(traceId, "-", ""), 1))
An example of such a a log processing rule using Dynatrace is
PARSE(content, "JSON{STRING:traceId}(flat=true)")
| PARSE(content, "JSON{STRING:segmentId}(flat=true)")
| FIELDS_ADD(dt.trace_id:SUBSTR(REPLACE_STRING(traceId, "-", ""), 1))
| FIELDS_REMOVE(traceId)
| FIELDS_RENAME(dt.span_id: segmentId)
for structured logs or
PARSE(content, "DATA 'XRAY TraceId:' SPACE? STRING:TraceId DATA 'SegmentId:' SPACE? STRING:SegmentId")
| FIELDS_ADD(TraceId,SegmentId)
| FIELDS_ADD(dt.trace_id:SUBSTR(REPLACE_STRING(TraceId, "-", ""), 1))
| FIELDS_REMOVE(TraceId)
| FIELDS_RENAME(dt.span_id: SegmentId)
for unstructured logs.
XRayConnector implements the workflow for polling from the AWS X-Ray REST API, does the data transformation and forwarding to an OpenTelemetry OTLP compatible endpoint. The data-transformation semantics converting from AWS X-Ray segment documents to OTLP is implemented in the XRay2OTLP library.
XRayConnector provides a REST API to manage the workflow.
The supported OpenTelemetry protocol is OTLP/HTTP JSON format
The workflow is implemented using Durable Functions, which abstracts away the complexity to manage a fault-tolerant and reliable polling mechanism as behind the scenes the framework manages state, checkpoints, and automatic restarts.
Durable Functions are powered by the Durable Task Framework (DTFx), which supports an extensible set of backend persistence stores. For this project the DurableTask.SqlServer extension is used to provide a cross-platform deployment using Kubernetes.
For more details about the architecture, scaling and operations using the SQLServer extension on K8s read here
The AWS X-Ray REST API is subject to throttling when the rate limit of 5 requests per second is reached. This limits the total number of traces able to pull. The required number of requests for a number of traces can be estimated with the following formula:
(NumberOfTraces * (1 + 20*RoundUp(AvgNumberofServicesPerTrace/10)))/100
This number can be helpful to understand the request limits when optimizing the polling interval to balance telemetry latency and maxing out the number of traces / requests beeing able to export from X-Ray. For more details see section Monitoring.
For reading from the AWS X-Ray REST API, create an AWS access key with a policy that includes at least following actions xray:BatchGetTraces and xray:GetTraceSummaries.
The default configuration uses a polling interval of 5 minutes to retrieve recent traces.
The XRayConnector pod is configured to use up to 5 workers, which should be sufficient to run the workflow in most scenarios, but it is recommended to test under load conditions. You should consider scaling out workers if your database tables dt.NevEvents or dt.NewTasksstart queuing up unprocessed events.
The database is deployed using a stateful-set with 8 Gib storage. As DTFx is based on the event-sourcing pattern, the database can grow very fast. An automatic purge of the history is implemented as a cronjob in xrayconnector.yml. The cronjob calls the /api/PurgeHistory every 6 minutes and keeps at least 30 minutes of historic events.
The xrayconnector.yml also includes a cronjob, that automatically calls /api/WorkflowWatchdog to check the status of the workflow every 3 minutes.
Step 1) KEDA v2 is a pre-requisite. Make sure KEDA is up and running.
For more details how to install KEDA, see
Step 2) Build the XRayConnector container and push it to your target repository
# Replace '<YOUR-REPOSITORY>' with your target container registry
docker build -t xrayconnector:latest -f ./xrayconnector/Dockerfile .
docker tag xrayconnector:latest <YOUR-REPOSITORY>/xrayconnector:latest
docker push <YOUR-REPOSITORY>/xrayconnector:latest
Step 3) Configure database mssql-statefulset-secrets.yml
Replace PLACEHOLDER with your password of choice to access the database.
Step 4) Deploy mssql server and create the database
kubectl create namespace xrayconnector-mssql
kubectl apply -f ./mssql-statefulset-secrets.yml -n xrayconnector-mssql
kubectl apply -f ./mssql-statefulset.yml -n xrayconnector-mssql
# Once pod is ready...
# ..get the name of the pod running SQL Server
$mssqlPod = kubectl get pods -n xrayconnector-mssql -o jsonpath='{.items[0].metadata.name}'
# Use sqlcmd.exe to create a database named "DurableDB".
# Replace 'PLACEHOLDER' with the password you used earlier
$mssqlPwd = "PLACEHOLDER"
kubectl exec -n xrayconnector-mssql $mssqlPod -- /opt/mssql-tools18/bin/sqlcmd -C -S . -U sa -P $mssqlPwd -Q "CREATE DATABASE [DurableDB] COLLATE Latin1_General_100_BIN2_UTF8"
Step 5) Configure the polling & forwarding of X-Ray data in connector-config.yml
Replace the placeholders with proper values providing AWS secrets, OTLP endpoints, ..
...
# - - - Database provider - - -
# Connection string, replace the <YOUR-DATABASE-PASSWORD> with your actual password as configured in mssql-statefulset-secrets.yml
SQLDB_Connection: "Server=mssqlinst.mssql.svc.cluster.local;Database=DurableDB;User ID=sa;Password=<YOUR-DATABASE-PASSWORD>;Persist Security Info=False;TrustServerCertificate=True;Encrypt=True;"
# - - - AWS IAM identifiers to access X-Ray API - - -
# Role based access, for using temporal credentials (recommended, optional)
AWS_RoleArn: "<YOUR-ROLE-ARN>",
# https://docs.aws.amazon.com/general/latest/gr/xray.html#xray_region
# us-east-1, ap-southeast-2, etc.
AWS_RegionEndpoint: "<YOUR-AWS-REGION>"
# Basis IAM credentials
AWS_IdentityKey: "<YOUR-AWS-IDENTITY-KEY>"
AWS_SecretKey: "<YOUR-AWS-SECRET-KEY>"
# - - - Workflow configuration - - -
# Polling interval/windows for retrieving trace telemetry from X-Ray API. Default is 180 (3 min)
PollingIntervalSeconds: "300"
# When polling is restarted, the maximum timespan to catch up, before the timeframe gets reset. A too large window can cause a polling jam. Default is 900 (15 min).
DefaultMaximumReplayHistorySeconds: "900"
# If set to True the workflow is automatically started (or re-started in case it was terminated or failed) when the api/WorkflowWatchdog is called. Default is "False".
AutoStart: "True"
# Enables JsonPayloadCompression If set to True (Recommended) the internal processing of tracedetails is compressed. Improves workflow performance and reduces I/O load on the database, but slightly increases CPU usage. Default is "False".
EnableJsonPayloadCompression: "True"
# - - - Target OTLP configuration - - -
# Target OTLP endpoint for sending telemetry. For Dynatrace this may look like this: "https://<YOUR-TENANT-ID>.live.dynatrace.com/api/v2/otlp/"
OTLP_ENDPOINT: "<YOUR-OTLP-TARGET-ENDPOINT>"
# Optional: OTLP header authorization (only used for traces). For Dynatrace provide an API Token with OTLP Trace Ingest permissions in the following format "Api-Token <YOUR-DYNATRACE-API-TOKEN>"
OTLP_HEADER_AUTHORIZATION: "<YOUR-OPTIONAL-OTLP-HEADER-AUTHORIZATION>"
# - - - Workflow telemetry - - -
# Enable workflow metrics sent via OTLP. Default is "False".
EnableMetrics: "True"
# Metrics exporter configuration. See also: https://opentelemetry.io/docs/languages/sdk-configuration/otlp-exporter/
# Metrics OTLP protocol. Possible values: "grpc", "http/protobuf". Default is "grpc".
# OTEL_EXPORTER_OTLP_METRICS_PROTOCOL: "<PROTOCOL>"
# OTLP endpoint for sending metrics. If not set, OTLP_ENDPOINT will be used.
# OTEL_EXPORTER_OTLP_METRICS_ENDPOINT :"<YOUR-OTLP-TARGET-ENDPOINT>"
# OTLP metrics headers for e.g. authorization
# OTEL_EXPORTER_OTLP_METRICS_HEADERS: "<YOUR-OPTIONAL-OTLP-HEADERS>"
# - - - TESTING ONLY - - -
# Uses a mocked XRay API Client that simulates API responses
# SimulatorMode: "XRayApi"
# Number of total traces returned for a TraceSummaries call
# SIM_TraceSummariesResponseCount: "100"
# Maxiumum number of trace per TraceSummaries request returned (to force paging)
# SIM_TraceSummariesPageSize: "25"
# Each simulated trace contains 5 segments. Configure if segments should be returned in a single batch ("None"), constantly batch ("Always") or randomly (~25%) batch. If batching is enabled, 2 batches are returned.
# SIM_BatchTraceSegments: "Random"
Step 6) Configure the function keys and registry in xrayconnector.yml
- (Recommended) Replace all function keys ( host.master, host.function.default, ..), which protect your functions with new ones, encoded in base64.
- Generate a new key with e.g. OpenSSL:
openssl rand -base64 32 - Base64 encode the returned key:
echo -n '<THE NEW KEY>' | base64
- Generate a new key with e.g. OpenSSL:
- (Recommended) Replace the host.masterkey used in the xrayconnector-watchdog cronjob
http://xrayconnector/api/WorkflowWatchdog?code=<REPLACE-WITH-THE-NEW-KEY>with the newly created key. - Replace <YOUR-REPOSITORY> with the container registry, hosting your image
Step 7) Deploy config and XRayConnector
kubectl create namespace xrayconnector
kubectl apply -f .\connector-config.yml -n xrayconnector
kubectl apply -f .\xrayconnector.yml -n xrayconnector
Checking deployment status...
kubectl get pods -n xrayconnector
kubectl rollout status deployment xrayconnector -n xrayconnector
See test.http which provides api requests to be run in Visual Studio Code (VSCode) via the REST Client extension.
If autostart is disabled, you need to automatically trigger the main workflow timer.
POST https://xxxx/api/TriggerPeriodicAPIPoller?code=<YOUR-FUNCTION-HOST-MASTER-KEY>
Manually stop the workflow. Does not disable autostart!
POST https://xxxx/api/TerminatePeriodicAPIPoller?code=<YOUR-FUNCTION-HOST-MASTER-KEY>
Checks the status of the workflow. If autostart is enabled, enforces a start of the workflow.
POST https://xxxx/api/WorkflowWatchdog?code=<YOUR-FUNCTION-HOST-MASTER-KEY>
Purges the workflow history for completed, failed or terminated instances. Optionally provide a timespan in minutes to only delete history older than X minutes.
POST https://xxxx/api/PurgeHistory?code=<YOUR-FUNCTION-HOST-MASTER-KEY>
content-type: text/plain
360
A simple endpoint to see if the api is up & running
GET https://xxxx/api/TestPing?code=<YOUR-FUNCTION-HOST-MASTER-KEY>
Sends a sample trace into X-Ray. This feature requires additional actions granted in your AWS IAM policy: xray:PutTelemetryRecords and xray:PutTraceSegments
POST https://xxxx/api/TestGenerateSampleTrace?code=<YOUR-FUNCTION-HOST-MASTER-KEY>
Sends a sample trace to the configured OTLP endpoint to validate connection settings.
POST https://xxxx/api/TestSendSampleTrace?code=<YOUR-FUNCTION-HOST-MASTER-KEY>
XRayConnector provides several metrics to monitor its execution:
| Metric name | Metric type | Unit | Additional dimensions | Description |
|---|---|---|---|---|
| workflow.polling_interval | Gauge | Milliseconds | workflow_instance, label | Times spent during polling interval |
| api_calls | Counter | Count | api, account, operation, paged, replay | Number of api requests |
| api_response_objects | Counter | Count | api, account, objectname | Number of objects returned |
| process_memory_usage.total | Gauge | Bytes | Process total memory | |
| process_memory_usage.gc_heap | Gauge | Bytes | Process managed memory |
Dynatrace Sample Dashboard Configuration
The following snapshot shows a test-run using the XRay Api simulator mode. It demonstrates an initially poor utilization of the X-Ray API in combination of an unnecessary high latency of data beeing exported and how it looks after configuration optimization is applied.
In the first phase, a large polling interval is used and the majority of the time spent during a run is waiting, which can be seen in the "Workflow execution time breakdown" (1), as well as the request numbers dropping to zero (2).
After a configuration change, reducing the polling interval targeting only a minimal wait each interval, we see a constant utilization of the X-Ray api and processing of traces during the whole period (3). Note: The rise in total api-calls and traces is caused as the simulator returns a fixed number of traces each call, independent of the polling interval.
The sample also shows a seamless rolling update, having the new pod running in parallel showing a second instance (4), before the first pod is terminated.
This is an open source project, and we gladly accept new contributions and contributors.
Licensed under Apache 2.0 license. See LICENSE for details.

