Skip to content

dtPaTh/AWS.XRayExporter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AWS X-Ray Exporter for OpenTelemetry

This project allows reading trace telemetry (segment documents) pulled from the AWS X-Ray REST API and converting/forwarding it to an OpenTelemetry OTLP-compatible endpoint.

It enables an observability solution to analyze the trace telemetry directly captured via e.g. OpenTelemetry together with X-Ray instrumented AWS services. Especially for fully managed (serverless) services such as Amazon API Gateway, which ONLY support tracing using X-Ray, the integration of X-Ray gives much better insights and end-to-end visibility.

Original Trace in X-Ray

X-Ray

Trace exported into Dynatrace

Dynatrace

Trace correlation

As AWS X-Ray uses its proprietary trace context, a transaction which passes multiple tracing systems such as X-Ray and OpenTelemetry (using W3C-TraceContext), will generate separated traces. To follow such a transaction you need to correlate the traces by capturing the trace context from the incoming different tracing system. This concept is also called span linking.

Logs in context of traces

AWS services with X-Ray enabled include X-Ray trace IDs in their log events. You can either look up the logs by the origin AWS X-Ray trace context, which is included as span attributes (aws.xray.trace.id and aws.xray.segment.id) or you can transform the X-Ray trace context in log events into the W3C trace context as used by the the conversion in XRay22OTLP.

Whereas the span-id is derived from the segment-id without any further modification, the trace-id is converted using the logic SUBSTR(REPLACE_STRING(traceId, "-", ""), 1))

An example of such a a log processing rule using Dynatrace is

PARSE(content, "JSON{STRING:traceId}(flat=true)")
| PARSE(content, "JSON{STRING:segmentId}(flat=true)")
| FIELDS_ADD(dt.trace_id:SUBSTR(REPLACE_STRING(traceId, "-", ""), 1))
| FIELDS_REMOVE(traceId)
| FIELDS_RENAME(dt.span_id: segmentId)

for structured logs or

PARSE(content, "DATA 'XRAY TraceId:' SPACE? STRING:TraceId DATA 'SegmentId:' SPACE? STRING:SegmentId")
| FIELDS_ADD(TraceId,SegmentId)
| FIELDS_ADD(dt.trace_id:SUBSTR(REPLACE_STRING(TraceId, "-", ""), 1))
| FIELDS_REMOVE(TraceId)
| FIELDS_RENAME(dt.span_id: SegmentId)

for unstructured logs.

How does it work?

XRayConnector implements the workflow for polling from the AWS X-Ray REST API, does the data transformation and forwarding to an OpenTelemetry OTLP compatible endpoint. The data-transformation semantics converting from AWS X-Ray segment documents to OTLP is implemented in the XRay2OTLP library.

XRayConnector provides a REST API to manage the workflow.

The supported OpenTelemetry protocol is OTLP/HTTP JSON format

Scalability & Portability

The workflow is implemented using Durable Functions, which abstracts away the complexity to manage a fault-tolerant and reliable polling mechanism as behind the scenes the framework manages state, checkpoints, and automatic restarts.

Durable Functions are powered by the Durable Task Framework (DTFx), which supports an extensible set of backend persistence stores. For this project the DurableTask.SqlServer extension is used to provide a cross-platform deployment using Kubernetes.

For more details about the architecture, scaling and operations using the SQLServer extension on K8s read here

AWS X-Ray API-Limits

The AWS X-Ray REST API is subject to throttling when the rate limit of 5 requests per second is reached. This limits the total number of traces able to pull. The required number of requests for a number of traces can be estimated with the following formula:

(NumberOfTraces * (1 + 20*RoundUp(AvgNumberofServicesPerTrace/10)))/100

This number can be helpful to understand the request limits when optimizing the polling interval to balance telemetry latency and maxing out the number of traces / requests beeing able to export from X-Ray. For more details see section Monitoring.

Getting Started

Pre-Requisites

For reading from the AWS X-Ray REST API, create an AWS access key with a policy that includes at least following actions xray:BatchGetTraces and xray:GetTraceSummaries.

Deploy to K8s

Default configuration

The default configuration uses a polling interval of 5 minutes to retrieve recent traces.

The XRayConnector pod is configured to use up to 5 workers, which should be sufficient to run the workflow in most scenarios, but it is recommended to test under load conditions. You should consider scaling out workers if your database tables dt.NevEvents or dt.NewTasksstart queuing up unprocessed events.

The database is deployed using a stateful-set with 8 Gib storage. As DTFx is based on the event-sourcing pattern, the database can grow very fast. An automatic purge of the history is implemented as a cronjob in xrayconnector.yml. The cronjob calls the /api/PurgeHistory every 6 minutes and keeps at least 30 minutes of historic events.

The xrayconnector.yml also includes a cronjob, that automatically calls /api/WorkflowWatchdog to check the status of the workflow every 3 minutes.

Step-by-Step Guide

Step 1) KEDA v2 is a pre-requisite. Make sure KEDA is up and running.

For more details how to install KEDA, see

Step 2) Build the XRayConnector container and push it to your target repository

# Replace '<YOUR-REPOSITORY>' with your target container registry
docker build -t xrayconnector:latest -f ./xrayconnector/Dockerfile .
docker tag xrayconnector:latest <YOUR-REPOSITORY>/xrayconnector:latest
docker push <YOUR-REPOSITORY>/xrayconnector:latest

Step 3) Configure database mssql-statefulset-secrets.yml

Replace PLACEHOLDER with your password of choice to access the database.

Step 4) Deploy mssql server and create the database

kubectl create namespace xrayconnector-mssql
kubectl apply -f ./mssql-statefulset-secrets.yml -n xrayconnector-mssql
kubectl apply -f ./mssql-statefulset.yml -n xrayconnector-mssql

# Once pod is ready...
# ..get the name of the pod running SQL Server
$mssqlPod = kubectl get pods -n xrayconnector-mssql -o jsonpath='{.items[0].metadata.name}'

# Use sqlcmd.exe to create a database named "DurableDB". 
# Replace 'PLACEHOLDER' with the password you used earlier
$mssqlPwd = "PLACEHOLDER"
kubectl exec -n xrayconnector-mssql $mssqlPod -- /opt/mssql-tools18/bin/sqlcmd -C -S . -U sa -P $mssqlPwd -Q "CREATE DATABASE [DurableDB] COLLATE Latin1_General_100_BIN2_UTF8"

Step 5) Configure the polling & forwarding of X-Ray data in connector-config.yml

Replace the placeholders with proper values providing AWS secrets, OTLP endpoints, ..

...
  # - - - Database provider - - - 
  # Connection string, replace the <YOUR-DATABASE-PASSWORD> with your actual password as configured in mssql-statefulset-secrets.yml
  SQLDB_Connection: "Server=mssqlinst.mssql.svc.cluster.local;Database=DurableDB;User ID=sa;Password=<YOUR-DATABASE-PASSWORD>;Persist Security Info=False;TrustServerCertificate=True;Encrypt=True;"

  # - - - AWS IAM identifiers to access X-Ray API - - - 
  # Role based access, for using temporal credentials (recommended, optional)
  AWS_RoleArn: "<YOUR-ROLE-ARN>",
  # https://docs.aws.amazon.com/general/latest/gr/xray.html#xray_region
  # us-east-1, ap-southeast-2, etc.
  AWS_RegionEndpoint: "<YOUR-AWS-REGION>"
  # Basis IAM credentials
  AWS_IdentityKey: "<YOUR-AWS-IDENTITY-KEY>"
  AWS_SecretKey: "<YOUR-AWS-SECRET-KEY>"
  
  # - - - Workflow configuration - - - 
  # Polling interval/windows for retrieving trace telemetry from X-Ray API. Default is 180 (3 min)
  PollingIntervalSeconds: "300"  
  # When polling is restarted, the maximum timespan to catch up, before the timeframe gets reset. A too large window can cause a polling jam. Default is 900 (15 min).
  DefaultMaximumReplayHistorySeconds: "900"
  # If set to True the workflow is automatically started (or re-started in case it was terminated or failed) when the api/WorkflowWatchdog is called. Default is "False".
  AutoStart: "True"
  # Enables JsonPayloadCompression If set to True (Recommended) the internal processing of tracedetails is compressed. Improves workflow performance and reduces I/O load on the database, but slightly increases CPU usage. Default is "False". 
  EnableJsonPayloadCompression: "True"
  
  # - - - Target OTLP configuration - - -
  # Target OTLP endpoint for sending telemetry. For Dynatrace this may look like this: "https://<YOUR-TENANT-ID>.live.dynatrace.com/api/v2/otlp/" 
  OTLP_ENDPOINT: "<YOUR-OTLP-TARGET-ENDPOINT>"
  # Optional: OTLP header authorization (only used for traces). For Dynatrace provide an API Token with OTLP Trace Ingest permissions in the following format "Api-Token <YOUR-DYNATRACE-API-TOKEN>"
  OTLP_HEADER_AUTHORIZATION: "<YOUR-OPTIONAL-OTLP-HEADER-AUTHORIZATION>"
  
  # - - - Workflow telemetry - - -
  # Enable workflow metrics sent via OTLP. Default is "False".
  EnableMetrics: "True" 
  # Metrics exporter configuration. See also: https://opentelemetry.io/docs/languages/sdk-configuration/otlp-exporter/
  # Metrics OTLP protocol. Possible values: "grpc", "http/protobuf". Default is "grpc". 
  # OTEL_EXPORTER_OTLP_METRICS_PROTOCOL: "<PROTOCOL>" 
  # OTLP endpoint for sending metrics. If not set, OTLP_ENDPOINT will be used.
  # OTEL_EXPORTER_OTLP_METRICS_ENDPOINT :"<YOUR-OTLP-TARGET-ENDPOINT>" 
  # OTLP metrics headers for e.g. authorization
  # OTEL_EXPORTER_OTLP_METRICS_HEADERS: "<YOUR-OPTIONAL-OTLP-HEADERS>" 

  # - - - TESTING ONLY - - -
  # Uses a mocked XRay API Client that simulates API responses
  # SimulatorMode: "XRayApi" 
  # Number of total traces returned for a TraceSummaries call
  # SIM_TraceSummariesResponseCount: "100"
  # Maxiumum number of trace per TraceSummaries request returned (to force paging)
  # SIM_TraceSummariesPageSize: "25"
  # Each simulated trace contains 5 segments. Configure if segments should be returned in a single batch ("None"), constantly batch ("Always") or randomly (~25%) batch. If batching is enabled, 2 batches are returned. 
  # SIM_BatchTraceSegments: "Random" 
  

Step 6) Configure the function keys and registry in xrayconnector.yml

  • (Recommended) Replace all function keys ( host.master, host.function.default, ..), which protect your functions with new ones, encoded in base64.
    • Generate a new key with e.g. OpenSSL: openssl rand -base64 32
    • Base64 encode the returned key: echo -n '<THE NEW KEY>' | base64
  • (Recommended) Replace the host.masterkey used in the xrayconnector-watchdog cronjob http://xrayconnector/api/WorkflowWatchdog?code=<REPLACE-WITH-THE-NEW-KEY> with the newly created key.
  • Replace <YOUR-REPOSITORY> with the container registry, hosting your image

Step 7) Deploy config and XRayConnector

kubectl create namespace xrayconnector
kubectl apply -f .\connector-config.yml -n xrayconnector
kubectl apply -f .\xrayconnector.yml -n xrayconnector

Checking deployment status...

kubectl get pods -n xrayconnector
kubectl rollout status deployment xrayconnector -n xrayconnector

REST API

See test.http which provides api requests to be run in Visual Studio Code (VSCode) via the REST Client extension.

Manually start the workflow

If autostart is disabled, you need to automatically trigger the main workflow timer.

POST https://xxxx/api/TriggerPeriodicAPIPoller?code=<YOUR-FUNCTION-HOST-MASTER-KEY>

Terminate the workflow

Manually stop the workflow. Does not disable autostart!

POST https://xxxx/api/TerminatePeriodicAPIPoller?code=<YOUR-FUNCTION-HOST-MASTER-KEY>

Check status of the workflow

Checks the status of the workflow. If autostart is enabled, enforces a start of the workflow.

POST https://xxxx/api/WorkflowWatchdog?code=<YOUR-FUNCTION-HOST-MASTER-KEY>

Purge workflow history

Purges the workflow history for completed, failed or terminated instances. Optionally provide a timespan in minutes to only delete history older than X minutes.

POST https://xxxx/api/PurgeHistory?code=<YOUR-FUNCTION-HOST-MASTER-KEY>
content-type: text/plain

360

Test API

A simple endpoint to see if the api is up & running

GET https://xxxx/api/TestPing?code=<YOUR-FUNCTION-HOST-MASTER-KEY>

Ingest a sample trace into X-Ray for testing

Sends a sample trace into X-Ray. This feature requires additional actions granted in your AWS IAM policy: xray:PutTelemetryRecords and xray:PutTraceSegments

POST https://xxxx/api/TestGenerateSampleTrace?code=<YOUR-FUNCTION-HOST-MASTER-KEY>

Send a sample trace to the backend

Sends a sample trace to the configured OTLP endpoint to validate connection settings.

POST https://xxxx/api/TestSendSampleTrace?code=<YOUR-FUNCTION-HOST-MASTER-KEY>

Monitoring

XRayConnector provides several metrics to monitor its execution:

Metric name Metric type Unit Additional dimensions Description
workflow.polling_interval Gauge Milliseconds workflow_instance, label Times spent during polling interval
api_calls Counter Count api, account, operation, paged, replay Number of api requests
api_response_objects Counter Count api, account, objectname Number of objects returned
process_memory_usage.total Gauge Bytes Process total memory
process_memory_usage.gc_heap Gauge Bytes Process managed memory

monitoring Dynatrace Sample Dashboard Configuration

How metrics help optimizing configuration parameters

The following snapshot shows a test-run using the XRay Api simulator mode. It demonstrates an initially poor utilization of the X-Ray API in combination of an unnecessary high latency of data beeing exported and how it looks after configuration optimization is applied.

optimize In the first phase, a large polling interval is used and the majority of the time spent during a run is waiting, which can be seen in the "Workflow execution time breakdown" (1), as well as the request numbers dropping to zero (2).

After a configuration change, reducing the polling interval targeting only a minimal wait each interval, we see a constant utilization of the X-Ray api and processing of traces during the whole period (3). Note: The rise in total api-calls and traces is caused as the simulator returns a fixed number of traces each call, independent of the polling interval.

The sample also shows a seamless rolling update, having the new pod running in parallel showing a second instance (4), before the first pod is terminated.

Contribute

This is an open source project, and we gladly accept new contributions and contributors.

License

Licensed under Apache 2.0 license. See LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •