RFC: Tracking at The Guardian #4474

oliverlloyd · 2022-03-30T17:21:02Z

oliverlloyd
Mar 30, 2022

Can we improve how we implement tracking at the Guardian?

Tracking at the Guardian is fragmented and generates poorly structured, hard to use data. Improving it would speed up delivery for all streams and enhance our ability to use metrics.

I've worked in Newsletters, Dotcom and Identity and in each of these teams we struggled to implement tracking. I spent a considerable percentage of my time trying to write code to capture metrics and I have spoken and paired with others who have had similar problems.

In Dotcom we ended up using a custom solution. The same happened in Commercial, possibly Apps and also in Acquisitions. Newsletters are currently writing their own solution because of the same problems. There are probably others.

What is Tracking?

Tracking in the context of this document is where we collect metrics about how are products perform or are used to help us make them better. Examples include:

How reader's interact with the page
Clicks
Sign ups
Subscriptions
Page views
Performance metrics for how a page loads
How fast ads render
How long it takes to build the code when developing

What are the issues?

There are currently multiple ways to gather tracking data (I think 5, but maybe more?)
The fragmented nature of how tracking works makes it difficult to know the best approach to implementing it. There's no guidance anywhere on this
Data that does get captured is inconsistent and hard to analyse
It can take up to two days for data sent from browsers to appear in the data store
Lazy loaded content is not being picked up by some solutions
Some of the tracking methods piggyback Ophan but have nothing to do with that team
Repeatedly asking teams to solve hard problems like how to measure duration on page or scroll depth is inefficient
Nobody owns tracking so these problems are not being addressed
tracker-js (used by Ophan) is unmaintained, legacy code written in Coffeescript

My opinionated ideas on what we should do instead

We should list and understand our use cases for how tracking is done now and then use that knowledge to create a centralised abstraction that better serves our existing and future requirements

The abstraction should be feature rich but have a simple api. Ideally, it would be modular, with a capture script posting data to a pipeline and then a separate library deciding where it should be stored.

Why?

Gathering metrics on how our products work is a key goal for the strategic direction of the department
We should be focussed on improving the developer experience to reduce friction and time spent to capture data on our products
We should ensure the data we capture is well structured and easy to evaluate
Faster turn around times for capturing data is hugely beneficial not just for developers implementing the tracking but also for the business who want to learn from the data
By offering teams pre-existing, reliable solutions they can implement better, richer metrics, faster and more easily than they might have been able to if they had to build the framework themselves.

What about consent

Yes, this very much needs to be considered. But by having a centralised solution this actually helps ensure that we are respecting our reader's privacy preferences. Continuing with a fragmented approach increases the risk that each new solution that keeps being built fails to implement our legal requirements correctly.

What are the ways data can be captured right now?

These are the ways that I am currently aware of. I believe there are others and that teams are actively building new ones right now.

Ophan data attributes

If you add a data-component= attribute to an element then the tracker-js script will find these and record the fact they appeared on the page

If you add a data-link-name= attribute then the tracker-js script will add event listeners to these and record click events on them

There may exist race conditions between when these elements are rendered and if the script finds them

Ophan component events

There is a function exposed on the window object that allows custom events to be sent to Ophan. These are increasingly being used to solve limitations with the attributes approach above.

They require javascript to be created and run on the page. This is strongly in opposition with how platforms are trying to remove javascript from the page.

In Dotcom alone, there are at least three different abstractions built around this function making it hard to know the 'right' way to use it.

Custom pipelines

Commercial have their own solution for capturing metrics

Dotcom piggybacked commercial's pipeline for build stats

Acquisitions added something similar

etc. We are actively encouraging teams to propagate this pattern which leads to more fragmentation.

What's wrong with the way we do things now? it works doesn't it?

Beyond the confusion it adds, the time we're demanding from teams to achieve seemingly simple tasks is very high. This is then compounded by the cost in having to maintain all these different solutions.

Further, each time we allow another metrics pipeline to land on the page we're adding more javascript, more listeners and more http traffic; these all have performance implications

jranks123 · 2022-04-06T16:43:52Z

jranks123
Apr 6, 2022

<3 <3 <3 this makes me so happy. I would love to see this happen. The biggest lesson I've learnt from years of working with (and building a few of) the different systems you've listed above is that tracking systems should focus on making it as easy as possible to set up new tracking. Needing to add a new Enum to a thrift model and then rebuild a library and import it into 3 codebases just to track a new component, for example, is prohibitively time consuming. It's so easy to patch things at the transformation step in BigQuery these days (we've recently migrated data transformation from Scala/EMR to SQL using dbt.

0 replies

arelra · 2022-04-07T11:48:14Z

arelra
Apr 7, 2022
Maintainer

Thank you for putting this together and making it so clear!

Does it help to distinguish between different categories of 'tracking'? For example:

user e.g. clicks, page views, scroll, attention, events
application
- audit e.g. component usage
- performance e.g. build time, render time, CWV, ad loading
error

I feel like some of the current difficulty is the mixing of these concerns and therefore requirements.

Should the different categories map to different solutions for example:

user => Ophan
application => a new generic solution
error => Sentry

Perhaps this is not a useful distinction given everything ends up in the data lake (errors aside) but it might help to untangle responsibilities.

It would be good to understand what Ophan deems to be their responsibility in this context. Is Ophan the de-facto place for any user centric tracking and hence would any attempt to divert user tracking away from Ophan be considered 'bad'?

I agree that any all encompassing solution (wherever it sits) should be as generic and loosely typed as possible in order to reduce friction and shift any transformation to the data lake.

The commercial logger that @marsavar and @mxdvl developed was intended to have a very generic API for other teams to use so that does feel like a natural starting point.

ps this is a minor subjective point but the term 'tracking' makes me think of user-centric tracking e.g. GA/ads style. An overarching term like 'logging' sits better for me but that's just imo!

3 replies

mxdvl Apr 7, 2022

I like how you articulate the three distinct types of data collection pipelines here, Ravi!

jamesgorrie Apr 13, 2022

I wonder if we could extend these definitions to focus more on who uses the data and how (we could also then extend that to retention etc, which I already think is partially being done)? This would also allow us to scale this list by adding new use cases.

e.g.

Editorial -> To make editorial decisions -> Ophan
Product teams (engineers / UX / product) -> To make product decisions (a/b tests, new features etc) -> Fastly logs
engineering / product -> replatforming (this feels outlier-y, but thought it worth mentioning as Dotcom use it this way) -> ELK
engineering -> fix real user errors -> sentry
engineering -> debugging -> ELK (app logging)

I wondered this as CWV are a mix between user & application. I would also be interested in where we store data when optimising an interface based on user interaction? Or would this be the above definition be application?

arelra May 4, 2022
Maintainer

we could extend these definitions to focus more on who uses the data

That's a really important point which I think leads into whether there is scope to improve the data platform to meet user needs.

I wondered this as CWV are a mix between user & application.

I think although CWV are measured in a user context they are not a measurement of user behaviour.

I would categorise CWV as an artefact of the application architecture and hence belong to the 'application' context.

Or another way to ask this - is it correct to say changes to the application affect CWV whereas user behaviour doesn't?

mxdvl · 2022-04-12T14:15:51Z

mxdvl
Apr 12, 2022

1. Data Pipeline

See Data Pipeline proposal

See JavaScript Consumer Contract proposal

See HTML Attributes Contract proposal

I’m going to talk specifically about client-side / JS tracking here, but this can easily be extended to capture other cases. Focussing on what @arelra defines as application is a helpful first step, and is where we can have the biggest impact on reducing the number of requests.

I think that we already have useful abstractions around how to capture data for Dotcom and Commercial.Dev. The only missing thing is consent, to my knowledge.

We should have a single abstraction for capture all of the JS events, so that we can send a singular payload to a singular endpoint. This helped should be available from libs–here’s a WIP proposal. I can see the flow of data as follows:

sequenceDiagram
    participant Client as JavaScript (client-side)
    participant Lambda as AWS Lambda / Fastly Compute@Edge
    participant Others as Other consumers
    participant Data as BigQuery (data warehouse)
    participant Dashboards as Data & Insight Dashboards
    loop JS Events
      Client->>Client: build Payload 
    end
    Note right of Client: if Payload is empty, do nothing
    Client->>Lambda: send Payload as JSON on unload
    Lambda->>Others: (maybe Ophan?)
    Others->>Data: Process data
    Lambda->>Data: Convert Payload keys to rows
    Dashboards-->>Data: Create “clean” SQL query views

In this diagram, I think that the we could leverage Compute@Edge with log streaming.

type Team = "dotcom" | "commercial"; // etc.
type Data = {
  properties: Record<string, string>,
  metrics: Record<string, number>,
  // TODO: include consent ?
}
type Payload = Record<Team, Data>

Next steps

Figure out how to handle consent
Build and deploy a Lambda or Edge Worker to handle the processing and writing the data
Rejoice

4 replies

jamesgorrie Apr 13, 2022

A very small observation on this model is that I would be cautious of using the key of an object to store information as it doesn't scale very well.

I might go for something like:

type Meta = { label: string }
type TrackRecord  = {
  meta: Meta,
  properties: Record<string, string>,
  metrics: Record<string, number>,
}
type Payload = { payload: TrackRecord[] }

oliverlloyd Apr 13, 2022
Author

I really like this approach @mxdvl - it makes a lot of sense to break out the stages into self contained modules. A few questions:

@mxdvl @alinaboghiu @arelra @rob What are you thoughts on having multiple 'consumers' of the data. In @mxdvl 's diagram above there is one consumer, Big Query, which is how things work now but is there a world where we might want to decouple things further and also have some data to flow into places like ELK or Ophan?

If we allow Ophan to be a consumer then this opens up the possibility for tracker-js to be deprecated in favour of this solution.

If we allow ELK to consume tracking data then this allows engineering teams to capture and easily analyse their own metrics.

oliverlloyd Apr 13, 2022
Author

Oops, sorry @rob I tagged you by mistake there, sorry about that. You're still welcome to give your thoughts but don't feel obliged 😄

Also, kudos on that Github handle!

arelra May 4, 2022
Maintainer

1. Single consumer for all tracking

What are you thoughts on having multiple 'consumers' of the data.

I would prefer a single consumer to reduce the architectural complexity and conceptual overhead for developers.

there is one consumer, Big Query, which is how things work now

This might depend on what your view of a consumer but aiui what we have now is multiple consumers e.g. BigQuery via Fastly, Ophan & ELK?

Some of those routes end up in BigQuery eventually but we still send data to differing consumers and hence via different methods.

What feels most immediately lacking however is clear delineation for a given type of tracking as to who the producer (e.g. ophan-tracker, application) and who the consumer should be.

2. Self service dashboards

Related to James' point on the end users of tracking data. Is there any desire to improve the way we as development, product or analysts use data from the data platform?

aiui we can currently:

create visualisations and charts via sheets connected to BigQuery
have D&I create Tableau dashboards with some custom ETL behind the scenes

This maybe my lack of familiarity with other teams dashboards, but my sense is that we would like better self-service capabilities to create our own dashboards for tracked data? For example being able to pipe/ETL data from the datalake into tools of our choosing like a custom Kibana or Grafana?

So potentially moving from a (buzzword warning) lakehouse¹ to a data-mesh²³. I assume this conversation has been had at length by the data design team in which case it would be great to get their input.

Either way, this feels like a very wide ranging discussion. Would it be helpful to get together interested parties to discuss?

mxdvl · 2022-05-17T08:47:41Z

mxdvl
May 17, 2022

2. JavaScript Consumer Interface

See Data Pipeline proposal

See JavaScript Consumer Contract proposal

See HTML Attributes Contract proposal

Here’s a proposal that would cover all current use cases. It registers a list of “collectors” that will collect the data to send. These collectors will only run when the page unloads. The initialisation method returns a method that needs to be called with a boolean.

This patterns allows consumers to schedule–or remove from the schedule–any data collector, synchronously or asynchronously.

// module.js
type Data = {
  label?: never;
  properties: Record<string, string>;
  metrics: Record<string, number>;
};

/**
 * Initialise data collection scheduler
 * @param {string} label The label to identify the data
 * @param {() => Data} collector The function to collect data at page unload
 * @returns {(collect: boolean) => void} Call this function to schedule collection at page unload
 */
export const init = (label: string, collector: () => Data) =>
  (collect: boolean) => { /* … */ },

// consumer.js
const collector = () => ({
  properties: { /* … */ },
  metrics: { /* … */ },
});

const enableScroll = init("scroll", collector);

// check consent:
getConsent().then((consent) => enableScroll(consent));
getConsent().then(enableScroll);

// sample 10% of page views:
if(Math.random() > 10 / 100) enableScroll(true)

// track specific AB tests:
enableScroll(guardian.config.switches.abTrackScroll)

See Gist for more details

3 replies

oliverlloyd May 17, 2022
Author

This sounds really interesting. I'm sort of on the edge of seeing this vision, would you mind posting some quick sudo code examples on how you see this being used to help me understand it better? Maybe within some kind of real world context.

mxdvl May 17, 2022

This is only the JS interface, it does not attempt to establish a contract for HTML attributes.

Here’s a Codepen: https://codepen.io/mxdvl/pen/BaYWNQd?editors=1010 (you have to switch to a different tab)

oliverlloyd May 17, 2022
Author

Thanks for the extra help. For the thread, this is a (very interesting) proposal for bringing together the different ways that we send data to an endpoint into a single interface rather than a way to capture metrics

mxdvl · 2022-05-17T14:40:30Z

mxdvl
May 17, 2022

3. HTML Attributes Contract

See Data Pipeline proposal

See JavaScript Consumer Contract proposal

See HTML Attributes Contract proposal

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Tracking at The Guardian #4474

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 10 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

RFC: Tracking at The Guardian #4474

Can we improve how we implement tracking at the Guardian?

What is Tracking?

What are the issues?

My opinionated ideas on what we should do instead

Why?

What about consent

What are the ways data can be captured right now?

Ophan data attributes

Ophan component events

Custom pipelines

What's wrong with the way we do things now? it works doesn't it?

Replies: 5 comments · 10 replies

arelra Apr 7, 2022 Maintainer

arelra May 4, 2022 Maintainer

1. Data Pipeline

Next steps

oliverlloyd Apr 13, 2022 Author

oliverlloyd Apr 13, 2022 Author

arelra May 4, 2022 Maintainer

1. Single consumer for all tracking

2. Self service dashboards

Footnotes

2. JavaScript Consumer Interface

oliverlloyd May 17, 2022 Author

oliverlloyd May 17, 2022 Author

3. HTML Attributes Contract

Replies: 5 comments 10 replies

arelra
Apr 7, 2022
Maintainer

arelra May 4, 2022
Maintainer

oliverlloyd Apr 13, 2022
Author

oliverlloyd Apr 13, 2022
Author

arelra May 4, 2022
Maintainer

oliverlloyd May 17, 2022
Author

oliverlloyd May 17, 2022
Author