RFC: Tracking at The Guardian #4474
Replies: 5 comments 10 replies
-
<3 <3 <3 this makes me so happy. I would love to see this happen. The biggest lesson I've learnt from years of working with (and building a few of) the different systems you've listed above is that tracking systems should focus on making it as easy as possible to set up new tracking. Needing to add a new Enum to a thrift model and then rebuild a library and import it into 3 codebases just to track a new component, for example, is prohibitively time consuming. It's so easy to patch things at the transformation step in BigQuery these days (we've recently migrated data transformation from Scala/EMR to SQL using dbt. |
Beta Was this translation helpful? Give feedback.
-
Thank you for putting this together and making it so clear! Does it help to distinguish between different categories of 'tracking'? For example:
I feel like some of the current difficulty is the mixing of these concerns and therefore requirements. Should the different categories map to different solutions for example:
Perhaps this is not a useful distinction given everything ends up in the data lake (errors aside) but it might help to untangle responsibilities. It would be good to understand what Ophan deems to be their responsibility in this context. Is Ophan the de-facto place for any user centric tracking and hence would any attempt to divert user tracking away from Ophan be considered 'bad'? I agree that any all encompassing solution (wherever it sits) should be as generic and loosely typed as possible in order to reduce friction and shift any transformation to the data lake. The commercial logger that @marsavar and @mxdvl developed was intended to have a very generic API for other teams to use so that does feel like a natural starting point. ps this is a minor subjective point but the term 'tracking' makes me think of user-centric tracking e.g. GA/ads style. An overarching term like 'logging' sits better for me but that's just imo! |
Beta Was this translation helpful? Give feedback.
-
1. Data PipelineI’m going to talk specifically about client-side / JS tracking here, but this can easily be extended to capture other cases. Focussing on what @arelra defines as I think that we already have useful abstractions around how to capture data for Dotcom and Commercial.Dev. The only missing thing is consent, to my knowledge. We should have a single abstraction for capture all of the JS events, so that we can send a singular payload to a singular endpoint. This helped should be available from libs–here’s a WIP proposal. I can see the flow of data as follows: sequenceDiagram
participant Client as JavaScript (client-side)
participant Lambda as AWS Lambda / Fastly Compute@Edge
participant Others as Other consumers
participant Data as BigQuery (data warehouse)
participant Dashboards as Data & Insight Dashboards
loop JS Events
Client->>Client: build Payload
end
Note right of Client: if Payload is empty, do nothing
Client->>Lambda: send Payload as JSON on unload
Lambda->>Others: (maybe Ophan?)
Others->>Data: Process data
Lambda->>Data: Convert Payload keys to rows
Dashboards-->>Data: Create “clean” SQL query views
In this diagram, I think that the we could leverage Compute@Edge with log streaming. type Team = "dotcom" | "commercial"; // etc.
type Data = {
properties: Record<string, string>,
metrics: Record<string, number>,
// TODO: include consent ?
}
type Payload = Record<Team, Data> Next steps
|
Beta Was this translation helpful? Give feedback.
-
2. JavaScript Consumer InterfaceHere’s a proposal that would cover all current use cases. It registers a list of “collectors” that will collect the data to send. These collectors will only run when the page unloads. The initialisation method returns a method that needs to be called with a This patterns allows consumers to schedule–or remove from the schedule–any data collector, synchronously or asynchronously. // module.js
type Data = {
label?: never;
properties: Record<string, string>;
metrics: Record<string, number>;
};
/**
* Initialise data collection scheduler
* @param {string} label The label to identify the data
* @param {() => Data} collector The function to collect data at page unload
* @returns {(collect: boolean) => void} Call this function to schedule collection at page unload
*/
export const init = (label: string, collector: () => Data) =>
(collect: boolean) => { /* … */ }, // consumer.js
const collector = () => ({
properties: { /* … */ },
metrics: { /* … */ },
});
const enableScroll = init("scroll", collector);
// check consent:
getConsent().then((consent) => enableScroll(consent));
getConsent().then(enableScroll);
// sample 10% of page views:
if(Math.random() > 10 / 100) enableScroll(true)
// track specific AB tests:
enableScroll(guardian.config.switches.abTrackScroll) |
Beta Was this translation helpful? Give feedback.
-
3. HTML Attributes Contract |
Beta Was this translation helpful? Give feedback.
-
Can we improve how we implement tracking at the Guardian?
Tracking at the Guardian is fragmented and generates poorly structured, hard to use data. Improving it would speed up delivery for all streams and enhance our ability to use metrics.
I've worked in Newsletters, Dotcom and Identity and in each of these teams we struggled to implement tracking. I spent a considerable percentage of my time trying to write code to capture metrics and I have spoken and paired with others who have had similar problems.
In Dotcom we ended up using a custom solution. The same happened in Commercial, possibly Apps and also in Acquisitions. Newsletters are currently writing their own solution because of the same problems. There are probably others.
What is Tracking?
Tracking in the context of this document is where we collect metrics about how are products perform or are used to help us make them better. Examples include:
What are the issues?
My opinionated ideas on what we should do instead
We should list and understand our use cases for how tracking is done now and then use that knowledge to create a centralised abstraction that better serves our existing and future requirements
The abstraction should be feature rich but have a simple api. Ideally, it would be modular, with a capture script posting data to a pipeline and then a separate library deciding where it should be stored.
Why?
What about consent
Yes, this very much needs to be considered. But by having a centralised solution this actually helps ensure that we are respecting our reader's privacy preferences. Continuing with a fragmented approach increases the risk that each new solution that keeps being built fails to implement our legal requirements correctly.
What are the ways data can be captured right now?
These are the ways that I am currently aware of. I believe there are others and that teams are actively building new ones right now.
Ophan data attributes
If you add a data-component= attribute to an element then the tracker-js script will find these and record the fact they appeared on the page
If you add a data-link-name= attribute then the tracker-js script will add event listeners to these and record click events on them
There may exist race conditions between when these elements are rendered and if the script finds them
Ophan component events
There is a function exposed on the
window
object that allows custom events to be sent to Ophan. These are increasingly being used to solve limitations with the attributes approach above.They require javascript to be created and run on the page. This is strongly in opposition with how platforms are trying to remove javascript from the page.
In Dotcom alone, there are at least three different abstractions built around this function making it hard to know the 'right' way to use it.
Custom pipelines
Commercial have their own solution for capturing metrics
Dotcom piggybacked commercial's pipeline for build stats
Acquisitions added something similar
etc. We are actively encouraging teams to propagate this pattern which leads to more fragmentation.
What's wrong with the way we do things now? it works doesn't it?
Beyond the confusion it adds, the time we're demanding from teams to achieve seemingly simple tasks is very high. This is then compounded by the cost in having to maintain all these different solutions.
Further, each time we allow another metrics pipeline to land on the page we're adding more javascript, more listeners and more http traffic; these all have performance implications
Beta Was this translation helpful? Give feedback.
All reactions