Discovery basics: Data Sharing #36

oneiros · 2024-11-14T13:52:16Z

This is the first step of defining discovery providers (see https://www.fediscovery.org).

The first specification revolves around "data sharing". This is the foundation of any search and discovery related functionality. FASP need to be able to learn of new (and also existing) content and then fetch it to be able to index it.

"data sharing" as a title is still just a preliminary proposal. We used to call this "content ingestion", but the way this should work, is that instances only share URLs with FASP. A FASP is then responsible for how to act on this information. And while we included some hints on how to fetch the data properly, the core of the specification is the interaction between FASP and fediverse software. And that is not about "ingestion". Also, since we also deal with user account data, "content" might not be a perfect term here. So we arrived at "data sharing". I would be happy if anyone had a better idea, but for now I think this should work.

Note that I post this now to get early feedback, but I will also start working on an implementation. And I might still make some adjustments when I learn that some things do not work out in practice.

Most important changes: * No longer allow to subscribe to individual lifecycle events * Historical content gets delivered asynchronously using the same endpoint as events * Reword some things to better accommodate for these changes * Add a section on the actual retrieval of objects Open questions: * When fulfilling backfill requests how does the fediverse server signal that no more content is available? * Can we get away with the FASP having no AP inbox/outbox?

... and specify a `moreObjectsAvailable` key to signal when no more objects are available for a backfill request.

We decided to move to seperate specifications that can be evolved independently. Also adjusts to latest requirements from the `general` specification.

ThisIsMissEm · 2024-11-14T14:00:09Z

I haven't reviewed the full proposal yet, but based on your description above, it may make sense to share not just URLs, but also types, hashtags and account URLs, as those are the most common things to aggregate by.

ThisIsMissEm · 2024-11-14T19:28:32Z

discovery/data_sharing/v0.1/data_sharing.md

+fediverse servers MUST share not only local but also remote content with
+the FASP.


This may be contentious, especially around the lack of Data Processing Agreements being in place here. i.e., could be a potential legal nightmare.

Why do you think that? IANAL but since we only exchange URLs, I do not think we are dealing with PII here. At least not directly.

And when it comes to actually fetching the content there are several layers to protect user data (i.e. only public data being fetchable in the first place, checking the consent flags and being a blockable actor).

We are still in a way, since the FASP is requesting the content at those URLs. I think we'd want a way for a FASP, when installed, to say "Here's the Data Processing Agreement governing our processing data from Fediverse Server ABC" and then those could automatically be appended to the Privacy Policy.

Could be through either supplied markdown or a link to a document.

We already mandate that every Specification MUST include this. It is just that I am not sure if this even applies here 😕

I made an issue out of this #36.

ThisIsMissEm · 2024-11-14T19:29:17Z

discovery/data_sharing/v0.1/data_sharing.md

+actually allowed to share. It MUST NOT share any content that is not
+public. In the case of account and post data it MUST make sure to only


what about unlisted?

Is that generally considered to be public? I do not think so. Do you think we need to state this here specifically?

I think it'd be wise to be exactly clear as to what content is being shared & why.

ThisIsMissEm · 2024-11-14T19:31:02Z

discovery/data_sharing/v0.1/data_sharing.md

+The FASP in turn MUST also ensure that creators have opted in to
+discovery before storing and indexing content.


What happens if this value changes? Should we send events to say "hey, this actor is now discoverable" or "that actor is now no longer discoverable" ?

This is a really good question and I do not have an answer for that yet.

In case of accounts from servers that use the FASP this should not be a problem, since the FASP would subscribe to updates of them and could act accordingly.

But remote accounts and maybe even more importantly indexed statuses are a different in that regard.

I suspect we will need to mandate that the actor is to be refetched periodically, but maybe someone has a better idea?

I added a paragraph at the end describing the need to periodically refetch the content. As an interval I chose one week as I think this is the interval after which mastodon considers an object to be stale.

I don't think refetching alone is enough. The Fediverse server knows through Update activities when discoverability preferences change. As such, it could send a notification to the FASP of this change.

Ditto, we know when content is edited (through Update activities) and deleted (through Delete activities), so we can also inform FASPs of that.

Yes and this is all part of the spec already, but as always the server talking to the FASP might not know about these things if they happen on third-party server. It should, but there are no guarantees.

Hence we will need both: Servers announcing changes to FASP and FASP rechecking content regularly.

discovery/data_sharing/v0.1/data_sharing.md

ThisIsMissEm · 2024-11-14T19:32:48Z

discovery/data_sharing/v0.1/data_sharing.md

+* `category`: One of `content`, `account`. This is the category of
+  objects the FASP is interested in.


Is this objectType? i.e., Actor, Post, Media, etc?

It was at some point, but then we decided against it (hence the change from objectType to category). At this point we do not expect FASPs to handle different types of content very differently, so it does not seem to make much sense having separate subscriptions for them.

This is also more flexible as the fediverse software can decide what it considers "content" worth indexing and the FASP can decide if and how to handle different types.

What happens if I want to subscribe to both content and account? two separate subscriptions? ditto for subscriptionType?

I think we may also want to add in link and media here separately?

What happens if I want to subscribe to both content and account? two separate subscriptions? ditto for subscriptionType?

Yes, separate subscriptions.

I think we may also want to add in link and media here separately?

Maybe later, but this will not be in our first version.

discovery/data_sharing/v0.1/data_sharing.md

ThisIsMissEm · 2024-11-14T19:35:30Z

discovery/data_sharing/v0.1/data_sharing.md

+The fediverse server MUST validate the request. If it is invalid it MUST
+return an HTTP status code `422` (Unprocessable Content).


Should it return different errors for different things? e.g., 400 if the payload isn't just, 422 if the schema isn't correct, and 401 if the http signatures don't validate?

The latter is already part of the "general" spec. Do we need the distinction between 400 and 422? I have no idea 😕

I think we need a link to the general spec as to what "validate the request" means. I do think there's a difference between an authentication failure and a bad request body.

There's possibly also 429, or 409 (conflict) for when a FASP already has a subscription that matches this subscription.

ThisIsMissEm · 2024-11-14T19:36:32Z

discovery/data_sharing/v0.1/data_sharing.md

+```http
+POST /data_sharing/v0/event_subscriptions
+```


Requests to this should also include an endpoint relative to the FASP such that you know where to send the content (much like with Webhooks). Additionally, care should be taken that multiple domains may be used for a FASP, e.g., in IFTAS CCS, the webapp runs at https://ccs.iftas.cloud but the endpoint for receiving webhooks is https://webhooks.ccs.iftas.cloud/. We did this because the Web App's UI doesn't really need that much scaling, but we need to be able to scale up and down the webhook service based on throughput/requests.

I opted for all endpoints to be specified here statically. I really would like these documents here to be as easy and straight-forward to implement as possible. And I think the less decisions an implementer has to make the better.

This may of course change in the future when we have some real world experience, but to get to that point, I think it is important to help getting people to implement this.

Similarly while I agree that this will probably have to scale in some way, I am not yet sure how exactly. And if it needs to be scaled to several load-balanced instances, I see no problem with all of them serving the occasional web app request as well.

Again, this is subject to change if practical experience informs that change.

Yeah, what I'm trying to provide here is real-world experience from having built a complex application that relies on data-sharing to work (IFTAS CCS). This is a real practical concern. For example, IFTAS CCS is composed of 4-6 individual services, all working together to do different parts of the process. We definitely don't want the complexity of kafka or event queueing and the wild scaling requirements that has to need to be embedded in the web app. The web app also needs to be able to support multipart requests and similar content-type parsing, where as the webhooks service is explicitly locked down on that front to prevent potential abuse of it.

Making this work more like webhook subscriptions gives developers a much easier way to implement their code & services how they would like, without being bound to how fediverse software thinks they should implement their software.

discovery/data_sharing/v0.1/data_sharing.md

ThisIsMissEm · 2024-11-14T19:43:41Z

discovery/data_sharing/v0.1/data_sharing.md

+* `source` lets the FASP know in reply to which of its request this
+  announcement is sent. It MUST include an object with either a
+  `subscriptionId` or a `backfillRequestId`.


Would once again recommend:

source: { type: "subscription" | "backfill_request" id: "1234567" }

I think it would be more consistent this way:

"source": { "subscription": { "id": "132542" } }

I just pushed a change to that effect but am still open to alternatives.

ThisIsMissEm · 2024-11-14T19:46:48Z

discovery/data_sharing/v0.1/data_sharing.md

+These requests MUST be signed. To achieve a maximum of compatibility
+with existing fediverse software, FASP MUST support request signing with
+both "HTTP Message Signatures" as defined by
+[RFC-9421](https://tools.ietf.org/html/rfc9421.html) as well as the
+earlier draft version "HTTP Signatures" as defined by
+[cavage-12](https://datatracker.ietf.org/doc/html/draft-cavage-http-signatures).


This will require a Actor to exist for the FASP and for the FASP to support webfinger.

Yes, as stated below, though I should probably mention webfinger.

Having re-checked I wonder: Do we really need webfinger here? The keyId used for signatures already points to the actor JSON, so in theory no webfinger lookup should be necessary.

I don't think so. I think we could just say that you need to request back the actor at that URL following the process defined in: https://swicg.github.io/activitypub-http-signature/#how-to-verify-a-signature

i.e., don't this we necessarily need to specific information that is specified in that spec here.

I tried to implement this to work against a mastodon server and yes a whole lot more is needed than I had initially hoped, including webfinger. I updated the document accordingly.

ThisIsMissEm

Left a whole bunch of comments, apologies for the few that weren't part of the review.. habit of clicking the wrong button.

ThisIsMissEm · 2024-11-14T19:47:40Z

discovery/data_sharing/v0.1/data_sharing.md

+To find out which version a given fediverse server supports, FASP should
+implement "double-knocking": They should first attempt a request using
+"HTTP Message Signatures" and if the fediverse server replies with an
+HTTP status code of `401` or `403` make a second attempt with the older
+draft version, "HTTP Signatures".


Why not just always require sending signatures? If the server doesn't need them, it'll discard them. That'll reduce request loads by half.

We do require that, we just propose to try the final version before falling back to the draft one.

They use the same HTTP header and thus cannot both be used at the same time. I included a link to the SWICG document about this and this is basically their recommendation I just followed here.

I'd probably move the link to the SWICG report up, and perhaps we can do some clarifications to that report, e.g., "How to send requests with signatures" which is kind of mentioned, but saying "follow this process, but do double knocking" (i.e., try latest spec, if that fails, try cavage-12). This is kinda awkwardly placed in that document.

discovery/data_sharing/v0.1/data_sharing.md

ThisIsMissEm · 2024-11-14T19:54:48Z

discovery/data_sharing/v0.1/data_sharing.md

+Example call:
+
+```http
+POST /data_sharing/v0/announcements


I'm not in favour of this endpoint being hardcoded are relative to the FSAP homepage, since this will greatly impact the ability to scale the FSAP and work with different http routing systems. I think it'd be better to have the FSAP pre-register the domains for endpoints during registration, and then use a endpointUrl when making a subscription request which must be one of the domains register, and can be an arbitrary path.

e.g., maybe I want to split processing of backfilling from realtime data, such that I can throttle differently, or maybe I want to give each customer their own endpoint, such that I can more easily apply rate limits, billing, priority, etc.

I think I answered this above. Just to reiterate: I agree these are valid concerns but I would prefer to tackle those in a later version when we have some practical experience.

Co-authored-by: Emelia Smith <[email protected]>

...describing the need to periodically recheck objects.

ThisIsMissEm · 2024-11-18T21:29:05Z

README.md

@@ -34,7 +34,7 @@ To learn more about the initial plans for search and discovery please visit the

 Specifications:

-* Coming soon 
+* [Data Sharing](discovery/data_sharing/v0.1/data_sharing.md) 


Should debug be in here as well?

It is in there above this section.

ThisIsMissEm · 2024-11-18T21:30:23Z

discovery/README.md

+# Fediscovery: Fediverse Discovery Providers
+
+This directory contains the basic specifications for discovery
+providers, FASPs that facilitate better search and discovery for the
+fediverse.


Just thinking a bit more, I think the generic "subscribe to all posts / accounts" could be lifted up to a general data sharing, where as trends remain in discovery?

Since things like spam, content scanning, etc, would all need posts / accounts as well.

We discussed this internally on more than one occasion 🙂

My take is this: The mechanism described here is currently tailored for discovery. That is why we allow sharing remote content, let the FASP fetch the actual content, make doubly sure indexable and discoverable are set and so on.

Other use cases will have to work differently, I think.

Will we need a more general approach to data sharing at some point? Most probably yes.

Will discovery fit into that and / or be a special case of that? I am not sure.

So I argued (and still argue) that this is only for discovery for now and we resolve this once someone actually defines one of the other use cases in more detail.

ThisIsMissEm · 2024-11-18T21:48:33Z

discovery/README.md

+* [Trends](trends/v0.1/)
+
+  This describes the `trends` capability for providers that help
+  discovering content that is currently trending.


Is there a "write" API for a FASP to say "Hey these things are trending according to me" ?

Defining the trends APIs will be one of our next steps. So far we are thinking more about the server querying the FASP and not the FASP announcing trends. But nothing is set in stone yet.

ThisIsMissEm · 2024-11-18T21:49:15Z

discovery/README.md

+* [Data Sharing](data_sharing/v0.1/)
+
+  All discovery providers need a way to request content from fediverse
+  servers. This specification describes both an API to subscribe to new
+  content as well as an API to retrieve existing content.


As mentioned above, I think half of this needs to be moved up to general to support other use-cases like spam, harassment, labeling, content detection, etc.

ThisIsMissEm · 2024-11-18T21:51:37Z

discovery/data_sharing/v0.1/data_sharing.md

+fediverse servers MUST share not only local but also remote content with
+the FASP.


We are still in a way, since the FASP is requesting the content at those URLs. I think we'd want a way for a FASP, when installed, to say "Here's the Data Processing Agreement governing our processing data from Fediverse Server ABC" and then those could automatically be appended to the Privacy Policy.

Could be through either supplied markdown or a link to a document.

ThisIsMissEm · 2024-11-18T22:13:05Z

discovery/data_sharing/v0.1/data_sharing.md

+  "objectUris": [
+    "https://fediverse-server.example.com/@example/2342",
+    "https://fediverse-server.example.com/@other/8726"
+  ]


Would it be worth adding in an actorUris here? Such that I can easily check which actors I need to fetch for this batch of data, before fetching the actual posts themselves?

ThisIsMissEm · 2024-11-18T22:14:34Z

discovery/data_sharing/v0.1/data_sharing.md

+These requests MUST be signed. To achieve a maximum of compatibility
+with existing fediverse software, FASP MUST support request signing with
+both "HTTP Message Signatures" as defined by
+[RFC-9421](https://tools.ietf.org/html/rfc9421.html) as well as the
+earlier draft version "HTTP Signatures" as defined by
+[cavage-12](https://datatracker.ietf.org/doc/html/draft-cavage-http-signatures).


I don't think so. I think we could just say that you need to request back the actor at that URL following the process defined in: https://swicg.github.io/activitypub-http-signature/#how-to-verify-a-signature

ThisIsMissEm · 2024-11-18T22:15:47Z

discovery/data_sharing/v0.1/data_sharing.md

+These requests MUST be signed. To achieve a maximum of compatibility
+with existing fediverse software, FASP MUST support request signing with
+both "HTTP Message Signatures" as defined by
+[RFC-9421](https://tools.ietf.org/html/rfc9421.html) as well as the
+earlier draft version "HTTP Signatures" as defined by
+[cavage-12](https://datatracker.ietf.org/doc/html/draft-cavage-http-signatures).


i.e., don't this we necessarily need to specific information that is specified in that spec here.

ThisIsMissEm · 2024-11-18T22:20:43Z

discovery/data_sharing/v0.1/data_sharing.md

+To find out which version a given fediverse server supports, FASP should
+implement "double-knocking": They should first attempt a request using
+"HTTP Message Signatures" and if the fediverse server replies with an
+HTTP status code of `401` or `403` make a second attempt with the older
+draft version, "HTTP Signatures".


I'd probably move the link to the SWICG report up, and perhaps we can do some clarifications to that report, e.g., "How to send requests with signatures" which is kind of mentioned, but saying "follow this process, but do double knocking" (i.e., try latest spec, if that fails, try cavage-12). This is kinda awkwardly placed in that document.

ThisIsMissEm · 2024-11-18T22:22:37Z

discovery/data_sharing/v0.1/data_sharing.md

+Once indexed or persisted in any way, FASP MUST periodically re-check
+both content and account data. At least once every week FASP MUST
+revalidate that the content / account is still publicly available and
+still allowed to be indexed. If the content has been changed these
+changes MUST be applied in the FASPs data storage as well. If FASP have
+been notified of changes through their subscriptions they MAY suspend
+the periodical check for this object for the next week.


This feels like it's going to potentially put excessive load on downstream servers, so instead would be best handled by forwarding Update/Delete activities.

Yes, I worry about this as well. But is this really worse than for example "regular" search engine crawlers?

ThisIsMissEm · 2024-11-18T22:25:52Z

discovery/data_sharing/v0.1/data_sharing.md

+In order to subscribe to new content, FASPs can subscribe to events
+by making a `POST` call to the `/data_sharing/v0/event_subscriptions`
+endpoint on the fediverse server.


We should also add a GET /data_sharing/v0/event_subscriptions to retrieve back all of the current subscriptions that a FASP has to a server.

Likewise, there should probably be some sort of notification if the Fediverse server disconnects the FASP.

oneiros added 5 commits November 13, 2024 14:04

WIP initial draft of discovery fasps.

492716c

Remove references to AS vocabulary...

f3f950b

... and specify a `moreObjectsAvailable` key to signal when no more objects are available for a backfill request.

Change structure

360b605

We decided to move to seperate specifications that can be evolved independently. Also adjusts to latest requirements from the `general` specification.

Mention discoverable flag.

d99652a

ThisIsMissEm reviewed Nov 14, 2024

View reviewed changes

discovery/data_sharing/v0.1/data_sharing.md Outdated Show resolved Hide resolved

ThisIsMissEm reviewed Nov 14, 2024

View reviewed changes

discovery/data_sharing/v0.1/data_sharing.md Show resolved Hide resolved