-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discovery basics: Data Sharing #36
base: main
Are you sure you want to change the base?
Conversation
Most important changes: * No longer allow to subscribe to individual lifecycle events * Historical content gets delivered asynchronously using the same endpoint as events * Reword some things to better accommodate for these changes * Add a section on the actual retrieval of objects Open questions: * When fulfilling backfill requests how does the fediverse server signal that no more content is available? * Can we get away with the FASP having no AP inbox/outbox?
... and specify a `moreObjectsAvailable` key to signal when no more objects are available for a backfill request.
We decided to move to seperate specifications that can be evolved independently. Also adjusts to latest requirements from the `general` specification.
I haven't reviewed the full proposal yet, but based on your description above, it may make sense to share not just URLs, but also types, hashtags and account URLs, as those are the most common things to aggregate by. |
fediverse servers MUST share not only local but also remote content with | ||
the FASP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be contentious, especially around the lack of Data Processing Agreements being in place here. i.e., could be a potential legal nightmare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think that? IANAL but since we only exchange URLs, I do not think we are dealing with PII here. At least not directly.
And when it comes to actually fetching the content there are several layers to protect user data (i.e. only public data being fetchable in the first place, checking the consent flags and being a blockable actor).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are still in a way, since the FASP is requesting the content at those URLs. I think we'd want a way for a FASP, when installed, to say "Here's the Data Processing Agreement governing our processing data from Fediverse Server ABC" and then those could automatically be appended to the Privacy Policy.
Could be through either supplied markdown or a link to a document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already mandate that every Specification MUST include this. It is just that I am not sure if this even applies here 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made an issue out of this #36.
actually allowed to share. It MUST NOT share any content that is not | ||
public. In the case of account and post data it MUST make sure to only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about unlisted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that generally considered to be public? I do not think so. Do you think we need to state this here specifically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be wise to be exactly clear as to what content is being shared & why.
The FASP in turn MUST also ensure that creators have opted in to | ||
discovery before storing and indexing content. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if this value changes? Should we send events to say "hey, this actor is now discoverable" or "that actor is now no longer discoverable" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really good question and I do not have an answer for that yet.
In case of accounts from servers that use the FASP this should not be a problem, since the FASP would subscribe to updates of them and could act accordingly.
But remote accounts and maybe even more importantly indexed statuses are a different in that regard.
I suspect we will need to mandate that the actor is to be refetched periodically, but maybe someone has a better idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a paragraph at the end describing the need to periodically refetch the content. As an interval I chose one week as I think this is the interval after which mastodon considers an object to be stale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think refetching alone is enough. The Fediverse server knows through Update
activities when discoverability preferences change. As such, it could send a notification to the FASP of this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, we know when content is edited (through Update
activities) and deleted (through Delete
activities), so we can also inform FASPs of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and this is all part of the spec already, but as always the server talking to the FASP might not know about these things if they happen on third-party server. It should, but there are no guarantees.
Hence we will need both: Servers announcing changes to FASP and FASP rechecking content regularly.
* `category`: One of `content`, `account`. This is the category of | ||
objects the FASP is interested in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this objectType
? i.e., Actor, Post, Media, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was at some point, but then we decided against it (hence the change from objectType
to category
). At this point we do not expect FASPs to handle different types of content very differently, so it does not seem to make much sense having separate subscriptions for them.
This is also more flexible as the fediverse software can decide what it considers "content" worth indexing and the FASP can decide if and how to handle different types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if I want to subscribe to both content and account? two separate subscriptions? ditto for subscriptionType
?
I think we may also want to add in link
and media
here separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if I want to subscribe to both content and account? two separate subscriptions? ditto for
subscriptionType
?
Yes, separate subscriptions.
I think we may also want to add in
link
andmedia
here separately?
Maybe later, but this will not be in our first version.
The fediverse server MUST validate the request. If it is invalid it MUST | ||
return an HTTP status code `422` (Unprocessable Content). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it return different errors for different things? e.g., 400 if the payload isn't just, 422 if the schema isn't correct, and 401 if the http signatures don't validate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter is already part of the "general" spec. Do we need the distinction between 400 and 422? I have no idea 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a link to the general spec as to what "validate the request" means. I do think there's a difference between an authentication failure and a bad request body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's possibly also 429, or 409 (conflict) for when a FASP already has a subscription that matches this subscription.
```http | ||
POST /data_sharing/v0/event_subscriptions | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requests to this should also include an endpoint relative to the FASP such that you know where to send the content (much like with Webhooks). Additionally, care should be taken that multiple domains may be used for a FASP, e.g., in IFTAS CCS, the webapp runs at https://ccs.iftas.cloud
but the endpoint for receiving webhooks is https://webhooks.ccs.iftas.cloud/
. We did this because the Web App's UI doesn't really need that much scaling, but we need to be able to scale up and down the webhook service based on throughput/requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted for all endpoints to be specified here statically. I really would like these documents here to be as easy and straight-forward to implement as possible. And I think the less decisions an implementer has to make the better.
This may of course change in the future when we have some real world experience, but to get to that point, I think it is important to help getting people to implement this.
Similarly while I agree that this will probably have to scale in some way, I am not yet sure how exactly. And if it needs to be scaled to several load-balanced instances, I see no problem with all of them serving the occasional web app request as well.
Again, this is subject to change if practical experience informs that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, what I'm trying to provide here is real-world experience from having built a complex application that relies on data-sharing to work (IFTAS CCS). This is a real practical concern. For example, IFTAS CCS is composed of 4-6 individual services, all working together to do different parts of the process. We definitely don't want the complexity of kafka or event queueing and the wild scaling requirements that has to need to be embedded in the web app. The web app also needs to be able to support multipart requests and similar content-type parsing, where as the webhooks service is explicitly locked down on that front to prevent potential abuse of it.
Making this work more like webhook subscriptions gives developers a much easier way to implement their code & services how they would like, without being bound to how fediverse software thinks they should implement their software.
* `source` lets the FASP know in reply to which of its request this | ||
announcement is sent. It MUST include an object with either a | ||
`subscriptionId` or a `backfillRequestId`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would once again recommend:
source: {
type: "subscription" | "backfill_request"
id: "1234567"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more consistent this way:
"source": {
"subscription": {
"id": "132542"
}
}
I just pushed a change to that effect but am still open to alternatives.
These requests MUST be signed. To achieve a maximum of compatibility | ||
with existing fediverse software, FASP MUST support request signing with | ||
both "HTTP Message Signatures" as defined by | ||
[RFC-9421](https://tools.ietf.org/html/rfc9421.html) as well as the | ||
earlier draft version "HTTP Signatures" as defined by | ||
[cavage-12](https://datatracker.ietf.org/doc/html/draft-cavage-http-signatures). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will require a Actor to exist for the FASP and for the FASP to support webfinger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as stated below, though I should probably mention webfinger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having re-checked I wonder: Do we really need webfinger here? The keyId used for signatures already points to the actor JSON, so in theory no webfinger lookup should be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. I think we could just say that you need to request back the actor at that URL following the process defined in: https://swicg.github.io/activitypub-http-signature/#how-to-verify-a-signature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e., don't this we necessarily need to specific information that is specified in that spec here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to implement this to work against a mastodon server and yes a whole lot more is needed than I had initially hoped, including webfinger. I updated the document accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a whole bunch of comments, apologies for the few that weren't part of the review.. habit of clicking the wrong button.
To find out which version a given fediverse server supports, FASP should | ||
implement "double-knocking": They should first attempt a request using | ||
"HTTP Message Signatures" and if the fediverse server replies with an | ||
HTTP status code of `401` or `403` make a second attempt with the older | ||
draft version, "HTTP Signatures". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just always require sending signatures? If the server doesn't need them, it'll discard them. That'll reduce request loads by half.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do require that, we just propose to try the final version before falling back to the draft one.
They use the same HTTP header and thus cannot both be used at the same time. I included a link to the SWICG document about this and this is basically their recommendation I just followed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably move the link to the SWICG report up, and perhaps we can do some clarifications to that report, e.g., "How to send requests with signatures" which is kind of mentioned, but saying "follow this process, but do double knocking" (i.e., try latest spec, if that fails, try cavage-12). This is kinda awkwardly placed in that document.
Example call: | ||
|
||
```http | ||
POST /data_sharing/v0/announcements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not in favour of this endpoint being hardcoded are relative to the FSAP homepage, since this will greatly impact the ability to scale the FSAP and work with different http routing systems. I think it'd be better to have the FSAP pre-register the domains for endpoints during registration, and then use a endpointUrl
when making a subscription request which must be one of the domains register, and can be an arbitrary path.
e.g., maybe I want to split processing of backfilling from realtime data, such that I can throttle differently, or maybe I want to give each customer their own endpoint, such that I can more easily apply rate limits, billing, priority, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I answered this above. Just to reiterate: I agree these are valid concerns but I would prefer to tackle those in a later version when we have some practical experience.
Co-authored-by: Emelia Smith <[email protected]>
...describing the need to periodically recheck objects.
@@ -34,7 +34,7 @@ To learn more about the initial plans for search and discovery please visit the | |||
|
|||
Specifications: | |||
|
|||
* Coming soon | |||
* [Data Sharing](discovery/data_sharing/v0.1/data_sharing.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should debug be in here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is in there above this section.
# Fediscovery: Fediverse Discovery Providers | ||
|
||
This directory contains the basic specifications for discovery | ||
providers, FASPs that facilitate better search and discovery for the | ||
fediverse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking a bit more, I think the generic "subscribe to all posts / accounts" could be lifted up to a general data sharing, where as trends remain in discovery?
Since things like spam, content scanning, etc, would all need posts / accounts as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this internally on more than one occasion 🙂
My take is this: The mechanism described here is currently tailored for discovery. That is why we allow sharing remote content, let the FASP fetch the actual content, make doubly sure indexable
and discoverable
are set and so on.
Other use cases will have to work differently, I think.
Will we need a more general approach to data sharing at some point? Most probably yes.
Will discovery fit into that and / or be a special case of that? I am not sure.
So I argued (and still argue) that this is only for discovery for now and we resolve this once someone actually defines one of the other use cases in more detail.
* [Trends](trends/v0.1/) | ||
|
||
This describes the `trends` capability for providers that help | ||
discovering content that is currently trending. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a "write" API for a FASP to say "Hey these things are trending according to me" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defining the trends APIs will be one of our next steps. So far we are thinking more about the server querying the FASP and not the FASP announcing trends. But nothing is set in stone yet.
* [Data Sharing](data_sharing/v0.1/) | ||
|
||
All discovery providers need a way to request content from fediverse | ||
servers. This specification describes both an API to subscribe to new | ||
content as well as an API to retrieve existing content. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, I think half of this needs to be moved up to general to support other use-cases like spam, harassment, labeling, content detection, etc.
fediverse servers MUST share not only local but also remote content with | ||
the FASP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are still in a way, since the FASP is requesting the content at those URLs. I think we'd want a way for a FASP, when installed, to say "Here's the Data Processing Agreement governing our processing data from Fediverse Server ABC" and then those could automatically be appended to the Privacy Policy.
Could be through either supplied markdown or a link to a document.
"objectUris": [ | ||
"https://fediverse-server.example.com/@example/2342", | ||
"https://fediverse-server.example.com/@other/8726" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth adding in an actorUris
here? Such that I can easily check which actors I need to fetch for this batch of data, before fetching the actual posts themselves?
These requests MUST be signed. To achieve a maximum of compatibility | ||
with existing fediverse software, FASP MUST support request signing with | ||
both "HTTP Message Signatures" as defined by | ||
[RFC-9421](https://tools.ietf.org/html/rfc9421.html) as well as the | ||
earlier draft version "HTTP Signatures" as defined by | ||
[cavage-12](https://datatracker.ietf.org/doc/html/draft-cavage-http-signatures). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. I think we could just say that you need to request back the actor at that URL following the process defined in: https://swicg.github.io/activitypub-http-signature/#how-to-verify-a-signature
These requests MUST be signed. To achieve a maximum of compatibility | ||
with existing fediverse software, FASP MUST support request signing with | ||
both "HTTP Message Signatures" as defined by | ||
[RFC-9421](https://tools.ietf.org/html/rfc9421.html) as well as the | ||
earlier draft version "HTTP Signatures" as defined by | ||
[cavage-12](https://datatracker.ietf.org/doc/html/draft-cavage-http-signatures). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e., don't this we necessarily need to specific information that is specified in that spec here.
To find out which version a given fediverse server supports, FASP should | ||
implement "double-knocking": They should first attempt a request using | ||
"HTTP Message Signatures" and if the fediverse server replies with an | ||
HTTP status code of `401` or `403` make a second attempt with the older | ||
draft version, "HTTP Signatures". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably move the link to the SWICG report up, and perhaps we can do some clarifications to that report, e.g., "How to send requests with signatures" which is kind of mentioned, but saying "follow this process, but do double knocking" (i.e., try latest spec, if that fails, try cavage-12). This is kinda awkwardly placed in that document.
Once indexed or persisted in any way, FASP MUST periodically re-check | ||
both content and account data. At least once every week FASP MUST | ||
revalidate that the content / account is still publicly available and | ||
still allowed to be indexed. If the content has been changed these | ||
changes MUST be applied in the FASPs data storage as well. If FASP have | ||
been notified of changes through their subscriptions they MAY suspend | ||
the periodical check for this object for the next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it's going to potentially put excessive load on downstream servers, so instead would be best handled by forwarding Update/Delete activities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I worry about this as well. But is this really worse than for example "regular" search engine crawlers?
In order to subscribe to new content, FASPs can subscribe to events | ||
by making a `POST` call to the `/data_sharing/v0/event_subscriptions` | ||
endpoint on the fediverse server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add a GET /data_sharing/v0/event_subscriptions
to retrieve back all of the current subscriptions that a FASP has to a server.
Likewise, there should probably be some sort of notification if the Fediverse server disconnects the FASP.
This is the first step of defining discovery providers (see https://www.fediscovery.org).
The first specification revolves around "data sharing". This is the foundation of any search and discovery related functionality. FASP need to be able to learn of new (and also existing) content and then fetch it to be able to index it.
"data sharing" as a title is still just a preliminary proposal. We used to call this "content ingestion", but the way this should work, is that instances only share URLs with FASP. A FASP is then responsible for how to act on this information. And while we included some hints on how to fetch the data properly, the core of the specification is the interaction between FASP and fediverse software. And that is not about "ingestion". Also, since we also deal with user account data, "content" might not be a perfect term here. So we arrived at "data sharing". I would be happy if anyone had a better idea, but for now I think this should work.
Note that I post this now to get early feedback, but I will also start working on an implementation. And I might still make some adjustments when I learn that some things do not work out in practice.