Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discovery basics: Data Sharing #36

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Discovery basics: Data Sharing #36

wants to merge 12 commits into from

Conversation

oneiros
Copy link
Collaborator

@oneiros oneiros commented Nov 14, 2024

This is the first step of defining discovery providers (see https://www.fediscovery.org).

The first specification revolves around "data sharing". This is the foundation of any search and discovery related functionality. FASP need to be able to learn of new (and also existing) content and then fetch it to be able to index it.

"data sharing" as a title is still just a preliminary proposal. We used to call this "content ingestion", but the way this should work, is that instances only share URLs with FASP. A FASP is then responsible for how to act on this information. And while we included some hints on how to fetch the data properly, the core of the specification is the interaction between FASP and fediverse software. And that is not about "ingestion". Also, since we also deal with user account data, "content" might not be a perfect term here. So we arrived at "data sharing". I would be happy if anyone had a better idea, but for now I think this should work.

Note that I post this now to get early feedback, but I will also start working on an implementation. And I might still make some adjustments when I learn that some things do not work out in practice.

Most important changes:

* No longer allow to subscribe to individual lifecycle events
* Historical content gets delivered asynchronously using the
  same endpoint as events
* Reword some things to better accommodate for these changes
* Add a section on the actual retrieval of objects

Open questions:

* When fulfilling backfill requests how does the fediverse
  server signal that no more content is available?
* Can we get away with the FASP having no AP inbox/outbox?
... and specify a `moreObjectsAvailable` key to signal when no more
objects are available for a backfill request.
We decided to move to seperate specifications that can be
evolved independently.

Also adjusts to latest requirements from the `general` specification.
@ThisIsMissEm
Copy link

I haven't reviewed the full proposal yet, but based on your description above, it may make sense to share not just URLs, but also types, hashtags and account URLs, as those are the most common things to aggregate by.

Comment on lines +16 to +17
fediverse servers MUST share not only local but also remote content with
the FASP.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be contentious, especially around the lack of Data Processing Agreements being in place here. i.e., could be a potential legal nightmare.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think that? IANAL but since we only exchange URLs, I do not think we are dealing with PII here. At least not directly.

And when it comes to actually fetching the content there are several layers to protect user data (i.e. only public data being fetchable in the first place, checking the consent flags and being a blockable actor).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are still in a way, since the FASP is requesting the content at those URLs. I think we'd want a way for a FASP, when installed, to say "Here's the Data Processing Agreement governing our processing data from Fediverse Server ABC" and then those could automatically be appended to the Privacy Policy.

Could be through either supplied markdown or a link to a document.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already mandate that every Specification MUST include this. It is just that I am not sure if this even applies here 😕

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made an issue out of this #36.

Comment on lines +29 to +30
actually allowed to share. It MUST NOT share any content that is not
public. In the case of account and post data it MUST make sure to only

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about unlisted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that generally considered to be public? I do not think so. Do you think we need to state this here specifically?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be wise to be exactly clear as to what content is being shared & why.

Comment on lines +33 to +34
The FASP in turn MUST also ensure that creators have opted in to
discovery before storing and indexing content.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this value changes? Should we send events to say "hey, this actor is now discoverable" or "that actor is now no longer discoverable" ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good question and I do not have an answer for that yet.

In case of accounts from servers that use the FASP this should not be a problem, since the FASP would subscribe to updates of them and could act accordingly.

But remote accounts and maybe even more importantly indexed statuses are a different in that regard.

I suspect we will need to mandate that the actor is to be refetched periodically, but maybe someone has a better idea?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a paragraph at the end describing the need to periodically refetch the content. As an interval I chose one week as I think this is the interval after which mastodon considers an object to be stale.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think refetching alone is enough. The Fediverse server knows through Update activities when discoverability preferences change. As such, it could send a notification to the FASP of this change.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, we know when content is edited (through Update activities) and deleted (through Delete activities), so we can also inform FASPs of that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and this is all part of the spec already, but as always the server talking to the FASP might not know about these things if they happen on third-party server. It should, but there are no guarantees.

Hence we will need both: Servers announcing changes to FASP and FASP rechecking content regularly.

Comment on lines +62 to +63
* `category`: One of `content`, `account`. This is the category of
objects the FASP is interested in.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this objectType? i.e., Actor, Post, Media, etc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was at some point, but then we decided against it (hence the change from objectType to category). At this point we do not expect FASPs to handle different types of content very differently, so it does not seem to make much sense having separate subscriptions for them.

This is also more flexible as the fediverse software can decide what it considers "content" worth indexing and the FASP can decide if and how to handle different types.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if I want to subscribe to both content and account? two separate subscriptions? ditto for subscriptionType?

I think we may also want to add in link and media here separately?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if I want to subscribe to both content and account? two separate subscriptions? ditto for subscriptionType?

Yes, separate subscriptions.

I think we may also want to add in link and media here separately?

Maybe later, but this will not be in our first version.

Comment on lines +95 to +96
The fediverse server MUST validate the request. If it is invalid it MUST
return an HTTP status code `422` (Unprocessable Content).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it return different errors for different things? e.g., 400 if the payload isn't just, 422 if the schema isn't correct, and 401 if the http signatures don't validate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter is already part of the "general" spec. Do we need the distinction between 400 and 422? I have no idea 😕

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a link to the general spec as to what "validate the request" means. I do think there's a difference between an authentication failure and a bad request body.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's possibly also 429, or 409 (conflict) for when a FASP already has a subscription that matches this subscription.

Comment on lines +53 to +55
```http
POST /data_sharing/v0/event_subscriptions
```
Copy link

@ThisIsMissEm ThisIsMissEm Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requests to this should also include an endpoint relative to the FASP such that you know where to send the content (much like with Webhooks). Additionally, care should be taken that multiple domains may be used for a FASP, e.g., in IFTAS CCS, the webapp runs at https://ccs.iftas.cloud but the endpoint for receiving webhooks is https://webhooks.ccs.iftas.cloud/. We did this because the Web App's UI doesn't really need that much scaling, but we need to be able to scale up and down the webhook service based on throughput/requests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted for all endpoints to be specified here statically. I really would like these documents here to be as easy and straight-forward to implement as possible. And I think the less decisions an implementer has to make the better.

This may of course change in the future when we have some real world experience, but to get to that point, I think it is important to help getting people to implement this.

Similarly while I agree that this will probably have to scale in some way, I am not yet sure how exactly. And if it needs to be scaled to several load-balanced instances, I see no problem with all of them serving the occasional web app request as well.

Again, this is subject to change if practical experience informs that change.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, what I'm trying to provide here is real-world experience from having built a complex application that relies on data-sharing to work (IFTAS CCS). This is a real practical concern. For example, IFTAS CCS is composed of 4-6 individual services, all working together to do different parts of the process. We definitely don't want the complexity of kafka or event queueing and the wild scaling requirements that has to need to be embedded in the web app. The web app also needs to be able to support multipart requests and similar content-type parsing, where as the webhooks service is explicitly locked down on that front to prevent potential abuse of it.

Making this work more like webhook subscriptions gives developers a much easier way to implement their code & services how they would like, without being bound to how fediverse software thinks they should implement their software.

Comment on lines 181 to 183
* `source` lets the FASP know in reply to which of its request this
announcement is sent. It MUST include an object with either a
`subscriptionId` or a `backfillRequestId`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would once again recommend:

source: {
  type: "subscription" | "backfill_request"
  id: "1234567"
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more consistent this way:

"source": {
  "subscription": {
    "id": "132542"
  }
}

I just pushed a change to that effect but am still open to alternatives.

Comment on lines +238 to +243
These requests MUST be signed. To achieve a maximum of compatibility
with existing fediverse software, FASP MUST support request signing with
both "HTTP Message Signatures" as defined by
[RFC-9421](https://tools.ietf.org/html/rfc9421.html) as well as the
earlier draft version "HTTP Signatures" as defined by
[cavage-12](https://datatracker.ietf.org/doc/html/draft-cavage-http-signatures).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will require a Actor to exist for the FASP and for the FASP to support webfinger.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as stated below, though I should probably mention webfinger.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having re-checked I wonder: Do we really need webfinger here? The keyId used for signatures already points to the actor JSON, so in theory no webfinger lookup should be necessary.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I think we could just say that you need to request back the actor at that URL following the process defined in: https://swicg.github.io/activitypub-http-signature/#how-to-verify-a-signature

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e., don't this we necessarily need to specific information that is specified in that spec here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to implement this to work against a mastodon server and yes a whole lot more is needed than I had initially hoped, including webfinger. I updated the document accordingly.

Copy link

@ThisIsMissEm ThisIsMissEm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a whole bunch of comments, apologies for the few that weren't part of the review.. habit of clicking the wrong button.

Comment on lines +245 to +249
To find out which version a given fediverse server supports, FASP should
implement "double-knocking": They should first attempt a request using
"HTTP Message Signatures" and if the fediverse server replies with an
HTTP status code of `401` or `403` make a second attempt with the older
draft version, "HTTP Signatures".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just always require sending signatures? If the server doesn't need them, it'll discard them. That'll reduce request loads by half.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do require that, we just propose to try the final version before falling back to the draft one.

They use the same HTTP header and thus cannot both be used at the same time. I included a link to the SWICG document about this and this is basically their recommendation I just followed here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably move the link to the SWICG report up, and perhaps we can do some clarifications to that report, e.g., "How to send requests with signatures" which is kind of mentioned, but saying "follow this process, but do double knocking" (i.e., try latest spec, if that fails, try cavage-12). This is kinda awkwardly placed in that document.

discovery/data_sharing/v0.1/data_sharing.md Outdated Show resolved Hide resolved
Example call:

```http
POST /data_sharing/v0/announcements

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not in favour of this endpoint being hardcoded are relative to the FSAP homepage, since this will greatly impact the ability to scale the FSAP and work with different http routing systems. I think it'd be better to have the FSAP pre-register the domains for endpoints during registration, and then use a endpointUrl when making a subscription request which must be one of the domains register, and can be an arbitrary path.

e.g., maybe I want to split processing of backfilling from realtime data, such that I can throttle differently, or maybe I want to give each customer their own endpoint, such that I can more easily apply rate limits, billing, priority, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I answered this above. Just to reiterate: I agree these are valid concerns but I would prefer to tackle those in a later version when we have some practical experience.

oneiros and others added 4 commits November 18, 2024 10:56
...describing the need to periodically recheck objects.
@@ -34,7 +34,7 @@ To learn more about the initial plans for search and discovery please visit the

Specifications:

* Coming soon
* [Data Sharing](discovery/data_sharing/v0.1/data_sharing.md)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should debug be in here as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is in there above this section.

Comment on lines +1 to +5
# Fediscovery: Fediverse Discovery Providers

This directory contains the basic specifications for discovery
providers, FASPs that facilitate better search and discovery for the
fediverse.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking a bit more, I think the generic "subscribe to all posts / accounts" could be lifted up to a general data sharing, where as trends remain in discovery?

Since things like spam, content scanning, etc, would all need posts / accounts as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this internally on more than one occasion 🙂

My take is this: The mechanism described here is currently tailored for discovery. That is why we allow sharing remote content, let the FASP fetch the actual content, make doubly sure indexable and discoverable are set and so on.

Other use cases will have to work differently, I think.

Will we need a more general approach to data sharing at some point? Most probably yes.

Will discovery fit into that and / or be a special case of that? I am not sure.

So I argued (and still argue) that this is only for discovery for now and we resolve this once someone actually defines one of the other use cases in more detail.

Comment on lines +27 to +30
* [Trends](trends/v0.1/)

This describes the `trends` capability for providers that help
discovering content that is currently trending.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a "write" API for a FASP to say "Hey these things are trending according to me" ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining the trends APIs will be one of our next steps. So far we are thinking more about the server querying the FASP and not the FASP announcing trends. But nothing is set in stone yet.

Comment on lines +21 to +25
* [Data Sharing](data_sharing/v0.1/)

All discovery providers need a way to request content from fediverse
servers. This specification describes both an API to subscribe to new
content as well as an API to retrieve existing content.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, I think half of this needs to be moved up to general to support other use-cases like spam, harassment, labeling, content detection, etc.

Comment on lines +16 to +17
fediverse servers MUST share not only local but also remote content with
the FASP.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are still in a way, since the FASP is requesting the content at those URLs. I think we'd want a way for a FASP, when installed, to say "Here's the Data Processing Agreement governing our processing data from Fediverse Server ABC" and then those could automatically be appended to the Privacy Policy.

Could be through either supplied markdown or a link to a document.

Comment on lines +225 to +228
"objectUris": [
"https://fediverse-server.example.com/@example/2342",
"https://fediverse-server.example.com/@other/8726"
]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth adding in an actorUris here? Such that I can easily check which actors I need to fetch for this batch of data, before fetching the actual posts themselves?

Comment on lines +238 to +243
These requests MUST be signed. To achieve a maximum of compatibility
with existing fediverse software, FASP MUST support request signing with
both "HTTP Message Signatures" as defined by
[RFC-9421](https://tools.ietf.org/html/rfc9421.html) as well as the
earlier draft version "HTTP Signatures" as defined by
[cavage-12](https://datatracker.ietf.org/doc/html/draft-cavage-http-signatures).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I think we could just say that you need to request back the actor at that URL following the process defined in: https://swicg.github.io/activitypub-http-signature/#how-to-verify-a-signature

Comment on lines +238 to +243
These requests MUST be signed. To achieve a maximum of compatibility
with existing fediverse software, FASP MUST support request signing with
both "HTTP Message Signatures" as defined by
[RFC-9421](https://tools.ietf.org/html/rfc9421.html) as well as the
earlier draft version "HTTP Signatures" as defined by
[cavage-12](https://datatracker.ietf.org/doc/html/draft-cavage-http-signatures).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e., don't this we necessarily need to specific information that is specified in that spec here.

Comment on lines +245 to +249
To find out which version a given fediverse server supports, FASP should
implement "double-knocking": They should first attempt a request using
"HTTP Message Signatures" and if the fediverse server replies with an
HTTP status code of `401` or `403` make a second attempt with the older
draft version, "HTTP Signatures".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably move the link to the SWICG report up, and perhaps we can do some clarifications to that report, e.g., "How to send requests with signatures" which is kind of mentioned, but saying "follow this process, but do double knocking" (i.e., try latest spec, if that fails, try cavage-12). This is kinda awkwardly placed in that document.

Comment on lines +308 to +314
Once indexed or persisted in any way, FASP MUST periodically re-check
both content and account data. At least once every week FASP MUST
revalidate that the content / account is still publicly available and
still allowed to be indexed. If the content has been changed these
changes MUST be applied in the FASPs data storage as well. If FASP have
been notified of changes through their subscriptions they MAY suspend
the periodical check for this object for the next week.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it's going to potentially put excessive load on downstream servers, so instead would be best handled by forwarding Update/Delete activities.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I worry about this as well. But is this really worse than for example "regular" search engine crawlers?

Comment on lines +47 to +49
In order to subscribe to new content, FASPs can subscribe to events
by making a `POST` call to the `/data_sharing/v0/event_subscriptions`
endpoint on the fediverse server.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add a GET /data_sharing/v0/event_subscriptions to retrieve back all of the current subscriptions that a FASP has to a server.

Likewise, there should probably be some sort of notification if the Fediverse server disconnects the FASP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants