Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Length Limit for CID/Multihash? #4918

Closed
kevina opened this issue Apr 3, 2018 · 23 comments
Closed

Length Limit for CID/Multihash? #4918

kevina opened this issue Apr 3, 2018 · 23 comments

Comments

@kevina
Copy link
Contributor

kevina commented Apr 3, 2018

Right now (and to my surprise) there doesn't seam to be any length limit. When identity hashes are used (see #4910) , an 18k file with hashed with the identity hash just works. I am even able to add that file to a directory entry with the files API and retrieve it.

$ ipfs add t0250-files-api.sh --hash=id -q > hash
$ ipfs cat `cat hash` > thefile
$ diff t0250-files-api.sh thefile 
$ ipfs files cp /ipfs/`cat hash` /tmp2
$ ipfs files ls
tmp
tmp2
$ ipfs files ls -l
tmp	z45btRgLsWtUjnxUC6eCo1EWfquee6nzwdLPDGyM7YAjiZHn6iArvVr546aaCXwsinDV7pLuzx3CfM7KiAXqf9CcqjjfVU9g4toFd9VE7VMASTrxDYWCzohasmJy1CbGhoArqQp5PyNPC6SBMYMqPQsGuUNXAtEBphybn9xRw3q9U7uyw77gMz3RzRSCo5D7nd6F3KcoQYuxD6NoGuqFuYoE2PHhZLksDd5StdH2a69WEWF2Y9RP1ZMZ5igg8XZH6DREV3h3	188
tmp2	 and cat commands8YceaLocPms5oBeeYVo34TWwft94ismUB4VsVJFE782J49WqXETWBbahnDtNd45R1WTmPKh7KawZQGtCcC26xM1skGUjdBqoQdiWaGXK93j9B7zsLx5vqVgVL57XpHGRrUaWxeRSpfakpFSt454uBGiakGk4jaB5ZB7RfkocpyVCas3hyGMKAUGUcVvVGYaBy2TB9rvkLUFwBo1A9QabbNeuoz6RD55mbJXD4f85ErKjct7LooLVCMXWp7pfguQyvinqb6hYAYMVDj6Kp1BHh25yRGy5DUbjfnbbxtMcjerXSZgp56HKjrKtiDXXbbwNjhfCHkjmHxd5QTUZCaunHyPNCcfqQxANPS39Vs5L6VCkP9uuaFAyEEcf8CNqovFnhDTkesvw9e9w9WWk8aKhWgx7aXFfVMkcG6bLGuAvB2eybyEBLg5FVSJ523YcRpzx9fM1Y7PjoSCwmqkL4CiiL6Ados27vJedrJg2ewhQJvo7hLsiGgkyKZAUcStpCyZFRgye47Xi3LcaK5KQ2fqZBMCHDrCHzBziQM8ZwmHCt1qHM3rpAm1jbj7mQdxqY75yJ1cq78BmAv5ZYvkWGEKgDHK9fUZ857YH4icAQe8P6mAiALDUinVRyJwWXXjAwznwDJ1kFqoxxgBvGcCZL5m2w6qBzJhDP5ogV7vEq7LZzJc12qeQjw3NBszJiFeMJgZiBXdi7CKy5DMFtGH52s37wHTm1MxAaqhv2pJ6rKKL9FnPTSTgqJNYYV9nJH9CzoVGb1Mm15FJnref5PR2mAJqSa9UY1wSKg9JeyZG7i2ZnUUdZn5JbUvq4EcPUAKHUYvYzQFtE9V81apsuioLNemWUacAaPQSmyja9DZpi8R8Jn5SxZYE1XFqDZXGr2qS8L3RyGwssggjtpyY9asKQPUomiRzn38GnSnzTexXUaLwFhSYiA2fSySg7pBV9tATmvArpG7gbZWkHwa2EzYyJvivJbA9dJnCQo5XMJwJECHSEYa7YUkFQmFtfVMJfMv4gSCUN8JPTsBma2P9tsRixjD5XA1GSXDLoqHzFsmiDKPqyro8xdo75LohgXF9xyFA61PLgnfJ7Jfo2ZPouSNophPMXDWWaeuVp3wEaEumeacVnB8LmBcdgwN2UBbhLCLL8ZD7cNRUzJhJTV3g49Hm6uxGrSmQPCSHyAN4ha14aDxXA2HKeGdN4e2ksMCLXTSEP6FCk8Dp5Kfe3p2T2skSaNjF8NtsWnjLM3zCka9k1dn2X9hh9Hb11twpznB4XCciz7vXk7wxk7wj7DWoUNrRFkk7USCCdbEeSQxaiNexv9HdRSosz5rrzCCPCSf6QB1pA7wzy5RJZ44Pbd6U1n4kNX7KJAtBQYxAGbHLJLvwywJTvkMPRwG5ju7Z1kzU3yofpoZzGC53JbLBpVg5KbXTehsvBEaoZQ2YBRYwMbXFGSfSGsR5uUbb5WDipYZ4K2H46aLz1NJPM5G5uv3ioPRwTs2mhUuRCtWtX4dWYiDECuMYhtPgb9ybGWqakgL3BCPfBsRB1SDYZLcWqbFDScpvD2AnYszdZXHhU4WWNmQ2FcDBysXwE4fZyesFEJrC3vr51uPYCqHnHQdkosQDiV3xdQTfEo3Yoeyd3cw7ogECkKBPzBYpnXE5y1M9zAREokNBZu9YBj9Di6oBnSR8KPHdkaE1YZbkGuuqMiuDnY1DYmvy2LAKNxV5ZrkFq2Gn6PsQ27YTszg38vwJCHMvZTrVPM27QMM3bWtPMQkuFWdNKJufcwNY73F2u89jNvZKjpSUZCtMWs47fs5jqL86wQAReNVKxTVE8eRjwrfMpB2twoYTFb4TuyZLZEfUUr9bDyNsKXVy1E3X48cWgKSizdUj6twjjh7Q1HTqnaCa19EL9chFRCp2RxWX9F8ZFwVDU3PEWQR2FB4Ges8d78EgePPzrEVEiAijnxZhUV28YwdVbwPDENcEKFXV5EafZ6N7WGDtKEMNinNdfzMPHNbyGuTCsKZiKoQekt7cFKMDfAh7	19028
$ ipfs files read /tmp2 > thefile
$ diff  t0250-files-api.sh thefile # files the same
@kevina
Copy link
Contributor Author

kevina commented Apr 5, 2018

For now the wrapper blockstore enforces a limit of 64 bytes.

@ivan386
Copy link
Contributor

ivan386 commented Apr 5, 2018

Why need a limit?

@kevina
Copy link
Contributor Author

kevina commented Apr 5, 2018

Very large keys are likely to create a problem somewhere.

Right now the biggest problem is displaying them. I am not against allowing identity hashes of a few kbytes in directory entries as long as we do something about displaying them, one idea is to replace the key with something like <inlined>. This means the only way to access the context is via the directory entry as for all intents and purposes it doesn't have a hash.

@Stebalien
Copy link
Member

Yeah, the issue is for people.

@kevina
Copy link
Contributor Author

kevina commented Apr 6, 2018

BTW: the 64 byte (or 512 bits) limits comes from the fact that is the largest output size of modern cryptographic hash functions.

@kevina
Copy link
Contributor Author

kevina commented Apr 6, 2018

Another idea is to impose a length limit on the display of hashes that limit will be the size required to display a CID of a hash of of 64 bytes (which will vary depending on the multibase used) after that the length will be truncated with the a ... displayed at the end.

@ivan386
Copy link
Contributor

ivan386 commented Apr 7, 2018

We can use link index in block for display or get data/link.

[block cid]:1

We already have limits of block size. No need limits for it parts. If link not fit to this block limit then move it to sub block. If link not fit to block limit then error.

It can be used for micro tar files in dir block:
Tar header is 512 bytes + Tar Padding max 1535 bytes + link to file data 512 bits max.

@Stebalien
Copy link
Member

The issue here is that people need to use CIDs. That is, /ipfs/QmId... should always be reasonably short and the user shouldn't care if the CID is inline or not.

For unixfs2, we may want to consider inline files but that changes some of the semantics of ipfs (e.g., not all files will be directly addressable without indirecting through a directory).

@kevina
Copy link
Contributor Author

kevina commented Apr 10, 2018

@Stebalien if will allow long id-hashes we effectively allow inline files, you could access it via an a very long hash string, but no sane person will want to do that, so you effectively have to access via the directory entry. If we allow this the only issue is with display. I can think of two possibilities, we replace the hash with something like <inline>:

QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn 4    l/
<inline>                                       2048 afile

or we truncate long hashes

QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn 4    l/
z45fakeidhashvLxQVkXqqLX5R1X345qqfHbsf67hvA... 2048 afile

Where we determine some metric for the maxim length.

@mib-kd743naq
Copy link
Contributor

@ivan386 We already have limits of block size. No need limits for it parts.

1MB is way too large. See below.

you could access it via an a very long hash string, but no sane person will want to do that

They might want to in a pinch ( i.e. useful when the IPFS network is unavailable for some reason ). What we need to design for is ensuring that it is always possible.to do so.

The easiest hard limitation that comes to mind is: the URL length. GIven there is desire to move over to base32 encoding in the future, this translates to a max of:

2000 =
  ( maximum_data_length + 3_bytes_of_CID_prefix + 2_bytes_of_varint_size_under_16k)
    *
  log(256) / log(32)
   +
  ~25_bytes_to_hold_the _url_prefix_https://ipfs.io/ipfs/

2000 = ( maximum_data_length + 5 ) * 8/5 + 25

maximum_data_length = 1229

To allow larger leading URLs we should probably cap this at 1200

There is one more consideration: bookmarks could be capped at 260 characters on some browsers. If we take that as the basis ( should we? ) we get a maximum_data_length of only 141 bytes

@ivan386
Copy link
Contributor

ivan386 commented Apr 11, 2018

@mib-kd743naq

They might want to in a pinch ( i.e. useful when the IPFS network is unavailable for some reason ). What we need to design for is ensuring that it is always possible.to do so.

Pin underline block. Index of link in block migth be used to limit pin.

Data URI scheme is analog of identity hash in web.

Data protocol URL size limitations

@Stebalien
Copy link
Member

if will allow long id-hashes we effectively allow inline files, you could access it via an a very long hash string, but no sane person will want to do that, so you effectively have to access via the directory entry. If we allow this the only issue is with display.

So, for unixfs2, we actually don't need to use CIDs for this. Instead, we can just allow inline files:

{
    "file1": {
        ... // metadata
        data: Cid(...),
    },
    "file2": {
        ... // metadata,
       data: [bytes...],
    },
}

The primary motivation for inline CIDs is that it allows us to concisely point to inlined objects. If we can't do that (e.g., have a large, unwieldily CID), then there isn't much of a point in using CIDs.

@ivan386
Copy link
Contributor

ivan386 commented Apr 19, 2018

@Stebalien what about inline data in files in any place?

@kevina
Copy link
Contributor Author

kevina commented Apr 19, 2018

@Stebalien unixfs2 is not defined yet, and there is no clear consensus on what to include (that is unless I am missing something).

@kevina
Copy link
Contributor Author

kevina commented Apr 19, 2018

In particular I am not sure allowing inline files is such a good idea, at very least it requires careful though.

I am more included to allow larger identity hashes.

@Stebalien
Copy link
Member

CIDs need to remain usable in paths. I agree that we'd need to be careful about inline files (and I'm not even sure if should support them) but CIDs must remain usable by humans.

We need to focus on the motivations:

  • CIDs shouldn't be more than ~50% of the size of the file (max): Inline CIDs.
  • Transferring lots of small blocks is inefficient: Make it efficient (e.g., batch transfer).
  • Storing lots of small blocks is inefficient: Use a better datastore.

@mib-kd743naq
Copy link
Contributor

CIDs must remain usable by humans.

@Stebalien do you have thoughts on the ~140byte figure I derived in #4918 (comment) ?

@kevina
Copy link
Contributor Author

kevina commented Jun 2, 2018

This needs to be decided on, changing it later will create comparability problems. I would prefer to use a nice power of two size. Right now I am trying to decide between 64 and 128 bytes for the length of the hash component. If we continue to use 256 bit (=32 bytes) hashes than 64 satisfies the requirement "CIDs shouldn't be more than ~50% of the size of the file (max): Inline CIDs.". 128 bytes will be a bit more flexible, while it will create annoying long hashes they are still manageable, for example when part of the URL as described above.

@Stebalien @whyrusleeping @Kubuxu (others?) thoughts?

@Stebalien
Copy link
Member

One thing to remember is that we're talking about a cutoff, not a maximum. That is, we're not forbidding users from creating larger CIDs, we're saying that all files smaller than X will be embedded directly in the CID.

So, if we say 140 or even 128, we will generate a ton of 140/128 byte CIDs automatically. 128 looks like:

/ipfs/zDL1jdad4g2Dzzabm1PboDzAd9Xh5HBpk4B6vccGziRVW7KHXSZifGEjtNbB4DxmZnciQiyxrqGx3AoRgaKkVM7PsuKWMbrSHPQ21ARmPouAypAG2PYxdFWAMhrrSax3nKvz3aLxbmUAQWHYGWxtkTHLhVGqQZkX8Yxq2UBTzveQaczj

On the other hand, a 64 byte (base58 encoded) CID looks like:

/ipfs/z3QhWPA3C2ZgtVBGRP8SwU1eEu4on2fuPxv8y1WonNNqMr6kjQnLw78e311KBwqznkxJqwgFKRob3oq779Q5uXzZ5

Currently, a V1 CID looks like:

/ipfs/zb2rhgb5oHpycXA39nhQSknjAWW59r5TvcmLh8QAAd3mQAXgD


Honestly, even the 64 byte CID is a bit long. We may even want to consider 52 bytes as the resulting path (including the prefix) will fit in 78 bytes (under 80 characters):

/ipfs/zVvDErb6sZbLuCww5AxA4dy4a7Mro6ZGcvXxnAD8NiARgFWgQZnyxSP5tkrXRUqiCZUHFFgT

However, that doesn't give us much room to work with.

So, I'd say that 64 is a max, at the very least.

@kevina
Copy link
Contributor Author

kevina commented Jun 3, 2018

That is, we're not forbidding users from creating larger CIDs

Actually that was part of the plan, unless we create a special rule that says id hashes can not be longer then XX, but other CIDs could, which doesn't make sense to me.

We could just allow longer CIDs and just not use them by default as I don't think that will break anything, and now that I think about it might be the best way forward.

Also I want to be clear when I say 64 bytes, I mean a 64 byte digest length, the complete Cid with the prefix (including the multihash one) will likely be a bit longer.

If we do set a hard limit then 64 should be a absolute minimum in case someday we want to the the full 512 bits of some crypto. hashes.

Also note, if we really want to consider things such as display width (which I don't think is such a good idea), consider that we may switch to base 32, which is slightly longer.

I now see three options

  1. Don't set a hard limit on CID digest size, but by default id hashes will have a maxium digest length (and thus content length) of 64 bytes
  2. Set a hard limit of 128 bytes on digest length (to keep things from getting to out of hand, but also to not artificially limit our options) but limit id hashes to 64 bytes by default.
  3. Set a hard limit of 64 bytes on digest length and thus limit id hashes to this length

I am now leaning towards (1), and possible going with (2). (3) is an option but I don't see a technical reason to force CID length to this.

@ivan386
Copy link
Contributor

ivan386 commented Jun 3, 2018

It is important for me that the program can understand the identifier of any size. And don't discards blocks with them.

@Stebalien
Copy link
Member

I've moved this to the CID repo as it's a spec issue and we need to involve people out side of go.

@Stebalien
Copy link
Member

New issue: multiformats/cid#21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants