星火 SparkCN

痛点分析发布于 2026/05/29

痛点为 AI 基于上游原始证据的初步提炼；未包含额外中国市场检索。

痛点

在二进制协议设计中，开发者需要一种紧凑且无歧义的变长整数编码，但现有方案如LEB128存在非规范编码问题，即同一个数值可以有多种表示方式，这会导致签名验证漏洞。文章明确指出，这种非规范性是设计上的事后考虑，而非编码结构本身强制，因此需要运行时检查来规避，增加了复杂性和潜在错误。此外，LEB128在SIMD并行处理时性能不佳，而Bijou64虽然解决了规范性问题，但在编码大小上对2字节范围内的数值支持有限（仅500个），对于需要大量小标识符的场景（如multicodec项目）不够用。这些痛点导致开发者必须在安全性、性能和紧凑性之间做出艰难权衡，且现有方案无法同时满足所有需求。

External Article

External article summary

An accidentally fast variable-length integer encoding

External Article

External article source

Article title: bijou64
Source URL: https://www.inkandswitch.com/tangents/bijou64/
Host: www.inkandswitch.com

§ Dossier

Selected HN comments

The problem is that this breaks down once you try to use SIMD instructions. I'd developed a similar kind of approach to encoding integers (and ieee774 floats) a couple of years ago (first byte encodes length and first bit of data: https://github.com/kstenerud/bonjson/blob/05b91f6fe7d6b07186... ). It was very clever and used compiler intrinsics to get the length in 1 instruction, so 2 instructions got you the final value, with no branches. But testing proved that when you move to SIMD instructions, ULEB128 ( https://github.com/kstenerud/bonjson/blob/main/bonjson.md#ty... ) or sentinel values ( https://github.com/kstenerud/bonjson/blob/main/bonjson.md#lo... ) win every time because of the parallelization opportunities. The true irony is that even SIMD text parsing would outperform this! SIMD is that powerful.

kstenerud

This reminded me of ISO 7816-4 BER-TLV encodings, which uses the format defined in ISO/IEC 8825-1 (ASN.1 related spec). Length integer values of 0-127 are encoded in 1 byte. If the high bit is set, then the first 7 bits tell you the number of subsequent octets. So there's no offsetting involved, making it slightly less compact, but also dead simple. EDIT: BUT, BER-TLV does permit overlong encodings. And I once found and reported a Yubikey 4 bug related to this. My source code comment for the workaround: -- The Yubikey 4 has an off-by-one bug which -- declares tag length of 255 (for the 0x53 outer -- tag of a certficate DO) when there are only 254 -- bytes remaining in the reply. The reply is -- chained across two packets, but the off-by-one is -- probably related to the over-long encoded length -- (0x82 0x00 0xff instead of 0x81 0xff). -- -- [snip packet captures] -- -- Yubico's ykpiv_fetch_object function in ykpiv.c -- (confirmed 1.4.3-1.5.0) contains a read (memmove) -- overflow when the declared inner BER-TLV length -- (of the 0x53 tag) is longer than what was -- received over the wire. That makes Yubico's -- library oblivious to the issue. Relatedly, the -- set_length function has an off-by-one bug (length -- < 0xff instead of length <= 0xff) which produces -- an over-long encoded length. That doesn't by -- itself explain why the Yubikey 4 transmits a -- truncated logical reply unless the same code is -- being used.

wahern

Non-canonical encodings are actually quite useful for some applications that need variable length integers. DWARF and WASM both use LEB128. The problem is linking: a compiler needs to emit code into independent translation units, which contain "missing" references to symbols in other translation units, without yet knowing where all the code will end up in the final executable. Since we don't know where the location of other code is yet, we don't know how big the number representing that location is yet, which means that we don't know how wide the variable length encoding of that number will be. If the width changes after linking, then we have to push around the surrounding code to make space for the wider integer. Unfortunately, this changes the location of all the surrounding code, so we have to recompute all the references! The solution is to always emit un-linked var ints in the widest possible encoding (5 bytes for LEB128) that way when the references are patched during linking, no code is moved around. All integers can be converted to a non-canonical 5 byte form that is "wasteful" but its a worthwhile tradeoff because it solves this issue. Other integers that don't need to be linked can be packed in a smaller var int form to save space.

i2talics

I like the denormalization of VLE ints (with or without zig-zag encoding of negatives), it helps support out of band information, such as nulls and other signals in serialization protocols with minimal overhead. For example you can use a denormalized zero to signal null. You can still define a canonical encoding where denormalizations have specific meaning or signal an error.

juancn

I've used LEB128 (with canonicalisation) extensively and... this looks so much nicer for most use-cases (length prefixed, supports the full uint64 range without that extra 10th byte). The downside is the encoding size. LEB128 quickly grows to 2 bytes, but stays at 2 bytes all the way to 2^14. This is important if you're using these numbers as tags/identifiers as we were in the multicodec [1] project, or for network message lengths. bijou64 only gives you 500 <= 2 byte numbers. [1]: https://github.com/multiformats/multicodec

stebalien

源数据· Raw Archive

source: Hacker News
upstream_source: hacker_news
upstream_item_id: 48323992
daily_ranking_item_id: fd2b8c79-fa96-4b4a-839a-d9bbde414ec9
rank_date: 2026-05-30
rank: 7
name: Bijou64: A variable-length integer encoding
tagline: www.inkandswitch.com
votes_count: 190
comments_count: 70
created_at_on_source: 2026-05-29T15:03:32.000Z
source_url: https://news.ycombinator.com/item?id=48323992
website_url: https://www.inkandswitch.com/tangents/bijou64/

media / source-specific data

{
  "author": "justinweiss",
  "hn_item_id": 48323992,
  "external_url": "https://www.inkandswitch.com/tangents/bijou64/"
}

raw_payload

{
  "by": "justinweiss",
  "id": 48323992,
  "url": "https://www.inkandswitch.com/tangents/bijou64/",
  "kids": [
    48325099,
    48327115,
    48326265,
    48329771,
    48324781,
    48324942,
    48325019,
    48326671,
    48325332,
    48326400,
    48324865,
    48326063,
    48324889,
    48326176,
    48325724,
    48327963,
    48326121,
    48324960,
    48327306,
    48325035,
    48325402,
    48325508,
    48324675
  ],
  "time": 1780067012,
  "type": "story",
  "score": 190,
  "title": "Bijou64: A variable-length integer encoding",
  "descendants": 70
}

source_raw_snapshot

{
  "id": "8bd22fca-9a8d-4a7e-b3c7-347e49086ca7",
  "daily_ranking_item_id": "fd2b8c79-fa96-4b4a-839a-d9bbde414ec9",
  "source": "hacker_news",
  "external_id": "48323992",
  "fetched_at": "2026-05-29T22:01:20.907Z",
  "story_raw": {
    "by": "justinweiss",
    "id": 48323992,
    "url": "https://www.inkandswitch.com/tangents/bijou64/",
    "kids": [
      48325099,
      48327115,
      48326265,
      48329771,
      48324781,
      48324942,
      48325019,
      48326671,
      48325332,
      48326400,
      48324865,
      48326063,
      48324889,
      48326176,
      48325724,
      48327963,
      48326121,
      48324960,
      48327306,
      48325035,
      48325402,
      48325508,
      48324675
    ],
    "time": 1780067012,
    "type": "story",
    "score": 190,
    "title": "Bijou64: A variable-length integer encoding",
    "descendants": 70
  },
  "stats_raw": {
    "time": 1780067012,
    "score": 190,
    "descendants": 70
  },
  "aux_raw": {
    "external_url": "https://www.inkandswitch.com/tangents/bijou64/",
    "hn_comment_url": "https://news.ycombinator.com/item?id=48323992",
    "normalized_text": null,
    "external_article": {
      "title": "bijou64",
      "excerpt": "It’s nice when you work on security and accidentally get some performance for free. This is the story of a small encoding called bijou64 — a variable-length integer (varint) encoding that we developed for the Subduction CRDT sync protocol. It was intended to fix a subtle signature-verification bug by making each number only representable a single way. It turned out to also run a few times faster than the more common varint LEB128 .\n\nWe didn’t set out to write a fast varint, but it turns out that our design constraints made for an encoding that has to do less work.\n\nMany binary protocols need a compact way to encode integers that are usually small but occasionally large. Variable-length integer encodings (“varints”) solve this, but most designs treat canonicality as an afterthought — something enforced by a runtime check in the decoder rather than by the structure of the encoding itself.\n\nSince it’s the most common varint, we’re going to pick on LEB128 a bit here. I want to emphasize how much LEB128 is a great choice for many projects, and the reasons that it was not a good choice for us also applies to the other formats that we looked at. It just happened to not be a perfect fit fo",
      "final_url": "https://www.inkandswitch.com/tangents/bijou64/",
      "fetched_at": "2026-05-29T22:01:18.106Z",
      "description": "An accidentally fast variable-length integer encoding"
    },
    "selected_comments": [
      {
        "id": 48325099,
        "raw": {
          "by": "kstenerud",
          "id": 48325099,
          "kids": [
            48325533,
            48325932,
            48327921
          ],
          "text": "The problem is that this breaks down once you try to use SIMD instructions. I&#x27;d developed a similar kind of approach to encoding integers (and ieee774 floats) a couple of years ago (first byte encodes length and first bit of data: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;kstenerud&#x2F;bonjson&#x2F;blob&#x2F;05b91f6fe7d6b0718686830abfb5028157c3fd28&#x2F;bonjson.md#length-field\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;kstenerud&#x2F;bonjson&#x2F;blob&#x2F;05b91f6fe7d6b07186...</a> ). It was very clever and used compiler intrinsics to get the length in 1 instruction, so 2 instructions got you the final value, with no branches.<p>But testing proved that when you move to SIMD instructions, ULEB128 (<a href=\"https:&#x2F;&#x2F;github.com&#x2F;kstenerud&#x2F;bonjson&#x2F;blob&#x2F;main&#x2F;bonjson.md#typed-array\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;kstenerud&#x2F;bonjson&#x2F;blob&#x2F;main&#x2F;bonjson.md#ty...</a>) or sentinel values (<a href=\"https:&#x2F;&#x2F;github.com&#x2F;kstenerud&#x2F;bonjson&#x2F;blob&#x2F;main&#x2F;bonjson.md#long-string\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;kstenerud&#x2F;bonjson&#x2F;blob&#x2F;main&#x2F;bonjson.md#lo...</a>) win every time because of the parallelization opportunities.<p>The true irony is that even SIMD text parsing would outperform this! SIMD is that powerful.",
          "time": 1780070871,
          "type": "comment",
          "parent": 48323992
        },
        "body": "The problem is that this breaks down once you try to use SIMD instructions. I'd developed a similar kind of approach to encoding integers (and ieee774 floats) a couple of years ago (first byte encodes length and first bit of data: https://github.com/kstenerud/bonjson/blob/05b91f6fe7d6b07186... ). It was very clever and used compiler intrinsics to get the length in 1 instruction, so 2 instructions got you the final value, with no branches. But testing proved that when you move to SIMD instructions, ULEB128 ( https://github.com/kstenerud/bonjson/blob/main/bonjson.md#ty... ) or sentinel values ( https://github.com/kstenerud/bonjson/blob/main/bonjson.md#lo... ) win every time because of the parallelization opportunities. The true irony is that even SIMD text parsing would outperform this! SIMD is that powerful.",
        "is_op": false,
        "author": "kstenerud",
        "raw_body": "The problem is that this breaks down once you try to use SIMD instructions. I&#x27;d developed a similar kind of approach to encoding integers (and ieee774 floats) a couple of years ago (first byte encodes length and first bit of data: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;kstenerud&#x2F;bonjson&#x2F;blob&#x2F;05b91f6fe7d6b0718686830abfb5028157c3fd28&#x2F;bonjson.md#length-field\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;kstenerud&#x2F;bonjson&#x2F;blob&#x2F;05b91f6fe7d6b07186...</a> ). It was very clever and used compiler intrinsics to get the length in 1 instruction, so 2 instructions got you the final value, with no branches.<p>But testing proved that when you move to SIMD instructions, ULEB128 (<a href=\"https:&#x2F;&#x2F;github.com&#x2F;kstenerud&#x2F;bonjson&#x2F;blob&#x2F;main&#x2F;bonjson.md#typed-array\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;kstenerud&#x2F;bonjson&#x2F;blob&#x2F;main&#x2F;bonjson.md#ty...</a>) or sentinel values (<a href=\"https:&#x2F;&#x2F;github.com&#x2F;kstenerud&#x2F;bonjson&#x2F;blob&#x2F;main&#x2F;bonjson.md#long-string\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;kstenerud&#x2F;bonjson&#x2F;blob&#x2F;main&#x2F;bonjson.md#lo...</a>) win every time because of the parallelization opportunities.<p>The true irony is that even SIMD text parsing would outperform this! SIMD is that powerful.",
        "created_at": 1780070871,
        "reply_count": 3
      },
      {
        "id": 48327115,
        "raw": {
          "by": "wahern",
          "id": 48327115,
          "text": "This reminded me of ISO 7816-4 BER-TLV encodings, which uses the format defined in  ISO&#x2F;IEC 8825-1 (ASN.1 related spec). Length integer values of 0-127 are encoded in 1 byte. If the high bit is set, then the first 7 bits tell you the number of subsequent octets. So there&#x27;s no offsetting involved, making it slightly less compact, but also dead simple.<p>EDIT: BUT, BER-TLV does permit overlong encodings. And I once found and reported a Yubikey 4 bug related to this. My source code comment for the workaround:<p><pre><code>  -- The Yubikey 4 has an off-by-one bug which\n  -- declares tag length of 255 (for the 0x53 outer\n  -- tag of a certficate DO) when there are only 254\n  -- bytes remaining in the reply. The reply is\n  -- chained across two packets, but the off-by-one is\n  -- probably related to the over-long encoded length\n  -- (0x82 0x00 0xff instead of 0x81 0xff).\n  --\n  -- [snip packet captures]\n  --\n  -- Yubico&#x27;s ykpiv_fetch_object function in ykpiv.c\n  -- (confirmed 1.4.3-1.5.0) contains a read (memmove)\n  -- overflow when the declared inner BER-TLV length\n  -- (of the 0x53 tag) is longer than what was\n  -- received over the wire. That makes Yubico&#x27;s\n  -- library oblivious to the issue. Relatedly, the\n  -- set_length function has an off-by-one bug (length\n  -- &lt; 0xff instead of length &lt;= 0xff) which produces\n  -- an over-long encoded length. That doesn&#x27;t by\n  -- itself explain why the Yubikey 4 transmits a\n  -- truncated logical reply unless the same code is\n  -- being used.</code></pre>",
          "time": 1780078602,
          "type": "comment",
          "parent": 48323992
        },
        "body": "This reminded me of ISO 7816-4 BER-TLV encodings, which uses the format defined in ISO/IEC 8825-1 (ASN.1 related spec). Length integer values of 0-127 are encoded in 1 byte. If the high bit is set, then the first 7 bits tell you the number of subsequent octets. So there's no offsetting involved, making it slightly less compact, but also dead simple. EDIT: BUT, BER-TLV does permit overlong encodings. And I once found and reported a Yubikey 4 bug related to this. My source code comment for the workaround: -- The Yubikey 4 has an off-by-one bug which -- declares tag length of 255 (for the 0x53 outer -- tag of a certficate DO) when there are only 254 -- bytes remaining in the reply. The reply is -- chained across two packets, but the off-by-one is -- probably related to the over-long encoded length -- (0x82 0x00 0xff instead of 0x81 0xff). -- -- [snip packet captures] -- -- Yubico's ykpiv_fetch_object function in ykpiv.c -- (confirmed 1.4.3-1.5.0) contains a read (memmove) -- overflow when the declared inner BER-TLV length -- (of the 0x53 tag) is longer than what was -- received over the wire. That makes Yubico's -- library oblivious to the issue. Relatedly, the -- set_length function has an off-by-one bug (length -- < 0xff instead of length <= 0xff) which produces -- an over-long encoded length. That doesn't by -- itself explain why the Yubikey 4 transmits a -- truncated logical reply unless the same code is -- being used.",
        "is_op": false,
        "author": "wahern",
        "raw_body": "This reminded me of ISO 7816-4 BER-TLV encodings, which uses the format defined in  ISO&#x2F;IEC 8825-1 (ASN.1 related spec). Length integer values of 0-127 are encoded in 1 byte. If the high bit is set, then the first 7 bits tell you the number of subsequent octets. So there&#x27;s no offsetting involved, making it slightly less compact, but also dead simple.<p>EDIT: BUT, BER-TLV does permit overlong encodings. And I once found and reported a Yubikey 4 bug related to this. My source code comment for the workaround:<p><pre><code>  -- The Yubikey 4 has an off-by-one bug which\n  -- declares tag length of 255 (for the 0x53 outer\n  -- tag of a certficate DO) when there are only 254\n  -- bytes remaining in the reply. The reply is\n  -- chained across two packets, but the off-by-one is\n  -- probably related to the over-long encoded length\n  -- (0x82 0x00 0xff instead of 0x81 0xff).\n  --\n  -- [snip packet captures]\n  --\n  -- Yubico&#x27;s ykpiv_fetch_object function in ykpiv.c\n  -- (confirmed 1.4.3-1.5.0) contains a read (memmove)\n  -- overflow when the declared inner BER-TLV length\n  -- (of the 0x53 tag) is longer than what was\n  -- received over the wire. That makes Yubico&#x27;s\n  -- library oblivious to the issue. Relatedly, the\n  -- set_length function has an off-by-one bug (length\n  -- &lt; 0xff instead of length &lt;= 0xff) which produces\n  -- an over-long encoded length. That doesn&#x27;t by\n  -- itself explain why the Yubikey 4 transmits a\n  -- truncated logical reply unless the same code is\n  -- being used.</code></pre>",
        "created_at": 1780078602,
        "reply_count": 0
      },
      {
        "id": 48326265,
        "raw": {
          "by": "i2talics",
          "id": 48326265,
          "kids": [
            48328121
          ],
          "text": "Non-canonical encodings are actually quite useful for some applications that need variable length integers. DWARF and WASM both use LEB128.<p>The problem is linking: a compiler needs to emit code into independent translation units, which contain &quot;missing&quot; references to symbols in other translation units, without yet knowing where all the code will end up in the final executable. Since we don&#x27;t know where the location of other code is yet, we don&#x27;t know how big the number representing that location is yet, which means that we don&#x27;t know how wide the variable length encoding of that number will be. If the width changes after linking, then we have to push around the surrounding code to make space for the wider integer. Unfortunately, this changes the location of all the surrounding code, so we have to recompute all the references!<p>The solution is to always emit un-linked var ints in the widest possible encoding (5 bytes for LEB128) that way when the references are patched during linking, no code is moved around. All integers <i>can</i> be converted to a non-canonical 5 byte form that is &quot;wasteful&quot; but its a worthwhile tradeoff because it solves this issue. Other integers that don&#x27;t need to be linked can be packed in a smaller var int form to save space.",
          "time": 1780075199,
          "type": "comment",
          "parent": 48323992
        },
        "body": "Non-canonical encodings are actually quite useful for some applications that need variable length integers. DWARF and WASM both use LEB128. The problem is linking: a compiler needs to emit code into independent translation units, which contain \"missing\" references to symbols in other translation units, without yet knowing where all the code will end up in the final executable. Since we don't know where the location of other code is yet, we don't know how big the number representing that location is yet, which means that we don't know how wide the variable length encoding of that number will be. If the width changes after linking, then we have to push around the surrounding code to make space for the wider integer. Unfortunately, this changes the location of all the surrounding code, so we have to recompute all the references! The solution is to always emit un-linked var ints in the widest possible encoding (5 bytes for LEB128) that way when the references are patched during linking, no code is moved around. All integers can be converted to a non-canonical 5 byte form that is \"wasteful\" but its a worthwhile tradeoff because it solves this issue. Other integers that don't need to be linked can be packed in a smaller var int form to save space.",
        "is_op": false,
        "author": "i2talics",
        "raw_body": "Non-canonical encodings are actually quite useful for some applications that need variable length integers. DWARF and WASM both use LEB128.<p>The problem is linking: a compiler needs to emit code into independent translation units, which contain &quot;missing&quot; references to symbols in other translation units, without yet knowing where all the code will end up in the final executable. Since we don&#x27;t know where the location of other code is yet, we don&#x27;t know how big the number representing that location is yet, which means that we don&#x27;t know how wide the variable length encoding of that number will be. If the width changes after linking, then we have to push around the surrounding code to make space for the wider integer. Unfortunately, this changes the location of all the surrounding code, so we have to recompute all the references!<p>The solution is to always emit un-linked var ints in the widest possible encoding (5 bytes for LEB128) that way when the references are patched during linking, no code is moved around. All integers <i>can</i> be converted to a non-canonical 5 byte form that is &quot;wasteful&quot; but its a worthwhile tradeoff because it solves this issue. Other integers that don&#x27;t need to be linked can be packed in a smaller var int form to save space.",
        "created_at": 1780075199,
        "reply_count": 1
      },
      {
        "id": 48329771,
        "raw": {
          "by": "juancn",
          "id": 48329771,
          "text": "I like the denormalization of VLE ints (with or without zig-zag encoding of negatives), it helps support out of band information, such as nulls and other signals in serialization protocols with minimal overhead.<p>For example you can use a denormalized zero to signal null.<p>You can still define a canonical encoding where denormalizations have specific meaning or signal an error.",
          "time": 1780091356,
          "type": "comment",
          "parent": 48323992
        },
        "body": "I like the denormalization of VLE ints (with or without zig-zag encoding of negatives), it helps support out of band information, such as nulls and other signals in serialization protocols with minimal overhead. For example you can use a denormalized zero to signal null. You can still define a canonical encoding where denormalizations have specific meaning or signal an error.",
        "is_op": false,
        "author": "juancn",
        "raw_body": "I like the denormalization of VLE ints (with or without zig-zag encoding of negatives), it helps support out of band information, such as nulls and other signals in serialization protocols with minimal overhead.<p>For example you can use a denormalized zero to signal null.<p>You can still define a canonical encoding where denormalizations have specific meaning or signal an error.",
        "created_at": 1780091356,
        "reply_count": 0
      },
      {
        "id": 48324781,
        "raw": {
          "by": "stebalien",
          "id": 48324781,
          "kids": [
            48325213
          ],
          "text": "I&#x27;ve used LEB128 (with canonicalisation) extensively and... this looks so much nicer for most use-cases (length prefixed, supports the full uint64 range without that extra 10th byte).<p>The downside is the encoding size. LEB128 quickly grows to 2 bytes, but stays at 2 bytes all the way to 2^14. This is important if you&#x27;re using these numbers as tags&#x2F;identifiers as we were in the multicodec [1] project, or for network message lengths. bijou64 only gives you 500 &lt;= 2 byte numbers.<p>[1]: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;multiformats&#x2F;multicodec\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;multiformats&#x2F;multicodec</a>",
          "time": 1780069817,
          "type": "comment",
          "parent": 48323992
        },
        "body": "I've used LEB128 (with canonicalisation) extensively and... this looks so much nicer for most use-cases (length prefixed, supports the full uint64 range without that extra 10th byte). The downside is the encoding size. LEB128 quickly grows to 2 bytes, but stays at 2 bytes all the way to 2^14. This is important if you're using these numbers as tags/identifiers as we were in the multicodec [1] project, or for network message lengths. bijou64 only gives you 500 <= 2 byte numbers. [1]: https://github.com/multiformats/multicodec",
        "is_op": false,
        "author": "stebalien",
        "raw_body": "I&#x27;ve used LEB128 (with canonicalisation) extensively and... this looks so much nicer for most use-cases (length prefixed, supports the full uint64 range without that extra 10th byte).<p>The downside is the encoding size. LEB128 quickly grows to 2 bytes, but stays at 2 bytes all the way to 2^14. This is important if you&#x27;re using these numbers as tags&#x2F;identifiers as we were in the multicodec [1] project, or for network message lengths. bijou64 only gives you 500 &lt;= 2 byte numbers.<p>[1]: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;multiformats&#x2F;multicodec\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;multiformats&#x2F;multicodec</a>",
        "created_at": 1780069817,
        "reply_count": 1
      }
    ],
    "presentation_fields": {
      "title": "Bijou64: A variable-length integer encoding",
      "tagline": "www.inkandswitch.com",
      "website_url": "https://www.inkandswitch.com/tangents/bijou64/",
      "canonical_url": "https://news.ycombinator.com/item?id=48323992"
    },
    "external_url_hostname": "www.inkandswitch.com",
    "selected_comments_raw": [
      {
        "by": "kstenerud",
        "id": 48325099,
        "kids": [
          48325533,
          48325932,
          48327921
        ],
        "text": "The problem is that this breaks down once you try to use SIMD instructions. I&#x27;d developed a similar kind of approach to encoding integers (and ieee774 floats) a couple of years ago (first byte encodes length and first bit of data: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;kstenerud&#x2F;bonjson&#x2F;blob&#x2F;05b91f6fe7d6b0718686830abfb5028157c3fd28&#x2F;bonjson.md#length-field\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;kstenerud&#x2F;bonjson&#x2F;blob&#x2F;05b91f6fe7d6b07186...</a> ). It was very clever and used compiler intrinsics to get the length in 1 instruction, so 2 instructions got you the final value, with no branches.<p>But testing proved that when you move to SIMD instructions, ULEB128 (<a href=\"https:&#x2F;&#x2F;github.com&#x2F;kstenerud&#x2F;bonjson&#x2F;blob&#x2F;main&#x2F;bonjson.md#typed-array\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;kstenerud&#x2F;bonjson&#x2F;blob&#x2F;main&#x2F;bonjson.md#ty...</a>) or sentinel values (<a href=\"https:&#x2F;&#x2F;github.com&#x2F;kstenerud&#x2F;bonjson&#x2F;blob&#x2F;main&#x2F;bonjson.md#long-string\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;kstenerud&#x2F;bonjson&#x2F;blob&#x2F;main&#x2F;bonjson.md#lo...</a>) win every time because of the parallelization opportunities.<p>The true irony is that even SIMD text parsing would outperform this! SIMD is that powerful.",
        "time": 1780070871,
        "type": "comment",
        "parent": 48323992
      },
      {
        "by": "wahern",
        "id": 48327115,
        "text": "This reminded me of ISO 7816-4 BER-TLV encodings, which uses the format defined in  ISO&#x2F;IEC 8825-1 (ASN.1 related spec). Length integer values of 0-127 are encoded in 1 byte. If the high bit is set, then the first 7 bits tell you the number of subsequent octets. So there&#x27;s no offsetting involved, making it slightly less compact, but also dead simple.<p>EDIT: BUT, BER-TLV does permit overlong encodings. And I once found and reported a Yubikey 4 bug related to this. My source code comment for the workaround:<p><pre><code>  -- The Yubikey 4 has an off-by-one bug which\n  -- declares tag length of 255 (for the 0x53 outer\n  -- tag of a certficate DO) when there are only 254\n  -- bytes remaining in the reply. The reply is\n  -- chained across two packets, but the off-by-one is\n  -- probably related to the over-long encoded length\n  -- (0x82 0x00 0xff instead of 0x81 0xff).\n  --\n  -- [snip packet captures]\n  --\n  -- Yubico&#x27;s ykpiv_fetch_object function in ykpiv.c\n  -- (confirmed 1.4.3-1.5.0) contains a read (memmove)\n  -- overflow when the declared inner BER-TLV length\n  -- (of the 0x53 tag) is longer than what was\n  -- received over the wire. That makes Yubico&#x27;s\n  -- library oblivious to the issue. Relatedly, the\n  -- set_length function has an off-by-one bug (length\n  -- &lt; 0xff instead of length &lt;= 0xff) which produces\n  -- an over-long encoded length. That doesn&#x27;t by\n  -- itself explain why the Yubikey 4 transmits a\n  -- truncated logical reply unless the same code is\n  -- being used.</code></pre>",
        "time": 1780078602,
        "type": "comment",
        "parent": 48323992
      },
      {
        "by": "i2talics",
        "id": 48326265,
        "kids": [
          48328121
        ],
        "text": "Non-canonical encodings are actually quite useful for some applications that need variable length integers. DWARF and WASM both use LEB128.<p>The problem is linking: a compiler needs to emit code into independent translation units, which contain &quot;missing&quot; references to symbols in other translation units, without yet knowing where all the code will end up in the final executable. Since we don&#x27;t know where the location of other code is yet, we don&#x27;t know how big the number representing that location is yet, which means that we don&#x27;t know how wide the variable length encoding of that number will be. If the width changes after linking, then we have to push around the surrounding code to make space for the wider integer. Unfortunately, this changes the location of all the surrounding code, so we have to recompute all the references!<p>The solution is to always emit un-linked var ints in the widest possible encoding (5 bytes for LEB128) that way when the references are patched during linking, no code is moved around. All integers <i>can</i> be converted to a non-canonical 5 byte form that is &quot;wasteful&quot; but its a worthwhile tradeoff because it solves this issue. Other integers that don&#x27;t need to be linked can be packed in a smaller var int form to save space.",
        "time": 1780075199,
        "type": "comment",
        "parent": 48323992
      },
      {
        "by": "juancn",
        "id": 48329771,
        "text": "I like the denormalization of VLE ints (with or without zig-zag encoding of negatives), it helps support out of band information, such as nulls and other signals in serialization protocols with minimal overhead.<p>For example you can use a denormalized zero to signal null.<p>You can still define a canonical encoding where denormalizations have specific meaning or signal an error.",
        "time": 1780091356,
        "type": "comment",
        "parent": 48323992
      },
      {
        "by": "stebalien",
        "id": 48324781,
        "kids": [
          48325213
        ],
        "text": "I&#x27;ve used LEB128 (with canonicalisation) extensively and... this looks so much nicer for most use-cases (length prefixed, supports the full uint64 range without that extra 10th byte).<p>The downside is the encoding size. LEB128 quickly grows to 2 bytes, but stays at 2 bytes all the way to 2^14. This is important if you&#x27;re using these numbers as tags&#x2F;identifiers as we were in the multicodec [1] project, or for network message lengths. bijou64 only gives you 500 &lt;= 2 byte numbers.<p>[1]: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;multiformats&#x2F;multicodec\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;multiformats&#x2F;multicodec</a>",
        "time": 1780069817,
        "type": "comment",
        "parent": 48323992
      }
    ]
  },
  "selection_meta": {
    "discussion_depth": "top_comments_v1",
    "external_article": {
      "status": "ok",
      "final_url": "https://www.inkandswitch.com/tangents/bijou64/",
      "status_code": 200,
      "content_type": "text/html; charset=utf-8",
      "failure_reason": null
    },
    "snapshot_version": "hn_story_v3",
    "selected_comments_count": 5,
    "external_article_resolved": true,
    "text_normalization_applied": false
  },
  "created_at": "2026-05-29T22:01:21.110Z",
  "updated_at": "2026-05-29T22:01:21.110Z"
}