Huge Binaries(fzakaria.com)

216 pointsby todsacerdotia month ago15 comments

yjftsjthsd-ha month ago
> I had observed binaries beyond 25GiB, including debug symbols. How is this possible? These companies prefer to statically build their services to speed up startup and simplify deployment. Statically including all code in some of the world’s largest codebases is a recipe for massive binaries.
I am very sympathetic to wanting nice static binaries that can be shipped around as a single artifact[0], but... surely at some point we have to ask if it's worth it? If nothing else, that feels like a little bit of a code smell; surely if your actual executable code doesn't even fit in 2GB it's time to ask if that's really one binary's worth of code or if you're actually staring at like... a dozen applications that deserve to be separate? Or get over it the other way and accept that sometimes the single artifact you ship is a tarball / OCI image / EROFS image for systemd[1] to mount+run / self-extracting archive[2] / ...
[0] Seriously, one of my background projects right now is trying to figure out if it's really that hard to make fat ELF binaries.
[1] https://systemd.io/PORTABLE_SERVICES/
[2] https://justine.lol/ape.html > "PKZIP Executables Make Pretty Good Containers"
- jmmva month ago
  This is something that always bothered me while I was working at Google too: we had an amazing compute and storage infrastructure that kept getting crazier and crazier over the years (in terms of performance, scalability and redundancy) but everything in operations felt slow because of the massive size of binaries. Running a command line binary? Slow. Building a binary for deployment? Slow. Deploying a binary? Slow.
  The answer to an ever-increasing size of binaries was always "let's make the infrastructure scale up!" instead of "let's... not do this crazy thing maybe?". By the time I left, there were some new initiatives towards the latter and the feeling that "maybe we should have put limits much earlier" but retrofitting limits into the existing bloat was going to be exceedingly difficult.
  - joatmon-snooa month ago
    There's a lot of tooling built on static binaries:
    - google-wide profiling: the core C++ team can collect data on how much of fleet CPU % is spent in absl::flat_hash_map re-bucketing (you can find papers on this publicly)
    - crashdump telemetry
    - dapper stack trace -> codesearch
    Borg literally had to pin the bash version because letting the bash version float caused bugs. I can't imagine how much harder debugging L7 proxy issues would be if I had to follow a .so rabbit hole.
    I can believe shrinking binary size would solve a lot of problems, and I can imagine ways to solve the .so versioning problem, but for every problem you mention I can name multiple other probable causes (eg was startup time really execvp time, or was it networked deps like FFs).
    MaskRaya month ago
    We are missing tooling to partition a huge binary into a few larger shared objects.
    As my https://maskray.me/blog/2023-05-14-relocation-overflow-and-c... (linked by author, thanks! But I maintain lld/ELF instead of "wrote" it - it's engineer work of many folks)
    Quoting the relevant paragraphs below:
    ## Static linking
    In this section, we will deviate slightly from the main topic to discuss static linking. By including all dependencies within the executable itself, it can run without relying on external shared objects. This eliminates the potential risks associated with updating dependencies separately.
    Certain users prefer static linking or mostly static linking for the sake of deployment convenience and performance aspects:
    * Link-time optimization is more effective when all dependencies are known. Providing shared object information during executable optimization is possible, but it may not be a worthwhile engineering effort.
    * Profiling techniques are more efficient dealing with one single executable.
    * The traditional ELF dynamic linking approach incurs overhead to support [symbol interposition](https://maskray.me/blog/2021-05-16-elf-interposition-and-bsy...).
    * Dynamic linking involves PLT and GOT, which can introduce additional overhead. Static linking eliminates the overhead.
    * Loading libraries in the dynamic loader has a time complexity `O(|libs|^2*|libname|)`. The existing implementations are designed to handle tens of shared objects, rather than a thousand or more.
    Furthermore, the current lack of techniques to partition an executable into a few larger shared objects, as opposed to numerous smaller shared objects, exacerbates the overhead issue.
    In scenarios where the distributed program contains a significant amount of code (related: software bloat), employing full or mostly static linking can result in very large executable files. Consequently, certain relocations may be close to the distance limit, and even a minor disruption (e.g. add a function or introduce a dependency) can trigger relocation overflow linker errors.
    jcalvinowensa month ago
    > We are missing tooling to partition a huge binary into a few larger shared objects
    Those who do not understand dynamic linking are doomed to reinvent it.
    a month ago
    undefined
    Filligreea month ago
    There’s no way my proxy binary actually requires 25GB of code, or even the 3GB it is. Sounds to me like the answer is a tree shaker.
    Sesse__a month ago
    Google implemented the C++ equivalent of a tree shaker in their build system around 2009.
    setherona month ago
    the front-end services to be "fast" AFAIK probably include nearly all the services you need to avoid hops -- so you can't really shake that much away.
  - lenkitea month ago
    Maybe I am missing something, but why didn't they just leverage dynamic libraries ?
    btillya month ago
    When I was at Google, on an SRE team, here is the explanation that I was given.
    Early on Google used dynamic libraries. But weird things happen at Google scale. For example Google has a dataset known, for fairly obvious reasons, as "the web". Basically any interesting computation with it takes years. Enough to be a multiple of the expected lifespan of a random computer. Therefore during that computation, you have to expect every random thing that tends to go wrong, to go wrong. Up to and including machines dying.
    One of the weird things that becomes common at Google scale, are cosmic bit flips. With static binaries, you can figure out that something went wrong, kill the instance, launch a new one, and you're fine. That machine will later launch something else and also be fine.
    But what happens if there was a cosmic bit flip in a dynamic library? Everything launched on that machine will be wrong. This has to get detected, then the processes killed and relaunched. Since this keeps happening, that machine is always there lightly loaded, ready for new stuff to launch. New stuff that...wind up broken for the same reason! Often the killed process will relaunch on the bad machine, failing again! This will continue until someone reboots the machine.
    Static binaries are wasteful. But they aren't as problematic for the infrastructure as detecting and fixing this particular condition. And, according to SRE lore circa 2010, this was the actual reason for the switch to static binaries. And then they realized all sorts of other benefits. Like having a good upgrade path for what would normally be shared libraries.
    ambrosioa month ago
    > But what happens if there was a cosmic bit flip in a dynamic library?
    I think there were more basic reasons we didn't ship shared libraries to production.
    1. They wouldn't have been "shared", because every program was built from its own snapshot of the monorepo, and would naturally have slightly different library versions. Nobody worried about ABI compatibility when evolving C++ interfaces, so (in general) it wasn't possible to reuse a .so built at another time. Thus, it wouldn't actually save any disk space or memory to use dynamic linking.
    2. When I arrived in 2005, the build system was embedding absolute paths to shared libraries into the final executable. So it wasn't possible to take a dynamically linked program, copy it to a different machine, and execute it there, unless you used a chroot or container. (And at that time we didn't even use mount namespaces on prod machines.) This was one of the things we had to fix to make it possible to run tests on Forge.
    3. We did use shared libraries for tests, and this revealed that ld.so's algorithm for symbol resolution was quadratic in the number of shared objects. Andrew Chatham fixed some of this (https://sourceware.org/legacy-ml/libc-alpha/2006-01/msg00018...), and I got the rest of it eventually; but there was a time before GRTE, when we didn't have a straightforward way to patch the glibc in prod.
    That said, I did hear a similar story from an SRE about fear of bitflips being the reason they wouldn't put the gws command line into a flagfile. So I can imagine it being a rationale for not even trying to fix the above problems in order to enable dynamic linking.
    > Since this keeps happening, that machine is always there lightly loaded, ready for new stuff to launch. New stuff that...wind up broken for the same reason!
    I did see this failure mode occur for similar reasons, such as corruption of the symlinks in /lib. (google3 executables were typically not totally static, but still linked libc itself dynamically.) But it always seemed to me that we had way more problems attributable to kernel, firmware, and CPU bugs than to SEUs.
    btillya month ago
    Thanks. It is nice to hear another perspective on this.
    But here is a question. How much of SEUs not being problems were because they weren't problems? Versus because there were solutions in place to mitigate the potential severity of that kind of problem? (The other problems that you name are harder to mitigate.)
    ambrosioa month ago
    Memory and disk corruption definitely were a problem in the early days. See https://news.ycombinator.com/item?id=14206811 for example. I also recall an anecdote about how the search index basically became unbuildable beyond a certain size due to the probability of corruption, which was what inspired RecordIO. I think ECC RAM and transport checksums largely fixed those problems.
    It's pretty challenging for software to defend against SEUs corrupting memory, especially when retrofitting an existing design like Linux. While operating Forge, we saw plenty of machines miscompute stuff, and we definitely worried about garbage getting into our caches. But my recollection is that the main cause was individual bad CPUs. We would reuse files in tmpfs for days without reverifying their checksums, and while we considered adding a scrubber, we never saw evidence that it would have caught much.
    Maybe the CPU failures were actually due to radiation damage, but as they tended to be fairly sticky, my guess is something more like electromigration.
    kridsdale1a month ago
    As a developer depending on the infrastructure and systems you guys make reliable every day inside Google, Bless You. Truly.
    When Forge has a problem, I might as well go on a nature hike.
    dh2022a month ago
    In Azure - which I think is at Google scale - everything is dynamically linked. Actually a lot of Azure is built on C# which does not even support static linking...
    Statically linking being necessary for scaling does not pass the smell test for me.
    mbreesea month ago
    I never worked for Google, but have seen some strange things like bit flips at more modest scales. From the parent description, it looks like defaulting to static binaries is helping to speed up troubleshooting to remove the “this should never happen, but statistically will happen every so often” class of bugs.
    As I see it, the issue isn’t requiring static compiling to scale. It’s requiring it to make troubleshooting or measuring performance at scale easier. Not required, per se, but very helpful.
    btillya month ago
    Exactly. SRE is about monitoring and troubleshooting at scale.
    Google runs on a microservices architecture. It's done that since before that was cool. You have to do a lot to make a microservices architecture work. Google did not advertise a lot of that. Today we have things like Data Dog that give you some of the basics. But for a long time, people who left Google faced a world of pain because of how far behind the rest of the world was.
    btillya month ago
    Azure's devops record is not nearly as good as Google's was.
    The biggest datasets that ChatGPT is aware of being processed in complex analytics jobs on Azure are roughly a thousand times smaller than an estimate of Google's regularly processed snapshot of the web. There is a reason why most of the fundamental advancements in how to parallelize data and computations - such as map-reduce and BigTable - all came from Google. Nobody else worked at their scale before they did. (Then Google published it, and people began to implement it. Then failed to understand what was operationally important to making it actually work at scale...)
    So, despite how big it is, I don't think that Azure operates at Google scale.
    For the record, back when I worked at Google, the public internet was only the third largest network that I knew of. Larger still was the network that Google uses for internal API calls. (Do you have any idea how many API calls it takes to serve a Google search page?) And larger still was the network that kept data synchronized between data centers. (So, for example, you don't lose your mail if a data center goes down.)
    a month ago
    undefined
    arccya month ago
    perhaps that's why azure has such a bad reputation in the devops crowd.
    dh2022a month ago
    Does AWS have a good reputation in devops? Because large chunks of AWS are built on Java - which also does not offer static linking (bundling a bunch of *.jar files into one exe does not count as static linking). Still does not pass the smell test.
    arccya month ago
    In AWS, only the very core Infra-as-a-Service that they dogfood can be considered "good", Everything else that's more Platform-as-a-Service can be considered a half baked leaky abstraction. Anything they release as "GA" especially around ReInvent should be avoided for a minimum of 6 months-1 year since it's more like a public Beta with some guaranteed bugs.
    dh2022a month ago
    In AWS, only the very core Infra-as-a-Service that they dogfood can be considered "good" - large chunks of which are, by the way, written in Java. I think you are proving my point...
    arccya month ago
    which just means Java isn't affected? or your definition of not not counting bundled and not shared jars as static linking is wrong, since they achieve the same effect.
    selkina month ago
    > But what happens if there was a cosmic bit flip in a dynamic library?
    You'd need multiple of those, because you have ECC. Not impossible, but getting all those dice rolled the same way requires even bigger scale than Google's.
    cozzyda month ago
    Sounds like Google should put their computers at Homestake
    tmoertela month ago
    One reason is that using static binaries greatly simplifies the problem of establishing Binary Provenance, upon which security claims and many other important things rely. In environments like Google’s it's important to know that what you have deployed to production is exactly what you think it is.
    See for more: https://google.github.io/building-secure-and-reliable-system...
    inkyotoa month ago
    > One reason is that using static binaries greatly simplifies the problem of establishing Binary Provenance, upon which security claims and many other important things rely.
    It depends.
    If it is a vulnerability stemming from libc, then every single binary has to be re-linked and redeployed, which can lead to a situation where something has been accidentally left out due to a unaccounted for artefact.
    One solution could be bundling the binary or related multiple binaries with the operating system image but that would incur a multidimensional overhead that would be unacceptable for most people and then we would be talking about «an application binary statically linked into the operating system» so to speak.
    tmoertela month ago
    > If it is a vulnerability stemming from libc, then every single binary has to be re-linked and redeployed, which can lead to a situation where something has been accidentally left out due to a unaccounted for artefact.
    The whole point of Binary Provenance is that there are no unaccounted-for artifacts: Every build should produce binary provenance describing exactly how a given binary artifact was built: the inputs, the transformation, and the entity that performed the build. So, to use your example, you'll always know which artefacts were linked against that bad version of libc.
    See https://google.github.io/building-secure-and-reliable-system...
    inkyotoa month ago
    I am well aware of and understand that.
    However,
    > […] which artefacts were linked against that bad version of libc.
    There is one libc for the entire system (a physical server, a virtual one, etc.), including the application(s) that have/have been deployed into an operating environment.
    In the case of the entire operating environment (the OS + applications) being statically linked against a libc, the entire operating environment has to be re-linked and redeployed as a single concerted effort.
    In dynamically linked operating environments, only the libc needs to be updated.
    The former is a substantially more laborious and inherently more risky effort unless the organisation has achieved a sufficiently large scale where such deployment artefacts are fully disposable and the deployment process is fully automated. Not many organisations practically operate at that level of maturity and scale, with FAANG or similar scale being a notable exception. It is often cited as an aspiration, yet the road to that level of maturity is windy and is fraught with many shortcuts in real life which result in the binary provenance being ignored or rendering it irrelevant. The expected aftermath is, of course, a security incident.
    tmoertela month ago
    What is the point you're trying to make?
    I claimed that Binary Provenance was important to organizations such as Google where it is important to know exactly what has gone into the artefacts that have been deployed into production. You then replied "it depends" but, when pressed, defended your claim by saying, in effect, that binary provenance doesn't work in organizations that have immaturate engineering practices where they don't actually follow the practice of enforcing Binary Provenance.
    But I feel like we already knew that practices don't work unless organizations actually follow them.
    So what was your point?
    inkyotoa month ago
    My point is that static linking alone and by itself does not meaningfully improve binary provenance and is mostly expensive security theatre from a provenance standpoint due to a statically linked binary being more opaque from a component attribution perspective – unless an inseparable SBOM (which is cryptographically tied to the binary), plus signed build attestations are present.
    Static linking actually destroys the boundaries that a provenance consumer would normally want due to erasure of the dependency identities rendering them irrecoverable in a trustworthy way from the binary by way of global code optimisation, inlining (sometimes heavy), LTO, dead code elimination and alike. It is harder to reason about and audit a single opaque blob than a set of separately versioned shared libraries.
    Static linking, however, is very good at avoiding «shared/dynamic library dependency hell» which is a reliability and operability win. From a binary provenance standpoint, it is largely orthogonal.
    Static linking can improve one narrow provenance-adjacent property: fewer moving parts at deploy and run time.
    The «it depends» part of the comment concerned the FAANG-scale level of infrastructure and operational maturity where the organisation can reliably enforce hermetic builds and dependency pinning across teams, produce and retain attestations and SBOM's bound to release artefacts, rebuild the world quickly on demand and roll out safely with strong observability and rollback. Many organisations choose dynamic linking plus image sealing because it gives them similar provenance and incident response properties with less rebuild pressure at a substantially smaller cost.
    So static linking mainly changes operational risk and deployment ergonomics, not evidentiary quality about where the code came from and how it was produced, whereas dynamic linking, on the other hand, may yield better provenance properties when the shared libraries themselves have strong identity and distribution provenance.
    NB Please do note that the diatribe is not directed at you in any way, it is an off-hand remark and a reference to people who prescribe purported benefits to the static linking that it espouses because «Google does» it without taking into account the overall context, maturity and scale of the operating environment Google et al operate at.
  - bfroga month ago
    Sounds like Google could really use Nix
  - darubedaroba month ago
    I think google of all companies could build a good autostripper reducing binaries by adding partial load assembly on misses. It cant be much slower then shovelling a full monorepo assembly plus symbols into ram.
    loega month ago
    The low-hanging fruit is just not shipping the debuginfo, of course.
    usefulcata month ago
    Is compressed debug info a thing? It seems likely to compress well, and if it's rarely used then it might be a worthwhile thing to do?
    loega month ago
    It is: https://maskray.me/blog/2022-01-23-compressed-debug-sections
    But the compression ratio isn't magical (approx. 1:0.25, for both zlib and zstd in the examples given). You'd probably still want to set aside debuginfo in separate files.
    Gibbon1a month ago
    Small brained primate comment.
    With embedded firmware you only flash the .text and and flash to the device. But you still can debug using the .elf file. In my case if I get a bus fault I'll pull the offending address off the stack and use bintools and the .elf to show me who was naughty. I think if you have a crash dump you should be able to make sense of things as long as you keep the unstripped .elf file around.
- shevy-javaa month ago
  > https://systemd.io/PORTABLE_SERVICES/
  Systemd and portable?
  - yjftsjthsd-ha month ago
    Portable across systemd/Linux systems, yes:)
- jceleriera month ago
  What's wild to me is not using -gsplit-dwarf to have separate debug info and "normal-sized" binaries
  - jeffbeea month ago
    Google contributed the code, and the entire concept, of DWARF fission to both GCC and LLVM. This suggests that rather than overlooking something obvious that they'll be embarrassed to learn on HN, they were aware of the issues and were using the solutions before you'd even heard of them.
    sionisrecura month ago
    A case of the left hand not knowing what the right hand is doing?
    jeffbeea month ago
    There's no contradiction, no missing link in the facts of the story. They have a huge program, it is 2GiB minus epsilon of .text, and a much larger amount of DWARF stuff. The article is about how to use different code models to potentially go beyond 2GiB of text, and the size of the DWARF sections is irrelevant trivia.
    jceleriera month ago
    > They have a huge program, it is 2GiB minus epsilon of .text,
    but the article says 25+GiB including debug symbols, in a single binary?
    also, I appreciate your enthusiasm in assuming that because some people do something in an organization, it is applied consistently everywhere. Hell, if it were microsoft other departments would try to shoot down the "debug tooling optimization" dpt
    loega month ago
    Yes, the 25GB figure in the article is basically irrelevant to the 2GB .text section concern. Most ELF files that size are 95%+ debuginfo.
    jceleriera month ago
    yes and that's what I'm saying, I find it crazy to not split the debug info out. At least on my machine it really makes a noticeable difference of load time if I load a binary which is ~2GB with debug info in or the same binary which is ~100MB with debug info out.
    Mic92a month ago
    Doesn't make any difference in practice. The debug info is never mapped into memory by the loader. This only matters if you want to store the two separate i.e lazy load debug symbols if needed.
    jceleriera month ago
    this is just not true. I just tried with one of my binaries which is 3.2G unstripped, and 150MB-ish stripped. Unstripped takes 23 seconds until the window shows up, stripped takes ~a second
    jeffbeea month ago
    There is something wacky going on with your system, or the program is written in a way that makes it traverse the debug info if it is present. What program is it?
    For example I can imagine desktop operating system antivirus/integrity checks having this effect.
    jeffbeea month ago
    ELF is just a container format and you can put literally anything into one of its sections. Whether the DWARF sections are in "the binary" or in another named file is really quite beside the point.
- forrestthewoodsa month ago
  If you have 25gb of executables then I don’t think it matters if that’s one binary executable or a hundred. Something has gone horribly horribly wrong.
  I don’t think I’ve ever seen a 4gb binary yet. I have seen instances where a PDB file hit 4gb and that caused problems. Debug symbols getting that large is totally plausible. I’m ok with that at least.
  - niutecha month ago
    Llamafile (https://llamafile.ai) can easily exceed 4GB due to containing LLM weights inside. But remember, you cannot run >4GB executable files on Windows.
  - wolfi1a month ago
    I did, it was a Spring Boot fat jar with a NLP, I had to deploy it to the biggest instance AWS could offer, the costs were enormous
    selkina month ago
    Java bytecode is always dynamically linked.
    wolfi1a month ago
    still, if I remember correctly I had to reserve 6gig of memory so that the jvm could actually start
  - loega month ago
    If you haven't seen a 25GB binary with debuginfo, you just aren't working in large, templated, C++ codebases. It's nothing special there.
    forrestthewoodsa month ago
    Not quite. I very much work in large, templated, C++ codebases. But I do so on windows where the symbols are in a separate file the way the lord intended.
  - throwawaymobulea month ago
    A few ps3 games I've seen had 4GB or more binaries.
    This was a problem because code signing meant it needed to be completely replaced by updates.
    swiftcodera month ago
    > A few ps3 games I've seen had 4GB or more binaries.
    Is this because they are embedding assets into the binary? I find it hard to believe anyone was carrying around enough code to fill 4GB in the PS3 era...
    throwawaymobulea month ago
    I assume so, there were rarely any other files on the disc in this case.
    It varied between games, one of the battlefields (3 or bad company 2) was what I was thinking of. It generally improved with later releases.
    The 4GB file size was significant, since it meant I couldn't run them from a backup on a fat32 usb drive. There are workarounds for many games nowadays.
10000truthsa month ago
Debug symbol size shouldn't be influencing relocation jump distances - debug info has its own ELF section.
Regardless of whether you're FAANG or not, nothing you're running should require an executable with a 2 GB large .text section. If you're bumping into that limit, then your build process likely lacks dead code elimination in the linking step. You should be using LTO for release builds. Even the traditional solution (compile your object files with -ffunction-sections and link with --gc-sections) does a good job of culling dead code at function-level granularity.
- saagarjhaa month ago
  Google Chrome ships as a 500 MB binary on my machine, so if you're embedding a web browser, that's how much you need minimum. Now tack on whatever else your application needs and it's easy to see how you can go past 2 GB if you're not careful. (To be clear, I am not making a moral judgment here, I am just saying it's possible to do. Whether it should happen is a different question.)
  - throwawaymobulea month ago
    Do you have some special setup?
    Chromium is in the hundred and something MB range on mine last I looked. Might expand to more on install.
    saagarjhaa month ago
    I just checked Google Chrome Framework on my Mac, it was a little over 400 MB. Although now that I think about it it's probably a universal binary so you can cut that in half?
    trevor-ea month ago
    Yea looks like Chrome ships a universal binary with both x86_64 and arm64.
    sznioa month ago
    makes sense, chromium on my Fedora system takes up 234MB.
    a month ago
    undefined
- yablaka month ago
  FAANGs we're deeply involved in designing LTO. See, e.g.,
  https://research.google/pubs/thinlto-scalable-and-incrementa...
  And other refs.
  And yet...
  - jeffbeea month ago
    Google also uses identical code folding. It's a pretty silly idea that a shop that big doesn't know about the compiler flags.
    Orphisa month ago
    Google is made of many thousands of individuals. Some experts will be aware of all those, some won't. In my team, many didn't know about those details as they were handled by other builds teams for specific products or entire domains at once.
    But since each product in some different domains had to actively enable those optimizations for themselves, they were occasionally forgotten, and I found a few in the app I worked for (but not directly on).
    jeffbeea month ago
    ICF seems like a good one to keep in the box of flags people don't know about because like everything in life it's a tradeoff and keeping that one problematic artifact under 2GiB is pretty much the only non-debatable use case for it.
yablaka month ago
> We would like to keep our small code-model. What other strategies can we pursue?
Move all the hot BBs near each other, right?
Facebook's solution: https://github.com/llvm/llvm-project/blob/main/bolt%2FREADME...
Google's:
https://lists.llvm.org/pipermail/llvm-dev/2019-September/135...
- setherona month ago
  but for x86_64, as of right now, if only a single call needs more than 31bits you have to upgrade the whole code section to large code model.
  BOLT AFAIU is more about cache locality of putting hot code near each other and not really breaking the 2GiB barrier.
  - jeffbeea month ago
    Why? Can't the linker or post-link optimizer reduce all near calls, leaving the more complicated mov with immediate form only where required?
    mananaysiemprea month ago
    Once the compiler has generated a 32-bit relative jump with an R_X86_64_PLT32 relocation, it’s too late. (A bit surprising for it to be a PLT relocation, but it does make some sense upon reflection, and the linker turns it into a direct call if you’re statically linking.) I think only RISC-V was brave enough to allow potentially size-changing linker relaxation, and incidentally they screwed it up (the bug tracker says “too late to change”, which brings me great sadness given we’re talking about a new platform).
    On x86-64 it would probably be easier to point the relative call to a synthesized trampoline that does a 64-bit one, but it seems nobody has bothered thus far. You have to admit that sounds pretty painful.
stnclsa month ago
> The simplest solution however is to use -mcmodel=large which changes all the relative CALL instructions to absolute JMP.
Makes sense, but in the assembly output just after, there is not a single JMP instruction. Instead, CALL <immediate> is replaced with putting the address in a 64-bit register, then CALL <register>, which makes even more sense. But why mention the JMP thing then? Is it a mistake or am I missing something? (I know some calls are replaced by JMP, but that's done regardless of -mcmodel=large)
- dwattttta month ago
  I would assume loose language, referring to a CALL as a JMP. However of the two reasons given to dislike the large code model, register pressure isn't relevant to that particular snippet.
  It's performing a call, ABIs define registers that are not preserved over calls; writing the destination to one of those won't affect register pressure.
- loega month ago
  I think the author is just noting that the construction is similar to an 8-byte JMP instruction. The text now reads:
  > The simplest solution however is to use -mcmodel=large which changes all the relative CALL instructions to absolute 64bit ones; kind of like a JMP.
  (We still need to use CALL in order to push a return address.)
meisela month ago
> Responses to my publication submissions often claimed such problems did not exist
I see this often even in communities of software engineers, where people who are unaware of certain limitations at scale will announce that the research is unnecessary
loega month ago
Sure! But there's a sleight of hand in the numbers here where we're talking about 25GB binaries with debuginfo and then 2GB maximum offsets in the .text section. Of those 25GB binaries, probably 24.5 of them are debuginfo. You have to get into truly huge binaries before >2GB calls become an issue.
(I wonder but have no particular insight into if LTO builds can do smarter things here -- most calls are local, but the handful of far calls can use the more expensive spelling.)
- benlivengooda month ago
  At Google I worked with one statistics aggregation binary[0] that was ~25GB stripped. The distributed build system wouldn't even build the debug version because it exceeded the maximum configured size for any object file. I never asked if anyone had tried factoring it into separate pipelines but my intuition is that the extra processing overhead wouldn't have been worth splitting the business logic that way; once the exact set of necessary input logs are in memory you might as well do everything you need to them given the dramatically larger ratio of data size to code size.
  [0] https://research.google/pubs/ubiq-a-scalable-and-fault-toler...
- inkyotoa month ago
  > […] 2GB maximum offsets in the .text section
  … on the x86 ISA because it encodes the 32-bit jump/call offset directly in the opcode.
  Whilst most RISC architecture do allow PC-relative branches, the offset is relatively small as 32-bit opcodes do not have enough room to squeeze a large offset in.
  «Long» jumps and calls are indirect branches / calls done via registers where the entirety of 64 bits is available (address alignment rules apply in RISC architectures). The target address has to be loaded / calculated beforehand, though. Available in RISC and x86 64-bit architectures.
doubletwoyoua month ago
25 GiB for a single binary sounds horrifying
at some point surely some dynamic linking is warranted
- nneonneoa month ago
  To be fair, this is with debug symbols. Debug builds of Chrome were in the 5GB range several years ago; no doubt that’s increased since then. I can remember my poor laptop literally running out of RAM during the linking phase due to the sheer size of the object files being linked.
  Why are debug symbols so big? For C++, they’ll include detailed type information for every instantiation of every type everywhere in your program, including the types of every field (recursively), method signatures, etc. etc., along with the types and locations of local variables in every method (updated on every spill and move), line number data, etc. etc. for every specialization of every function. This produces a lot of data even for “moderate”-sized projects.
  Worse: for C++, you don’t win much through dynamic linking because dynamically linking C++ libraries sucks so hard. Templates defined in header files can’t easily be put in shared libraries; ABI variations mean that dynamic libraries generally have to be updated in sync; and duplication across modules is bound to happen (thanks to inlined functions and templates). A single “stuck” or outdated .so might completely break a deployment too, which is a much worse situation than deploying a single binary (either you get a new version or an old one, not a broken service).
  - 01HNNWZ0MV43FFa month ago
    I've hit the same thing in Rust, probably for the same reasons.
    Isn't the simple solution to use detached debug files?
    I think Windows and Linux both support them. That's how phones like Android and iOS get useful crash reports out of small binaries, they just upload the stack trace and some service like Sentry translates that back into source line numbers. (It's easy to do manually too)
    I'm surprised the author didn't mention it first. A 25 GB exe might be 1 GB of code and 24 GB of debug crud.
    nicoburnsa month ago
    > Isn't the simple solution to use detached debug files?
    It should be. But the tooling for this kind of thing (anything to do with executable formats including debug info and also things like linking and cross-compilation) is generally pretty bad.
    dwattttta month ago
    > I think Windows and Linux both support them.
    Detached debug files has been the default (only?) option in MS's compiler since at least the 90s.
    I'm not sure at what point it became hip to do that around Linux.
    kvemkona month ago
    Since at least October 2003 on Debian:
    [1] "debhelper: support for split debugging symbols"
    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=215670
    [2] https://salsa.debian.org/debian/debhelper/-/commit/79411de84...
  - yjftsjthsd-ha month ago
    Can't debug symbols be shipped as separate files?
    bregmaa month ago
    The problem is that when a final binary is linked everything goes into it. Then, after the link step, all the debug information gets stripped out into the separate symbols file. That means at some point during the build the target binary file will contain everything. I can not, for example, build clang in debug mode on my work machine because I have only 32 GB of memory and the OOM killer comes out during the final link phase.
    Of course, separate binaries files make no difference at runtime since only the LOAD segments get loaded (by either the kernel or the dynamic loader, depending). The size of a binary on disk has little to do with the size of a binary in memory.
    jceleriera month ago
    > The problem is that when a final binary is linked everything goes into it
    I don't think that's the case on Linux, when using -gsplit-dwarf the debug info is put in separate files at the object file level, they are never linked into binaries.
    yablaka month ago
    Yes, but it can be more of a pain keeping track of pairs. In production though, this is what's done. And given a fault, the debug binary can be found in a database and used to gdb the issue given the core. You do have to limit certain online optimizations in order to have useful tracebacks.
    This also requires careful tracking of prod builds and their symbol files... A kind of symbol db.
    loega month ago
    Yes, absolutely. Debuginfo doesn't impact .text section distances either way, though.
  - tempaya month ago
    I’ve seen LLVM dependent builds hit well over 30GB. At that point it started breaking several package managers.
- 0xbadcafebeea month ago
  To be fair, they worked at Google, their engineering decisions are not normal. They might just decide that 25 GiB binaries are worth a 0.25% speedup at start time, potentially resulting in tens of millions of dollars' worth of difference. Nobody should do things the way Google does, but it's interesting to think about.
- flohofwoea month ago
  The overall size wouldn't get smaller just because it is dynamically linked, on the contrary (because DLLs are a dead code elimination barrier). 25 GB is insane either way, something must have gone horribly wrong very early in the development process (also why, even ship with debug information included, that doesn't make sense in the first place).
- dilyevskya month ago
  Won't make a bit of difference because everything is in a sort of container (not Docker) anyway. Unless you're suggesting those libraries to be distributed as base image to every possible Borg machine your app can run on which is an obvious non-starter.
MaskRaya month ago
Note, sections without the SHF_ALLOC flag, such as `.debug_*` sections, do not contribute to the relocation distance pressure. Many 10+GiB binaries (likely due to not using split DWARF) might have much smaller code+data and not even close to the limit.
However, Google, Meta, and ByteDance have encountered x86-64 relocation distance issue with their huge C++ server binaries. To my knowledge industry users in other domains haven't run into this problem.
To address this, Google adopted the medium code model approximately two years ago for its sanitizer and PGO instrumentation builds. CUDA fat binaries also caused problems. I suggest that linker script `INSERT BEFORE/AFTER` for orphan sections (https://reviews.llvm.org/D74375 ) served as a key mitigation.
I hope that a range extension thunk ABI, similar to AArch64/Power, is defined for the x86-64 psABI. It is better than the current long branch pessimization we have with -mcmodel=large.
---
It seems that nobody has run into this .eh_frame_hdr implementation limitation yet
* `.eh_frame_hdr -> .text`: GNU ld and ld.lld only support 32-bit offsets (`table_enc = DW_EH_PE_datarel | DW_EH_PE_sdata4;`) as of Dec 2025.
wyldfirea month ago
> What other strategies can we pursue?
You can use thunks/trampolines. lld can make them for some architectures, presumably also for x86_64. Though I don't know why it didn't in your case.
But, like the large code model it can be expensive to add trampolines, both in icache performance and just execution if a trampoline is in a particularly hot path.
- setherona month ago
  In many ways that is what the PLT is also.
  This is what my next post will explore. I ran into some issues with the GOT that I'll have to explore solutions for.
  I'm writing this for myself mostly. The whole idea for code models when you have thunks feels unnecessary.
- setherona month ago
  follow-up: https://fzakaria.com/2025/12/29/huge-binaries-i-thunk-theref...
  - wyldfirea month ago
    > With this information, the necessity of code-models feels unecessary [sic]. Why trigger the cost for every callsite when we can do-so piecemeal as necessary with the opportunity to use profiles to guide us on which methods to migrate to thunks.
    Does the linker have access to the same hotness information that the compiler uses during PGO? Well -- presumably it could, even if it doesn't now. But it would be like a heuristic with a hotness threshold? Do linkers "do" heuristics?
shevy-javaa month ago
25GB seems excessive, but I keep on having the basic compile toolchain as statically compiled executables. It simply works better when things go awry.
a month ago
undefined
a_t48a month ago
I've seen terrible, terrible binary sizes with Eigen + debug symbols, due to how Eigen lazy evaluation works (I think). Every math expression ends up as a new template instantiation.
- forrestthewoodsa month ago
  Eigen is one of the worst libraries when it comes to both exe size and compile times. <shudder>
  - a_t48a month ago
    In terms of compile times, boost geometry is somehow worse. You're encouraged to import boost/geometry.hpp, which includes every module, which stalls compile times by several seconds just to parse all the templates. It's not terrible if you include just the headers you need, but that's not the "default" that most people use.
    forrestthewoodsa month ago
    boost is on my “do not ever use ever oh my god what are you doing stop it” list. It’s so bad.
    a_t48a month ago
    Same.
nicebytea month ago
shameless plug: if you want to understand the content of this post better, first read the first half of my article on jumps [1] (up to syscall). goes into detail about relocations and position-independent code.
[1] https://gpfault.net/posts/asm-tut-4.html
reactordeva month ago
Oh man, that first paragraph. “Such problems don’t exist…” what a gaslighting response to a publication submittal. The least they could do is ask where this problem emerges and you can hand wavy your answer without revealing business IP.
Also, we, as an industry of software engineers, need to re-examine these hard defaults we thought could never be achieved. Such as the .text limits.
Anyway, very good read.
geriksona month ago
The HN de-sensationalize algo for submission titles needs tweaking. Original title is simply "Huge Binaries".
- acosmisma month ago
  agreed. Binaries is a bit too sensational for my taste. this can be further optimized.
  - fuzzfactora month ago
    "Files So Big They Might As Well Be Trinaries".
  - binaryturtlea month ago
    "Bins"? :)
    bayindirha month ago
    01.
    Why not?
    DHRicoFa month ago
    False