Nearly all binary searches and mergesorts are broken (2006)(research.google)

164 pointsby thunderbong6 months ago39 comments

jsnell6 months ago
(2006)
Past discussions:
https://news.ycombinator.com/item?id=33492122
https://news.ycombinator.com/item?id=14906429
https://news.ycombinator.com/item?id=12147703
https://news.ycombinator.com/item?id=9857392
https://news.ycombinator.com/item?id=6799336
https://news.ycombinator.com/item?id=1130463
https://news.ycombinator.com/item?id=621557
- motorest6 months ago
  > (2006)
  In the very least,this should feature in the title. In fact, previous submissions from over a decade ago also feature year in the title.
  Otherwise it conveys the idea that this is some major epiphany.
  - brandall106 months ago
    Considering how often this happens, I'm surprised HN doesn't have a feature to flag and index a historically resubmitted article, whether externally by users or internally by the admin team.
    Then it could have a bot create links to past submissions like the OP did and use the best arrived at title for resubmissions.
    It does have the ability to catch multiple submissions of new articles but that probably has a time window of a day or so.
    kurthr6 months ago
    One problem would be that you can't just check the URL, you'd have to check the content. Not only are there many URLs that could point to the same content, but content on a page can obviously change.
    I suppose you could compare against wayback, and not sure I'd try to compare with an LLM or RAG.
  - xmcqdpt26 months ago
    Also, I didn't know about this specific bug but I spotted it almost immediately while reading the code. This is not because I'm some kind of 10x programmer but because Integer.MAX_VALUE bugs while processing arrays are actually fairly common in java programs in 2025, I fixed one a few weeks back and it's something I look for in code reviews.
    I guess it would have been surprising in 2006?
    brabel6 months ago
    In 2006, if you tried to create an array with 2^30 elements you would just get an OutOfMemoryError almost anywhere.
    masklinn6 months ago
    AMD's 64b CPU were released in 2003, Intel followed up in 2004, and in 2006 launched their first 64b laptop chip (with Core 2). By then the ability to reasonably allocate more than 1GB in one shot was becoming quite widespread (the mid 2007 MBPs could be upgraded to 6GB RAM).
    And obviously Google would have been dealing with servers, where it'd be even older as they'd probably have been using PAE and each process would be able to allocate and address a full 4GB.
    brabel6 months ago
    Not sure if you're a troll... but for general info, in 2006 most desktop computers (laptops were still not so common) had something like 100MB of RAM if you were lucky. Maybe Google already had some huge machines with 1GB+ but that was not at all a common thing outside supercomputers.
    aspenmayer6 months ago
    I think you’re off by a decade or so.
    Common consumer laptops had 2-4 GB of RAM in 2006.
    https://everymac.com/systems/apple/macbook_pro/specs/macbook...
    brabel6 months ago
    Just like today you can buy machines with 128GB of RAM... but that doesn't mean that's what people were using... a lot of people buy machines with 4GB today (just checked the most popular website in my country, lots of the offers only have 4GB: https://www.elgiganten.se/datorer-kontor/datorer/laptop).
    I remember pretty clearly that if you had anywhere above 512MB of RAM in 2006 you had very much top-of-the-line.
    aspenmayer6 months ago
    > I remember pretty clearly that if you had anywhere above 512MB of RAM in 2006 you had very much top-of-the-line.
    That’s a different claim than your original statement of having 100MB max.
    brabel6 months ago
    I am so sorry for not being 100% accurate.
    aspenmayer6 months ago
    I don't mean to invalidate your experience, as computers and parts prices have varied a lot historically and geographically. My point was to give some context with what was then a decent laptop with the understanding that some setups might be even more high end, and that legacy systems would have lower specs, similar to what you mentioned.
    masklinn6 months ago
    You are not "not being 100% accurate", you are 100% inaccurate while accusing people of trolling.
    saalweachter6 months ago
    Eh, 2006 was right on the cusp of the 32-bit/64-bit flip. 32-bit was still pretty common, but my first task at my first job that year was migrating a search app to compile as a 64-bit binary, for servers that had (I believe) 8GB. They were good-sized servers for the time, but not particularly exotic, commodity hardware at a startup.
    wnissen6 months ago
    The MacBook came out in 2006, it had 512MB. I suppose it could have contained an array of 2^30 bits, though that's unlikely to occur. The point is that most machines weren't going to run into this at the time it was published. I'd be curious to see how many machines would run into it today, single sets of that size just aren't that common.
    masklinn6 months ago
    > The MacBook came out in 2006, it had 512MB.
    Apple released multiple different models in 2006, "early 2006" had either 512 or 1GB base (1GB for T2600, the 17" and the high-end 15) expandable to 2GB; and "late 2006" had 1GB to 2GB base expandable to 4.
    aspenmayer6 months ago
    The MacBook Pro I linked came out in 2006, and allows for 4GB, though there are some limitations to using all of it beyond 3GB.
    Smithalicious6 months ago
    You think Windows Vista ran on 100MB of RAM? I wish!
    brabel6 months ago
    I was still on Windows 98 :D, that could easily run on 64MB.
    6 months ago
    undefined
    spixy6 months ago
    Why are they common in java? In .net or node.js I dont think I have ever had one.
    xmcqdpt26 months ago
    I don't know, maybe I review code that uses arrays more?
  - coldtea6 months ago
    [flagged]
- VonNaturAustre6 months ago
  Thanks!
joshka6 months ago
The problem with this article's name is that we're in an era where actually checking whether "nearly all sorts are broken" against all publicly available implementations of binary searches would be almost feasible, but that's not actually what the article is claiming.
> Moreover, to be really certain that a program is correct, you have to test it for all possible input values, but this is seldom feasible.
This is incorrect. Generally it's just a little excessive to try to solve the halting problem in a library's unit tests ;). You don't have to test a program for all possible inputs, only for all materially unique state transitions. In a binary search, each variable can be zero, one, some(*), max, overflowed. The combination of these is not infinite as the values of some that cause different behavior is much more reasonable when grouped into chunks of the same behavior.
- PaulKeeble6 months ago
  It was certainly possible to run the algorithm on 16GB of array before the moment when it happened but did the original developer have that sort of space on their desktop at the time? Possibly not.
  If a unit test only runs on a server and not on the laptops of the developers then its not going to be written, whereas ideally someone should write a test that is tagged to only run on the server but that is a lot of extra work if that isn't a common thing on a project. Even now I would be quite wary of producing a max size input test for an array of ints and especially objects, that is going to raise some questions due to being a slow test and a highly resource consuming one.
  If I was worrying about the same in aerospace programming however then no question that test would get written and we would ensure it got run on the hardware that it was designed to run on. In typical business software its less common to run potentially very long running tests and simulations of the machines states everyone wants to go faster than that especially for the sake of a max input test.
  - miohtama6 months ago
    Also you can have checked math, like in Rust, and automatically just crash when you overflow a variable.
    In this case, it's not a bug (cannot get incorrect result) but an unsupported input.
  - joshka6 months ago
    If you're worried about being able to make an array where a.length == int.max, (which is reasonable in an algorithm which does any math on the length), replace the array with another substitute which you can mock the size of (e.g. in Java that would be possible against an ArrayList). You can test the algorithm separately from the execution of the algorithm.
  - poincaredisk6 months ago
    Small nitpick: you don't need 16GiB, 2^31 bytes is "just" 2GiB. Doesn't contradict your point though.
    penteract6 months ago
    Each element of the input array is at least 4 bytes, bringing it to 8GiB.
    javcasas6 months ago
    Not necessary. But, still, my work laptop from 2020 has 32Gb of memory. So, not that implausible.
    6 months ago
    undefined
    joshka6 months ago
    Swap in an ArrayList and mock size() and you need a few bytes...
  - omoikane6 months ago
    You might not need 16GB of memory. There are systems where int is only 16 bits, and overflowing 16 bits is not that difficult.
    But maybe it was uncommon to have arrays larger than 64K in the 80s due to segmented memory?
    PaulKeeble6 months ago
    The core of Java was being written in the late 1990s. I had a machine in 1995 that had 16MB of memory but 8MB was more typical for Pentium machines. By 2000 the AMD Athlon and Pentium 3 were the latest and greatest and Anandtech was testing with 128MB of memory [1].
    Java defines an int as 32 signed, it doesn't do anything else it calls 16 bit ints shorts. So it definitely has to be 8GB for the array.
    Sun was using Solaris machines rather than PCs and they were higher spec on things like memory but still I doubt they had 2GB let alone 8GB+ needed to run this test. That sort of memory didn't become routine for another decade where Sandy Bridge was being tested with 8GB in 2011 [2].
    Also goes to show how much things have stagnated, a desktop computer with 16GB has been standard for a long time. The preceding decade we went from 128MB to 8GB as normal and the next 15 years to today normal is 16-32GB which is no where near the same place of progress of memory density.
    [1] https://www.anandtech.com/show/566/6 [2] https://www.anandtech.com/show/4083/the-sandy-bridge-review-...
  - xmcqdpt26 months ago
    The correct way to test that code is to write a version that doesn't take an actual array but an index -> int function, then you wouldn't need to instantiate the array at all.
  - 6 months ago
    undefined
- rileymat26 months ago
  > In a binary search, each variable can be zero, one, some(*), max, overflowed. The combination of these is not infinite as the values of some that cause different behavior is much more reasonable when grouped into chunks of the same behavior.
  You are presuming that the algorithm is correct in the first place. Contrived example.
  binary_search(array, key) { if (key == 453) explode_world() //Do correct binary search. }
  So you also need to prove explode_world is not called or something similar but less contrived is not in there.
  - joshka6 months ago
    No. I'm presuming that you're able to look at the branches in code and write white box tests. Here the relevant two branches are 453 and anything other than 453. For addition of unsigned numbers, the relevant branches are 0 + 0, 0 + 1, 0 + max, 1 + 0 (usually can be skipped), 1 + 1, 1 + max (which overflows), etc.
    The number of relevant material states is bounded and knowable. It's neither unbounded (1,2,3,4,5,6 ... ) or unknowable as the algorithm is available.
  - adrianN6 months ago
    That is easily caught by a model checker that tries the equivalence classes. Or just normal coverage tooling.
- dotancohen6 months ago
  > each variable can be zero, one, some(*), max, overflowed
  These are the value ranges that I test in my unit tests, plus at and before some powers of 2 (e.g. 65535 and 65536), and -1. That is because -1 is uniquely used in some systems to indicate error, and thus trips some bugs.
- MaxikCZ6 months ago
  Are you quantizing information?
- manquer6 months ago
  You mean property testing ? QuickCheck would be the go to library ( the original is in Haskell, however most languages have one ).

This bug, and many others, can be detected with a trivial amount of formal verification. I really think formal verification should see much wider use for things that see as much use as standard libraries, even if it's just for the trivial things like overflow and out-of-bounds access.

In the below code we can see a formal verification tool (GNATProve) detect both the original error and the out-of-bounds access that it causes. Doing the arithmetic using a larger type clears both reported errors without the need for any additional annotations for GNATProve.

    function Search (A : A_Type; Target : Integer) return Integer is
       Left : Integer := A'First;
       Right : Integer := A'Last;
    begin
       while Left <= Right loop
          declare
             Mid : Integer := (Left + Right) / 2;
          begin
             if A (Mid) = Target then
                return Mid;
             elsif A (Mid) < Target then
                Left := Mid + 1;
             elsif A (Mid) > Target then
                Right := Mid - 1;
             end if;
          end;
       end loop;
    end Search;

GNATProve output:

    Phase 1 of 2: generation of Global contracts ...
    Phase 2 of 2: flow analysis and proof ...

    wrapper.adb:12:36: medium: overflow check might fail, cannot prove lower bound for Left + Right
       12 |            Mid : Integer := (Left + Right) / 2;
          |                             ~~~~~~^~~~~~~~
      reason for check: result of addition must fit in a 32-bits machine integer
    wrapper.adb:12:45: info: division check proved
    
    wrapper.adb:14:19: medium: array index check might fail
       14 |            if A (Mid) = Target then
          |                  ^~~
      reason for check: value must be a valid index into the array

jheriko6 months ago
can also be spotted with experience.
these kinds of problems are present in very many standard treatments of algorithms and don't survive their first contact with real world use in some cases.
"needs a carry bit or a wider type" is common for arithmetic operations that actually use the range.

bhouston6 months ago
It makes the case that we need Math.mean or Math.avg function that we can use in these cases rather than than reinventing it.
Basically we should favor using built in functions for as much as possible because those should be reliable and tested more than ad hoc code we write. And compilers should optimize those built in functions well so there is no extra cost in using them.
- sltkr6 months ago
  C++ added std::midpoint() to the standard library: https://en.cppreference.com/w/cpp/numeric/midpoint
  Another fun case, besides integer overflow, is negative values: in Java and C/C++ (i + j)/2 will round towards j, but i + (j - i)/2 will round towards i instead. Sometimes the difference matters.
- ape46 months ago
  I was thinking the same. And the code would be clearer too.
heinrichhartman6 months ago
```
    int mid =(low + high) / 2;
```
> Specifically, it fails if the sum of low and high is greater than the maximum positive int value (2^31 - 1).
I would really challenge calling this kind of effects "bug" or "breakage".
It's like calling Newtons law of gravity broken, because it's not accurate at predicting how galaxies move.
Things are engieered and tested for a certain scale.
Knowing which tools to use at which sacle is part of the craft of engineering.
- wat100006 months ago
  Sometimes they’re engineered and tested for a certain scale.
  More often they’re engineered and tested for an arbitrary scale. The limits aren’t considered, behavior at the edges isn’t accounted for, and it’s assumed it will be good enough for real world inputs.
  The use of `int` tends to be a dead giveaway. There are some cases where it’s clearly correct: where the spec says so (like argv), where you’re starting from a smaller type and it’s impossible for the calculations to overflow in an int (like adding two uint8), that sort of thing. And there are cases where it’s subtly correct, because you know the range of the value is sufficiently limited, either by mechanics or by spec.
  But most of the time, int gets chosen because it’s the apparent default and it’s easy to type. No analysis has been done to see if it’s correct or if you want to declare your code to only support inputs in a certain range.
  It’s really clear if you’ve written a binary search (or anything else that works on general arrays) in C and you use int as the index type. There’s pretty much no scenario where that makes sense. In theory you could analyze the entire program and prove that over-large arrays are never passed in, and keep doing it to make sure it stays that way, but that’s not realistic. If the programmer actually took one second to think about the appropriate data types, they’d use size_t rather than int.
  You can still have this bug with size_t, of course. But it won’t be “this falls apart with arrays over 1G elements on 64-bit systems that can easily handle them.” If you declare that you wrote the obvious midpoint calculation with size_t because you didn’t intend to support byte arrays larger than half the address space, it’s at least plausible.
  - tehjoker6 months ago
    i write c++, but i had to teach myself and always wondered why others use imprecise types. portability is one possibility, but then you can't know if your datastructure will break for a given input
    wat100006 months ago
    History and tradition at this point. Bit-sized integers and the other “meaningful” integer types like size_t weren’t added to the languages themselves until C99 and C++11. A lot of us learned those languages before that, and lots of code still exists from that time, or at least code bases that have evolved from that time.
    I think it actually comes from the opposite of portability. Access to different kinds of systems wasn’t common then. If you were learning and working on a system where int is 32 bits and pointers are 32 bits, and other possibilities are just vague mentions in whatever books you’re learning from, it’s very easy to get into the habit of thinking that int is the right type for a 32-bit quantity and for something that can hold a pointer.
    wakawaka286 months ago
    The lack of explicitly sized ints is actually a pro-portability feature but it prioritizes speed and ease of implementation of arithmetic operations over bitwise operations. The minimum ranges for each type can be used as a guide for average users to write correct and portable arithmetic and carefully-written bitwise operations. But most people would rather not think about the number of bits being variable at all.
    wat100006 months ago
    Sort of. It was kind of handy when int would be the natural machine size, 16-bit on 16-bit hardware, 32 on 32. But making int be 64-bit leaves a gap, so it’s generally stuck at 32-but even on 64-bit hardware. And then people couldn’t agree on whether long should be 32 or 64 on 64-bit platforms, so now none of the basic integer types will typically give you the natural machine size on both 32 and 64-bit targets. At this point, if you want the “biggest integer that goes fast on this hardware” then your best bet is probably intptr_t or size_t.
    wakawaka286 months ago
    There were/are machines where the char size is not 8 bits, and the ints are not sized in powers of 2. These machines are now rare but I think they still exist. This references some historical examples: https://retrocomputing.stackexchange.com/questions/12794/wer...
    tehjoker6 months ago
    Oh wow, I didn't know size_t was so recent.
    cesarb6 months ago
    At least for C++, it's older than C++11; a lot of us used for a long time the "C++0x" pseudo-standard (which is mostly the draft of what later became C++11; as the C++0x name indicates, it was originally intended to be finished before 2010), and on most C++ compilers headers and types from C99 were available even when compiling C++ code (excluding MSVC, which really dragged their feet in implementing C99, and which AFAIK to this day still hasn't fully implemented all mandatory C99 features).
    wat100006 months ago
    I believe it was somewhat older as part of typical C and C++ implementations, but don’t get standardized for a while. A big part of the older C and C++ standards are about unifying and codifying things that implementations were already doing.
    prewett6 months ago
    I'm not sure what you mean by "imprecise types", but if you mean something like using an `int` for an array index instead of `size_t` or something, I can tell you why I do it. Using `int` lets you use -1 as an easy invalid index, and iterating backwards is a straightforward modification of the normal loop: `for (int i = max; i >= 0; --i)`. That loop fails if using `size_t`, since it is never negative. Actually `size_t` may not even be correct for STL containers, it might be `std::vector::size_type` or something. Also, I don't think I've encountered an array with more than 2 billion items. And some things, like movie data, are usually iterated over using pointers. As you say `int` is easy to type.
    Also, for something like half my programming life, a 2+GB array was basically unobtainable.
    tehjoker6 months ago
    By precise, I meant more the byte width (uint32_t vs uint64_t etc). The other kinds of types help you track what the purpose of something is, but don't really assist with correctness at the machine level.
    In my work, I have a lot of data that is > 2GB, so int32_t vs uint32_t is very meaningful, and usually using a uint32_t is just delaying upgrading to int64_t or uint64_t.
    Going in the other direction, a point cloud can usually be represented using 3 uint16_t and that saves a lot of memory vs using uint32_t or uint64_t.
    wat100006 months ago
    If you want an index that can go negative, then the right type is ssize_t, not int.
- alanfranz6 months ago
  > certain scale
  Make it explicit. If the array is too large, throw an IllegalArgumentException. Document the limit. Then I agree with you.
  Otherwise, if an allowed input crashes the program at runtime with a random exception, I respectfully disagree.
  - feoren6 months ago
    IllegalArgumentException is OK in your book, but OverflowException is not? It's very rare that I actually care which exception type is thrown, but it's a little more polite to throw more clear and reliable exception types. But saying "this could be a little more polite and therefore YOUR CODE HAS A BUG" would make me never want to work with you again.
    alanfranz6 months ago
    Explict OverflowException could be ok but it may not tell me the root cause. A random index error is not ok at any time.
  - brabel6 months ago
    Then you should absolutely stay away from C :)
    jldugger6 months ago
    Words to live by!
    alanfranz6 months ago
    I do.
    But the example was Java.
- ajuc6 months ago
  The problem is that it's not that much harder to make it work for all the valid inputs.
  Not doing that is not good enough.
  Another example is summing lots of floats naively instead of using Kahan's algorithm.
  It's like we had a theory of gravity that doesn't work on Mars because we have unnecessary division by (m-Mars' mass) in our laws :) It wouldn't be good physics.
  - feoren6 months ago
    It's not much harder in this toy example. In real examples what this looks like is a ton of bikeshedding and arguing about minutiae instead of improving the system in ways that are orders of magnitude more valuable to everyone. The truth is it's far easier to waste time on this mostly meaningless crap, exchanging one exception for another, than it is to think deeply about the problem you're actually trying to solve. It's lazy.
    ajuc6 months ago
    In real world you're not implementing binary search but using one implemented already.
    Hopefully implemented by someone who cared enough to avoid bugs or at least document the range of the arguments.
    feoren6 months ago
    In the real world I don't have an array with over 2^30 elements 99.75% of the time. If I did need to process an array with over 2^30 elements, I'd have dozens of reasons to worry that standard techniques might start failing, and would therefore carefully consider and test everything to a much higher degree. I'd want a safety margin of at least 10x, which means I'm designing for 2^33+ and already know I need to avoid int32 everywhere I can. The type signature of a binary sort returning int32 in such a case should already be triggering alarm bells.
    This is a bit like arguing that every building humans have built is flawed because it wouldn't withstand a significant meteor strike. That's not how we do it. We consider maximum likely use cases and then build in safety margins of at least 10^2, sometimes up to 10^6. If you think there's a possibility you're dealing with hundreds of millions of elements, you stop using int32 entirely. You don't wait for 100 million to turn into 4 billion and crash.
  - xigoi6 months ago
    Is it worth making the algorithm slower just to have it work on extreme edge cases?
    jamietheriveter6 months ago
    > Is it worth making the algorithm slower just to have it work on extreme edge cases?
    Yes! If an API has no documented (and preferably enforced) limits on its arguments, the caller has every right to assume the full range of the argument types is supported.
    Else it is very much a bug.
    “Be reasonable, dude” is ok when you order 1,000 beers. It’s not an appropriate response when calling a binary search API.
    javcasas6 months ago
    Is it worth to make the algorithm faster at the cost of throwing surprise OutOfBounds exceptions in some extreme edge cases?
    Maybe, but only if you - and only you,and not an attacker can control the case you are in.
    xigoi6 months ago
    If an attacker can somehow make sure that there is an array with 2³⁰ elements, you have worse problems than a binary search crashing.
    javcasas6 months ago
    Why do you think this algorithm only applies to arrays? Why do you think this algorithm doesn't apply to this sine lookup table that the compiler placed at the end of the memory in the microcontroller?
    xigoi6 months ago
    Because the article is about binary searching an array. Obviously algorithms must be adapted to what they are used for.
    javcasas6 months ago
    Sure, because my pre-computed table isn't an array. It's just a bunch of numbers in consecutive positions in memory. That has never been an array.
    Also, what makes you think I'm not adapting the algorithm? I mean, I set the search indexes at the address of the beginning and end of the totally-not-an-array array so as to not have to do a relative memory read, because those are not supported in my microcontroller, or if they are supported, it's an extra cycle, and my timing budget is quite tight. Also, I don't do a div by 2, but a shift right by 1, same reason.
    6 months ago
    undefined
    ajuc6 months ago
    It's as simple as replacing one NULL character with something else. Or 1 length field.
    xigoi6 months ago
    If that happens, you’re going to get an out-of-bounds error no matter how you implement your binary search.
    ajuc6 months ago
    In Java. In C you just access random memory values (or segfault).
    xigoi6 months ago
    Yes, which doesn’t contradict my point.
    ajuc6 months ago
    Well it's either that or document the domain.
  - croemer6 months ago
    Nice example with the 1/(m-m_mars)!
  - tempodox6 months ago
    +1 for mentioning Kahan.
- javcasas6 months ago
  > Certain scale
  Or just put the algorithm in a 16-bit microcontroller, put some table that needs to be looked up (think precomputed sine table), put that table near the end of the memory range, and just make the mistake to call binary search specifying the start and end memory positions of the table.
- perching_aix6 months ago
  These implementations are definitely broken when the specification goes like "you just pass in your array here and it will perform binary search on it for your value." Yes, you could constrain this spec, but come on...
  It's such a blatant example for a bug, that I struggle to believe how can anyone even remotely conceive and pivot to the idea that "nuh-uh, this is a bad spec not a bad implementation!".
- secondcoming6 months ago
  I would disagree. The inputs to these functions are user controlled and so can be forced to break, whereas humans cannot change how gravity works.
  - 6 months ago
    undefined
  - dzaima6 months ago
    But in what realistic scenario would a user be able to put in ≥2^31 while not being able to put in 2^32-1 (which probably breaks many more things from innocent increments or similar) or 2^100?
    Retric6 months ago
    And further this all assumes they used int vs. long. It can be “wrong” in that it only works for arrays of under 2^63 elements without that ever being a possibility.
    Production code is often filled with edge case bugs that simply never come up. Works for 100x the expected use case is generally good enough if you’re approaching the limits where 2^31 is a potential issue then you are also approaching the case where 2^32 definitely will be.
    sillysaurusx6 months ago
    Hacking, of course. Overflows are one of the primary ways that hackers gain control of systems.
    Retric6 months ago
    That’s irrelevant for majority of software.
    sillysaurusx6 months ago
    This mindset is why hackers are able to exploit most systems.
    Retric6 months ago
    Which isn’t a problem as exploiting most software is meaningless. Wow someone can hack software they already have root access to the machine for whatever will we do.
    It’s management and developers not treating software where it is meaningful for someone to hack as a different category that’s an actual problem.
    javcasas6 months ago
    So a coworker sends me this CAD file via email, I open it and my computer gets controlled by a botnet, and the file immediately sends itself to all my outlook contacts.
    Nah, that sounds impossible. I'm sure it has never ever happened.
    Retric6 months ago
    Opening 3rd party files is one of those risks I was just talking about.
    There’s a user moving around in a game, and there’s opening a workplace file these are inherently different kinds of risks. If your building a flappy bird clone you can know exactly how it’s going to interact with the world.
    wat100006 months ago
    2^32-1 is almost always a possibility when 2^32 is, but there are many cases where those are possible but 2^100 is not. Basically anything where the value is a count of something rather than raw input fits the bill. How many characters, lines, or files do you support? 2^32 is a totally feasible number in many contexts. 2^100 is physically impossible, there isn’t enough matter.
    dzaima6 months ago
    If you accept 2^32, then code using 32-bit ints is definitely broken on it and thus the OP question of the issue on half that is irrelevant. Which is my point - widening the acceptable input range from 2^31 to 2^32 (or in the case of signed integers, from 2^30 to 2^31; give or take 1 of course) just "fixes" one small case of the actual core issue of nearly any arithmetic anywhere being wrong if you don't explicitly constrain input sizes.
    wat100006 months ago
    I agree on there not being much difference between 2^30/31/32. But it’s not “nearly any arithmetic.” If your size is an actual data size, then 2^64 is fine.
    dzaima6 months ago
    Right, with 64-bit ints things are a lot nicer. Though you can still run into some issues on "generate this much data" tasks as opposed to "operate over existing data of this size", though perhaps less exploitably so.
djoldman6 months ago
> The bug is in this line:
> In Programming Pearls Bentley says that the analogous line "sets m to the average of l and u, truncated down to the nearest integer." On the face of it, this assertion might appear correct, but it fails for large values of the int variables low and high. Specifically, it fails if the sum of low and high is greater than the maximum positive int value (231 - 1). The sum overflows to a negative value, and the value stays negative when divided by two. In C this causes an array index out of bounds with unpredictable results. In Java, it throws ArrayIndexOutOfBoundsException.
At some point we have to draw an arbitrary line. Even an "arbitrary precision" calculation is bounded by system memory.
"bug" is not well-defined, or perhaps "bug" is more of a continuous label as opposed to discrete.
- poincaredisk6 months ago
  Why do we need to draw that line somewhere? Fixed solution works for a full range of Java int.
borntyping6 months ago
Nearly all binary searches and mergesorts are broken in languages with bounded integers.
- dietr1ch6 months ago
  Doing any math with bounded integers considered harmful.
  At some point during my undergrad I realized this and tried to be really careful when implementing algorithms, but it's stupidly hard to do in a tidy way, at least in old C. It's not practical and people just rather close their eyes and live in blissful avoidance.
- perching_aix6 months ago
  And on machines with finite memory, right? Which would be every actual computer ever built?
  - socksy6 months ago
    Well I would posit that it would be hard to get to this code in a language with unbounded integers where (low n + high n) causes an OOM error, because in order to run this code, you first need an array n units wide.
    You could argue that the array itself could take up most of the space leaving no room for the indices, but that's hardly a fault with the algorithm, as now you've got a computer that basically can't do anything due to overloading. Whereas overflowing a 32 bit integer is a much more likely occurrence that arguably the algorithm should account for.
    chowells6 months ago
    Why does everyone talk about binary search in terms of arrays? Binary search works with any monotonic function. Looking up a value in a sorted array is just a special case of a monotonic function.
    LegionMammal9786 months ago
    Because the 'bug' as presented in the article applies to binary search over an array that has a natural maximum length. If you weren't using an array, there'd be nothing constraining the magnitude of the indices, so you might as well go straight to bigints.
    chowells6 months ago
    Of course there's a natural constraint: the type of the function. You can't pass a bigint to a function that expects an int. And if you are just blind casting, it turns out you have a similar bug: you are calling the function with different arguments than you think you are. It's the same underlying problem with a different expression in some cases.
xigoi6 months ago
So previously it worked for arrays up to length 2³⁰ and now it works for arrays up to length 2³¹. Is it really that much of a difference? Wouldn’t it be better to change the indexing type to a 64-bit integer?
- ddxxdd6 months ago
  The difference is that we know for a fact that the proper implementation works for integers up to 2^31, whereas the improper implementation deceived us into thinking that the code would work in situations where the code actually doesn't work.
  I find it valuable to understand when my code will crash.
- foldr6 months ago
  Article was written in 2006 when 32-bit architectures were dominant.
  I suppose the issue is moot now on 64-bit architectures, as the difference between 2^62 and 2^63 isn't relevant. (2^62 bytes is 4611686 terrabytes.)
  - im3w1l6 months ago
    Spelling it out like that sure gives some perspective - It's a frighteningly low number! They sell 2tb microsd cards nowadays. I bet you could fit 2^64 bytes in a shoebox. So I think uint64 might eventually be insufficient as a size type.
    Edit: Well not quite, more like a small closet.
    tliltocatl6 months ago
    Almost makes you think RISC-V was right with 128-bit extension. On the other hand, exabyte-scale memories might one day be possible, but would it still be useful to maintain single-address-space illusion for these?
    snvzz6 months ago
    RV128I is not an extension, but a (non-ratified) base ISA.
    Independent from RV64I, which in turn is independent from (and predates) RV32I.
    coldtea6 months ago
    >Spelling it out like that sure gives some perspective - It's a frighteningly low number!
    Yeah, it's very common for computers to have byte addressable 4 exabytes of storage...
    im3w1l6 months ago
    Well I used to think 64 bits would be enough forever basically, I guess that's why I was a little shocked that it actually might become insufficient at some point even if its far off.
    coldtea6 months ago
    I'd say it's more likely that in the future we'll regress to 8-bit machines, losing the ability to sustain fab technology, than we'll exhaust 64 bits address space.
    Having said that, with the advances of Electron apps, you might very well have a point...
    xigoi6 months ago
    I do not look forward to the future where common software requires exabytes of memory.
- eesmith6 months ago
  The use of:
  int high = a.length - 1;
  tells us that a.length-1 is supposed to fit into an int, so there is no need to handle 2³¹ or larger.
  - rob_c6 months ago
    Yep. No story here, feel free to move on...
- MattPalmer10866 months ago
  Not really. Arrays are limited to an int in size. You would be using more memory for the calculation and have to cast the value down to a 32 bit value to use as an array index.
  Or you could just write the code so it isn't vulnerable to integer overflow.
  - touisteur6 months ago
    Which is sad (array size limited to an int) and has always annoyed me, coming back from Ada where the index of an array is just another discrete type (including boolean, enum type, and ranged integers).
  - rob_c6 months ago
    Ideally the index should support int64 with int being a synonym on a 64bit platform by default.
    If not yes frankly you're into such unexpected behaviour territory that you should check your whole codebase rather than rely on stuff working just because it compiled.
    And we all know how everyone loves writing and understanding integration tests... (I personally do but most problems in the industry wouldn't be a thing if more people stopped to write them)
    bobmcnamara6 months ago
    size_t, you mean?
    rob_c6 months ago
    Well yes by inference
    hedora6 months ago
    Java doesn’t have a type that corresponds to size_t. It only has signed integer types, so the closest match is ssizet_t (but even then you need to figure out how many bits that is on your architecture).
    bobmcnamara6 months ago
    J8 has unsigned integer arithmetic at least.
Ameo6 months ago
A cynical takeaway from this is that it's hard to write good code and it doesn't really matter if you do or not.
Most code at every company I've worked at and project I've built is littered with technical incorrectness and buggy half-measure implementations. It's human nature or time pressure or something like that, but the application continues providing business value despite that.
- SoftTalker6 months ago
  Because it's sort of like premature optimization. If your business case will never be dealing with billion-element arrays, it's a waste of time to make sure your code can handle them.
3ple_alpha6 months ago
No they're not. If you're using an array with length over a billion in Java, your code stinks already before you start using binary search.
- coldtea6 months ago
  That's not only wrong in itself, but totally orthogonal.
  A binary search implementation should still work, regardless of the array length, or have the limitation documented.
  And of course an array "with length over a billion" can be totally valid, depending on the use case, your tradeoffs, available memory, etc. It could even be the optimal data structure for some use cases.
- poincaredisk6 months ago
  I'm not a Java programmer, but how would you load of a 1GB file into memory? I assume read returns some kind of an array.
  Also big arrays being (supposedly) a coffeee smell doesn't mean that code handling them improperly is not buggy.
  - aardvark1796 months ago
    If you really needed it in memory you’d use one of the file APIs that will map it and present a direct byte buffer view over that memory.
    Those APIs use long as their offset unlike the 32 ints used by arrays, and would avoid having to copy the data into some other object.
    There has been some discussion over the years about how arrays could be changed in the JVM to support longer lengths, but doing so without breaking existing code and while providing truly useful functionality without providing obvious footguns isn’t as easy as you might think.
  - angus_gh6 months ago
    Java's arrays use a signed 32-bit int as their length, so the longest they can be is about 2 billion elements.
    If your code has arrays over a billion elements, then it will fall over the moment someone inputs slightly larger data
- danvonk6 months ago
  Relational databases often require searching and sorting gigabytes of data to answer queries (sometimes larger than RAM if e.g. k-way merge sort is used) so it doesn't seem that far-fetched, especially given that there are database systems written in Java.
- tehjoker6 months ago
  not doing much scientific programming eh?
- 6 months ago
  undefined
sonu276 months ago
Title needs updating with the year 2006
- usr11066 months ago
  I often think AI is mostly crap, wasting a lot of energy for very questionable benefits. But could/should this repetitive task of reminding submitters to follow the submission guidelines and add the year to submissions of old articles be automated?
  - pdimitar6 months ago
    I would agree, though why would you need AI for that is an open question.
    sonu276 months ago
    A simple crawler would have been able to detect it’s from 2006. Perhaps a reminder should be added if the year is not recent
    Too6 months ago
    Even simpler, just check if the url or title has been submitted before. That would also take care of all the duplicate entries that pop up once per day for a week after a viral story is emerging.
    In this instance, the url is slightly different from previous submissions so some more clever fuzzy matching or using only the title would be needed.
    usr11066 months ago
    Yes, I have always wondered why the simple duplicate checker within the same couple of days does not exist. Or does it exist and the duplicates are actually sligt variations of the URL.
    usr11066 months ago
    What algorithm would you suggest to find the year in an arbitrary submission? Of course AI is not a very clearly defined term, more difficult problems certainly exist. I was just thinking of the case the submission contains several dates or none at all and still several hints a human would take into consideration get checked.
    Of course some minimal implementation without AI techniques could already handle many cases. My AI suggestion was not death-serious ;)
    coder5436 months ago
    Google's research blog does not seem to provide this, but many blogs include the Open Graph metadata[0] around when the article was published or modified:
    article:published_time - datetime - When the article was first published. article:modified_time - datetime - When the article was last changed.
    For example, I pulled up a random article on another website, and found these <meta> tags in the <head>:
    <meta property="article:published_time" content="2025-01-11T13:00:00.000Z"> <meta property="article:modified_time" content="2025-01-11T13:00:00.000Z">
    For pages that contain this metadata, it would be a cheaper/faster implementation than using an LLM, but using an LLM as a fallback could easily provide you with the publication date of this Google article.
    [0]: https://ogp.me/
    coldtea6 months ago
    >What algorithm would you suggest to find the year in an arbitrary submission?
    In the submission title, a simple regex for the presence of a date with a standard format (e.g. %Y) would suffice.
    Matching it to the article might or might not be possible, but that would already be enough (assuming having the date is a good thing, which I'm not certain at all)
    pdimitar6 months ago
    As another comment suggested, you can scan for previous submissions by URL -- Algolia is very helpful with that.
    Outside that, no clue, been a long time since I last wrote crawlers, admittedly. Though it can't be too difficult to crowd-source origin date parsers per domain?
    But hey, if any LLM's free tier can achieve it, then why not. My point was that many people worked on that particular problem historically. It would be a shame if we can't use any of their hard work.
  - coldtea6 months ago
    I think adding the year is mostly crap. What exactly information would it give, except perhaps the false impression that this article is "antiquated information", when it pretty much holds true, and describes a perrenial issue?
    gmfawcett6 months ago
    It gives a cue about how many times I've probably seen the article before. Quite useful, IMO. I read this particular article when it came out in 2006... it's convenient to know we're not discussing a novel finding on the same topic.
nsxwolf6 months ago
Write this in a Leetcode interview and I suspect the average interviewer will fail you and not believe your reasoning.
nritchie6 months ago
Isn't the larger point of this article that errors like this can sneak in and remain dormant for years? Even if this example is old, isn't this lesson still relevant? Expecting that we are now immune from this class of errors because it is not 2025 and not 2006 is hubris. Hubris rarely ends well.
zahlman6 months ago
The simplest fix is to use 64-bit indices; that way you couldn't possibly allocate enough memory for a sum of valid index values to overflow (even if your array stores single-byte types, there are necessarily other things in memory besides the array).
(Also, what brabel said: https://news.ycombinator.com/item?id=42665755.)
(Or you could use a language with arbitrary-sized integers. Python surged in popularity in 2005, as I recall.)
secondcoming6 months ago
C++20 introduced std::midpoint() [0]
Interestingly, calculating the midpoint of doubles seems quite complex (according to gcc at least) [1]
[0] https://en.cppreference.com/w/cpp/numeric/midpoint
[1] https://godbolt.org/z/G1P611o7Y
truefossil6 months ago
size_t is 64 bits on 64-bit CPUs
jiggawatts6 months ago
This article when I first read it over a decade ago made me internalise the fact that “int” types are not mathematical integers but are instead rings.
If you look at every algorithm through that lens, interpreting them with the actual type instead of an idealised one, then bugs like this will jump out at you. (Similarly, compiler optimizer passes and the like all need to account for this.)
cvoss6 months ago
> It is not sufficient merely to prove a program correct; you have to test it too.
Well... If your proof made the (false) assumption that int is an unbounded integral type, then you didn't prove the program is correct at all. What you proved was than an algorithm in some ideal universe is correct. But your program is a different beast that lives in a specific programming language.
kazinator6 months ago
How you can fix it is by using an integer type that is the same width as the pointer type.
Then, there is no way your array can be so big that high + low overflows, unless it's an array of bytes over more than half the entire bit virtual space, which is a fiction.
(If you're binary searching some abstract space that is not an array, I retract my ignorant statements.)
kunley6 months ago
I think most of the language implementers know about it.
Here is a relevant line from Go, even with a comment that it's about overflow:
https://github.com/golang/go/blob/19e9231/src/sort/search.go...
Dwedit6 months ago
At which point does this article talk about Merge Sort?
Anyway... Merge sort doesn't even need to be recursive in the first place. It's always taught that way in CS classes, but it can just as easily be written with nested for loops. On the outermost for loop, you double the sublist size until you exceed the whole size.
- bediger40006 months ago
  Wikipedia gives an example merge sort that does nearly exactly what you describe.
  That algorithm has some weird performance drops at 2^n+1 sorted elements. https://bruceediger.com/posts/mergesort-investigation-8/
  - 6 months ago
    undefined
yencabulator6 months ago
https://rust-lang.github.io/rust-clippy/master/index.html#ar... is your friend.
anonnon6 months ago
What's funny is that the safer alternative:
```
   int mid = low + ((high - low) / 2);
```
is probably what most of us originally came up with before we saw the shorter, more elegant, but overflow-prone approach.
- RaftPeople6 months ago
  I wrote one before I read the article to see if I would hit the bug and yep, I wrote it the safer way.
  For me, that is the most readable way to do it because it lines up both conceptually and visually with how I'm thinking of the problem.
- commandersaki6 months ago
  I was once penalised for doing this in a leetcode problem given during an interview. I was told to rethink things about calculating a mid-point from first principles.
whatever16 months ago
Ok let me say it. The implementation of ints in computers is plain stupid. I am not sure why we have been trying to convince ourselves otherwise for decades.
So many programs are wrong because of this and they are just ticking bombs.
- toast06 months ago
  I think you'd have better responses if you said the implentation of ints in most programming languages is plain stupid.
  Processors are doing a reasonable job, but there's certainly things they could do differently. Some languages have pretty specific integer types. And then you have C. Java at least makes all the integer widths consistently sized on all platforms, but neither provides good ways to express and manage overflow or carry. And the architects of Java decided that developers aren't qualified to use unsigned types, which makes a lot of tasks much harder.
- Y_Y6 months ago
  What would you have done? I don't disagree, but I'm not sure what would have been better either.
  - whatever16 months ago
    Int as a type serves different purposes. Counting, indexing, rounding floats, exact arithmetic etc.
    These have completely different requirements and edge cases. It was a mistake trying to tackle all of these with a singular type.
    Y_Y6 months ago
    Not to mention bools, bitarrays, error codes...
ziml776 months ago
I love how wonderfully simple the solution is and how it incurs no performance/memory penalty. Literally just changing the arithmetic shift right instruction to the logical shift right instruction.
mac3n6 months ago
while it's unlikely you'd allocate a memory array[2^30] on a 32-bit machine, I also use binary search on mmap'd files.
Specifically, text files. Binary search doesn't need the exact midpoint, just somewhere nearby, so I pick the midpoint and look around for line boundaries.
On normal sorted text files, this works fairly well. I was able to search a list of a billion IP addresses while touching only a few pages.
https://gitlab.com/mac3n/ksip
jansan6 months ago
Not true for Javascript. The maximum array length is 2^32-1, but the maximum safe integer value for numbers is 2^53 – 1.
makach6 months ago
Wait a second. Is it really your bug if the fix is to make it work around the hardware's limits?
- cdogl6 months ago
  My personal experience is yes. Hardware provides a substrate of reality, and software developers are asked to implement aspirations.
djmips6 months ago
If you did a lot of stuff on 8 bit systems you ran into this very early and often.
- rep_lodsb6 months ago
  Not in assemly, where it's trivial to shift right with carry.
  - djmips6 months ago
    I agree - My point is that you'd be coding in assembly and the issue wouldn't be abstract since you come up against the limits of your integer size all the time.
bediger40006 months ago
I'm puzzled by the inclusion of merge sort. Merge sort is referred to by the title, and in passing in the text, but no examples are given.
akam4n4n6 months ago
maybe dumb, but what happens when low and high > unitMAX/2
shift left is worse in that case right?
- poincaredisk6 months ago
  It's not possible, because low and high are of type int, which is always lesser than uint_nax/2
beyondCritics6 months ago
> mid = ((unsigned int)low + (unsigned int)high)) >> 1;
Also not correct if the sum overflows.
- poincaredisk6 months ago
  It can't overflow - max_int+max_int < max_uint
  - beyondCritics6 months ago
    This reasoning is correct but still has to assume that both indices are non negative, which is reasonable.
  - tda6 months ago
    So does the bitshift cast the result of the addition to an unsigned int? I am not so familiar with Java.
  - 6 months ago
    undefined
lmm6 months ago
"Nearly all"? Really? This is an obvious, common category of bug that's been well known for decades. Surely nowadays a decent proportion of binary searches and mergesorts are written in languages that are less stupid (such as Python) and prevent this class of bug.
- masklinn6 months ago
  > This is an obvious, common category of bug that's been well known for decades.
  TFA is actually the article which made it common knowledge, it's from 2006.
thisisnotauser6 months ago
Notably, this is impossible in well-designed languages.
- RaftPeople6 months ago
  Do you mean that the language would either halt or throw an exception due to overflow on "high+low"?
moffkalast6 months ago
I'll make sure to consider this the next time I'm working with a billion length array lmao.
- jamietheriveter6 months ago
  Just because such a thing is beyond your experience doesn’t mean it’s rare or unimportant. I see this attitude in students who have worked mostly on toy (assigned) problems, but at least they are aware this is the case.
anacrolix6 months ago
yawn this is old news
6 months ago
undefined
summeroflove206 months ago
[dead]
phplovesong6 months ago
Meh. Thats not the algo but int type used. I did not like this title, it was a scamm by ai.