Past discussions:
https://news.ycombinator.com/item?id=33492122
https://news.ycombinator.com/item?id=14906429
https://news.ycombinator.com/item?id=12147703
https://news.ycombinator.com/item?id=9857392
https://news.ycombinator.com/item?id=6799336
In the very least,this should feature in the title. In fact, previous submissions from over a decade ago also feature year in the title.
Otherwise it conveys the idea that this is some major epiphany.
Then it could have a bot create links to past submissions like the OP did and use the best arrived at title for resubmissions.
It does have the ability to catch multiple submissions of new articles but that probably has a time window of a day or so.
I suppose you could compare against wayback, and not sure I'd try to compare with an LLM or RAG.
I guess it would have been surprising in 2006?
And obviously Google would have been dealing with servers, where it'd be even older as they'd probably have been using PAE and each process would be able to allocate and address a full 4GB.
Common consumer laptops had 2-4 GB of RAM in 2006.
https://everymac.com/systems/apple/macbook_pro/specs/macbook...
I remember pretty clearly that if you had anywhere above 512MB of RAM in 2006 you had very much top-of-the-line.
That’s a different claim than your original statement of having 100MB max.
Apple released multiple different models in 2006, "early 2006" had either 512 or 1GB base (1GB for T2600, the 17" and the high-end 15) expandable to 2GB; and "late 2006" had 1GB to 2GB base expandable to 4.
> Moreover, to be really certain that a program is correct, you have to test it for all possible input values, but this is seldom feasible.
This is incorrect. Generally it's just a little excessive to try to solve the halting problem in a library's unit tests ;). You don't have to test a program for all possible inputs, only for all materially unique state transitions. In a binary search, each variable can be zero, one, some(*), max, overflowed. The combination of these is not infinite as the values of some that cause different behavior is much more reasonable when grouped into chunks of the same behavior.
If a unit test only runs on a server and not on the laptops of the developers then its not going to be written, whereas ideally someone should write a test that is tagged to only run on the server but that is a lot of extra work if that isn't a common thing on a project. Even now I would be quite wary of producing a max size input test for an array of ints and especially objects, that is going to raise some questions due to being a slow test and a highly resource consuming one.
If I was worrying about the same in aerospace programming however then no question that test would get written and we would ensure it got run on the hardware that it was designed to run on. In typical business software its less common to run potentially very long running tests and simulations of the machines states everyone wants to go faster than that especially for the sake of a max input test.
In this case, it's not a bug (cannot get incorrect result) but an unsupported input.
But maybe it was uncommon to have arrays larger than 64K in the 80s due to segmented memory?
Java defines an int as 32 signed, it doesn't do anything else it calls 16 bit ints shorts. So it definitely has to be 8GB for the array.
Sun was using Solaris machines rather than PCs and they were higher spec on things like memory but still I doubt they had 2GB let alone 8GB+ needed to run this test. That sort of memory didn't become routine for another decade where Sandy Bridge was being tested with 8GB in 2011 [2].
Also goes to show how much things have stagnated, a desktop computer with 16GB has been standard for a long time. The preceding decade we went from 128MB to 8GB as normal and the next 15 years to today normal is 16-32GB which is no where near the same place of progress of memory density.
[1] https://www.anandtech.com/show/566/6 [2] https://www.anandtech.com/show/4083/the-sandy-bridge-review-...
You are presuming that the algorithm is correct in the first place. Contrived example.
binary_search(array, key) {
if (key == 453) explode_world()
//Do correct binary search.
}
So you also need to prove explode_world is not called or something similar but less contrived is not in there.The number of relevant material states is bounded and knowable. It's neither unbounded (1,2,3,4,5,6 ... ) or unknowable as the algorithm is available.
> each variable can be zero, one, some(*), max, overflowed
These are the value ranges that I test in my unit tests, plus at and before some powers of 2 (e.g. 65535 and 65536), and -1. That is because -1 is uniquely used in some systems to indicate error, and thus trips some bugs.In the below code we can see a formal verification tool (GNATProve) detect both the original error and the out-of-bounds access that it causes. Doing the arithmetic using a larger type clears both reported errors without the need for any additional annotations for GNATProve.
function Search (A : A_Type; Target : Integer) return Integer is
Left : Integer := A'First;
Right : Integer := A'Last;
begin
while Left <= Right loop
declare
Mid : Integer := (Left + Right) / 2;
begin
if A (Mid) = Target then
return Mid;
elsif A (Mid) < Target then
Left := Mid + 1;
elsif A (Mid) > Target then
Right := Mid - 1;
end if;
end;
end loop;
end Search;
GNATProve output: Phase 1 of 2: generation of Global contracts ...
Phase 2 of 2: flow analysis and proof ...
wrapper.adb:12:36: medium: overflow check might fail, cannot prove lower bound for Left + Right
12 | Mid : Integer := (Left + Right) / 2;
| ~~~~~~^~~~~~~~
reason for check: result of addition must fit in a 32-bits machine integer
wrapper.adb:12:45: info: division check proved
wrapper.adb:14:19: medium: array index check might fail
14 | if A (Mid) = Target then
| ^~~
reason for check: value must be a valid index into the array
these kinds of problems are present in very many standard treatments of algorithms and don't survive their first contact with real world use in some cases.
"needs a carry bit or a wider type" is common for arithmetic operations that actually use the range.
Basically we should favor using built in functions for as much as possible because those should be reliable and tested more than ad hoc code we write. And compilers should optimize those built in functions well so there is no extra cost in using them.
Another fun case, besides integer overflow, is negative values: in Java and C/C++ (i + j)/2 will round towards j, but i + (j - i)/2 will round towards i instead. Sometimes the difference matters.
> In Programming Pearls Bentley says that the analogous line "sets m to the average of l and u, truncated down to the nearest integer." On the face of it, this assertion might appear correct, but it fails for large values of the int variables low and high. Specifically, it fails if the sum of low and high is greater than the maximum positive int value (231 - 1). The sum overflows to a negative value, and the value stays negative when divided by two. In C this causes an array index out of bounds with unpredictable results. In Java, it throws ArrayIndexOutOfBoundsException.
At some point we have to draw an arbitrary line. Even an "arbitrary precision" calculation is bounded by system memory.
"bug" is not well-defined, or perhaps "bug" is more of a continuous label as opposed to discrete.
int mid =(low + high) / 2;
> Specifically, it fails if the sum of low and high is greater than the maximum positive int value (2^31 - 1).I would really challenge calling this kind of effects "bug" or "breakage".
It's like calling Newtons law of gravity broken, because it's not accurate at predicting how galaxies move.
Things are engieered and tested for a certain scale.
Knowing which tools to use at which sacle is part of the craft of engineering.
More often they’re engineered and tested for an arbitrary scale. The limits aren’t considered, behavior at the edges isn’t accounted for, and it’s assumed it will be good enough for real world inputs.
The use of `int` tends to be a dead giveaway. There are some cases where it’s clearly correct: where the spec says so (like argv), where you’re starting from a smaller type and it’s impossible for the calculations to overflow in an int (like adding two uint8), that sort of thing. And there are cases where it’s subtly correct, because you know the range of the value is sufficiently limited, either by mechanics or by spec.
But most of the time, int gets chosen because it’s the apparent default and it’s easy to type. No analysis has been done to see if it’s correct or if you want to declare your code to only support inputs in a certain range.
It’s really clear if you’ve written a binary search (or anything else that works on general arrays) in C and you use int as the index type. There’s pretty much no scenario where that makes sense. In theory you could analyze the entire program and prove that over-large arrays are never passed in, and keep doing it to make sure it stays that way, but that’s not realistic. If the programmer actually took one second to think about the appropriate data types, they’d use size_t rather than int.
You can still have this bug with size_t, of course. But it won’t be “this falls apart with arrays over 1G elements on 64-bit systems that can easily handle them.” If you declare that you wrote the obvious midpoint calculation with size_t because you didn’t intend to support byte arrays larger than half the address space, it’s at least plausible.
I think it actually comes from the opposite of portability. Access to different kinds of systems wasn’t common then. If you were learning and working on a system where int is 32 bits and pointers are 32 bits, and other possibilities are just vague mentions in whatever books you’re learning from, it’s very easy to get into the habit of thinking that int is the right type for a 32-bit quantity and for something that can hold a pointer.
Also, for something like half my programming life, a 2+GB array was basically unobtainable.
In my work, I have a lot of data that is > 2GB, so int32_t vs uint32_t is very meaningful, and usually using a uint32_t is just delaying upgrading to int64_t or uint64_t.
Going in the other direction, a point cloud can usually be represented using 3 uint16_t and that saves a lot of memory vs using uint32_t or uint64_t.
Make it explicit. If the array is too large, throw an IllegalArgumentException. Document the limit. Then I agree with you.
Otherwise, if an allowed input crashes the program at runtime with a random exception, I respectfully disagree.
Not doing that is not good enough.
Another example is summing lots of floats naively instead of using Kahan's algorithm.
It's like we had a theory of gravity that doesn't work on Mars because we have unnecessary division by (m-Mars' mass) in our laws :) It wouldn't be good physics.
Hopefully implemented by someone who cared enough to avoid bugs or at least document the range of the arguments.
This is a bit like arguing that every building humans have built is flawed because it wouldn't withstand a significant meteor strike. That's not how we do it. We consider maximum likely use cases and then build in safety margins of at least 10^2, sometimes up to 10^6. If you think there's a possibility you're dealing with hundreds of millions of elements, you stop using int32 entirely. You don't wait for 100 million to turn into 4 billion and crash.
Yes! If an API has no documented (and preferably enforced) limits on its arguments, the caller has every right to assume the full range of the argument types is supported.
Else it is very much a bug.
“Be reasonable, dude” is ok when you order 1,000 beers. It’s not an appropriate response when calling a binary search API.
Maybe, but only if you - and only you,and not an attacker can control the case you are in.
Also, what makes you think I'm not adapting the algorithm? I mean, I set the search indexes at the address of the beginning and end of the totally-not-an-array array so as to not have to do a relative memory read, because those are not supported in my microcontroller, or if they are supported, it's an extra cycle, and my timing budget is quite tight. Also, I don't do a div by 2, but a shift right by 1, same reason.
Or just put the algorithm in a 16-bit microcontroller, put some table that needs to be looked up (think precomputed sine table), put that table near the end of the memory range, and just make the mistake to call binary search specifying the start and end memory positions of the table.
It's such a blatant example for a bug, that I struggle to believe how can anyone even remotely conceive and pivot to the idea that "nuh-uh, this is a bad spec not a bad implementation!".
Production code is often filled with edge case bugs that simply never come up. Works for 100x the expected use case is generally good enough if you’re approaching the limits where 2^31 is a potential issue then you are also approaching the case where 2^32 definitely will be.
It’s management and developers not treating software where it is meaningful for someone to hack as a different category that’s an actual problem.
Nah, that sounds impossible. I'm sure it has never ever happened.
There’s a user moving around in a game, and there’s opening a workplace file these are inherently different kinds of risks. If your building a flappy bird clone you can know exactly how it’s going to interact with the world.
At some point during my undergrad I realized this and tried to be really careful when implementing algorithms, but it's stupidly hard to do in a tidy way, at least in old C. It's not practical and people just rather close their eyes and live in blissful avoidance.
You could argue that the array itself could take up most of the space leaving no room for the indices, but that's hardly a fault with the algorithm, as now you've got a computer that basically can't do anything due to overloading. Whereas overflowing a 32 bit integer is a much more likely occurrence that arguably the algorithm should account for.
I find it valuable to understand when my code will crash.
I suppose the issue is moot now on 64-bit architectures, as the difference between 2^62 and 2^63 isn't relevant. (2^62 bytes is 4611686 terrabytes.)
Edit: Well not quite, more like a small closet.
Independent from RV64I, which in turn is independent from (and predates) RV32I.
Yeah, it's very common for computers to have byte addressable 4 exabytes of storage...
Having said that, with the advances of Electron apps, you might very well have a point...
Or you could just write the code so it isn't vulnerable to integer overflow.
If not yes frankly you're into such unexpected behaviour territory that you should check your whole codebase rather than rely on stuff working just because it compiled.
And we all know how everyone loves writing and understanding integration tests... (I personally do but most problems in the industry wouldn't be a thing if more people stopped to write them)
Most code at every company I've worked at and project I've built is littered with technical incorrectness and buggy half-measure implementations. It's human nature or time pressure or something like that, but the application continues providing business value despite that.
A binary search implementation should still work, regardless of the array length, or have the limitation documented.
And of course an array "with length over a billion" can be totally valid, depending on the use case, your tradeoffs, available memory, etc. It could even be the optimal data structure for some use cases.
Also big arrays being (supposedly) a coffeee smell doesn't mean that code handling them improperly is not buggy.
Those APIs use long as their offset unlike the 32 ints used by arrays, and would avoid having to copy the data into some other object.
There has been some discussion over the years about how arrays could be changed in the JVM to support longer lengths, but doing so without breaking existing code and while providing truly useful functionality without providing obvious footguns isn’t as easy as you might think.
If your code has arrays over a billion elements, then it will fall over the moment someone inputs slightly larger data
In this instance, the url is slightly different from previous submissions so some more clever fuzzy matching or using only the title would be needed.
Of course some minimal implementation without AI techniques could already handle many cases. My AI suggestion was not death-serious ;)
article:published_time - datetime - When the article was first published.
article:modified_time - datetime - When the article was last changed.
For example, I pulled up a random article on another website, and found these <meta> tags in the <head>: <meta property="article:published_time" content="2025-01-11T13:00:00.000Z">
<meta property="article:modified_time" content="2025-01-11T13:00:00.000Z">
For pages that contain this metadata, it would be a cheaper/faster implementation than using an LLM, but using an LLM as a fallback could easily provide you with the publication date of this Google article.[0]: https://ogp.me/
In the submission title, a simple regex for the presence of a date with a standard format (e.g. %Y) would suffice.
Matching it to the article might or might not be possible, but that would already be enough (assuming having the date is a good thing, which I'm not certain at all)
Outside that, no clue, been a long time since I last wrote crawlers, admittedly. Though it can't be too difficult to crowd-source origin date parsers per domain?
But hey, if any LLM's free tier can achieve it, then why not. My point was that many people worked on that particular problem historically. It would be a shame if we can't use any of their hard work.
(Also, what brabel said: https://news.ycombinator.com/item?id=42665755.)
(Or you could use a language with arbitrary-sized integers. Python surged in popularity in 2005, as I recall.)
Interestingly, calculating the midpoint of doubles seems quite complex (according to gcc at least) [1]
If you look at every algorithm through that lens, interpreting them with the actual type instead of an idealised one, then bugs like this will jump out at you. (Similarly, compiler optimizer passes and the like all need to account for this.)
Then, there is no way your array can be so big that high + low overflows, unless it's an array of bytes over more than half the entire bit virtual space, which is a fiction.
(If you're binary searching some abstract space that is not an array, I retract my ignorant statements.)
Well... If your proof made the (false) assumption that int is an unbounded integral type, then you didn't prove the program is correct at all. What you proved was than an algorithm in some ideal universe is correct. But your program is a different beast that lives in a specific programming language.
Here is a relevant line from Go, even with a comment that it's about overflow:
https://github.com/golang/go/blob/19e9231/src/sort/search.go...
Anyway... Merge sort doesn't even need to be recursive in the first place. It's always taught that way in CS classes, but it can just as easily be written with nested for loops. On the outermost for loop, you double the sublist size until you exceed the whole size.
That algorithm has some weird performance drops at 2^n+1 sorted elements. https://bruceediger.com/posts/mergesort-investigation-8/
int mid = low + ((high - low) / 2);
is probably what most of us originally came up with before we saw the shorter, more elegant, but overflow-prone approach.For me, that is the most readable way to do it because it lines up both conceptually and visually with how I'm thinking of the problem.
So many programs are wrong because of this and they are just ticking bombs.
Processors are doing a reasonable job, but there's certainly things they could do differently. Some languages have pretty specific integer types. And then you have C. Java at least makes all the integer widths consistently sized on all platforms, but neither provides good ways to express and manage overflow or carry. And the architects of Java decided that developers aren't qualified to use unsigned types, which makes a lot of tasks much harder.
These have completely different requirements and edge cases. It was a mistake trying to tackle all of these with a singular type.
Specifically, text files. Binary search doesn't need the exact midpoint, just somewhere nearby, so I pick the midpoint and look around for line boundaries.
On normal sorted text files, this works fairly well. I was able to search a list of a billion IP addresses while touching only a few pages.
shift left is worse in that case right?
Also not correct if the sum overflows.
TFA is actually the article which made it common knowledge, it's from 2006.