Float Exposed(float.exposed)

390 pointsby SomaticPiratea day ago24 comments

vismit2000a day ago
This remains one of the best explanations on the topic: https://fabiensanglard.net/floating_point_visually_explained... Saw this when I just started using HN and such posts only inspired me to stick to it: https://news.ycombinator.com/item?id=29368529
- cdavid18 hours ago
  Maybe I am too mathematically enclined, but this was not easy to understand.
  The ELI5 explanation of floating point: they approximately give you the same accuracy (in terms of bits) independently of the scale. Whether your number if much below 1, around 1, or much above 1, you can expect to have as much precision in the leading bits.
  This is the key property, but internalizing it is difficult.
  - ekelsen14 hours ago
    I like "between each power of 2, there are the same number of numbers."
    So between 1/2 and 1 there are the same number of numbers as between 1024 and 2048. If you have 1024 numbers between each power of 2, then each interval is 1/2048 in the first case and 1 in the second case.
    I reality there are usually:
    bfloat16: 128 numbers between each power of 2
    float16: 1024 numbers between each power of 2
    float32: 2*23 numbers (~8 million) between each power of 2
    float64: 2*52 numbers (~4.5 quadrillion) between each power of 2
    13 hours ago
    undefined
  - hnuser12345617 hours ago
    Or, say, you can write any number you want, but it has to be a whole number from 0 to 9, and you can only make the number bigger or smaller by moving the decimal point, and you can only move the decimal point up to 10 spaces. And you can add or remove a negative sign in front.
- larodi20 hours ago
  and makes so much sense in the context of this recent blog from TM's researchers
  https://news.ycombinator.com/item?id=45200925
- nesk_a day ago
  I've never seen this topic so well explained, thank you for sharing!
  - Etherlord8720 hours ago
    From just scanning through the article quickly, I don't see there the mantissa similarly easily explained, but there is a very intuitive way to think of it. Because the mantissa (like everything else) is encoded in binary, the first explicit (because there's implicit 1. at the beginning) digit of it means either 0/2 or 1/2 (just like in decimal the first digit after the dot means either 0/10 or 1/10 or 2/10...), the next digit is (0/2² = 0/4) or 1/4, third digit is 0/8 or 1/8 etc. You can visualize this by starting at the beginning of the "window", and then you divide the window into 2 halves: now the first digit of the mantissa tells you if you stay at the beginning of the first half, or move to the beginning of the 2nd half. Now whatever half you picked, you divide it into 2 halves again, and use the next bit of mantissa to tell you if you should advance to the next half. So you just keep subdividing, and so the more bits in the mantissa you have, the more you can subdivide the window, and if the exponent (after applying bias) is equal exactly the number of explicit bits in the mantissa, the smallest subdivision cell has length equal exactly 1. Incrementing exponent by 1 will now double the window size, and without additional subdivision each cell has length equal exactly 2 meaning each next float number now increments by 2.
    (keep in mind there is a subnormal range where there's implicit 0. at the beginning of the mantissa instead)
    To reiterate, increasing the exponent by 1 doubles the window size, so the exponent describes how many times the window size was doubled while the number of bits of mantissa describes how many times you can do the reverse and "half" it, hence the exponent to mantissa bits relation.
- a day ago
  undefined
jjcoba day ago
One problem I was struggling with: What's the shortest, unambiguous decimal representation of a float?
For example, if you use single precision floats, then you need up to 9 digits of decimal precision to uniquely identify a float. So you would need to use a printf pattern like %.9g to print it. But then 0.1 would be output as 0.100000001, which is ugly. So a common approach is to round to 6 decimal digits: If you use %.6g, you are guaranteed that any decimal input up to 6 significant digits will be printed just like you stored it.
But you would no longer be round-trip safe when the number is the result of a calculation. This is important when you do exact comparisons between floats (eg. to check if data has changed).
So one idea I had was to try printing the float with 6 digits, then scanning it and seeing if it resulted in the same binary representation. If not, try using 7 digits, and so on, up to 9 digits. Then I would have the shortest decimal representation of a float.
This is my algorithm:
```
    int out_length;
    char buffer[32];
    for (int prec = 6; prec<=9; prec++) {
        out_length = sprintf(buffer, "%.*g", prec, floatValue);
        if (prec == 9) {
            break;
        }
        float checked_number;
        sscanf(buffer, "%g", &checked_number);
        if (checked_number == floatValue) {
            break;
        }
    }
```
I wonder if there is a more efficient way to determine that shortest representation rather than running printf/scanf in a loop?
- lifthrasiira day ago
  Your problem is practically important as it can be considered as the "canonical" string representation of given floating point number (after adding the closestness requirement). There are many efficient algorithms as a result, selected algorithms include Dragon4, Grisu3, Ryu and Dragonbox. See also Google's double-conversion library which implements first two of them.
  - jjcob21 hours ago
    Thank you for pointing me to all the prior work on this!
    I am surprised how complex the issue seems to be. I assumed there might be an elegant solution, but the problem seems to be a lot harder than I thought.
    pklausler13 hours ago
    See also the code in LLVM Flang for binary/decimal conversions. Minimal-digit decimals don’t come up in Fortran as often as other languages, but I use them when allowed by the output format. Interestingly, the last digit of a minimal decimal conversion is often irrelevant, so long as it exists.
  - unnah20 hours ago
    Since such algorithms were developed in the 1990's, nowadays you can expect your language's standard library to use them for float-to-decimal and decimal-to-float conversions. So all you need to do in code is to print the float without any special formatting instructions, and you'll get the shortest unique decimal representation.
    lifthrasiir20 hours ago
    Except that C specifies that floating point numbers should be printed in a fixed precision (6 decimal digits) when no precision is given. Internally they do use some sort of float-to-decimal algorithms [1], but you can't get the shortest representation out of them.
    [1] Some (e.g. Windows CRT) do use the shortest representation as a basis, in which case you can actually extract it with large enough precision (where all subsequent digits will be zeros). But many libcs print the exact representation instead (e.g. 3.140000000000000124344978758017532527446746826171875 for `printf("%.51f", 3.14)`), so they are useless for our purpose.
    unnah18 hours ago
    Ok, you got me there. Looks like they fixed that in C++17 with std::to_chars. https://en.cppreference.com/w/cpp/utility/to_chars.html
    Sharlin18 hours ago
    That's what the %g format specifier is for.
    printf("%f\n", 3.14); // 3.140000 printf("%g\n", 3.14); // 3.14
    lifthrasiir18 hours ago
    This is a common misconception. Quoting the ISO C (actually the final draft of C23, but should be enough for this purpose):
    > g, G: A double argument representing a floating-point number is converted in style f or e (or in style F or E in the case of a G conversion specifier), depending on the value converted and the precision. Let P equal the precision if nonzero, 6 if the precision is omitted, or 1 if the precision is zero. Then, if a conversion with style E would have an exponent of X: if P > X ≥ −4, the conversion is with style f (or F) and precision P − (X + 1). otherwise, the conversion is with style e (or E) and precision P − 1.
    Note that it doesn't say anything about, say, the inherent precision of number. It is a simple remapping to %f or %e depending on the precision value.
    Sharlin18 hours ago
    Hmm, is that then just an extension that's de facto standard? Every compiler I tried at godbolt.org prints 3.140000 and 3.14 respectively.
    lifthrasiir18 hours ago
    3.14 is the correct answer for both %g and the shortest possible representation. Try 1.0 / 3.0 instead; it won't show all digits required for round-tripping.
- Etherlord8720 hours ago
  Don't worry about negative comments, yes, this is not the best way, but (if there's no error, I didn't analyze it thoroughly) it's often good enough; if it works, then it works.
  I once wanted to find a vector for which Euler rotation (5°, 5°, 0) will result with the same vector, so I just ran a loop of million iterations or so which would take a starting vector, translate it randomly slightly (add a small random vector) and see if after rotation it's closer to original than the previous vector would be after rotation, if not, discard the change otherwise keep it. The script ran for a couple seconds on Python and with decreasing translation vector based on iteration number, I got perfect result (based on limited float precision of the vector). 2s would be terribly slow in a library but completely satisfying for my needs :D
  - oasisaimlessly13 hours ago
    FYI, you can solve your problem in closed form by converting your Euler angles into the 'rotation vector'[1] or 'axis-angle'[2] representations and then normalizing the resulting vector.
    [1]: https://en.wikipedia.org/wiki/Axis%E2%80%93angle_representat...
    [2]: https://en.wikipedia.org/wiki/Axis%E2%80%93angle_representat...
    Etherlord878 hours ago
    Cool! Thanks for that; it makes perfect sense: if you convert the rotation to axis-angle, and then take that axis, obviously a vector rotated around the axis defined by itself won't change.
    I just tested it:
    from bpy import context as C from mathutils import Vector, Euler from math import radians as rad r = Euler((rad(5), rad(5), 0)) ob = C.object ob.rotation_euler = r ob.rotation_mode = 'AXIS_ANGLE' a, x, y, z = ob.rotation_axis_angle v = Vector((x, y, z)) print(v) v.rotate(r) print(v) print("--")
    Can be done without using an object:
    from mathutils import Vector, Euler from math import radians as rad r = Euler((rad(5), rad(5), 0)) v = Vector(r.to_quaternion().axis) print(v) v.rotate(r) print(v) print("--")
  - Sharlin18 hours ago
    I guess that's faster than working out the eigenvectors by hand.
- IshKebab15 hours ago
  > I wonder if there is a more efficient way to determine that shortest representation rather than running printf/scanf in a loop?
  Yes, just `printf("%f", ...);` will get you that.
  The actual algorithms to do the float->string conversion are quite complicated. Here is a recent pretty good one: https://github.com/ulfjack/ryu
  I think there's been an even more recent one that is even more efficient than Ryu but I don't remember the name.
- forrestthewoodsa day ago
  std::numeric_limits<float>::max_digits10
  https://en.cppreference.com/w/cpp/types/numeric_limits/max_d...
- burnt-resistor21 hours ago
  No, just no. And, never use sscanf().
  This is totally pointless when serialization to and from an unsigned integer that's binary equivalent would be perfectly reversible and therefore wouldn't lose any information.
  double f = 0.0/0.0; // might need some compiler flags to make this a soft error. double g; char s[9]; assert(sizeof double == sizeof uint64_t); snprintf(s, 9, "%0" PRIu64, *(uint64_t *)(&f)); snscanf(s, 9, "%0" SCNu64, (uint64_t *)(&g));
  If you want something shorter, apply some sort of heuristic that doesn't sacrifice faithful reproduction of the original representation, e.g., idempotency.
  - Lvl999Noob21 hours ago
    Off-topic, For this kind of pointer casting, shouldn't you be using a union? I believe this is undefined behaviour, as written.
    Someone18 hours ago
    As written, it is UB, yes, but certainly in C++, and, I think, also in C, using a union is undefined behavior, too. I think (assuming isn’t and float to be of the same size) the main risk is that, if you do
    union { float f; int i; } foo; foo.f = 3.14; printf(“%x”, foo.i);
    that the compiler can think the assignment to foo.f isn’t used anywhere, and thus can chose not to do it.
    In C++, you have to use memmove (compilers can and often do recognize that idiom)
    eska20 hours ago
    Yes, it violates the standard, although in practice it should work because the alignment is the same.
  - lifthrasiir21 hours ago
    If the goal is just to reliably serialize and deserialize floating point numbers, use `%a` instead. It will print and read hexadecimal floating point literals.
ridiculous_fisha day ago
My favorite FP Fun Fact is that float comparisons can (almost) use integer comparisons. To determine if a > b, reinterpret a and b as signed ints and just compare those like any old ints. It (almost) works!
The implication is that the next biggest float is (almost) always what you get when you reinterpret its bits as an integer, and add one. For example, start with the zero float: all bits zero. Add one using integer arithmetic. In int-speak it's just one; in float-speak it's a tiny-mantissa denormal. But that's the next float; and `nextafter` is implemented using integer arithmetic.
Learning that floats are ordered according to integer comparisons makes it feel way more natural. But of course there's the usual asterisks: this fails with NaNs, infinities, and negative zero. We get a few nice things, but only a few.
- Sniffnoya day ago
  This isn't accurate. It's true for positive numbers, and when comparing a positive to a negative, but false for comparisons between negative numbers. Standard floating point uses sign-magnitude representation, while signed integers these days use 2s-complement. On negative numbers, comparisons are reversed between these two encodings. Incrementing a float as if it were an integer will, in ordinary circumstances, get you the next one larger in magnitude, but with the same sign -- i.e., you go up for positives but down for negatives. Whereas with signed integers, you always go up except when there's an overflow into the sign bit.
  A more correct version of the statement would be that comparison is the same as on sign-magnitude integers. Of course, this still has the caveats you already mentioned.
  - adgjlsfhk1a day ago
    One of the unambiguously nice things about Posits (unlike floats) is that they use a 2s compliment scheme which makes it actually true for all values that they sort like integers.
    Sniffnoya day ago
    I was going to say "well, you've still got NaR", but apparently that's been defined to sort less than all other posits? Huh, OK.
    adgjlsfhk118 hours ago
    yeah. having a total order just makes everything so much nicer. total order is one of the defining properties of the reals, and realistically if the user calls sort (or puts one in a B-tree), you have to put the NaNs at one side or the other (unless you're C/C++ and allow that to launch the nukes)
  - ridiculous_fisha day ago
    You're right, thank you for the correction.
- Sharlin18 hours ago
  For what it's worth, here's the Rust std implementation [1] of the total (as in, makes NaNs comparable) comparison algorithm given by IEE 751:
  let mut left = self.to_bits() as i32; let mut right = other.to_bits() as i32; // In case of negatives, flip all the bits except the sign // to achieve a similar layout as two's complement integers left ^= (((left >> 31) as u32) >> 1) as i32; right ^= (((right >> 31) as u32) >> 1) as i32; left.cmp(&right)
  [1] https://doc.rust-lang.org/src/core/num/f32.rs.html#1348
- splicera day ago
  Thanks for HexFiend man! I use several times a week.
SomaticPiratea day ago
This came during my OMSCS Game AI course as an example of the dangers of using floats to represent game object location. If you get further from the origin point (or another game element you are referencing from) you are losing precision as the float needs to use more of the significand to store the larger value.
- efskapa day ago
  I love how that became part of the "mythology" of Minecraft as the "Far Lands", where travelling far from the world origin made terrain generation and physics break down, subtly at first, and less so as you kept going.
  It's like the paranormal trope of an expedition encountering things being disconcertingly "off" at first, and then eventually the laws of nature start breaking down as well. All because of float precision.
  - Terr_11 hours ago
    That makes me think of games where the story places the protagonist in a simulation, which is incredibly convenient for the real-world developers and authors. All the quirks and limitations of the real game (invisible barriers, fail on killing a story-critical NPC, etc.) can be blamed on issues in the fictional one.
    For example, the Assassin's Creed series.
  - Frotag12 hours ago
    Was curious what this looked like so did a google.
    From the wiki, far lands ("spongy walls of terrain") aren't caused by precision loss but integer overflow in the terrain generation.
    https://minecraft.wiki/w/Far_Lands
- pixelfarmera day ago
  If you have a large set of lets say floats in the range between 0 and 1 and you add them up, there is the straightforward way to do it and there is a way to pair them all up, add the pairs, and repeat that process until you have the final result. You will see that this more elaborate scheme actually gets a way more accurate result vs. simply adding them up. This is huge, because there is a lot of processing done with floats where this inherent issue is entirely disregarded (and has an impact on the final result).
  Donald Knuth has all of that covered in one of his "The Art of Computer Programming" books, with estimations about the error introduced, some basic facts like a + (b + c) != (a + b) + c with floats and similar things.
  And believe it or not, there have been real world issues coming out of that corner. I remember Patriot missile systems requiring a restart because they did time accounting with floats and one part of the software didn't handle the corrections for it properly, resulting in the missiles going more and more off-target the longer the system was running. So they had to restart them every 24 hours or so to keep that within certain limits until the issue got fixed (and the systems updated). There have been massive constructions breaking apart due to float issues (like material thicknesses calculated too thin), etc.
  - KeplerBoya day ago
    There are some really simple examples of that. Just try adding 1 to a half precision float in a loop. The accumulator will stop increasing at a mere 2048, since 2049 is not representable it rounds 2048 + 1 back down to 2048.
- mwkaufmaa day ago
  Define boundary conditions -- how much precision do you need? Then you can compute the min/max distances. If the "world" needs to be larger, then prepare to divide it into sectors, and have separate global/local coordinates (e.g. No Man's Sky works this way).
  Really though, games are theater tech, not science. Double-Precision will be more than enough for anything but the most exotic use-case.
  The most important thing is just to remember not to add very-big and very-small numbers together.
  - kbolino15 hours ago
    The problem with double-precision in video games is that the GPU hardware does not support it. So you are plagued with tedious conversions and tricks like "global vs. local coordinates" etc.
    mwkaufma14 hours ago
    100! OTOH - Constant factor scaling between game and render world space fixes a lot (gfx often need less precision than physics). - most view coords are in view or clip space, which are less impacted, so large world coords tend not to code-sprawl even when introduced.
  - skeezyboy19 hours ago
    > Define boundary conditions -- how much precision do you need?
    imagine if integer arithmetic gave wrong answers in certain conditions lol why did we choose the current compromise?
    kbolino15 hours ago
    In my experience, most code that operates on integers does not anticipate overflow or wraparound. So it is almost always guaranteed to produce wrong results when these conditions occur, and is only saved by the fact that usually they doesn't occur in practice.
    It is odd to me that every major CPU instruction set has ALU codes to indicate when these conditions have occurred, and yet many programming languages ignore them entirely or make it hard to access them. Rust at least has the quartet of saturating, wrapping, checked, and unchecked arithmetic operations.
    mwkaufma8 hours ago
    They're not "wrong" -- the error bars are well-defined.
    Signed Integer Overflow OTOH is Undefined Behavior, so it's worse.
- danhau20 hours ago
  Kerbal Space Program required a lot of clever engineering to cram an entire solar system into the constrains of 32 bit floats. There are articles and videos out there, highly recommended!
  - mrguyorama6 hours ago
    Didn't they just do the normal "The world is centered on you" trick? And then switched to 64 bit coordinates anyway?
omoikanea day ago
Previously I was using this site to explore floating point representations:
https://www.h-schmidt.net/FloatConverter/IEEE754.html
This one has the extra feature of showing the conversion error, but it doesn't support double precision.
- Etherlord8720 hours ago
  I was looking through the comments to see if someone already mentioned it. Great webpage!
  However the one in OP has an amazing graph intuitively explaining the numeric space partitioning - the vertical axis is logarithmic, and the horizontal is linear for each row on its own, but rows are normalized to fit the range between the logarithmic values on the vertical axis. I guess it's obvious once you're comfortably understanding floats and could do with some explanations for those still learning it.
yuvadama day ago
This is cool, looks visually a lot like a CIDR range calculator [1] I built a few years ago to help me understand network ranges better.
These types of visualizations are super useful.
[1] - https://cidr.xyz
porker12 hours ago
It doesn't seem to have been shared in this thread but my favourite site on this topic is https://0.30000000000000004.com/
CraigJPerrya day ago
The award for "most fun integer" in 32 bit float is 16777217 (and 9007199254740992 for 64bit).
It's sometimes fun to have these kinds of edge cases up your sleeve when testing things.
- lifthrasiira day ago
  For 64-bit, 9007199254740991 is also known as `Number.MAX_SAFE_INTEGER` in JavaScript. Note that this value is not even; the next integer, ..0992 is still safe to represent by its own, but it can't be distinguished from a definitely unsafe integer ..0993 rounded to ..0992.
- simonaska day ago
  You mean ±9,007,199,254,740,993.0 for 64-bit floats. :-)
  For other curious readers, these are one beyond the largest integer values that can be represented accurately. In other words, the next representable value away from zero after ±16,777,216.0 is ±16,777,218.0 in 32 bits -- the value ±16,777,217.0 cannot be represented, and will be rounded somewhat hand-wavingly (usually towards zero).
  Precision rounding is one of those things that people often overlook.
pmarreck15 hours ago
I'm glad this exists, except for the fact that IEEE754 is the devil, and posits are better (not assuming hardware support). (Even better than both are bignum rationals, but those are the slowest.)
- the__alchemist15 hours ago
  It's a compromise that works well in a broad range of approaches. All the alternatives perform better in some domains, and worse in others.
  - pmarreck15 hours ago
    Fair enough. I just don't like the element of surprising inaccuracy.
rwmja day ago
Good job on showing the hexadecimal version. I recently had a challenging C bug where I was using printf("%a") to show the underlying representation, which uses hexadecimal, and it was a little confusing at first. This site would have been helpful.
a day ago
undefined
slatera day ago
Why tf is there a .exposed TLD?
- shooa day ago
  > THE .EXPOSED TLD This TLD is attractive and useful to end-users as it better facilitates search, self-expression, information sharing and the provision of legitimate goods and services. Along with the other TLDs in the Donuts family, this TLD will provide Internet users with opportunities for online identities and expression that do not currently exist. In doing so, the TLD will introduce significant consumer choice and competition to the Internet namespace – the very purpose of ICANN’s new TLD program.
  > .EXPOSED will be utilized by registrants seeking new avenues for expression on the Internet. There is a deep history of progressivity and societal advancement resulting from the online free expressions of criticism. Individuals and groups will register names in .EXPOSED when interested in editorializing, providing input, revealing new facts or views, interacting with other communities, and publishing commentary.
  -- https://icannwiki.org/.exposed
  - Joker_vD16 hours ago
    > better facilitates... the provision of legitimate goods and services.
    Like what? A risque lingerie shop at balls.exposed or something? And new TLDs don't in any way facilitate "better search", you know, nor "information sharing".
    > Along with the other TLDs in the Donuts family
    Sorry, the what family?
    > online identities and expression that do not currently exist.
    What does this phrase even mean?
    > the TLD will introduce significant consumer choice and competition to the Internet namespace – the very purpose of ICANN’s new TLD program.
    "Considered harmful" etc.
    > Individuals and groups will register names in .EXPOSED when interested in editorializing, providing input, revealing new facts or views, interacting with other communities, and publishing commentary.
    Still not sure how "provision of legitimate goods" fits into this. Or the floating point formats, for that matter.
- maxbonda day ago
  For the same reason there's a .sucks TLD. There's a market for it.
  - magackamea day ago
    So windows.sucks and linux.sucks are available and 2000 USD/year, emacs.sucks is 200 USD/year and vi.sucks is already registered (but no website unfortunately)!
    On the other hand linux.rocks and windows.rocks are taken (no website), vi.rocks is 200 USD/year and emacs.rocks is just 14 USD/year.
    microsoft.sucks redirects to microsoft.com, but microsoft.rocks is just taken :thinking:
    8cvor6j844qw_d6a day ago
    Pretty sure there's a domain monitoring service for similarly or something along these lines that buys up domains like these to prevent usage.
    On that note, I've been trying to see if GoDaddy will buy a domain and resell for higher price by searching for some plausibly nice domain names on their site. They haven't took the "bait" yet.
    15155a day ago
    MarkMonitor
noxa14 hours ago
Would be cool if this supported the various fp8 formats that have been shipped on GPUs recently!
librastevea day ago
fantastic explanation and very nice to see both the decimal and binary representations together
https://raku.org defaults to Rationals (Rats) and provides FatRat for arbitrary precision
otherwise even relatively normal calculations (eg what’s the mass of electron in quectogram) fail
- saghm17 hours ago
  "FatRat" is a hilarious name for a numeric type (and I mean this in a purely positive way). I've always had an impression of Perl as a language with a bit of a quirky, fun community, and it's nice to see that might have carried over to Raku as well.
  - librasteve11 hours ago
    well Larry (Wall) was a linguist and adept at coining memorable memes
    saghm10 hours ago
    For sure! I've honestly never written Perl, but I still often end up sharing the "three virtues" with junior engineers I work with. There's a certain timeless quality to a lot of the stuff he's come up with. (Plus, those of us who don't write Perl still owe him a debt of gratitude even just for the `patch` utility!)
tempodox21 hours ago
Fantastic. A visual interactive explanation of how floating-point representation works.
burnt-resistora day ago
Far too superficial because it lacks explanation of non-normal conditions (denormals, zeroes, infinities, sNaNs, and qNaNs) and the complete mapping of values. This isn't properly educational.
elvircrn19 hours ago
Good stuff. Having fp8/fp4 would be great, too!
- lifthrasiir19 hours ago
  There are too many fp8/fp4 formats out there, many with different trade-offs. At least fp16 is standardized...
  - elvircrn19 hours ago
    Yes, all of them would be great.
    lifthrasiir6 hours ago
    Hmm, you successfully nerd-sniped me. AFAIK there are at least:
    - 8-bit floating point types: (S1)E4M3 (finite only, has no-negative-zero variant), (S1)E5M2 (has no-negative-zero variant), E8M0
    - 6-bit floating point types: (S1)E2M3, (S1)E3M2
    - 4-bit floating point types: (S1)E2M1, NF4
    - Block floating point types: MXFP8 (E5M2/E4M3*32+E8M0), MXFP6 (E3M2/E2M3*32+E8M0), MXFP4 (E2M1*32+E8M0), NXFP [1]
    [1] https://arxiv.org/pdf/2412.19821
    Do you have anything to add on?
a day ago
undefined
dorianmariecoma day ago
not obvious you need to press enter to change the value
makeworld16 hours ago
See also https://integer.exposed/
- jcalx13 hours ago
  Can't wait for someone to buy boolean.exposed and teach me about some esoteric representation of booleans in memory that I'd never considered (either that or it's a very simple page).
- aarestad15 hours ago
  ...authored by [Julia Evans](https://jvns.ca) of tech-explainer fame :)
llm_nerd19 hours ago
The best explanation on floats - https://dennisforbes.ca/blog/features/floating_point/underst...
fnord7720 hours ago
assuming this is IEEE 754
KingLancelot13 hours ago
[dead]
GistNoesisa day ago
For a .exposed domain, it's not really shocking.
The real shocking fact about floating point is that they are even used at all.
It's throwing out of the window the most basic property operations on number should have : "associativity" and all that for a gain in dynamic range which is not necessary most of the time.
The associativity we expect to hold is (a+b)+c == a+(b+c) and (ab)c == a(bc) and these don't hold for floats even though most math formulas and compiler optimizations rely on these to hold. It's a sad miracle that everything somehow still works out OK most of the time.
You lose determinism most of the time with respect to compiler optimizations, and platform reproducibility if processor don't exactly respect IEE-754 (or is it IEE-854).
The real problem comes when you want to use parallelism. With things like atomic operations and multiple processor doing things out of order, you lose determinism and reproducibility, or add a need for synchronisation or casting operations everywhere.
Even more problematic, is that because number operations are used so often, they are set in "stone", and are implemented at the hardware level. And they use much more transistor because they are more complex than integer arithmetic.
Real programmers don't use floating points, only sloppy lazy ones do.
Real programmers use fixed point representation and make sure the bounds don't overflow/underflow unexpectedly.
Let's ban all hardware floating-point implementation : Just imagine future alien archeologists having a laugh at us when they look at our chips and think "no wonder they were doomed they can't even do a+b right : its foundations were built on sand".
- the-grumpa day ago
  Real programmers use an abacus and never take roots or calculate logarithms/powers.
  Amazing confidence on display here.
- jcranmer16 hours ago
  > these don't hold for floats even though most math formulas and compiler optimizations rely on these to hold.
  Most compilers have an option like -fassociative-math that explicitly allows optimization under the assumption of associativity and distributivity.
  > Real programmers use fixed point representation and make sure the bounds don't overflow/underflow unexpectedly.
  So you complain that floating-point is bad because it's not associative but then suggest that we use fixed-point instead (which is also nonassociative), but it's okay, because it's fine as long as you do thing that programmers rarely do.
  > Let's ban all hardware floating-point implementation : Just imagine future alien archeologists having a laugh at us when they look at our chips and think "no wonder they were doomed they can't even do a+b right : its foundations were built on sand".
  Ah, you're the kind of person who sees that 0.1 + 0.2 != 0.3 and decides to go on a crusade against floating-point because of it. By the way, fixed point has that exact same bug: it's a fault that is caused by the base being different more than the other principles of the floating-point type.
  Floating-point has trade-offs; a good programmer understands the trade-offs of floating-point and will not ask more of it than it can provide.
- Findecanor21 hours ago
  I haven't seen compilers optimise integer `a+(b+c)` to `(a+b)+c` that much either.
  I'm looking for ways to do that in my home-made compiler back-end. If you've got an example (on compiler explorer, a paper, blog or whatever), I'd be interested in reading about it.
  I'd agree that fixed point would have sufficed in many cases where people use floating point. But floating point can be more convenient, with the increased range providing some protection against overflow.
  - GistNoesis17 hours ago
    These optimizations are often hidden when the compiler need to do some loop interchange optimizations.
    If the order of the operations is important because you don't have associativity then you can't legally do it.
    You can have special flags (-fassociative-math) for floats which allows to treat them as if they are associative but these mean your program result will depend on which optimization the compiler picked.
    And it turns out that these loop reordering optimizations are really useful when you need to do some backward automatic differentiation. Because all the loops are basically iterated in reverse for the automatically generated code of the backward pass.
    But the memory access pattern for the backward pass are not contiguous if you don't interchange the loop order, which the compiler can't do legally because of floats. Nor can he then merge loops together. Which is really useful because if you can merge the forward pass with the backward pass then you don't have to keep values inside a "tape".
    So basically you can't rely on compiler optimizations, so your auto-differentiator can't benefit from existing compiler progress. (You can have look either at Julia Zygote, or enzyme which rely on compiler optimizations chaining well). Or you write backward passes manually.
  - guipsp18 hours ago
    You have for sure seen this in constant propagation.
    Findecanor18 hours ago
    Pruning the data-flow graph depth-first, sure, but moving edges in it is beyond anything I've read so far.
- librastevea day ago
  try rotating & shading triangles in 3D with integer math … FPUs have earned their place
  - GistNoesisa day ago
    https://en.wikipedia.org/wiki/Fixed-point_arithmetic : allows you to have some thing which is integer math but works like floats. It's integer operations and bit shifts so really fast.
    The limitation is the minimal quantization level. But for a 3d engine let's say your base increment is nanometers. Then you set your maximum dimension let's say 1000km. You only have to be able to represent number up 10^20 so 64-bit fixed point number is good enough.
    Do everything in 128-bit fixed point numbers, and float are no more problem for anything scientific.
    lifthrasiira day ago
    In modern systems float ops are often as fast as corresponding integer ops, so fixed point numbers are not necessarily faster now.
    librasteve21 hours ago
    for general computation, I think Rationals (https://raku.org) are a good choice - and Raku has big Int as standard also
    nevertheless, us Weitek guys made 32-bit FPUs to do 3D graphics (pipeline, 1 instruction per clock) P754, IBM, DEC standards to power SGI, Sun etc
    this is still the best format to get graphics throughout per transistor (although the architectures have got a bit more parallel)
    then 64-bit became popular for CAD (32-bit means the wallpaper in your aircraft carrier might sometimes be under the surface of your wall)
    Findecanor17 hours ago
    An alternative numerical notation uses decimals but marks which digits at the end that are repeating. With enough digits, this format can represent all rational numbers that can be written in the standard numerator/denominator format.
    It does of course work with base 2 and exponents as well so you could still be using floating-point format, only with additional meta-data indicating the repeating range. When a result degenerates into a number that can't fit within the number of digits, you would be left with a regular floating-point number.
    I'd want to write a simple calculator that uses this numerical format but I have only been able to find algorithms for addition and subtraction. Every description I've found of the format has converted to the regular numerator/denominator form before multiplication and division.
    simonaska day ago
    No 3D engine in the real world uses 64-bit coordinates. With 32-bit coordinates, you could not hope to represent things in nanometers (you'd be stuck in a cube roughly 4x4x4 meters). Realistically you might choose millimeters, but that would certainly start to produce visible artifacts.
    For games and most simulation, the "soft failure" of gradual precision loss is much more desirable than the wildly wrong effects you would get from fixed-point overflow.
    Etherlord8720 hours ago
    This kind of problem appears also with floats, just later with 32-bit floats than with 64-bit ints.
    And the solution to this problem is to adjust your coordinate space, e.g. make every nanometer represented as `1` but have the containing object matrix have scale fields set to 1e-9.
    So this is not a theoretical problem, just a practical one: the z-fighting you get with floats, would happen much more often with integers - you absolutely can avoid it in both cases, but practically 3D engines are designed with performance in mind, and so some assumptions lead to limitations and you would get more of them with integers.
    GistNoesis20 hours ago
    The https://en.wikipedia.org/wiki/Z-fighting issue is the proof you often need those 64-bits.
    It's kind of a chicken and egg problem where people use floats because there are FPUs available. All the engineering effort which went into dealing with floats and the problem that comes with them, would have been better invested in making integers faster.
    We went onto the wrong path, and inertia keep us going on the wrong path. And now the wrong path is even more tempting because all efforts have made it more practical and almost as good. We hide the precision complexity to the programmer but it's still lurking around instead of being tamed.
    The absolute GPU cluster-fuck with as many floating types as you can write on a napkin while drunk at the bar, mean that at the end of the day your neural network is non-deterministic, and you can't replicate any result from your program from 6 month ago, or last library version. Your simulations results therefore are perishable.
    Inability to replicate results mean that you can't verify weight modifications to your neural networks haven't been tampered by an adversary. So you just lose all fighting chance to build a secure system.
    You also can't share work in a distributed fashion because since verification is not possible you can't trust any computation that you haven't done yourself.
    TinkersW15 hours ago
    On the CPU side, yes 64 bits is a good idea, but when transferring to the GPU you simply make the camera location 0,0,0, and transform everything relative to it, thus you can easily use 32 bit float and have no z-fighting or any other precision related issues(a logarithmic depth buffer also helps).
    Regarding 64 bit double vs 64 bit fixed width, I don't think there is a really good reason to bother with fixed width, it adds more instructions, and will require a custom debug visualizer to inspect the values.
    Bit shifts, at least in SSE/AVX2 etc, are only able to run on a single port, so they actually aren't such a great idea(not sure about scalar, I don't bother to optimize scalar code in this way).
    eska17 hours ago
    Regarding your second paragraph, those issues are equally catastrophic for game engines. Therefore they generally use (float x,y,z,int zone_id) to reset the origin and avoid floating point errors. Think MMOs, open world games, etc. There are talks about this from all the way back to Dungeon Siege up to Uncharted
- imtringueda day ago
  I hope nobody listens to lunatics like you. Fixed point sucks. Who on earth wants to analyze every single algorithm for its numerical stability? If you work with FPGAs, then converting a known to be working algorithm to fixed point is one of the most time consuming things you can do and it eats DSP slices like crazy. Every square or square root operation will cause a shiver to run down your spine because of the lack of dynamic range.
  Edit:
  For those who want more context:
  https://vbn.aau.dk/ws/portalfiles/portal/494103077/WPMC22_Ko...
  Here is a big fat warning:
  >There is one important downside though, which relates to the fact that designing fixed-point algorithms is a significantly more complicated task as compared to similar floating-point based algorithms. This fact essentially has two major consequences: >1) The designer must have an extended mathematical knowledge about the numerical characteristics of the algorithms, and >2) The development time is in some cases longer than for equivalent floating-point systems.
- simonaska day ago
  Quantization is also precision loss.