We expected this to hurt performance, but we were unable to measure any impact in practice.
Everyone still working in memory-unsafe languages should really just do this IMO. It would have mitigated this Mongo bug.
Looks like this is the default in OpenBSD.
Note that many malloc implementations will do this for you given an appropriate environment, e.g. setting MALLOC_CONF to opt.junk=free will do this on FreeBSD.
see here: https://godbolt.org/z/rMa8MbYox
I did, of course, test it, and anyway we now run into the "freed memory" pattern regularly when debugging (yes including optimized builds), so it's definitely working.
#include <stdlib.h>
#include <string.h>
void free(void* ptr);
void not_free(void* ptr);
void test_with_free(char* ptr) {
ptr[5] = 6;
void *(* volatile memset_v)(void *s, int c, size_t n) = memset;
memset_v(ptr + 2, 3, 4);
free(ptr);
}
void test_with_other_func(char* ptr) {
ptr[5] = 6;
void *(* volatile memset_v)(void *s, int c, size_t n) = memset;
memset_v(ptr + 2, 3, 4);
not_free(ptr);
}[0] https://llvm.org/docs/LangRef.html#llvm-memset-intrinsics
[1] https://gitweb.git.savannah.gnu.org/gitweb/?p=gnulib.git;a=b...
The Linux kernel extensively uses gcc extensions. That doesn't inherently make it insecure.
The C committee gave you memset_explicit. But note that there is still no guarantee that information can not leak. This is generally a very hard problem as information can leak in many different ways as it may have been copied by the compiler. Fully memory safe languages (so "Safe Rust" but not necessarily real-word Rust) would offer a bit more protection by default, but then there are still side-channel issues.
Creating memset_explicit won't fix existing code. "Oh but what if maybe" is just cope.
If I do memset then free then that's what I want to do
And the way things go I won't be surprised if they break memset_explicit for some other BS reason and then make you use memset_explicit_you_really_mean_it_this_time
Once you accept that optimizing compilers do, well, optimizations, the question is what should be allowed and what not. Both inlining "memset" and eliminating dead stores are both simply optimizations which people generally want.
If you want a store not to be eliminated by a compiler, you can make it volatile. The C standard says this can not be deleted by optimizations. The criticism with this was that later undefined behavior could "undo" this by "travelling in time". We made it clear in ISO C23 that this not allowed (and I believe it never was) - against protests from some compiler folks. Compilers still do not fully conform to this, which shows the limited power WG14 has to change reality.
> Once you accept that optimizing compilers do, well, optimizations
Why in tarnation it is optimizing out a write to a pointer out before a function that takes said pointer? Imagine it is any other function besides free, see how ridiculous that sounds?
Many of us who don't like working under such conditions have just moved on to other languages.
Optimizing out a function call to a heap pointer (especially memset) seems wrong to me. You called the function, it should call the function!
But it's again the C language saving time not wearing a seatbelt or checking the tire pressure for saving 10s on a 2h trip
There are still ways to obtain the desired behavior. Just put a call to a DLL or SO that implements what you need. The compiler cannot inspect the behavior of functions across module boundaries, so it cannot tell whether removing the call preserves semantics or not (for example, it could be that the external function sends the contents of the buffer to a file), so it will not remove it.
When you "call" a "function" in the source you're not specifying to the compiler that you want a specific opcode in the generated executable, you're merely specifying a particular observable behavior. This is why optimizations such as inlining and TCO are valid. If the compiler can prove that a heap allocation can be turned into a stack allocation, or even removed altogether (e.g. free(malloc(1ULL << 50))), the fact that these are exposed to the programmer as "functions" he can "call" poses no obstacle.
In the abstract machine, all expressions are evaluated as specified by the semantics. An actual
implementation need not evaluate part of an expression if it can deduce that its value is not used
and that no needed side effects are produced (including any caused by calling a function or through
volatile access to an object)
Problem is, calling external library function has a needed side effect of calling that library function. I do not see language that allows simply not doing that, based on assumed but unknown function behaviour.Again, could you please explain how compiler can decide to remove call to a function in an external dynamically loaded library, that is not known at compile time, simply based on the name of the function (i.e. not because the call is unreachable)? I do not see any such language in the standard.
And yes, calling unknown function from a dynamically loaded library totally is a side effect.
> And yes, calling unknown function from a dynamically loaded library totally is a side effect.
The thing is that malloc/free aren't "unknown function[s]". From the C89 standard:
> All external identifiers declared in any of the headers are reserved, whether or not the associated header is included.
And from the C23 standard:
> All identifiers with external linkage in any of the following subclauses (including the future library directions) and errno are always reserved for use as identifiers with external linkage
malloc/free are defined in <stdlib.h> and so are reserved names, so compilers are able to optimize under the assumption that malloc/free will have the semantics dictated by the standard.
In fact, the C23 standard explicitly provides an example of this kind of thing:
> Because external identifiers and some macro names beginning with an underscore are reserved, implementations can provide special semantics for such names. For example, the identifier _BUILTIN_abs could be used to indicate generation of in-line code for the abs function. Thus, the appropriate header could specify
#define abs(x) _BUILTIN_abs(x)
> for a compiler whose code generator will accept it.What a side effect is, is explained in "5.1.2.3". Calling function is only a side effect when the function contains a side effect, such as modifying an object, or a volatile access, or I/O.
As they're freely replaceable through loading, and designed for that, I would strongly suggest that are among the most magical areas of the C standard.
We get a whole section for those in the standard: 7.24.3 Memory management functions
Hell, malloc is allowed to return you _less than you asked for_:
> The pointer returned if the allocation succeeds is suitably aligned so that it may be assigned to a pointer to any type of object with a fundamental alignment requirement and size less than or equal to the size requested
That was one of the first optimisations we had, back with Fortran and COBOL. Before C existed - and as B started life as a stripped down Fortran compiler, the history carried through.
The K&R book describes the buddy system for malloc, and how its design makes it suitable for compiler optimisations - including ignoring a write to a pointer that does nothing, because the pointer will no longer be valid.
I'd suspect that eliding suitable malloc/free pairs would not break most existing code because most existing code simply does not depend on malloc/free doing anything other than and/or beyond what the C standard requires.
How would you propose that eliding free(malloc(x)) would break "most" existing code, anyways?
Or somebody would try to plug in mimalloc/jemalloc or a debug allocator and wonder what's going on.
Such a program would continue to function as normal; the dirty data would just be left on the stack. If the developer wants to clear that data too, they'd just have to modify the compiler to overwrite the stack just before (or just after) moving the stack pointer.
>Or somebody would try to plug in mimalloc/jemalloc or a debug allocator and wonder what's going on.
Again, that wouldn't be broken. They would see that no dynamic allocations were performed during that particular section. Which would be correct.
> As an example, user kentonv wrote: "I patched the memory allocator used by the Cloudflare Workers runtime to overwrite all memory with a static byte pattern on free". And compiler would, like, "nah, let's leave all that data on stack".
Strictly speaking, I don't think eliding malloc/free would "break" those programs because that behavior is there for security if/when something else goes wrong, not as part of the software's regular intended functionality (or at least I sure hope nothing relies on that behavior for proper functioning!).
> Or somebody would try to plug in mimalloc/jemalloc [] and wonder what's going on.
Why would mimalloc/jemalloc/some other general-purpose allocator care that it doesn't have to execute a matching malloc/free pair any more than the default allocator?
I'm not sure debug allocators would care either? If you're trying to debug mismatched malloc/free pairs then the ones the compiler elides are the ones you don't care about anyways since those are the ones that can be statically proven to be "self-contained" and/or correct. If you're gathering statistics then you probably care more about the malloc/free calls that do occur (i.e., the ones that can't be elided), not those that don't.
In any case, if you want to use a malloc/free implementation that promises more than the C standard does (e.g., special byte pattern on free, statistics/debug info tracking, etc.) there's always -fno-builtin-malloc (or memset_explicit if you're lucky enough to be using C23). Of course, the tradeoff is that you give up some potential performance.
I think that's an overly narrow reading of the footnote. I don't see an obvious reason why "such names" in the footnote should only cover "some macro names beginning with an underscore" and not also "external identifiers". And if implementations are allowed to define special semantics for "external identifiers", then... well, that's exactly what they did!
In addition, there's still the as-if rule. The semantics of malloc/free are defined by the C standard; if the compiler can deduce that there is no observable difference between a version of the program that calls those and a version that does not, why does it matter that the call is emitted? A function call in and of itself is not a side effect, and since the C standard dictates what malloc/free do the compiler knows their possible side effects.
Furthermore, the addition of memset_explicit and its footnote ("The intention is that the memory store is always performed (i.e. never elided), regardless of optimizations. This is in contrast to calls to the memset function (7.26.6.1)") implies that eliding calls is in fact acceptable behavior when optimizations are enabled. If eliding calls were not permissible when optimizing then what's the point of memset_explicit?
> There should be no case where compiler could assume anything about a function it does not see based simply on it's name.
Again, external identifiers defined by the C standard are reserved. Reserved external identifiers aren't just for show. From the C89 standard:
> If the program defines an external identifier with the same name as a reserved external identifier, even in a semantically equivalent form, the behavior is undefined.
And from C23:
> If the program declares or defines an identifier in a context in which it is reserved (other than as allowed by 7.1.4), the behavior is undefined.
This means that yes, under modern compilers' interpretation of UB compilers can assume things about functions based on their names because modern compilers generally optimize assuming UB does not happen. The compiler does not need to see the function's implementation because it is the function's implementation as far as it is concerned.
> 7.1.3 Reserved Identifiers
> [snip]
> Macro names and identifiers with external linkage that are specified in the C standard library clauses.
> This proposal does not propose any changes to these reserved identifiers.
Furthermore, that paper doesn't make the use of reserved external identifiers not UB, so there's no change there either.
If you need better performance, write your own allocator optimized for your specific use case — it's not that hard.
Besides, you if you don't need to clear old allocations, there are likely other optimizations you'll be able to find which would never fly in a system allocator.
* Don't worry about a schema.
* Don't worry about persistence or durability.
* Don't worry about reads or writes.
* Don't worry about connectivity.
This is basically the entire philosophy, so it's not surprising at all that users would also not worry about basic security.
https://www.shodan.io/search?query=mongodb https://www.shodan.io/search?query=mysql https://www.shodan.io/search?query=postgresql
But this must be proportional to the overall popularity.
That sort of inclination to push off doing the right thing now to save yourself a headache down the line probably overlaps with “let’s just make the db publicly exposed” instead of doing the work of setting up an internal network to save yourself a headache down the line.
Which is such a cop out, because there is always a schema. The only questions are whether it is designed, documented, and where it's implemented. Mongo requires some very explicit schema decisions, otherwise performance will quickly degrade.
Kleppmann chooses "schema-on-read" vs "schema-on-write" for the same concept, which I find harder to grasp mentally, but describes when schema validation need occur.
But now we can at least be rest assured that the important data in mongoDB is just very hard to read with the lack of schemas.
Probably all of that nasty "schema" work and tech debt will finally be done by hackers trying to make use of that information.
I suspect that this is in part due to historical inertia and exposure to SecDB designs.[0] Financial instruments can be hideously complex and they certainly are ever-evolving, so I can imagine a fixed schema for essentially constantly shifting time series universe would be challenging. When financial institutions began to adopt the SecDB model, MongoDB was available as a high-volume, "schemaless" KV store, with a reasonably good scaling story.
Combine that with the relatively incestuous nature of finance (they tend to poach and hire from within their own ranks), the average tenure of an engineer in one organisation being less than 4 years and you have an osmotic process of spreading "this at least works in this type of environment" knowledge. Add the naturally risk-averse nature of finance[ß] and you can see how one successful early adoption will quickly proliferate across the industry.
0: This was discussed at HN back in the day too: https://calpaterson.com/bank-python.html
ß: For an industry that loves to take financial risks - with other people's money of course, they're not stupid - the players in high finance are remarkably risk-averse when it comes to technology choices. Experimentation with something new and unknown carries a potentially unbounded downside with limited, slowly emerging upside.
Sometimes it comes from a misconception that your schema should never have to change as features are added, and so you need to cover all cases with 1-2 omni tables. Often named "node" and "edge."
I honestly feel like the opposite, at least if you're the only consumer of the data. I'd never really go out of my way to use a dynamically typed language, and at that point, I'm already going to be having to do something to get the data into my own language's types, and at that point, it doesn't really make a huge difference to me what format it used to be in. When there are a variety of clients being used though, this logic might not apply though.
It’s probably better to check what you’re working on than blindly assuming this thing you’ve gotten from somewhere is the right shape anyway.
I never said mongodb was wrong in that post, I just said it accumulated tech debt.
Let's stop feeling attacked over the negatives of tradeoffs
In any case, you quite literally said there was a "lack of schemas", and I disagreed with that characterization. I certainly didn't feel attacked by it; I just didn't think it was the most accurate way to view things from a technical perspective.
The end result is "everyone" kind of knows that if you put a PostgreSQL instance up publicly facing without a password or with a weak/default password, it will be popped in minutes and you'll find out about it because the attackers are lazy and just running crypto-mine malware, etc.
Mongo has spent its entire existence pretending to be a SQL database by poorly reinventing everything you get for free in postgres or mysql or cockroach.
Scylla! Yes, it will store and fetch your simple data very quickly with very good operational characteristics. Not so good for complex querying and indexing.
That being said the question was genuine - because I don't keep up with the ecosystem, I don't know it's ever valid practice to have a nosql db exposed to the internet.
Allocators are an interesting place to focus on for security. Chris did amazing work there for Blink that eventually rolled out to all of Chromium. The docs are a fun read.
https://blog.chromium.org/2021/04/efficient-and-safe-allocat...
https://chromium.googlesource.com/chromium/src/+/master/base...
I would rather not use it, but I see that there are legitimate cases where MongoDB or DynamoDB is a technically appropriate choice.
Absence of evidence is not evidence of absence...
Specifically, it looks like the exflitration primitive relies on errors being emitted, and those errors are what leak the data. They're also rather characteristic. One wouldn't reasonably expect MongoDB to hold onto all raw traffic data flowing in and out, but would absolutely expect them to have the error logs, at least for some time back.
Do other CVE reports come with more strong statements? I’m not sure they do. But maybe you can provide some counter examples that meet your bar.
It is standard, yes. The problem with it as a statement is that it's true even if you've collected exactly zero evidence. I can say I don't have evidence of anyone being exploited, and it's definitely true.
It is also a pretty standard response indeed. But now that it was highlighted, maybe it does deserve some scrutiny? Or is saying silly, possibly misleading things okay if that's what everyone has always been doing?
What would break if the compiler zero'd it first? Do programs rely on malloc() giving them the data that was there before?
Do yourself a favour, use ToroDB instead (or even straight PostgreSQL's JSONB).
Honestly, aside from the "<emoji> impact" section that really has an LLM smell (but remember that some people legit do this since it's in the llm training corpus), this more feels like LLM assisted (translated? reworded? grammar-checked?) that pure "explain this" prompt.
I did some research with it, and used it to help create the ASCII art a bit. That's about it.
I was afraid that adding the emoji would trigger someone to think it's AI.
In any case, nowadays I basically always get at least one comment calling me an AI on a post that's relatively popular. I assume it's more a sign of the times than the writing...
In hindsight, I would not even have thought about it if not for the comment I replied to. LLM prose fail to make me read whole paragraphs and I find myself skipping roughly the second half of every paragraph, which was definitely not the case for your article. I did somewhat skip at the emoji heading, not because of LLMs, but because of a saturation of emojis in some contexts that don't really need them.
I should have written "this could be LLM assisted" instead of "this more feels like LLM assisted", but well words.
Again, sorry, don't get discouraged by the LLM witch hunt.
If the material is wrong, explain why. Otherwise, shut up.