We expected this to hurt performance, but we were unable to measure any impact in practice.
Everyone still working in memory-unsafe languages should really just do this IMO. It would have mitigated this Mongo bug.
Note that many malloc implementations will do this for you given an appropriate environment, e.g. setting MALLOC_CONF to opt.junk=free will do this on FreeBSD.
Looks like this is the default in OpenBSD.
see here: https://godbolt.org/z/rMa8MbYox
I did, of course, test it, and anyway we now run into the "freed memory" pattern regularly when debugging (yes including optimized builds), so it's definitely working.
#include <stdlib.h>
#include <string.h>
void free(void* ptr);
void not_free(void* ptr);
void test_with_free(char* ptr) {
ptr[5] = 6;
void *(* volatile memset_v)(void *s, int c, size_t n) = memset;
memset_v(ptr + 2, 3, 4);
free(ptr);
}
void test_with_other_func(char* ptr) {
ptr[5] = 6;
void *(* volatile memset_v)(void *s, int c, size_t n) = memset;
memset_v(ptr + 2, 3, 4);
not_free(ptr);
}[0] https://llvm.org/docs/LangRef.html#llvm-memset-intrinsics
[1] https://gitweb.git.savannah.gnu.org/gitweb/?p=gnulib.git;a=b...
The C committee gave you memset_explicit. But note that there is still no guarantee that information can not leak. This is generally a very hard problem as information can leak in many different ways as it may have been copied by the compiler. Fully memory safe languages (so "Safe Rust" but not necessarily real-word Rust) would offer a bit more protection by default, but then there are still side-channel issues.
Creating memset_explicit won't fix existing code. "Oh but what if maybe" is just cope.
If I do memset then free then that's what I want to do
And the way things go I won't be surprised if they break memset_explicit for some other BS reason and then make you use memset_explicit_you_really_mean_it_this_time
Once you accept that optimizing compilers do, well, optimizations, the question is what should be allowed and what not. Both inlining "memset" and eliminating dead stores are both simply optimizations which people generally want.
If you want a store not to be eliminated by a compiler, you can make it volatile. The C standard says this can not be deleted by optimizations. The criticism with this was that later undefined behavior could "undo" this by "travelling in time". We made it clear in ISO C23 that this not allowed (and I believe it never was) - against protests from some compiler folks. Compilers still do not fully conform to this, which shows the limited power WG14 has to change reality.
> Once you accept that optimizing compilers do, well, optimizations
Why in tarnation it is optimizing out a write to a pointer out before a function that takes said pointer? Imagine it is any other function besides free, see how ridiculous that sounds?
* Don't worry about a schema.
* Don't worry about persistence or durability.
* Don't worry about reads or writes.
* Don't worry about connectivity.
This is basically the entire philosophy, so it's not surprising at all that users would also not worry about basic security.
That sort of inclination to push off doing the right thing now to save yourself a headache down the line probably overlaps with “let’s just make the db publicly exposed” instead of doing the work of setting up an internal network to save yourself a headache down the line.
Which is such a cop out, because there is always a schema. The only questions are whether it is designed, documented, and where it's implemented. Mongo requires some very explicit schema decisions, otherwise performance will quickly degrade.
Kleppmann chooses "schema-on-read" vs "schema-on-write" for the same concept, which I find harder to grasp mentally, but describes when schema validation need occur.
But now we can at least be rest assured that the important data in mongoDB is just very hard to read with the lack of schemas.
Probably all of that nasty "schema" work and tech debt will finally be done by hackers trying to make use of that information.
I suspect that this is in part due to historical inertia and exposure to SecDB designs.[0] Financial instruments can be hideously complex and they certainly are ever-evolving, so I can imagine a fixed schema for essentially constantly shifting time series universe would be challenging. When financial institutions began to adopt the SecDB model, MongoDB was available as a high-volume, "schemaless" KV store, with a reasonably good scaling story.
Combine that with the relatively incestuous nature of finance (they tend to poach and hire from within their own ranks), the average tenure of an engineer in one organisation being less than 4 years and you have an osmotic process of spreading "this at least works in this type of environment" knowledge. Add the naturally risk-averse nature of finance[ß] and you can see how one successful early adoption will quickly proliferate across the industry.
0: This was discussed at HN back in the day too: https://calpaterson.com/bank-python.html
ß: For an industry that loves to take financial risks - with other people's money of course, they're not stupid - the players in high finance are remarkably risk-averse when it comes to technology choices. Experimentation with something new and unknown carries a potentially unbounded downside with limited, slowly emerging upside.
Sometimes it comes from a misconception that your schema should never have to change as features are added, and so you need to cover all cases with 1-2 omni tables. Often named "node" and "edge."
It’s probably better to check what you’re working on than blindly assuming this thing you’ve gotten from somewhere is the right shape anyway.
I never said mongodb was wrong in that post, I just said it accumulated tech debt.
Let's stop feeling attacked over the negatives of tradeoffs
You very often have both NoSQL and SQL at scale.
NoSQL is used for high availability of data at scale - iMessage famously uses it for message threads, EA famously uses it for gaming matchmaking.
What you do is have both SQL and NoSQL. The NoSQL is basically caches of resources for high availability. Imagine you are making a social media app... Yes of course you have a SQL database that stores all the data, but you maintain API caches of posts in NoSQL.
Why? This gets to some of your other black vs white insults: NoSQL is typically WAY FASTER than SQL. That's why you use it. It's way faster to read a JSON file from a hard drive than it is to query a SQL database, always has been. So why not use NoSQL for EVERYTHING? Well, because you have duplicated data everywhere since it's not relational, it's just giant caches essentially. You also will get slow queries when the documents get huge.
Anyway you need both. It's not an either/or thing. I cannot believe this many years later people do not know the purpose of SQL and NoSQL and do not understand that it is not a competition at all. You want both!
Mongo has spent its entire existence pretending to be a SQL database by poorly reinventing everything you get for free in postgres or mysql or cockroach.
-JS devs after "Signing In With Facebook" to MongoDB Atlas
AKA me
Sorry guys, I broke it
The end result is "everyone" kind of knows that if you put a PostgreSQL instance up publicly facing without a password or with a weak/default password, it will be popped in minutes and you'll find out about it because the attackers are lazy and just running crypto-mine malware, etc.
I would rather not use it, but I see that there are legitimate cases where MongoDB or DynamoDB is a technically appropriate choice.
Absence of evidence is not evidence of absence...
Specifically, it looks like the exflitration primitive relies on errors being emitted, and those errors are what leak the data. They're also rather characteristic. One wouldn't reasonably expect MongoDB to hold onto all raw traffic data flowing in and out, but would absolutely expect them to have the error logs, at least for some time back.
Do other CVE reports come with more strong statements? I’m not sure they do. But maybe you can provide some counter examples that meet your bar.
It is standard, yes. The problem with it as a statement is that it's true even if you've collected exactly zero evidence. I can say I don't have evidence of anyone being exploited, and it's definitely true.
It is also a pretty standard response indeed. But now that it was highlighted, maybe it does deserve some scrutiny? Or is saying silly, possibly misleading things okay if that's what everyone has always been doing?
Almost always when you hear about emails or payment info leaking (or when Twitter stored passwords in plaintext lol) it's from logs. And a lot of times logs are in NoSQL because it is only ever needed in that same JSON format and in a very highly available way (all you Heroku users tailing logs all day, yw) and then almost nobody encrypts phone numbers and emails etc. whenever those end up in logs.
There's basically no security around logs actually. They're just like snapshots of the backend data being sent around and nobody ever cares about it.
Anyway it has nothing to do with the choice to use NoSQL, it has more to do with how neglected security is around it.
Btw in case you are wondering in both the Twitter plaintext password case and in the Rainbow Six Siege data leak you mention were both logs that leaked. NoSQL backed logs sure, but it's more about the data security around logging IMO.
What would break if the compiler zero'd it first? Do programs rely on malloc() giving them the data that was there before?
Honestly, aside from the "<emoji> impact" section that really has an LLM smell (but remember that some people legit do this since it's in the llm training corpus), this more feels like LLM assisted (translated? reworded? grammar-checked?) that pure "explain this" prompt.
I did some research with it, and used it to help create the ASCII art a bit. That's about it.
I was afraid that adding the emoji would trigger someone to think it's AI.
In any case, nowadays I basically always get at least one comment calling me an AI on a post that's relatively popular. I assume it's more a sign of the times than the writing...