OP should go through the rest of the code and see if there are other places where collections are potentially operated by multiple threads.
>The easiest way to fix this was to wrap the TreeMap with Collections.synchronizedMap or switch to ConcurrentHashMap and sort on demand.
That will make the individual map operations thread-safe, but given that nobody thought of concurrency, are you sure series of operations are thread-safe? That is, are you sure the object that owns the tree-map is thread-safe. I wouldn't bet on it.
>Controversial Fix: Track visited nodes
Don't do that! The collection will still not be thread-safe and you'll just fail in some other more subtle way .. either now, or in the future (if/when the implementation changes in the next Java release).
>Sometimes, a detail oriented developer will notice the combination of threads and TreeMap, or even suggest to not use a TreeMap if ordered elements are not needed. Unfortunately, that didn’t happen in this case.
That's not a good take-away! OP, the problem is you violating the contract of the collection, which is clear that it isn't thread-safe. The problem ISN'T the side-effect of what happens when you violate the contract. If you change TreeMap to HashMap, it's still wrong (!!!), even though you may not get the side-effect of a high CPU utilization.
---------
When working with code that is operated on by multiple threads, the only surefire strategy I found was to to make every possible object immutable and limiting any object that could not be made immutable to small, self-contained and highly controlled sections. We rewrote one of the core modules following these principles, and it went from being a constant source of issues, to one of the most resilient sections of our codebase. Having these guidelines in place, also made code reviews much easier.
This is a very important point! And, in my experience, a common mistake by programmers who aren't great at concurrency.
Lets say you have a concurrent dynamic array. The array class is designed to be thread-safe for the explicit purpose that you want to share it between threads. You want to access element 10, but you want to be sure that you're not out of bounds. So you do this:
if (array.size() > 10) {
array.element(10).doSomething();
}
It doesn't matter how thread-safe this array class is: these two operations combined are NOT thread-safe, because thread-safety of the class only means that `.size()` and `.element()` are on their own not going to cause any races. But it's entirely possible another thread removes elements in between you checking the size and accessing the element, at which point you'll (at best) get an out-of-bounds crash.The way to fix it is to either use atomic methods on the class which does both (something like `.element_or_null()` or whatever), or to not bother with a concurrent dynamic array at all and instead just use regular one you guard with a mutex (so the mutex is held during both operations and whatever other operations other threads perform on the array).
Concurrent transactions can screw it up, and you wouldn't notice until an escalation happens.
Ironically, .NET repeated the same mistake with the same outcome - it's just less obvious because for .NET the switch from synchronized-by-default to unsync-by-default happened when generics were introduced, so type safety of the new collections was the part that devs generally paid the most attention to.
This means there are three ways to resolve this:
* Add synchronization with locks, etc.
* Don't share mutable access, e.g. single-owner model with channels.
* Make data immutable: an insight originally from purely functional languages. I believe Google invested quite heavily in this model with Guava.
Rust lets you choose which one of the three to give up, and it also statically prevents all three from happening at the same time.
That can still break if a reader makes multiple calls and implicitly assumes that the data structure at large hasn't been mutated between them.
At least in the single writer scenario it all breaks down if a lock free reader requires multiple items that are related to one another. It's an easy oversight to make though.
Fixing it sucks because you usually have to go and add locking to your previously fast and efficient lock free setup.
This is true -- generally the solution to this is to ask for everything in a single message.
https://mailinator.blogspot.com/2009/06/beautiful-race-condi...
‘Not thread safe’ really means ‘not thread safe’.
The hard part is exactly all the combinations locks can be hold and given up, which will have a non-trivial number for more complex programs/data structures. That's where modeling software can help by iterating over all the possible cases.
What you suggest to make objects immutable: Yes, and then one actually uses different data structures, which are made for being used as immutable data structures, making use of structural sharing, to avoid or limit an explosion in memory usage. So again its a case of "use the right data structures for the job". Using purely functional data structures makes one side of the problem simply disappear. The other side might not even come up in many cases: What if the threads depend on other versions of the data structure, which the other threads create? (need intermediate results of other threads). Then one has a real headache coming ones way, and probably needs again other data structures.
One additional idea I had, if one really wants to shoehorn it with the already used mutable data structure is, that one could write something that serializes the access attempts before they hit the data structure. That could also include grouping accesses into transactions and only running complete transactions. Then it makes me think whether it is close to implementing a database ...
I don't intend to be mean, I just think the idea of independent execution in a shared memory environment is fundamentally flawed. And the effort in trying to get such follies working should have been put into better process models.
Where did you get that?
Immutability does not mean replicating the 'process model'. It means removing (limiting) an entire class of bugs from your multithreaded code. There will still be shared areas - and those should be controlled.
public void someFunction(SomeType relatedObject,
List<SomeOtherType> unrelatedObjects) {
...
treeMap.put(relatedObject.a(), relatedObject.b());
...
// unrelatedObjects is used later on in the function so the
// parameter cannot be removed
}
That’s not true. The original code only does the treeMap.put if unrelatedObjects is nonempty. That may or may not be a bug.You also would have to check that a and b return the same value every time, and that treeMap behaves like a map. If, for example, it logs updates, you’d have to check that changing that to log only once is acceptable.
Whether it occurs or not can depend on the specific data being processed, and the order in which it is being processed. So this can happen in production after seemingly working fine for a long time.
I haven't personally encountered a buggy comparator without a total order.
I recently wrote a blog post which (while not the main focus) includes a discussion on randomized testing of sort algorithms in the face of inconsistent comparators: https://sunshowers.io/posts/monads-through-pbt/ section 2.
To give one example, a common error is to compare two int values via subtraction, which can give incorrect results due to overflow and modulo semantics, instead of using Integer::compare (or some equivalent of its implementation).
Another common source of comparator bugs is when people compare floats or doubles and they don’t account for NaN, which is unequal to everything, including itself!
In Java, the usual symptom of comparator bugs is that sort throws the infamous “Comparison method violates its general contract!” exception.
Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle, as they could cause seemingly unrelated problems. Rarely is this accepted by whoever chooses what we should work on.
But of course it's quasi-impossible to know in advance the likelihood of a given error participating in a future problem, and whether it's cheaper to fix this error ahead or let the problem happen. So it becomes an art more than a science to decide what to focus on.
"fix nothing" is certainly a horrible approach, "fix everything" is often impractical. So you either need some sort of decision framework, or a combination of decent "instinct" (experience, really) and trust from your stakeholder (which comes from many places, including good communication and track record of being pragmatic over dogmatic)
Strongly disagreed. Strange, unexpected behaviour of code is a warning sign that you have fallen short in defensive programming and you no longer have a mental model of your code that corresponds with reality. That is a very dangerous to be in. Very quickly possible to be stuck in quicksand not too far afterwards.
Yes. Absolutely.
You don't believe your software is correct because your tests don't fail. You believe your software is correct because you have a mental model of your code. If your tests are not failing but your software is not behaving correctly, that your mental model of your code is broken.
Consider something like Unreal Engine for example. It's not realistic to expect to have a full mental image of the entire system in such a case.
At least in theory the tests are supposed to cover the observable behavior that matters. So I figure if the tests pass all is well. If I still find something broken then I need to add a test case for it.
Sure, but then you divide the larger system into smaller components where each team is responsible for one or few of these individual pieces and the chief architect is responsible for making sure of how the pieces are put together.
> At least in theory the tests are supposed to cover the observable behavior that matters. So I figure if the tests pass all is well. If I still find something broken then I need to add a test case for it.
But you sure as hell hope that the engineer working on implementing your database has a decent mental model for the thread safety of his code and not introduces subtle concurrency bugs because his tests are still green. You also hope that he understands that he needs to call fsync to actually flush to data to disk instead of going yolo (systems never crash and disks never fail ). How are you supposed to cover the user observable behavior in this case? You cut off the power supply to your system/plug off your disk while writing to the database and assert that all the statements that got committed actually persisted? And how many times you repeat that test to really convince you that you are not leaving behind a bug that will only happen in production systems say once every three years?
I am only giving database and multithreading as examples because they are the most obvious, but I think the principle applies more generally. Take the simplest piece of code everyone learns to write first thing in uni, quicksort. If you don't have a sufficient mental model for how that algorithm works, what amount of tests will you write to convince yourself that your implementation is correct?
And then you have ravioli code in the large. It is not going to make it easier to understand the bigger system, but it will make it harder to debug.
https://en.wikipedia.org/wiki/Spaghetti_code#Ravioli_code
If you can reproduce an error, you can fix it. Do that.
If you cannot reproduce it after a day of trying and it doesn’t happen often, don’t fix it.
There is some optimal level of splitting things up so that it's understandable and editable without overdoing it on the abstraction front. We need a term for that.
Sqlite famously has more test related code than database code.
Multithreading correctly is difficult enough that multiple specialized modeling languages exist and those are cumbersome enough to use that most people don't. In practice you avoid bugs there by strictly adhering to certain practices regarding how you structure your code.
You mention fsync but that famously does not always behave as you expect it to. Look up fsyncgate just for starters. It doesn't matter how well you think you understand your code if you have faulty assumptions about the underlying system.
Generally you come across to me as overconfident in your ability to get things right by carefully reasoning about them. Of course it's important to do that but I guarantee things are still going to break. You will never have a perfect understanding of any moderately large system. If you believe you do then you are simply naive. Plan accordingly (by which I mean, write tests).
Tests failing implies the code is incorrect. Tests not failing does not imply that the code is correct.
I don't think that's what's being suggested. Tests not failing when your code does implies that you are missing test cases. In other words things are underspecified.
Haskell is the extreme example of this. If it successfully compiles then it most likely does exactly what you intended but it might be difficult to get it to compile in the first place.
I am really confused. Have you guys never written any multithreaded code? You can write the most disgusting thread-unsafe code without a single lock and be perfectly green on all your tests. And who in the world can write tests to simulate all possible timing scenarios to test for race conditions?
I give multithreading as just the most egregiously obvious example that this "tests can prove correctness" idea is fundamentally broken, but I think it applies more generally.
>Haskell is the extreme example of this. If it successfully compiles then it most likely does exactly what you intended but it might be difficult to get it to compile in the first place.
Absolutely 100% of the safety of haskell comes from the mental model (functional programming, immutable data structures etc) and none from the test cases (although their community appears to even do testing slightly better than others).
> this "tests can prove correctness" idea
You are the only one putting forward such an idea. It's not that I think tests passing proves correctness. It's that I know from experience that I don't fully understand the system. The code that breaks is always a surprise to me because if it wasn't then I would have fixed it before it broke.
So if my code breaks and tests don't catch it then I figure that before I fix it I should add a test for it.
Of course there are some categories such as multithreading that you generally can't test. So you take extra care with those, and then inevitably end up sinking time into debugging them later regardless.
This was the very first comment you made that started this thread:
>> If none of your tests fail then does it really matter?
Sometimes it's just not cost effective to solve every last bit of jank.
If you write software that controls safety critical systems then obviously that statement does not apply. But if you write webapps or games or ...
Fair. Despite the lengthy argument, I don't think our stances are all that drastically different. I am not saying that you shouldn't write tests. Just that the mental model comes first and that the tests are informed by it (in an ideal world the test are an automatically executable specification of your mental model).
Still I don't understand why you insist that "there exists an automated test for it" has to be the definition of whether something matters.
> But if you write webapps or games or ...
In which case it might just be fine to YOLO it without tests.
People have been delivering valuable software without automated tests for decades. Checkout the code of the linux kernel from circa 2005 and see how many tests there are.
I, personally, was mostly just pissed we didn't get it to zero. Unsurprisingly the number has climbed back up since
I did something similar with spotbugs. There were existing warnings I couldn't get time to fix so I configured the maven to fail if it exceed the level at which I enabled it.
This has the unfortunate side effect that if it drops and no one adjusts the threshold then people can add more issues without failing the build.
Our tests are often written with a list of known exceptions. However, such tests also fail if an exception is no longer needed - with a congratulatory message and a notice that this exception should be removed from the list. This ensures that the list gets shorter and shorter.
It worked really well to incrementally improve things without forcing people to deal with them all the time. People would from time to time make sure they cleaned up a number of issues in the files they happened to be working on, but they didn't have to do them all (which can be a problem with strategies that for example lint only the changed files, but require 0 errors). We threw a small line chart up one one of our dashboards to provide some sense of overall progress. I think we got it down to zero or thereabouts within a year or so.
That feels less arbitrary than a magic number (because it is!) and I've seen it work.
However, management felt kinda burned because that was a bunch of time and unsurprisingly nobody was measurably more productive afterwards (it turns out those are just shitty code tidbits, but not highly correlated with areas which where it is miserable to make changes. Some of the over-refactorings probably made things harder.
It was a lovely measurable metric, making it an easy sell in advance. Which maybe was the problem idk.
Still better to quash small bugs and errors where possible, but at least if you know why they happen, you can prevent unforeseen issues.
Nobody wants to triage an issue for eight weeks, but one thing to keep in mind is that the more difficult it is to triage an issue the more ignorance about the system that process is revealing - if your most knowledgeable team members are unable to even triage an issue in a modest amount of time it reveals that your most knowledgeable team members have large knowledge gaps when it comes to comprehending your system.
This, at least, goes for a vague comprehension of the cause - there are times you'll know approximately what's going wrong but may get a question from the executive suite about the problem (i.e. "Precisely how many users were affected by the outage that caused us to lose our access_log") that might take weeks or months or be genuinely nigh-on-impossible to answer - I don't count questions like that as part of issue diagnosis. And if it's a futile question you should be highly defensive about developer time.
With third party libraries, I've too-often found myself reading the code to figure something out, although that's a frustrating enough experience I generally wouldn't wish on other people.
there's nothing pagmatic about it. once I get into the habit of ignoring a few warnings that effectively means all warnings will be ignored
It then creates immense value by avoiding a lot of risk and uncertainty for little effort.
Getting from "thousands of warnings" to zero isn't a good ROI in many cases, certainly not on a shortish term. But staying at zero is nearly free.
This is even more so with those "fifteen flickering tests" these 23 tests that have been failing and ignored or skipped for years.
It's also why I commonly set up a CI, testing systems, linters, continuous deployment before anything else. I'll most often have an entire CI and guidelines and build automation to deploy something that will only say "hello world". Because it's much easier to keep it green, clean and automated than to move there later on
They want to take time out to write a lot of unit tests, but they're not willing to change the process to allow/expect devs to add unit tests along with each feature they write.
I'll be surprised if all the tests are still passing two months after this project, since nobody runs them.
I don't typically write tests first so it's not true TDD but it's been a big personal process improvement and quality boost.
For me such gigs are a red flag and immediate turn down (I'm freelancer with enough opportunities, luxury position, I know).
I would consider it really weird if management dictates exactly what tools and steps a carpenter must take to repair a chair. Or when the owner of a hotel tells the chef what steps are allowed when preparing fish. We trust the carpenter or chef to know this best. To know best how to employ their skills given the context.
If management doesn't trust the experts they hire to make the right choice in how they work, what tools they use, what steps they take, etc. that's a red flag: either they are hiring the wrong people (and the micromanaging is an attempt to fix that) or they don't think the experts are expert enough to make decisions on their own.
For me, this goes for tools (e.g. management dictates I must work on their windows machine with their IDE and other software) for processes (management forbids tests, or requires certain rituals around merges etc) and for internals (management forbidding or requiring certain abstractions, design patterns etc)
To be clear: a team, through, or via a management, should have common values and structures and such. And it makes perfect sense for management to define the context (e.g. this is a proof of concept, no need for the rigid quality here. Or we must get these features out of the door before thursday, nothing else matters.) It's when the management dictates how teams or experts must achieve this that it becomes a red flag to me.
I haven't been wrong in this. These red-flags almost always turned out to hint at underlying, deeply rooted cultural problems that caused all the technical troubles.
Wouldn't they just run as part of the build? At least for Java, Junit tests run as part of the build by default.
Sounds like what we used to call "professionalism." That was before "move fast, break things and blame the user" became the norm.
If manual input can generate undefined behavior, you depend on a human making a correct decision, or you're dealing with real-world behavior using incomplete sensors to generate a model...sometimes, the only reasonable target is "fail gracefully". You cannot expect to generate right outputs with wrong inputs. It's not wrong to blame the user when economics, not just laziness, say that you need to trust the user to not do something unimagineable.
I think this is the kind of situation where a little professionalism would have prevented the issue: Handling uncaught exceptions in your threadpool/treemap combo would have prevented the problem from happening.
When VCs only give you effectively 9 months of runway (3 months of coffees, 9 months of actual getting work done, 3 months more coffees to get the next round, 3 more months because all the VCs backed out because your demo wasn't good enough), move fast and break things is how things are done.
If giving startups 5 years of runway was the norm, then yeah, we could all be professional.
I think you're professing a false dichotomy. Is it unprofessional to "move fast, break things"?
I'm a slow moving yak shaver partly due to concious intention. I admire some outcomes from engineers that break things like big rockets.
I definitely think we learn fast by breaking things: assuming we are scientific enough to design to learn without too much harm/cost.
I work in unmanned aerospace. It started with 55lb quadcopters and got… significantly bigger from there.
I’ve thought a ton about what you’re saying over the last 5-6 years. I have broken things. My coworkers and underlings have broken things. We’ve also spent a bunch of time doing design review, code review, FEA review, big upfront design, and all those expensive slow things.
Here’s, for me, the dividing line: did we learn something new and novel? Or did we destroy kilobux worth of hardware to learn something we could have learned from a textbook, or doing a bit of math, or getting a coworker to spend a few hours reviewing our work before flying it?
And I 100% agree with your last statement: you can “move fast and break things for science” professionally. But… if something breaks when you weren’t expecting it to, the outcome of the post-mortem really should be surprising new knowledge and not “this made it to prod without review and was a stupid mistake”
Most of the time it is. The sorry state of software in general is a testament to that.
Pretty sure what you actually mean is that you treat some warnings as errors, and disable the others. I would find it hard to believe you're building with literally all the compiler warnings enabled all the time.
-Wall,-Wextra,-Werror,-Wno-type-limits,-Wno-attributes,-Wno-deprecated-declarations
And of course some are suppressed with a pragma for a short section of specific segments of the code where you can make the argument that it's appropriate during code review, but those pragmas stick out like a sore thumb.A lot, yes, but definitely a lot more than 2 that you still don't have enabled. Might be worth looking into them if you haven't already -- you will definitely disable some of them in the process.
(And I assume you already have clang-tidy, but if not, look into that too.)
But having shared, mutable state without locks or other protection should never have passed code review in the first place. I would look at that commit to try to determine how it happened, then try to change the team's processes to prevent that.
The way people structure code, it might even be non-obvious that there's a use of something like TreeMap, as it will be abstracted away into "addNode" method.
Still a red flag, since the process when introducing threads should be "ensure data structures allow for concurrent write and read, or guard them otherwise", but when the task says "performance improvement" and one's got 14x or 28x improvement already, one might skip that other "bit" due to sheer enthusiasm.
It’s a nuisance issue sorta like hemorrhoids. Do you get the surgery and suffer weeks of extreme pain, or do you just deal with it? Hemorrhoids are mostly benign, but certainly have the potential to become more severe and very problematic. Maybe this phenomenon should be called digital hemorroids?
There is something going wrong in your body that has hemorrhoids as a downstream effect. Surgery can't fix the root cause.
If you have constipation then consider the following: the large intestine has bacteria that process your undigested food and this can have many nasty consequences. What is going wrong in the small intestine that leads to this?
Otherwise, I see the cleanups and refactoring as part of normal development process. There is no need to put such tasks in Jira - they must be done as preparation for the regular tasks. I can imagine that some companies take agile too seriously and want to micromanage every little task, but I guess lack of time for refactoring is not the biggest problem.
debugging in codebases with a lot of magic (rails) is hard. it can be very difficult to follow calls around unless you're quite the expert. certain styles of programming really frustrate me, but then again I program like a scientist so the kinds of things I'm prone to do frustrate software engineers (for loops nested 8 deep, preference for single character variables, etc.)
Even without corruption a race condition can cause significant performance issues by causing the same work to be done many times with only one result being kept.
> Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle
For warnings: at least explained in a comment, where it has been decided irrelevant (preferably with a pragma to turn off the warning as locally as possible).
Strange behaviour I prefer to get rid of. I've found code marked (by me at least once!) "not sure why this works, but" which very much no longer works, and had to rewrite in a rush where if it had been addressed earlier there would have been more time to be careful.
Only time to fix is just after it is discovered - or mostly never ever because it becomes expensive to build up the context in mind again.
You need to find more disciplined teams.
There are still people out there who care about correctness and understand how to achieve it without it being an expensive distraction. It a team culture factor that mostly just involves staying on top of these concerns as soon as they're encountered so there's not some insurmountable and inscrutable backlog that makes it feel daunting and hopeless or that makes prioritization difficult.
Switching that 500k LOCs Ruby or JavaScript project over to Rust ain't happening quickly in neither small nor big team.
I agree. I hate lingering warnings. Unfortunately at the at time of this bug I did not have static analysis tools to detect these code smells.
I get that YMMV based on the org, but I find that more often than not, it’s expected that you are properly maintaining the software that you build, including the things like fixing warnings as you notice them. I can already feel the pushback coming, which is “no but really, we are NOT allowed to do ANYTHING but feature work!!!!” and… okay, I’m sorry, that really sucks… but maybe, MAYBE, some of that is you being defeatist, which is only adding to this culture problem. Oh well, now’s not the time to get fired for something silly like going against the status quo, so… I get it either way.
But for some solace https://en.wikipedia.org/wiki/The_Power_of_10:_Rules_for_Dev...
This is the kind of thing you should probably just do, not seek buy-in for from whoever chooses what you should work on. Unless it is going to take some extreme amount of time.
Hyperbole sure, but made me chuckle and is a nice reminder to check all assumptions.
I myself do try to fix most warnings -- some are just false positives.
So I guess the new user / student was trying to do things in parallel to speed things up when they chopped up their large text file into N (32 or 64) sections based on line number (not separate files), and then ran N copies of perl in parallel, each processing its own set of lines from that one file.
Not only did you have a large amount (for back then) of RAM used by N copies of the perl interpreter (separate processes, not threads, mind you!) processing its data, as well, any attempt to swap was interleaved with frantic seeking to a different section of the same file to read a few more lines for one of N processes stalled on IO. Also, probably the Jth process had to read J/N of the entire file to get to its section. So the first section of the file was read N times, the next N-1, then N-2, etc.
We (me and the fellow-student-trusted-as-sysadmin who had the root password) couldn't even get a login prompt on the console. Luckily, I had a logged-in session (ssh from an actual x-terminal - a large-screen "dumb" terminal), and "su" prompted for a password after 20-30 minutes of running it. After another 5-10 minutes, we had a root session and were able to run top and figure out what was going on. Killing the offending processes (after trying to contact the user) restored the system back to normal.
Edit: forgot to say: had the right idea, but totally didn't understand the system's limitations. It was SEVERELY I/O limited with that hard drive and relatively low RAM, so just processing the data linearly would have easily been the best approach unless the amount of data to be kept would have gotten too large.
It's odd he gets so hung up on npe being a critical ingredient when that doesn't appear to be present in the original manifestation. There's no reason to think you can't have a bug like this in C just because it doesn't support exceptions.
To me it's all about class invariants. In general, they don't hold while a mutator is executing and will only be restored at the end. Executing another mutator before the invariants are reestablished is how you corrupt your data structure. If you're not in a valid state when you begin you probably won't be in a valid state when you finish.
Multithreading errors are the worst to debug. In this case it's dead simple to identify at design time and warning flags should have gone up as soon as he started thinking about using any of the normal containers in a multithreaded environment.
There's simply no straightforward default approach that won't have you running into and thinking through the most esoteric sounding problems. I guess that's half the fun!
Hopefully someone will invent something like STM [1] in the distant year of 2007 or so [2]. It has actual thread-safe data structures. Not just the current choice between wrong-answer-if-you-dont-lock and insane-crashing-if-you-dont-lock.
[1] https://www.adit.io/posts/2013-05-15-Locks,-Actors,-And-STM-...
A task is an abstraction over those primatives in any language. To my knowledge TBB task graph abstract over a threadpool using exactly that concept.
From what I've seen swift is the only language that properly handles concurrency. I'm taking another crack at rust but the fact that everyone uses tokio for anything parallel makes me feel like the language doesn't have great support for concurrency, it just has decent typing which isn't a surpise to anyone.
(Well, go is not even memory safe under data races!)
Also, Java is one of the languages where you can just add `synchronized` as part of the method signature, and while this definitely doesn't solve the problem, I don't think "divorced from the data" is accurate.
I think the primary issue here is that synchronized should be the default case, with optionally lifting that very strict restriction for multi-threaded access.
Nonetheless, objects with methods that would do STM on their inner state would be a pretty cool design.
I've written highly concurrent software with bog-standard hash maps plus channels. There are so many advantages to this style, such as events being linearized (and thus being easy to test against, log, etc).
Optimistic concurrency in general is a useful design pattern in many cases, though.
In other words, use Rust.
We have almost one '9' of uptime!
int balance = 0;
int get {...}
void set(int balance) {...}
void withdraw(int amt) {
if amnt <= balance {
balance -= amnt;
}
}
it will be flagged immediately in code review as a race condition and/or something that doesn't guarantee its implied balance>=0 invariant. Because threading.But lift the logic up into a REST controller like pretty much all web app backends:
GET /balance/get
POST /balance/set
POST /balance/withdraw
And it will sail straight through review. (Because we pretend each caller isn't its own Thread.)I've seen the same thing with an undersynchronized java.util.HashMap. This would have been in like 2009, but afaik it can still happen today. iirc HashMap uses chaining to resolve collisions; my guess was it introduced a cycle in the chain somehow, but I just got to work nuking the bad code from orbit [1] rather than digging in to verify.
I often interview folks on concurrency knowledge. If they think a data race is only slightly bad, I'm unimpressed, and this is an example of why.
[1] This undersynchronization was far from the only problem with that codebase.
...which I don't think I've ever seen before, but nicely explains the problem I saw, and coincidentally is from the same year!
Unlike the author’s hash set proposal it would require almost no memory or CPU overhead and may be more likely to be accepted.
That being said, in the decade plus I’ve used C# I’ve never found that I failed to consider concurrent access on data structures in concurrent situations.
It's not a fix but it is a good mitigation strategy.
Infinite loops are one of the nastiest kind of bugs. Although they are easy to spot in a debugger they can have really unfriendly consequences like OP's "can barely ssh in" situation.
Infinite loops in library code are particularly unpleasant.
Here's a story of a horror bughunt where the main characters are C++, select() and thread brandishing an exception: https://news.ycombinator.com/item?id=42532979
Once the data structure gets into the illegal state, every subsequent thread gets trapped in the same logic bomb, instead of erroring on an NPE which is the more likely illegal state.
Is there a way to ensure that whatever happens (CPU, network overloaded etc) one can always ssh in? Like reserve a tiny bit of stuff to the ssh daemon?
https://www.freedesktop.org/software/systemd/man/latest/syst...
You can also use sibling knobs to increase the CPU and IO weights of the unit, for example but setting this to something higher than 100:
https://www.freedesktop.org/software/systemd/man/latest/syst...
From time to time I’ll run something dumb on my machine (e.g. GC aggressive the wrong repo) and if I don’t catch the ramp up the machine will lock up until the oom killer find the right process. Sufficiently locked up, accessing alternate ttys to kill the offending process won’t work either.
I guess I could just reserve ssh then ssh into it from an other computer but…
java.util.concurrent is one of the best libraries ever. If you do something related to concurrency and don't reach for it to start, you're gonna have a bad time.
The fact that ConcurrentTreeMap doesn't exist in java.util.concurrent should be ringing loud warning bells.
If you're doing that anyway, it tends to be easier and more reliable to forget about thread safety at the bottom, and instead design your program so that thread safety happens in larger blocks, and you have big chunks of code that are effectively single-threaded and don't have to worry about it.
The concurrent collections are great, but they don’t save you from thinking about concurrent accesses and data consistency at a higher level, and managing concurrent operations externally as necessary.
No 32 core (thread, likely) machine should ever normally be in a state where someone can "barely ssh onto it". Is Java really that janky? Or is "barely ssh onto it" a bit hyperbolic?
The problem is that all processes run with pretty much the same priority level, so there's no differentiation in the scheduler between interactive ssh sessions and other programs that are consuming all the CPU, and most of the time you only realize this fact far after the point where you can SSH in and renice the misbehaving processes.
This would go on for a few minutes in which the machine was completely unresponsive including the mouse.
So it seems possible still to bring a high core count machine to its knees. But something then is indeed very wrong.
I wonder if the full race detector, go run -race, would catch it or not. I also want to explore if the RB tree used a slice instead of two different struct members if that would trigger the runtime race detector. So many things to try when I get back to a computer with go on it.
As a sidebar: I'm almost distracted by the clarity. The well-formed structure of this article is a perfect template for an AI evaluation of a problem.
It'd be interesting to generate a bunch of these articles, by scanning API docs for usage constraints, then searching the blog-sphere for articles on point, then summarizing issues and solutions, and even generating demo code.
Then presto! You have online karma! (and interviews...)
But then if only there were a way to credit the authors, or for them to trace some of the good they put out into the world.
So, a new metric: PageRank is about (sharing) links-in karma, but AuthorRank is partly about the links out, and in particular the degree to which they seem to be complete in coverage and correct and fair in their characterization of those links. Then a complementary page-quality metric identifies whether the page identifies the proper relationships between the issues, as reflected elsewhere, as briefly as possible in topological order.
Then given a set of ordered relations for each page, you can assess similarity with other pages, detect copying (er, learning), etc.
Why not just divide 100% by number of cores and make that the max, so you don't need to know the number of cores to know the actual utilization? Or better yet, have those microcontrollers that Intel tries to pass off as E and L cores take up a much smaller percentage to fit their general uselessness.
Programming Language Memory Models
"Note that this implementation is not synchronized. If multiple threads access a map concurrently, and at least one of the threads modifies the map structurally, it must be synchronized externally. (A structural modification is any operation that adds or deletes one or more mappings; merely changing the value associated with an existing key is not a structural modification.) This is typically accomplished by synchronizing on some object that naturally encapsulates the map. If no such object exists, the map should be "wrapped" using the Collections.synchronizedSortedMap method. This is best done at creation time, to prevent accidental unsynchronized access to the map..."
This is so important. The idea that code is thread-safe just because it uses a thread-safe data structure, and its cousin "This is thread-safe, because I have made all these methods synchronized" are... not frequent, but I've seen them expressed more often than I'd like, which is zero times.
The business logic is thread unsafe because it uses a TreeMap concurrently. It should either use something else or lock around all usages of the TreeMap. It does not seem to be "in itself" wrong, meaning for any other cause than the usage of TreeMap.
I did have to hand-roll something that partially replicates Rust's RWLock<T> recently, but the resulting semantics turned out to be decent, even if not providing the exact same level of assurance.
You wouldn't notice the load from 32 cpu pegging threads or processes on a 32-core host when ssh'ing in. Sounds like the OP is maybe leaving out what the load on the machine was, maybe it was more like thousands of spinning threads?
In the graphic from the example we would keep track like this:
low: - high -
low: 11 high: -
low: 23 high: -
low: 23 high: 26
Error: now we see item 13, but that is not inside our span!
- 1 CPU at 100% = 100%
- 10 CPUs at 100% = 100%
- 1 CPU at 100% + 9 at 0% = 10%
Is that not right? Or is CPU utilisation not usually normalised?
[1] https://gist.github.com/ben-manes/6312727adfa2235cb7c5e25cae...