If you want you can read NULL as UNKNOWN and suddenly a whole bunch of operations involving them become a lot more intuitive:
1. TRUE OR UNKNOWN = TRUE, because you know you have at least one TRUE already.
2. TRUE AND UNKNOWN = UNKNOWN, because you don't know whether you have two TRUEs or not. It's just out there.
3. UNKNOWN XOR UNKNOWN = UNKNOWN, because it could darn near be anything: TRUE XOR TRUE, TRUE XOR FALSE, FALSE XOR FALSE, FALSE XOR TRUE... Internalizing this is where SQL's use of NULL / UNKNOWN really becomes intuitive.
4. (TRUE AND FALSE) XOR (TRUE OR UNKNOWN) = (FALSE) XOR (TRUE) per #1 = TRUE. See, it's consistent, you just need to keep in mind that if you have a lot of known UNKNOWNs they're quite parasitic and your final outcome is likely to be, itself, an UNKNOWN. Just like in real life!
> The fact that NULLs are distinct for UNIQUE columns but are indistinct for SELECT DISTINCT and UNION continues to be puzzling. It seems that NULLs should be either distinct everywhere or nowhere. And the SQL standards documents suggest that NULLs should be distinct everywhere. Yet as of this writing, no SQL engine tested treats NULLs as distinct in a SELECT DISTINCT statement or in a UNION.
So NULLs are not comparable but they are part of the same set.
It is not the case that nulls are always the same as one another. It is also not the case that nulls are always distinct from each other. Thus, the normal rule of grouping, that same values are combined and distinct values make different groups, does not apply. Another principle is needed to determine the result of grouping with nulls.
Logic which allows for an unknown value can't be shoehorned into always giving definite true or false answers.
I believe this confusion is confusing the tool with the thing being measured. For simplicity, I will use the analogy of a record (stored as a row in the database) as an observation in a scientific experiment. If the tool was able to record a value, I enter a value like 579.13. If the tool was not able to record a value, the tool will enter NULL. I make a total of one hundred observations. Of one hundred rows, some have values and some are NULL.
Are NULLs distinct values? No, they are simply a failure in measurement; it is like asking if all errors are distinct or the same. Are NULLS part of the same dataset? Yes, because they are all observations for the same scientific experiment. What does it mean when "select distinct ... " returns several rows for known/measurable values and but only one row for NULL? If this is confusing, the scientist can update the rows and substitute "UNKNOWN/ERROR" for every NULL. When you do "select distinct ...", you will get the same thing. It will return several rows for known/measurable values and but only one row for "UNKNOWN/ERROR".
In the SQL spec by default unique indexes consider nulls distinct because you’re adding tuple to the relation, and this is done by equality.
When doing a select distinct or group by you’re not doing a boolean grouping, you’re doing a set projection. NULLs are considered part of the set of “unknown values”, so NULLs are grouped together but they’re still not equal to each other.
The behaviour is mathematically well defined, but it doesn’t match boolean logic.
I’ve been dealing with databases for well over 20 years now and I can only think of a couple of times when this behaviour wasn’t wanted, so I think it’s the right choice.
Two unknown values are assumed to be different, but they are not distinct from each other.
For example, take two boxes, in each box is a die, the value of the box is the value shown on the die inside. You don't know the value since you don't see the die, it may even change as you manipulate the box, so it is unknown, NULL in SQL. Because of that, you assume the two boxes have different values. They are, however, indistinguable, so, not distinct. All you need to know is that you hold two boxes with dices in it, which one you hold doesn't matter, and that's what "SELECT DISTINCT" tells you: that you have two boxes and it doesn't matter which is which, even though they have different values.
The literal definition distinct is:
>recognizably different in nature from something else of a similar type.
If you want to get down to it nothing is "equal" or the same.
Is a temperature measurement 25C the same as another of 25C? No these measurements are an approximation of the actual values which are actually not equal to each other they are distinct they have just been lumped into the same 25C group due to the resolution of measurement yet equality works just fine on that value in sql.
I have used SQL for a long time null handling is weird and inconsistent and a waste of time. For all the language bugs due to the existence of null at least I can count on null=null and not write garbage like value=param or (param is null and value is null)
Irrelevant. What matters is the meaning in the context of SQL.
> weird and inconsistent and a waste of time. For all the language bugs due to the existence of null
There are necessary, semantic cases that need to be dealt with. How else would you do it?
Also, it's really weird to use "bugs" to refer to well defined and well documented behavior.
It is extremely easy using partial uniques and the proper operators to treat nulls as non-distinct values and the approach we have allows some very important functionality that aligns with standard statistical data treatment which was influential to how SQL logic was originally designed.
Yes, the measurements are the same.
The actual temperatures probably are not, but measurements are not the same as the thing measured.
By the logic two unknown (null) measurements are the same regardless of the actual value which I agree with.
Whether two unknown measurements are the same is unknown.
The values in the db are the same in both cases which is what I would like my db language to deal with and not make assumptions about what that value actually means.
I see no value in treating null special when in comes to equality in a sql db, in fact it is a hinderance that it does so in my experience.
Is 0 for a temp measurement unknown sentinel or an actual measurement, how about 2,147,483,647 great probably not a measurement now its always included in greater than queries same with max negative with less than.
Null separates the value into its own distinct group and prevents it from being including in range queries due to it not being an actual numeric value while most languages still allow you to compare equality using standard operators.
Sum types would be great in sql but currently we get a union of the sql type and null, so null for sentinel values it is except for the having to using weird syntax to compare it.
A real sum type would be nice, but when you're using null then you need to accept that null was not designed with your specific use case in mind.
NULL is not a value.
NULL is a statement that a value is not available or unspecified reasons.
If you want a particular value where a query would return NULL, it's your job to replace the NULLs with the contextually-appropriate value, e.g., using COALESCE(), to provide it.
It's a convenience shortcut to allow more complicated data models to be rpersented in simpler table structures than a fully normalized NULL-free data model would require, and to provide information about missing data (which can be used with things like COALESCE, where appropriate) when a more complex data model is simplified into a resultset via a query with JOINS, etc.
I do not agree it is most certainly a value that is stored in the database.
>NULL is a statement that a value is not available or unspecified reasons.
Again I disagree, it is a value that denotes its value is of a different type than specified in the columns data type. The meaning of that value is for the user to decide. The system is literally storing a value that can be compared against using special equality syntax (IS NULL, IS DISTINCT FROM, etc).
The actual column definition is a sum type defining possible values in the column it is a constraint on the values:
columnA int null columnB int not null
If someone asks you what the value of a column in a result is when null do you say "I don't know" or do you say "null"?
For all these statements about what null means philosophically and the history about why it is treated the way it is in SQL there is little compelling argument to what value having the the equals operator always returns false when comparing nulls and instead one must use a separate syntax to properly compare null values for equality other than its for historical reasons and it changing it would be difficult.
This adds no value over typical programming languages where the normal equality operator can be used, its is weird and the source of confusion and even more bugs than your typical null handling creates.
I would prefer a database with fully fleshed out sum types rather than marking a column nullable, then a column could be marked as say a number + string + special sentinel type or whatever combination of types makes sense for my application, and if it that db did exist I am sure its equality operator would properly compare type and value to give a consistent binary result rather than the nonsense that is SQL null equality.
Your statements about using coalesce don't seem compelling to me, maybe I am a misunderstanding, to efficiently search for a value with a index you must use that value, I should be searching for with column IS NULL not Coalesce(column,[special value]) = [special value] which would be extremely inefficient.
Many languages have null coalescing operators and still use the standard equality operators for null. Coalesce to a special numerical value for a numerical column to represent a sentinel value is again a waste of time that again leads to strange greater than less than issues. Given a type system that allows nulls I would rather use IS NULL than coalesce that would be a further step backward but even better would be = NULL and get rid of the IS operator and its variants.
> recognizably different in nature from something else of a similar type.
But anyways, the point wasn't to justify the choices of SQL but rather as a way to make intuitive sense of its logic. SQL is one of the oldest and most successful programming languages in existence, we are not going to change it, and it is not going to disappear anytime soon, so we have to go with it, like it or not. There have been some attempts at alternatives, both at changing the paradigm (NoSQL) and at cleaning up the language, which, to be fair, would be a good thing, but without much success. The relational paradigm just works, and SQL is usable enough to make the cost of switching not worth it.
Edit:
And writing things like "value=param or (param is null and value is null)" is usually the sign of a poor understanding of the NULL logic. You are using it wrong basically. Sometimes, it is the right thing to do, but if that pattern starts appearing all over the place, it is usually a result of thinking "NULL is broken, I have to use this pattern to handle NULL properly". That's cargo culting, don't fix problems you don't understand by copy-pasting code you don't understand.
Note: this is not addressed to "you" in particular, there can be good reasons, no offense intended. But I think that in general, it is a code smell.
Two measurements of 25C are not recognizably different therefore they are equal, correct, regardless if the actual temperatures are not the same?
Two measurements of unknown are not recognizably different therefore they are equal in the context of the database.
Having null!=null has never been intuitive to me especially since every other programming language treats them equal. I am not hoping this gets changed, I know SQL is to far along for that, I can still complain about it and agree its wierd.
>And writing things like "value=param or (param is null and value is null)" is usually the sign of a poor understanding of the NULL logic.
It's needed with parametrized sql when your db doesn't support "is not distinct from" which is itself a silly way to just write '=' or '==' like a normal programming language. The distinct predict exist for this very reason to have yet another way to express equality that includes nulls: https://modern-sql.com/caniuse/T151
Indeed, the sqlite page the pull quote is from says as much.
Postgres lets you control that behaviour when creating the constraint (or index)
Luckily, Postgres nowadays lets you declare the behavior of each null in unique constraints, like it should be. We can expect this to creep down to the other DBMSs with time.
Making nulls distinct on a "select distinct" or a "group by" would be completely useless.
In a unique column normally you'll have an index, so NULL becomes a special value in an index, but in SELECT DISTINCT you probably won't have an index, which means a full scan is performed, then every row has to be compared with every other row.
But when I see select distinct at the start of a big and complex query, I do immediately suspect that the developer might have missed some join condition down the line and “got too many rows” back from the query. And since the rows look like duplicates due to the missing join predicate, for a junior (or careless) developer, the quick “solution” is to just apply distinct in the top level query to get rid of these pesky duplicates, declare success and move on.
Amount of false bullshit doesn’t make qualitative difference.
Only difference to make is that people should not take something as truth just because it is written in a book or in a blog post or if person has a degree or not.
Most technical writing is actually at the start of the technical journey, there's more people there and its more interesting to talk about strategy and ideas when its not "messy" with internal details and trade offs you make.
I encourage folks at any technical level to write, and I encourage folks to check authors work with a critical eye no matter how long the source has been in the industry, no amount of experience can protect you from being wrong.
People think if someone wrote blog post with technical details and it got upvoted - somehow it has to be an expert.
i quote:
"I graduated top of my class with a BSc in Computer Science [...]. I have a strong background in software engineering and technical leadership"
Which is fine! It's really hard to be truly expert in both. There's a reason why "programmer" and "database administrator" used to be two different professions. I'd like to think that I'm better than your average developer at flogging RDBMSes, but most DBAs I've worked with can still run circles around me when it comes to information modeling and database & query optimization.
It's not getting worse, you're getting better.
HN has for a long time been where I go for whatever you call the tech equivalent of watching stoners think they're having a deep conversation.
Ergonomics matter.
Your link makes the same mistake I already addressed. It conflates nullable booleans with tri-state logic.
Null is not a value. It is the absence of a value.
> The SQL null value basically means “could be anything”.
This is wrong. Null means it could be any valid value but that value is unknown. If the datatype is DATE then the value cannot be boolean TRUE or the string ‘purple’.
If I’m comparing a value of type date to a null I still think it works as it should if value is “unknown”. What greater insight or context do we have if it’s a small-int null?
Because the possible values are known.
> What greater insight or context do we have if it’s a small-int null?
The insight is that null is not a value. It’s not a smallint or a boolean or a date. It’s the absence of a possible value. The only way to see a null as tri-state is to conflate it with a nullable boolean. This is an incorrect mental model which leads to confusion.
If a student is sick and has not taken the exam, yes you could enter -99 to represent they did not take the test. But if you want to find the class average, you would have to do something like this:
select average(case when score =-99 then null else score end) as class_avg from …
Or you could have entered null to begin with.
You have a list of people and you ask if they own a car. You didn't get around to asking George, so that, somehow means he owns a car because you are using boolean logic? Or does it mean he doesn't own a car, because you are using boolean logic?
No, it means you haven't gathered this data point and don't know.
If there are three possible values, TRUE, FALSE and NULL (unknown), then you're probably working with something like Kleene logic. You can't truly be working with Boolean logic, though, any more than you can be doing integer arithmetic when 15.37 is a member of your domain.
To put it another way, if we're talking about the mathematical definition of boolean algebra and not just some programming language's quirky implementation that happens to be called "bool", then boolean values would by definition be non-nullable. That logic that allows nulls has been implemented using the same unified set of keywords and operator names is a pragmatic decision that simplifies the language implementation and spec, not a principled one that tries to be pedantic about mathematical terminology.
That's why it's name-squatting. Rather than introduce a 'kleene' datatype & operations, and let the user opt-in, they decided that in our 'bool' world, the following is not a truism:
a = a or a <> a
This is how it should be.
> Somehow means he owns a car because you are using boolean logic?
This is how it unfortunately is. There are 3 people, and there are 3 people who don't have a NULL car. Therefore George has a car.
CREATE TABLE people(name text, carId uuid);
INSERT INTO people values('Bill', '40c8a2d7-1eb9-40a9-b064-da358d6cee2b');
INSERT INTO people values('Fred', '3446364a-e4a5-400f-bb67-cbcac5dc2254');
INSERT INTO people values('George', NULL);
SELECT Count(*) FROM people WHERE name NOT IN (
SELECT name FROM people WHERE carId = NULL
);
Elsewhere people have argued that NULL propagates, so that your small unknowns infect larger queries. I could get behind that line of thinking, but the query above confidently returns 3.This is not what you are asking with your query: as someone else stated, NULL is meant to be "UNKNOWN", or "it could be any valid value".
So nothing is ever equal to something that can be anything, because even another NULL (i.e. unknown) value is in general different.
So in the line
SELECT name FROM people WHERE carId = NULL
the condition will always be false. Now if instead if meant to search for the rows where carId is actually unknown you have to write SELECT name FROM people WHERE carId is NULL
And your query will return as one may expect 2.It's worse than that. It's something is neither equal nor not-equal to something else.
Whether you can implement something differently as a workaround is immaterial. It's weird, per the title.
You could say that a boolean column with a NULL value is FALSE like how a lot of programming languages coerce it but if you wanted that you would just make a default of FALSE. The meaning of NULL in general being "value not specified" lends itself pretty nicely to "either true or false."
You mean neither true or false?
We bring in the 'Trinity of thought' a priori and forgot about the advantages and costs.
You have the principal of excluded middle, principal of identity, and the principal of non-contradiction.
If your problems territory fits things are convenient.
Once you introduce known, unknown, or unknowable unknowns, the classic model falls down.
Unfortunately you have to choose what to sacrifice based on the context of the problem at hand.
This is exactly where Rice's theorm, the halting problem etc.. arise.
AC from ZF(C) and IID from statistics bring PEM, which gives or forces the classical behavior.
The 'non-trivial' properties of Rice's theorm are exactly those properties where PEM doesn't hold.
The frame problem in machine learning is also related. As is Gödels (in) completeness theories.
Sometimes you can use failure as negation, other times you can use methods like accepting that NULL is an unknown.
But you have to decide what you can sacrifice and still solve real problems.
I think thinking of a problem and it's complement is better.
E.G P is the truthy T and co-P is the truthy F. Obviously P=co-P means that the traditional method works, but jump to NP vs co-NP and ot doesn't.
NP chained ORs, co-NP is chained ANDs is another way of thinking about it.
https://excelbaby.com/learn/the-difference-between-empty-nul...
(and sometimes Missing)
ANSI SQL:1991 provides an optional feature that introduces a special value UNKNOWN to boolean expressions [1]. But few databases support it. MSSQL [2] is one of the few that do. As I understand it, it's not a data type that be used in tables, but is only a potential result of boolean operations.
[1] https://modern-sql.com/concept/three-valued-logic
[2] https://learn.microsoft.com/en-us/sql/t-sql/language-element...
SQL needs to join 21st century type systems... or needs to get replaced altogether. SQL is the FORTRAN of relational programming languages, but hangs around because every time somebody tries to replace it they throw the relational-theory baby out with the bath-water.
and what is an alternative to sql ... quel?
Kleene's TRUE-FALSE-UNKNOWN logic is weird.
SQL nulls effectively violate the reflexive property of equality, because X=X does not result in a value of TRUE.
And in many contexts in SQL, NULL is treated as equivalent to false, such as within a WHERE clause.
So that means that X=X is effectively FALSE in SQL*.
That is a clown language.
*(yes, I have the same opinion about ANSI NaN, but that wouldn't come up so often if Javascript didn't love NaN for some stupid reason.)
NULL is not equivalent to FALSE, it is neither FALSE nor TRUE. It has the same effect as FALSE as the final result of evaluating a WHERE clause condition only because WHERE clause conditions allow a row to be included only when they evaluate strictly to TRUE. But if NULL were equivalent to FALSE in a WHERE clause, than a WHERE clause condition which would evaluate to NULL that was instead negated would be equivalent to TRUE but instead it remains NULL which remains not TRUE.
I realized earlier I was using the term NULL, but going forwards let's use the ANSI SQL concept where the null state of a Boolean is called UNKNOWN. You'll have to forgive me for using the term NULL this far, but in my defense the concept of NULL column-values and UNKNOWN boolean expression results are pretty intertwingled.
SELECT * FROM foo WHERE NOT (1 = NULL)
resolves to SELECT * FROM foo WHERE UNKNOWN
which is the same value that SELECT * FROM foo WHERE 1 = NULL
resolves to.So the WHERE clause is treating UNKNOWN/NULL as equivalent to false. The rest of the Boolean algebra is not.
SQL likes to have it both ways. Sometimes UNKNOWN/NULL is equivalent to false sometimes it's not.
It does this because UNKNOWN/NULL Booleans are incredibly inconvenient and tedious and sometimes they'd rather not think about them.
I'd like to do that as well, but this hoary old language won't give me the same permission it gives itself.
I don't think any databases treat `NULL` as `FALSE` in the WHERE clause. `SELECT * FROM foo WHERE bar = NULL` doesn't return rows with a NULL in the bar column. `SELECT * FROM foo WHERE bar != NULL` doesn't return rows without NULL in the bar column. `SELECT * FROM foo WHERE (bar = 'a') = NULL;` doesn't return rows where bar is not equal to `a`[1]. As far as I know every DB treats NULL as what it is, an unknown value.
It also doesn't to my mind violate the reflexive property because NULL is not equal to anything. It is a marker for an unknown value, not a value in and of itself. If you have a database of every person in a room and what color shirt they're wearing, and in your database, Alice and Bob both have NULL in their "shirt_color" column, that does not mean that Alice and Bob have the same color shirt. Nor does it mean that they don't have the same color shirt. Nor does it mean that someone with a green colored shirt has the same color shirt as Bob or Alice. It doesn't mean they don't have a shirt either. It means you don't have a record of/don't know what color their shirts are. You can't violate the reflexive property because you can't say what color shirt they have. You're not doing `X = X -> false`, you're doing `X = UNKNOWN -> UNKNOWN`
1 <> NULL => Boolean UNKNOWN,
so SELECT * FROM foo WHERE 1 <> NULL returns nothing.
1 = NULL => Boolean UNKNOWN,
so SELECT * FROM foo WHERE 1 = NULL returns nothing.
That's the thing that's being treated as FALSE. That UNKNOWN. Not the value of NULL itself.
> You're not doing `X = X -> false`, you're doing `X = UNKNOWN -> UNKNOWN`
That's not how "=" works. If you want a relationship for testing equality than handles unknown, don't call it equality.
Basic properties of equality, from Wikipedia
https://en.wikipedia.org/wiki/Equality_(mathematics)
- Reflexivity: for every a, one has a = a.
- Symmetry: for every a and b, if a = b, then b = a.
- Transitivity: for every a, b, and c, if a = b and b = c, then a = c.
edit:
We can also see the incoherence of this concept when we look at set theory.
Because UNKONWN booleans are neither true or false, if you use them in a WHERE clause you get the ugly result that the set of
X ⋃ Xᶜ
is not everything.I think the difference between how we're looking at this is for me there is no "value" of NULL. NULL has NO value which is why you can't do `1 = NULL` or `1 <> NULL` and have to specifically use `1 IS NOT NULL` or `1 IS DISTINCT FROM NULL`
>That's not how "=" works. If you want a relationship for testing equality >[that] handles unknown, don't call it equality. >Basic properties of equality, from Wikipedia >https://en.wikipedia.org/wiki/Equality_(mathematics) >- Reflexivity: for every a, one has a = a. >- Symmetry: for every a and b, if a = b, then b = a. >- Transitivity: for every a, b, and c, if a = b and b = c, then a = c.
Sure, that's all well and good, but equality only can work IF you know what the values are on either side. NULL means you don't know what a value is and therefore you can't answer whether or not a = NULL because NULL isn't known yet.
Or let me put it another way, is the launch price of the iPhone 17 != $799? The answer is neither true nor false, because the launch price of the iPhone 17 is unknown. We can make reasonable guesses, but if I give you a database with entries for iPhones from the first model and placeholder rows for the 17, 18, 19 and 20 models, none of them belong in the list when someone asks "which iPhones cost $799 at launch?" But equally true that none of them belong in the list when someone asks "which iPhones did not cost $799 at launch?"
No, it's being treated as UNKNOWN, and the semantics of SELECT...WHERE only returns rows where the value of the condition is TRUE.
I think you need to look into https://en.wikipedia.org/wiki/Negation_as_failure
Being based on someone's logic is not sufficient. Most weird things are based on some (weird) logic.
1. x ∧ 1 = 1 (identity law for conjunction)
2. x ∨ 0 = 1 (identity law for disjunction)
This is the mental model that I use and teach: a NULL is a virus that will likely make everything it touches NULL... unless cordoned off by a COALESCE or IFNULL or something else that explicitly speaks about its NULL-sensitive behavior in its documentation.
So if you see a bug where a WHERE or FILTER clause is unexpectedly failing, and you're in a time crunch, rather than pounding your head against the wall, start sprinkling COALESCE statements around (while being thoughtful about the domain meaning of what you're doing) and it's more likely than not to solve your problem!
A data expression always evaluates to a data value, and usually whenever any part of the expression is NULL, the entire expression evaluates to NULL.
A comparison evaluates to a truth value, and usually when a comparison invovles a NULL it returns UNKNOWN. This leads to weird behaviors where both `SELECT 3 WHERE NULL = NULL;` and `SELECT 3 WHERE NULL <> NULL;` returns nothing (because the query engine does not output a row if the predicate returns UNKNOWN on it).
What you listed above only comes into play for boolean/logical connectives like AND, OR, NOT, and in that case we follow 3-valued logic.
And there's more annoying corner cases when you deal with DISTINCT. The situation is so hopeless that SQLite has a whole table documenting divergent behaviors of NULL in different systems: https://www.sqlite.org/nulls.html
Nevertheless I find it a useful intuition pump. I wager that most people reading `UNKNOWN = UNKNOWN` or `UNKNOWN <> UNKNOWN` and thinking about the examples above would stop and say, "Wait, I actually don't know the value of that statement for sure either, since the LHS and the RHS could be completely different things," and would then double check what their SQL dialect would actually do in this situation.
That leads to an even more confusing point, that some systems (at least SQLite) overloads NULL to mean UNKNOWN, for example `SELECT 1 WHERE NULL = (NULL = NULL)`.
And the dangerous thing about NULLs is not when they are explicitly used in the query as a literal (as I did for brevity), but when they appear in tables. It's perfectly reasonable to assume `SELECT COUNT( * ) FROM t;` should be the same as `SELECT COUNT( * ) from t WHERE t.x = t.x OR t.x <> t.x`, but they are not the same because the latter does not return NULL rows. This has lead to real query optimizer bugs. For more examples see this paper https://dl.acm.org/doi/10.14778/3551793.3551818
[1] https://modern-sql.com/concept/three-valued-logic#compatibil...
Agreed with all of this, it would probably have been better if they were named `unknown` instead of reusing the `null` keyword.
Note also that since Postgresql 15, you can use `NULLS NOT DISTINCT` when creating a unique index [0]. I'm less familiar with other databases.
(Of course, this is rather awkward in practice, and when NULL is there, it's inevitably going to be used for that instead.)
Null has tons of problems, of course. But it's patently absurd to claim that you "ought to be" making a different 1-column table with a unique foreign key or shared primary key for every column that might be absent, because of some vague appeal to the fact that you can write it as a 1:N relation with N<=1. You can just as easily claim that every non-nullable column is a 1:N relation where N==1 and "should be encoded as such". It is encoded as such! That's what a column is!
Most of the time it's not that important and people can and do shortcut "null" to mean "not present" but then the issues with using null in equality statements is a result of taking the short cut, not necessarily with the logic around null.
It’s not.
That said, as someone self-taught in SQL, I agree NULL was not a good choice.
Replacing NULL with UNKNOWN and the third boolean value as INDETERMINATE for example would have been better.
[1]: https://stackoverflow.com/a/79270181
[2]: https://learn.microsoft.com/en-us/sql/t-sql/language-element...
[3]: https://www.postgresql.org/docs/current/functions-comparison...
You could imagine a world where instead of:
SELECT ( email, name, outstanding_balance ) FROM accounts WHERE outstanding_balance > 0 AND last_payment > 60 ORDER BY name
the queries looked more like:
accounts: outstanding_balance > 0 & last_payment > 60 => email, ^name, outstanding_balance
There were plenty of contemporary languages that looked like the latter, but few survive to this day. SQL's relative verbosity was not seen as a problem then and is definitely not one today.
Besides, if the verbosity was a problem they could easily shorten it to UNK. That would have been perfectly normal.
I also don't use UNIQUE constraints, so maybe that has something to do with it.
For example I learned the hard way that the DB we use at work does not index NULL values.
And once in a while if I'm tired or stressed I might forget about UNKNOWN and thus that "Col <> 42" does not return rows where Col is NULL.
Not that better naming would prevent such surprises, but I still think the current naming is less than optimal from a pedagogical perspective. At least I see this at times when teaching our support folks SQL (many have domain background and not a technical background).
So NULL can capture multiple distinct concepts: unknown values (say, as a result of a calculation), not applicable (where the data has been deliberately left out), and missing data (possibly due to instrumentation or user error).
If you do it this way you can avoid some confusion as to what NULL represents.
John Doe NULL IS EQUAL John Doe NULL
John Doe NULL NOT EQUAL John Doe UNKNOWN
John Doe UNKNOWN NOT EQUAL John Doe UNKNOWN
Determining if any particular input is NULL or UNKNOWN is a tricky problem, but at least this gets the programmer thinking about it up front and structuring their code to behave in the sanest possible manner.
Why would anyone want to use another database?
Maybe it is a driver specific issue? I have used Python/Java, and haven't updated any of my code or dependencies because of a major Postgre update
Personally I think upgrades are the one thing MySQL has on Postgres at this point.
Usually it takes almost no time, altought might be just as slow as PG when major changes to the data format are introduced. The only example I can remember is 8.0 when oracle completely rewrote the data format (making things like atomic ddl possible).
Of course, lots of people also think it's a feature. But those aren't very vocal. Anyway, this is a main reason why people keep using old postgres versions, they have to think about upgrading, and they often don't.
I'd posit that only a tiny fraction of PostgreSQL uses have to know or care that vacuuming is a thing because the autovacuum default handle it for them.
For example, HA and clustering will always be challenging to deploy/maintain, but you will still have a harder time doing that with postgres than with MySQL. Postgres also has a lot of benefits obviously, though.
It's also much easier to tune, most database instances require setting innodb_buffer_pool_size, and that's basically it. Newer versions can even set it automatically if you're fine with consuming all memory on that machine, thus requiring no tuning at all.
* use of a cloud provider that favours another database (like SQL server)
* some people claim mysql/maria is faster for them, simpler, or has a better support for replication
* use of sqlite for lightweight or portable apps
* domain specific databases that still use sql as their query language
* someone may want to use another database for fun or to learn something new
In practice, it behaves as "unknown" in some contexts, as "missing value" in other contexts, and sometimes it's just plain WTF like SUM() returning NULL rather than 0 if there are no rows.
Going by Wikipedia, I see that SQL is from 1974 and C from 1972. Were there earlier uses/drafts where `null` is "unknown" instead of "unset"?
C. A. R. Hoare, 1965-11:
"In order to enable references to represent partial functional relationships, i.e. ones which do not necessarily yield a value, a special reference value null is introduced. This value fails to refer to a record, and any attempt to use it to refer to a record leads to an undefined result."
In his proposal, Hoare has followed the COBOL 60 terminology for some of the additions to ALGOL, i.e. "record" instead of the word "structure", introduced by IBM PL/I in 1964-12, and "record class" instead of "structure type", hence the "class" of SIMULA 67, from where the word "class" has spread into all OOP languages.
In Hoare's proposal, references a.k.a. pointers could point only to values belonging to user-defined types, i.e. records a.k.a. structures, not to primitive types.
E.F. Codd added nulls to relational model in 1970 so that does pre-date C. The concept is even older than that I imagine.
If there might actually not be such a value, you're supposed to change your schema to reflect that.
In all my years, I've never used null for that. If I don't have a value yet then generally I'm not writing any part of the record. I only ever use nulls for the absence of a value. Creating a new table for every potential optional column is the "academic relational way" and also insane. :)
What you're describing is closer to how people do it in practice.
But then if your sensor fails to record a measurement don't you end up with NULL for that row's temperature_id?
LogEntry(LogEntryId, Date)
Temperature(TemperatureId, LogEntryId, DegreesF)
If there is no temperature measured, then you don't create a record in Temperature.
I mean sure, you could do yet another table. But honestly that level of normalization is much more work than it's worth. Just because it's some academic definition doesn't make it right.
> But honestly that level of normalization is much more work than it's worth
Yes. I question whether it's worth anything to begin with.
I wonder who first added that macro? Was it there from the beginning?
Just random thoughts...
Which is why I continue to be such an ORM skeptic. I agree that they're convenient. But I do worry that we've now got an entire generation of engineers who regularly interact with relational databases, but have largely been spared the effort of learning how they actually work.
As another commenter pointed out, if you've learned basic relational algebra then the way SQL nulls behave seems obvious and logically consistent. The logic is the same as the logic behind the comparison rules for NaN in IEEE floats. It's the behavior of C-style nulls that is, always and forever, a billion-dollar mistake.
That's an important one. It would be super nice to have a SQL dialect that works more like LINQ where you can compose your queries easily. I always hate it when I have to write SQL directly. It's super powerful but the syntax just isn't designed well. To me it feels like a throwback to the good old FORTRAN or COBOL days: you can get stuff done but modern languages are so much better.
But this has never been their primary purpose and it's not what they are good at. ORMs are supposed to map the relational model into an object oriented model so that you can work with objects rather than sets of tuples. And that's exactly how people use them.
ORMs incentivise people to replace simple and declarative set operations with complex procedural code operating on individual objects.
ORMs are just a terrible idea - conceptually messy, hard to debug and optimise, full of needless complexity.
Entity Framework did try to cater to the "SQL is scary, let me use objects" crowd, and that is the majority of how it's used, and that is a mistake in my opinion. But it is also very good at supporting relational algebra within C# and composing queries dynamically; ironically, it's best at it if you disable or avoid many of its features (dynamic subclassing, linked objects). Pass IQueryables around and compose them together and life is good. Updating needs work, but updates have always been a weakness for SQL too.
and that's why ORMs are so unpopular and entirely absent from successful production applications for the past 30 years
This is true for an HTTP library as much as it is an ORM.
No, it is only true for bad abstractions.
Stop thinking of ORMs as trying to hide the details of SQL and you'll stop hating them. Instead think of them as a way to compose SQL dynamically, with the full power of your language. SQL is an awful language to write application logic in, because it has horrible support for abstraction, composition, encapsulation, dependency injection, etc. The ORM gives you a way to produce SQL in an environment that actually supports basic software engineering principles. Scattering ORM logic everywhere in the codebase is the point: putting all your SQL in one data access layer is like putting all your arithmetic in one calculation layer. Why would you ever do that? What's wrong with seeing a plus sign in more than one file? What's wrong with seeing language-encoded relational logic in more than one file?
I can guarantee you the popularity is not "aesthetic". And convenience is a real thing that actually does reduce costs. People complain about ORMs, but have you seen the absolute horse-shit-level code that people jam into SQL functions and procedures to do the utterly most basic things? The standard for what ends up in SQL Stored procedures is the most unmaintainable garbage in the entire software engineering ecosystem.
The reason for a data layer is because the underlying data representation might change. For example, you might change the schema to handle some new performance requirement. Now you have to hunt down and change everywhere you've queried that table in your whole code base. Every time you directly call SQL you are coupling your code strongly to the database schema which is ultimately an implementation detail.
> And convenience is a real thing that actually does reduce costs.
I know convenience is a real thing. I also know that it very often increases costs in the long run.
Well you certainly can't do it all in SQL, because SQL doesn't support basic software engineering principles at all. That means you're picking a parent language that does support the principles you want, and attempting to build a library that allows you to build, pass around, and compose relational queries in a way that leverages the benefits of the parent language. To do this I'd argue you need these things:
1. a structured representation of the (possibly incomplete) relational query as it's being built, e.g. SQLAlchemy's internal representation or IQueryable in C#
2. a method of building those representations in a way that conforms to the norms of the parent language and leverages its benefits as much as possible, e.g. LINQ or direct use of the various extension methods on IQueryable. It seems like you could pick either SQLAlchemy or SQLAlchemy Core depending on whether you want to think in terms of objects or queries (I'd usually pick queries), but I'm not that familiar with SQLAlchemy.
3. a method of translating the query representation into a particular SQL dialect.
I don't know what exactly your definition of "ORM" is, but I'd argue that if any of those steps involve an object-level representation of the schema of your tables (e.g. "class Employee { int ID }" type of thing) then it's an ORM. Do you need that? Well, no, probably not, but why not? C# is strongly typed, so why wouldn't I tell its type system what the schema of my table is? That's part of leveraging the benefits of the parent language.
> The reason for a data layer is because the underlying data representation might change. For example, you might change the schema to handle some new performance requirement. Now you have to hunt down and change everywhere you've queried that table in your whole code base.
This is an argument for strong typing, not for putting all your code that has the word "select" in it in the same file. And building queries with a strongly typed representation of your schema in the parent language is the point of the ORM!
Without an ORM, you still have to hunt down tons of spots where things change when you change your schema. How do you ever not have to do this?
> Every time you directly call SQL you are coupling your code strongly to the database schema which is ultimately an implementation detail.
There exists some canonical relational model of your domain. This model is not an implementation detail, it is the foundation of most of your code. I choose to keep my SQL schema as close as possible to this canonical relational model as I can. The fact that a Monitoring Instrument has an Operating Status is a fundamental fact about my model; whether that column is called "operating_status" or "OperatingStatus" or "operating_status_ID" is an implementation detail. It's just hard to agree with you that "the database schema" is merely implementation detail -- clearly parts of it are not, unless it's completely untethered from your canonical domain model (which would be a nightmare, of course). Of course your code is strongly coupled to the fundamental domain model upon which it is operating. I'd certainly agree that it'd be nice to not have to worry about implementation details like the exact naming convention of the columns -- which, of course, ORMs let you do.
>> And convenience is a real thing that actually does reduce costs.
> I know convenience is a real thing. I also know that it very often increases costs in the long run.
Code that is easier to write is generally also easier to read and maintain. That's what I mean when I say "convenience". Simplicity and convenience go together.
Django seems to be an outlier on the side of "an ORM that's actually good". Whenever people have specific technical complaints about ORMs, it's generally not a problem or there's already a solution in Django you just have to use. It's usually only when you get to the more conceptual stuff like object-relational mismatch that such complaints tend to apply to Django.
You're not really missing anything other than query builders, which are worse for other reasons.
This attitude was so prevalent at the time, I sometimes wonder if the rise of noSQL was simply people sick of dealing with Oracle DBAs
That was definitely one part; another part was sharp corners in MySQL (at least as of 20 years ago; I would be surprised if many of them haven't been rounded off in the meantime). The last part was places with no DBA with developers unaware of how to handle schema migrations.
[1] https://www.amazon.com/Database-Technology-Nulls-Considered-...
More importantly, x = value instead of (x = value and x IS NOT NULL) is almost always a mistake, and a stupidly subtle one at that. And for this curse, we get… nothing particularly useful from these semantics.
Also the x != NULL case is completely cursed
It's a three-valued logic (though not trinary, which would use a base-3 number system) in a three-valued algebra: specifically, the relational algebra. The outcome of a logical test has three values: true, false, or NULL; this is distinct from Boole's algebra where outcomes have a continuous value between 0 and 1 inclusive.
Could you explain how this makes sense then?
SELECT ...
WHERE NULL
If NULL is just "unknown" then shouldn't this be a type error?Moreover, could you explain why the heck this ought to be empty?
WITH T AS (SELECT 1 AS C1,
NULL AS C2)
SELECT C1, C2
FROM T
INNER JOIN T
USING (C1, C2);
As a human this looks insane to me, "relational algebra" be damned. You find a row, then you look it up again, and it's not there? What the hell?I will die on the hill that regular C-like nulls are the actual thing that's weird.
The real billion dollar mistake [1] was the damage it made on the minds of developers.
Most languages nowadays do get nulls right, even PHP of all things.
Ironically NULL is probably the safest pointer value in C, as any dereferences from it (and thousands of pages that follow it) are guaranteed to crash.
That's a waste of a page.
Unless you're talking about some kind of device which supports virtual memory, but also by default sets up mappings, including at 0 which seems weird to me.
Memory mapping on all hardware with which I am familiar (which is not a small integer) is done by pages. If you have to fault on one address in a page (eg. zero page, stack canaries, heap guards) you have to fault on every single address in that page. But "memory is cheap". Until it's not.
Besides, memory access is twice as fast on the zero page! Wait, you're telling me the 6502 isn't the standard anymore? ;p
This depends on the context, no? I doubt there exists someone with a contact list on their phone which has every single field for every single contact populated.
There needs to be some way to codify that a field in a datarecord is unpopulated. Using the "zero value" for the type of the field (e.g., the empty string) is reasonable, but is this necessarily better than NULL? I reckon an argument can be made that this approach is just as likely to lead to bugs.
I'm not necessarily in favor of NULL, for what it's worth, but I can't think of an adequate replacement which doesn't reduce to "NULL in sheep's clothing".
Null is more confusing because it means different things in different languages. Sometimes it's a more constrained uncertainty, eg this definitely doesn't exist. But in sql it's a less constrained uncertainty, like "undefined" in math. The value of this thing couldn't make sense in this context, but we can't make assertions about its existence.
E. F. Codd thought about this issue.[0]
> Codd indicated in his 1990 book The Relational Model for Database Management, Version 2 that the single Null mandated by the SQL standard was inadequate, and should be replaced by two separate Null-type markers to indicate why data is missing. In Codd's book, these two Null-type markers are referred to as 'A-Values' and 'I-Values', representing 'Missing But Applicable' and 'Missing But Inapplicable', respectively. Codd's recommendation would have required SQL's logic system be expanded to accommodate a four-valued logic system. Because of this additional complexity, the idea of multiple Nulls with different definitions has not gained widespread acceptance in the database practitioners' domain. It remains an active field of research though, with numerous papers still being published.
If your signup form has an optional field for middle name which I don’t fill, it can absolutely be because I don’t have a middle name. It’s undefined and known to be so.
> E. F. Codd thought about this issue.[0]
And because four value logic was left out, nulls have to fulfil multiple incompatible roles, and ends up being weird.
The fact that some NULLs are indeed undefined doesn’t contradict my point that it would be inaccurate to refer to all NULLs as UNDEFINED.
> It’s undefined and known to be so… And because four value logic was left out, nulls have to fulfil multiple incompatible roles, and ends up being weird.
You know it to be undefined, but if you just leave that field blank on the form, the database still can’t know whether that’s because you don’t have one or because you didn’t provide the one you have, unless there was a field on the form that allowed you to explicitly indicate that you don’t have one.
There's the Closed World Assumption in a database.
What's interesting is, do we mean that in our data that attribute has no value? Or do we mean the real-world object represented by the data does not have that attribute?
Does null mean
a) We don't know the value of this attribute for this object, or
b) We do know that there is no value for this attribute in the real-world object represented by our data.
In JavaScript because there is both null and undefined it is easy to assume that undefined means we don't know the value and null means we do know it has no value.
EXAMPLE: The attribute 'spouse'. Some people have a spouse some don't. So what does it mean if the value of the field 'spouse' is null? That we know there is no spouse, or that we don't know who the spouse is if any.
In practical terms we can say null means "We don't know" which includes the case that there is no spouse.
Javascript objects have two kinds of undefined that are both represented by the same value. You have to use another method to see which it is, and I've seen "foo" in this example used for the same thing as "null" in your example:
>> z = {foo: undefined}
Object { foo: undefined }
>> z.foo
undefined
>> z.bar
undefined
>> z.hasOwnProperty('foo')
true
>> z.hasOwnProperty('bar')
false
This is something you have to account for because the key is still there if you try to remove a value by just setting it to undefined: >> Object.keys(z)
Array [ "foo" ]
>> for (let k in z) { console.info(k); }
foo
This is the right way to remove the key: >> delete z.foo
true
>> z
Object { }
>> Object.keys(z)
https://stackoverflow.com/questions/203493/why-does-oracle-9...
That was a fun bug to find out, after having dealt with quite a few other DBs over the years. It was one of those "No, but surely" and "This can't be! This is Oracle!" moments. Found it while porting some old code that needed to store an empty string as being distinct from a NULL in that same column.
The princess eyes are as blue as the stone of Galveston
Have you seen the princess eyes?
No!
Have you seen the blue stone of Galveston?
No!
So you're comparing something you've never seen with something else you've never seen!
That's NULL comparison
The value of NULL in a particular table cell is simply a way to indicate 'no value'. If you want the values in a column to be unique, cases where there are no values shouldn't be considered.
This plays out similarly in practice. For example, you may want to allow users to optionally reserve a username, and if they do, those usernames should be unique. It's hard to imagine a use case where by wanting a field to be both optional (nullable) and unique, you mean that the field should be optional for a single record (!) and required for all the rest. Of course, you mean that IF there is a value, THEN it should be unique.
(Although in rare cases that is even weirder: https://stackoverflow.com/a/58998043 )
Too many newbies hear that NULL is bad, so they declare all columns as NOT NULL and end up inserting ad hoc values like 0, -1, '', or {} when they inevitably come across cases where they don't have data. Which is even worse than NULL.
I think we would have been better-off by treating FKs (and maybe outer JOINs) as a special case, and using 2-value logic everywhere else.
If you don't like null semantics, you're free to use sentinel values. You can make all the sentinel values the same, or you can make them all different. Either way, you or someone who has to use your system will be back here tomorrow complaining about how weird it is.
`1 is not distinct from NULL` => false
`NULL is not distinct from NULL` => true
`0 is not distinct from 1` => false
https://modern-sql.com/concept/null
(Note: I am not affiliated with that bloh/website in any way, shape, or form.)
Or, the user must define a default value in the query itself.
Yes, tedious; but, precise and forces the programmer to really prepare for the "unknown" scenario.
https://blog.rustprooflabs.com/2022/07/postgres-15-unique-im...
Practically speaking, I go with not null, and always set default value.
Since the data we were getting was sourced from an RDBMS, I wanted NULL to be a first class concept in the DSL, with similar traits.
Early on, I simply made any expression that involved a NULL result in NULL. Naively this was all well and good, but it failed spectacularly in condition statements.
Instead of A = NULL == false, I had A = NULL == NULL. And, as you can imagine, a single NULL in the expression would just pollute the entire thing, and since NULL was considered as FALSE for conditionals, any NULL in an expression made the entire thing, eventually, FALSE.
Naturally I went back and made the comparison operators always return booleans. But it was a fun little side effect at the time.
Partially, we use ... and I'll be very broad here ... "variables" as boxes we look in for answers. Answers to questions, answers to "Hey I put something in there for the time being to reference later." If I went into programming terms rather than just SQL, sometimes we get meta-answers.
You haven't made the box yet (declared the variable).
You haven't decided how the box is structured (picked a type or a length or something).
Okay, you did those but the box is virgin (nothing has been placed in the box yet).
That kind of thing. An empty set for "yes, you asked but nothing meets those criteria."
And lastly undefined != null, because undefined is related with structures indicating that a field was not defined when the structure was created
But let's continue the logic of deleted_at being NULL indicating an active account, which seems to the intent here. You end up doing things like:
SELECT /* ... */
FROM accounts
WHERE email_address = '...'
AND deleted_at IS NOT NULL
Depending on your database, that may or may not index well. More problematic, you may end up with privacy leaks if someone forgets the last conditional.If anything, you want to reverse this so someone has to go out of their way to explicitly select deleted accounts. There are multiple strategies for this eg using an active_accounts view or table.
Lastly, there are lots of potential reasons for an account to be disabled or otherwise not visible/accessible. Takedowns, court orders, site safety, hacked accounts and so on.
Overloading deleted_at to have a semantic meaning for an active account is just fundamentally bad design.
Then don't do that. It's kind of a leap to say soft deletes are categorically bad because someone might confuse "deleted" with "inactive". My users table does the super-advanced thing of having both columns. The ORM also doesn't forget to add the not-null criterion. There's also zero databases in active use where it poses a problem to indexing.
Soft deletes suck in their own way, but none of the alternatives are perfect either.
Overloading timestamps to carry a boolean on null is awesome as long as you decide that's what you're doing and use one of the several standard techniques to dodge the easily avoided potential downside.
This isn't a valid security concern, more than any other incorrect sql query would be anyway. A dev can always write a bad Q, you need another way to address that it's not more likely here because of the null.
Imo, SQL should add "NOTHING", add "UNKNOWN" as a synonym for "NULL", and deprecate "NULL".
edit: an empty string, false, 0 are all values.
NULLs still have their (rare) place, but the foremost issue with query results is that they are tabular rather than hierarchical. The main culprits being (1) outer joins that represent or induce nonsensical operations and (2) lack of non-null “zero” values for types like date. Of course hierarchies can make querying more complex, but mostly in cases where the relational logic goes crazy itself and you had to go tabular anyway.
If you think of it, distinct, group by and windowing feel like workarounds in tabular mode but would be natural to hierarchies, because everything is naturally distinct and grouped-by by design and windows are basically subtables in these rows.
Bonus points you could fetch “SELECT FROM a, b_rows LEFT JOIN b AS b_rows …” in a single query without duplicating `a`s and nullifying `b`s when N <> 1. And when you aggregate through a column in `b`, there’s no headache what to do with join-produced NULLs (unless `b` columns are nullable by your design, then it’s on you). And when it all arrives to a client, it’s already well-shaped for ui, processing, etc. No more:
last_a_id = undefined
for (row of rows) {
if (row.id != last_a_id) {
…
last_a_id = row.id
}
…
}
I’m pretty sure you recognize this programming idiom immediately.Before you criticize, I’m not talking about hierarchical/OO tables. Only about ways of getting and handling query results. You still can reshape a relation like you want. The difference is that a database engine doesn’t have to put it all onto a (N x M x …) table and instead creates sort of a subset of relations which is efficient in space and natural to walk through. It already does that when walking through indexes, selects are naturally hierarchical. All it has to do is to track relations it went through rather than just dumping rows from a set of cursors that it knows the start and end points of, but loses this knowledge by writing into a plain table.
Personally I wish more languages were like python or ruby and had chosen None or Nil over Null which alleviates the confusion a bit, as those names better indicates that it's NOT an "unknown" (1 unknown value != 1 other unknown, which intuitively makes sense.) In ruby or python it's more obvious that None and Nil are "nothing" types and therefore equivalence makes sense (nil == nil, None == None are both true)