Another way to look at it is the functional core, imperative shell pattern.
Wrapping up your dict in a value object (dataclass or whatever that is in you language) early on means you handle the ugly stuff first. Parse don't validate. Resist the temptation of optional fields. Is there really anything you can do if the field is null? No, then don't make it optional. Let it crash early on. Clearly define you data.
If you have put your data in a neat value objects you know what is in it. You know the types. You know all required fields are there. You will be so much happier. No checking for null throughout the code, no checking for empty strings. You can just focus on the business logic.
Seriously so much suffering can be avoided by just following this pattern.
A good explanation of this is: https://www.destroyallsoftware.com/talks/boundaries
dict is an implementation of a hash table. Hash table are designed for o(1) lookup of items. As such, they are arrays which are much bigger than the number of items they store, to allow hashing items into integers and sidestep collisions. They’re meant to act like an index that contains many records, not a single record.
A single record is more like a tuple, except you want named access instead of, title = movie[0], release_year = movie[1], etc. And Python had that, in NamedTuple, but it was kinda magical and no one used it (shoutout Raymond Hettinger).
Granted, this rant is pretty much the meme with the guy explaining something to a brick wall, in that dicts are so firmly entrenched as the "record" type of choice in Python (but not so in other languages: struct, case class, etc. and JSON doesn’t just deserialize to a weak type but I digress).
A place where dicts for hard coded keys makes sense is notebooks. The convenience is worth it and it’s unlikely to get out of hand.
namedtuple is widely used in Python code, especially before the introduction of dataclasses.
No matter what you do, a lookup into an array will always be quicker than a hash lookup if you don't need to do a linear search, even in a lot of cases the linear search will be quicker.
Structs in other languages is a lookup of pointer + and offset. Which to my knowledge is also true in python classes using __slots__. There's no reason to use a dict if you know the contents of the data, use a dataclass with slots=True purely because there's no hash function run on every lookup into the datastructure.
Dicts are useful for looking things up, like if you have a list bunch of objects that you need to access and modify, you should use a dict.
If you are using the dict as a container like car={“make”:”honda”,”color”:”red”}, you should use a proper object like a class, dataclass, or pydantic model based on whether you need validation, type safety, etc. This drastically reduces bugs and code complexity, helps others reason about your code, gives you access to better tooling etc.
>"solution : use dataclasses"
Damn, it's almost like using an untyped language for large projects is not a great idea.
Yes, decades ago I was also fascinated by python and it's ease of doing stuff (compiler doesn't complain that I missed something) but with time I grew fond of statically typed languages... they simply catch swaths of errors earlier...
I've written software with both typed and untyped languages and never had problems (out of the ususal) with them.
Ah... yes... because static languages doesn't do that by forcing you to properly model everything. And as a bonus you can easily navigate between everything and not fear that you miss something while refactoring...
So yeah that small not hit portion of code can always be a time bomb if it does not get tested...
That has nothing to do with the language, and can happen in any language.
Leyec, the magic dev behind it managed to make a full python type checker with super advanced features and about 0 overhead. It's crazy
>Beartype now implicitly type-checks all annotated classes, callables, and variable assignments across all submodules of all packages.
"Typed Python" does poorly (compared to e.g. Typescript) on things like overloading functions, generics, structural subtyping, et al.
Golang does typing, but JSONs are PITA to handle.
Try parsing something like `[{"a': 1, "b": "c", "d": [], "e": {}}, null, 1, "2"]` in go.
Types are a bless as well as a curse.
We used to have strict typed XML. Nobody even bothered.
Not JSONs in general, but a sane API would never return something like that.
> We used to have strict typed XML. Nobody even bothered.
Nowadays there is OpenAPI, GraphQL, protobuf, etc. and people do bother about such things.
Yeah, because it was ugly as hell and not human-readable.
re: Python. I like PyRight/PyLance for Python typing, it seems to "just work" afaict. I also like msgspec for dataclass like behavior [2].
---
But the same issue exists as other dynamic languages, how do you know what the type is of the item you are accessing?
If you know the array will be laid out exactly like that before you make the request you can always create a custom parser to return a struct with those fields name what they actually are instead of arbitrary data.
The only valid way to parse that dynamically is to try and fail in a loop which is inefficient enough that you should stop using whatever API returns that monstrosity.
I know how to deal with missing values or variability in maps, and so do a lot of people.. what am I missing here?
When the data is not uniform (different keys point to differently-typed values), and not as dynamic (maybe your data model evolves over time, but certain functions always expect certain keys to be present), a dict is like a cancer. Sure, it's simple at first, but wait until the same dict gets passed around to a hundred different functions instead of properly-typed parameters. I just quit my job tech at a company that shall remain nameless, partially because the gigantic Ruby codebase I was working on had a highly advanced form of this cancer, and at that point it was impossible to remove. You were never sure if the dict you're supplying to some function had all the necessary keys for the function it would eventually invoke 50 layers down the call stack. But, changing every single call-site would involve such a major refactor that everybody just kept defining their functions to accept these opaque mega-dicts. So many bugs resulted because of this. That was far from the only problem with that codebase, but it was a major recurring theme.
I learned this lesson the hard way.
If getting a filed of your object had the same syntax as getting a value from a dict you could easily replace dicts with smarter, more rigid types at any point.
My dream is a language that has the containers share as much interface as possible so you can easily swap them out according to your needs without changing most of the code that refers to them. Like easily swap dict for BTreeMap or Redis.
I think the closest is Scala but it fallen out of favor before I had a chance to know it.
Here is Rich Hickey with an extreme counter example although I would argue he's really demonstrating against getters and setters. https://www.youtube.com/watch?v=aSEQfqNYNAc
As a result, they are very powerful and simple to use.
The issue is that the concrete types are implicit. Depending on the language, runtime or type system expressing the type in a “better” way might be very hard or un-ergonomic.
There’s useful ideas in this post but I’d be careful not to throw the baby out with the bath water. Dicts are right there. There’s dict literals and dict comprehensions. Reach for more specific dict-likes when it really matters.
If you use a real type, you never have to worry about this.
If you're duck typing, you find this out in the best case when your unit tests exercise it, and in the worst case by a support call when that 1/1000 error handling path finally gets exercised in production.
In typescript using plain JS objects is very straightforward. Of course you have to validate the schema at your system boundaries. But you'll have to do this either way.
So: If this works very well in TS it can't be dicts themselves but must be the way they integrate into- and are handled in python.
This leads me to the conclusion that arguments presented in the article might be the wrong ones.
(But I still think, the conclusion the article arrives at is okay. But I don't think there's a strong case being made in the article about wether to prefer data classes or typed dicts.)
External API <--dict--> Ser/De <--model--> Business Logic
Life's all great until "External API" adds a field that your model doesn't know about, it gets dropped when you deserialize it, and then when you send it back (or around somewhere else) it's missing a field.There's config for this in Pydantic, but it's not the default, and isn't for most ser/de frameworks (TypeScript is a notable exception here).
Closed enums have a similar tradeoff.
Dropping unknown/unused fields makes sense in 99% of cases.
Changing your API and assuming everything just keeps working is a nonsense cowboy attitude to software compatibility, even if some frameworks bend over backwards to support it through magic that's hidden from the developer. Furthermore, many programming languages are simply incapable of doing this, and this approach to APIs is immediately restricting those languages from use.
Finally, transforming objects to an internal domain model is really the cornerstone of a lot of recent well-thought-out programming discipline, and this API design is throwing that in the garbage. It's explicitly asking you to mess up your service architecture, spreading bad architecture like a virus to all systems that interact with the API.
https://www.youtube.com/watch?v=aSEQfqNYNAc
But ok, it's less bad in Python since objects are dicts anyway and you don't need getters.
Linting tools will pick up on every instance where you forgot to rename the fields of a class, but won't do the same for dicts.
This of course means TypeDicts don't give you run-time validation. For that, and for full-blown custom types in general, I tend to favor msgspec Structs: https://jcristharif.com/msgspec/benchmarks.html#json-seriali....
Citation needed? Pydantic is really quite fast, and you can pass raw JSON responses into it.
It may be slower (depending on the validators or structure), but I’d expect it to be comparably fast to the stdlib JSON module.
Pydantic v1 was slow enough for them to write a lot of the core logic in Rust for Pydantic v2, and for the previous sloth to have been an argument people launched against it if you look back at threads on here and Reddit comparing it to other libraries.
class GetThingResult
def initialize(json)
@json = json
end
# single thing
def thing_id
@json.dig('wrapper', 'metadata', 'id')
end
# multiple things
def history
@json['history'].map { |h| ThingHistory.new(h) }
end
... two dozen more things
end
Most developers will carry their previous language paradigms into their new ones. But if types, DDD (Domain-Driven Design), and classes are what you're looking for, then Python isn't the best fit. Python doesn't have compiler features that work well with those paradigms, such as dead code removal/tree shaking. However, starting out with dictionaries and then moving over to dataclasses is a great strategy.[1] As a small note, it's kind of ironic that the statically typed language Go took inferred typing with their := operator, while there is now a movement in Python to write foo: str = "bar".
You lose the algebra of dict’s - and it’s a rich algebra to lose since in python it’s not just all the basic obvious stuff but it’s also powerful things like dict comprehensions and ordering guarantees (3.7+ only).
You tightly couple to a definition - in the simple GitHubRepository example this is unlikely to be problematic. In the real world, coupling like this[1] to objects trying to capture domain data with dynamic structures is regularly the stuff of nightmares.
The over-arching problem with the approach given is that it puts code above data. You take what could be a schema, inert data about inert data, and instead use code. But it might also be an interesting case to consider as a slippery slope - if you can put code concerns above data concerns then maybe soon you will see cases where code concerns rank higher than the users of your software?
[1] - by coupling like this I mean the “parse don’t validate” school of thought which says as soon as you get a blob of data from an external source, be it a file, a database or in this case a remote service, you immediately tie yourself to a rocket ship whose journey can see you explosively grow the number of types to accurately capture the information needed for every use case of the data. You could move this parsing operation to be local to the use case of the data (much better) rather than have it here at the entry point of the data to the system but often times (although not always) we can arrive at a simpler solution if we are clever enough to express it in a style that can easily be understood by a newbie to programming. That often means relying on the common algebra of core types rather than introducing your own types.
I use it to parse and validate incoming webhook data in my Python AWS Lambda functions, then re-use the protobuf types when I later ship the webhook data to our Flutter-based frontend. Adding extensions to the protobuf fields gives me a nice, structured way to add flags and metadata to different fields in the webhook message. For example, I can add table & column names to the protobuf message fields, and have them automatically be populated from the DB with some simple helper functions. Avoids me needing to write many lines of code that look like:
MyProtoClass.field1 = DB.table.column1.val
MyProtoClass.field2 = DB.table.column2.val
If you're programming correctly and take encapsulation seriously, then whatever shape incoming data in a dict has isn't something you should take an issue with, you just need to make sure if what you care about is in it (or not) and handle that within your own context appropriately.
Rich Hickey once gave a talk about something like this talking about maps in Clojure and I think he made the analogy of the DHL truck stopping at your door. You don't care what every package in the truck is, you just care if your package is in there. If some other data changes, which data always does, that's not your concern, you should be decoupled from it. It's just equivalent to how we program networked applications. There are no global semantics or guarantees on the state of data, there can't be because the world isn't in sync or static, there is no global state. There's actually another Hickey-ism along the lines of "program on the inside the same way you program on the outside". Dicts are cool, just make sure that you're always responsible for what you do with one.
"Ignore fields coming from the API if you don’t need them. Keep only those that you use."
IMO this addresses only one part of the problem, namely "sanitize your inputs". But if you follow this, and therefore end up with a dict whose keys are known and always the same, using something "struct-like" (dataclasses, attrs, pydantic, ...) is just SO much more ergonomic :)
This is great if you know what you need from the start. If you only find out what you need after passing your data through multiple layers and modules of your system then you need to backtrack through all your code to the place of creation.
If you have immutable data structures then you have to backtrack through multiple places where your data is used from previous structures to create new ones to pass your additional data through all that.
So if your data travels through let's say 3 immutable types to reach the place you are working on then even if you know exactly where the new field that you need originates, you need to alter 3 types and 3 places where data is read from one type and crammed into another.
If you have a dict that you fill with all you got from the api there's zero work involved with getting the new piece of information that you thought you didn't need but you actually do. It's just there.
- The only way to figure out which parameters are even possible was to search through the code for the uses of the dict.
- Default values were decided on the spot all over the place (input.getOrDefault(..)).
- Parameter names had to be typed out each time, so better be careful with correct spelling.
- Having a concise overview how the input is handled (sanitized) was practically impossible.
0/10 design decision, would not recommend.
Plus all this 1995-era OOP and domain-driven-design crap, "business logic" and data layers and all this other architectural rigidity and usually-needless complexity, layers of boilerplate (and then tools to automate the generation of that), etc.
If your function takes a dict, and is called from many different places, document the dict format in the function comment. Or yes, create a dataclass if it saves more trouble than its additional boilerplate and code and maintenance causes. But take it case by case and aim for simplicity. Most of the time I call out to an API in python, I process its JSON/dict response right after the call, using maybe 10% of the data returned. That's so much cleaner and simpler than writing a whole Data Object Layer, to be used by my API Interface Layer, to talk to my Business Logic layer, etc.
For these kinds of people, no amount of rational evidence or argument is going to convince them this is bad. They practically make an identity out of eschewing anything that seems too orderly or too designed.
(Luckily, at work, most of us on our team like `Pydantic` and also (some of us more than others) type-checking, so these people are dragged along)
Python seems to have many different kinds of "better classes" - the article mentions `dataclass` and `TypedDict`, and AFAIK there are also two different kinds of named tuple (`collections.namedtuple` and `Typing.NamedTuple`).
What are the advantages of these "better classes" over traditional classes? How would you choose which of the four (or more?) kinds to use?
So people are trying to force Python to be something it isn’t in adherence to their ideology — but it fails to gain consensus because there’s a sizable cohort that use Python because it isnt those things.
So we get repeated implementations, from each ideologically motivated group.
un-annotated tuples and too many func params are cancer.
Un-annotated tuples and too many func params are OK, because at least they are pushed and popped from the stack.
Calls and rets without a prologue and epilogue on the other hand…
Or many, many stacks you can't comprehend nor amend.
I dare to add a new `key` to a dict, can you modify a func call or a tuple with confidence?
but i’m also trying to move on and do things differently today.
let’s just say the situation is displeasing and leave it at that.
If you want an immutable mapping, why not use an enum?
In particular, whenever anyone thinks that "deep clone vs shallow clone" is a meaningful distinction, that means their types are utterly void of meaning.