Elasticsearch was never a database(www.paradedb.com)

159 pointsby jamesgresqla month ago24 comments

roywiggins22 days ago
> Elastic has been working on this gap. The more recent ES|QL introduces a similar feature called lookup joins, and Elastic SQL provides a more familiar syntax (with no joins). But these are still bound by Lucene’s underlying index model. On top of that, developers now face a confusing sprawl of overlapping query syntaxes (currently: Query DSL, ES|QL, SQL, EQL, KQL), each suited to different use cases, and with different strengths and weaknesses.
I suppose we need a new rule, "Any sufficiently successful data store eventually sprouts at least one ad hoc, informally-specified, inconsistency-ridden, slow implementation of half of a relational database"
- xeraa22 days ago
  Funny argument on the query languages in hindsight, since the latest release (https://www.paradedb.com/blog/paradedb-0-20-0 but that was after this blog) just completely changed the API. To be seen how many different API versions you get if you make it to 15 years ;)
  PS: I've worked at Elastic for a long time, so it is fun to see the arguments for a young product.
- Joker_vD22 days ago
  Just as any "plain blob storage" eventually evolves a hierarchical filesystem (but with silly quirks!) on top of it.
  - deepsun22 days ago
    AFAIK, in Google it was the other way around -- their main blob storage (BigTable) is built on top of GFS (distributed filesystem).
    Scaevolus22 days ago
    You have that backwards. GFS was replaced by Colossus ca. 2010, and largely functions as blob storage with append-only semantics for modification. BigTable is a KV store, and the row size limits (256MB) make it unsuitable for blob storage. GCS is built on top of Spanner (metadata, small files) and Colossus (bulk data storage).
    But that's besides the point. When people say "RDBMS" or "filesystem" they mean the full suite of SQL queries and POSIX semantics-- neither of which you get with KV stores like BigTable or distributed storage like Colossus.
    The simplest example of POSIX semantics that are rapidly discarded is the "fast folder move" operation. This is difficult to impossible to achieve when you have keys representing the full path of the file, and is generally easier to implement with hierarchical directory entries. However, many applications are absolutely fine with the semantics of "write entire file, read file, delete file", which enables huge simplifications and optimizations!
    deepsun20 days ago
    Thank you, yes my knowledge was very outdated, waay before Spanner.
    Spanner for GCS actually explains how public Google Cloud was always ACID for object listing, while S3 only implemented it around 2020. I always suspected that there must be some very hard piece to implement that AWS didn't have until 2020. Makes sense now that that piece was Spanner.
- esafak22 days ago
  ICYMI https://en.wikipedia.org/wiki/Greenspun's_tenth_rule
  - wasting_time22 days ago
    ICYMI expands to "in case you missed it", ICYMI.
    patates21 days ago
    I see... why am I...
    virgil_disgr4ce22 days ago
    ICYMI expands to ... wait, shit
- kayo_2021103022 days ago
  ... and then becomes an email client (https://en.wikipedia.org/wiki/Jamie_Zawinski#Zawinski%27s_La...). A two-fer. lol.
  - all222 days ago
    It seems like everything converges on either LISP or emacs.
speedgoose22 days ago
Accenture managed to build a data platform for my company with Elasticsearch as the primary database. I raised concerns early during the process but their software architect told me they never had any issues. I assume he didn’t lie. I was only an user so I didn’t fight and decided to not make my work rely on their work.
- kubi0722 days ago
  I worked in a company that used elastic search as main db. It worked, company made alot of money from that project. It was a wrong decision but helped us complete the project very fast. We needed search capability and a db. ES did it both.
  Problems that we faced by using elastic search: High load, high Ram usage : db goes down, more ram needed. Luckily we had ES experts in infra team, helped us alot.(ecommerce company)
  To Write and read after, you need to refresh the index or wait a refresh. More inserts, more index refreshes. Which ES is not designed for, inserts become slow. You need to find a way to insert in bulk.
  Api starts, cannot find es alias because of connection issue, creates a new alias(our code did that when it cant find alias, bad idea). Oops whole data on alias is gone.
  Most important thing to use ES as main db is to use "keyword" type for every field that you don't text search.
  No transaction: if second insert fails you need to delete first insert by hand. Makes code look ugly.
  Advantages: you can search, every field is indexed, super fast reads. Fast development. Easy to learn. We never faced data loss, even if db crashed.
  - rectang22 days ago
    Databases and search engines have different engineering priorities, and data integrity is not a top tier priority for search engine developers because a search engine is assumed not to be the primary data store. Search engines are designed to build an index which augments a data store and which can be regenerated when needed.
    Anyone in engineering who recommends using a search engine as a primary data store is taking on risk of data loss for their organization that most non-engineering people do not understand.
    In one org I worked for, we put the search engine in front of the database for retrieval, but we also made sure that the data was going to Postgres.
    9rx22 days ago
    > Anyone in engineering who recommends using a search engine as a primary data store is taking on risk of data loss for their organization.
    It is true that Elasticsearch was not designed for it, but there is no reason why another "search engine" designed for that purpose couldn't fit that role.
  - ananthakumaran22 days ago
    ES should be thought of as a json key value store and search engine. The json key value store is fully consistent and supports read after write semantics, refresh is needed for search api. In some cases it does make sense to treat it as a database provided the key value store semantics is enough.
    I used it about 7 years ago. Text search was not that heavily used, but we utilized the keyword filter heavily. It's like having a database where you can throw any query at it and it would return a response in reasonable time, because you are just creating an index on all fields.
  - thisisananth22 days ago
    agree with comment. We use ES quite extensively as a database with huge documents and touchwood we haven't had any data loss. We take hourly backups and it is simple to restore. You have to get used to eventual consistency. If you want to read after writing even by id, you have to wait for the indexing to be complete (around 1 second). You have to design the documents in such a way that you shouldn't need to join the data with anything else. So make sure you have all the data you need for the document inside it. In an SQL db you would normalize the data and then join. Here assume you have only one table and put all the data inside the doc. But as we evolved and added more and more fields into the document, the document sizes have grown a lot (Megabytes) and hitting limits like (max searchable fields :1000 can be increased but not recommended) search buffer limits 100MB).
    My take is that ES is good for exploration and faster development but should switch to SQL as soon the product is successful if you're using it as the main db.
    simianwords22 days ago
    good ideas but sorry i simply don't understand why i would ever do a join at read time. one of the worst ideas!
  - aPoCoMiLogin22 days ago
    most of these is more lack of experience than the DB fault. most systems have its quirks, so you have to get used to it.
- Andys22 days ago
  This is made possible because Elastic gained a write-ahead log that actually syncs to disk after each write, like Postgres.
- victor10622 days ago
  > Accenture
  They messed up a $30 million dollar project big time at a previous company. My cto swore to never recommend them
  - bigfatkitten22 days ago
    How are they still in business?
    I’ve either been involved with or adjacent to dozens of Accenture projects at 5 companies over the last 20 years, and not a single one had a satisfactory outcome.
    I’ve never heard a single story of “Accenture came in, and we got what we wanted, on time and on budget.” Cases of “we got a minimum viable solution for $100m instead of $30m, and it was four years late” seem more typical.
    stackskipton22 days ago
    Just like IBM, they are big enough that no one ever got fired for buying them.
    I've also found they do a good job of getting cadre of executives that float between companies hiring them when they move between companies while they get wined and dined.
    almosthere22 days ago
    It's just that they're only seeing money to build and a place to make excuses on being late.
    If you hire your own people you can make them feel how well the business is doing and get features out the door tomorrow and build to the larger thing over time.
  - 9rx22 days ago
    I've seen some mess-ups in my life, but they started sticking out like a sore thumb long, long, long, long before anywhere close to $30 million was spent on it.
    What does a $30 million dollar mess-up look like?
    rawgabbit22 days ago
    Teams of consultants on site, some remote, and many offshore. Tons of documents are created and many environments and DevOps pipelines are stood up. First code release is when the people who push buttons touch the system for the first time. It is crap. Several more code releases attempt to make the system usable. Eventually another consultant or two are brought to evaluate the project and they say the project violated every best practice and common sense rule. Most egregiously the internal stakeholders who voiced serious concerns at the beginning of the project were dismissed or forced out etc.
    9rx22 days ago
    So much the same as what I've seen before, except instead of abandoning ship when the mess was clear and present, doubling down to see how far of a hole one can dig? Sunk cost must be one hell of a drug for the aforementioned CTO.
    nwallin22 days ago
    I am not OP and am not speaking for them.
    "A $30 million mess-up" can look like (at least) two things. It can be $30 million was spent on a project that earned $0 revenue and was ultimately canceled, or it can look like $x was spent on a project to win a $30 million contract but a competitor won the contract instead.
- CuriouslyC22 days ago
  Elastic feels about as much like a primary data store as Mongo, FWIW.
PedroBatista22 days ago
I really never understood how people could store very important information in ES like it was a database.
Even if they don't understand what ES is and what a "normal" database is, I'm sure some of those people run into issues where their "db" got either corrupted of lost data even when testing and building their system around it. This is and was general knowledge at the time, it was no secret that from time to time things got corrupted and indexes needed to be rebuilt.
Doesn't happen all the time, but way greater than zero times and it's understandable because Lucene is not a DB engine or "DB grade" storage engine, they had other more important things to solve in their domain.
So when I read stories of data loss and things going South, I don't have sympathy for anyone involved other than the unsuspecting final clients. These people knew or more or less knew and choose to ignore and be lazy.
- kentm22 days ago
  > I really never understood how people could store very important information in ES like it was a database.
  I agree.
  Its been a while since I touched it, but as far as I can remember ES has never pretended to be your primary store of information. It was mostly juniors that reached for it for transaction processing, and I had to disabuse them of the notion that it was fit for purpose there.
  ES is for building a searchable replica of your data. Every ES deployment I made or consulted sourced its data from some other durable store, and the only thing that wrote to it were replication processes or backfills.
- WASDx22 days ago
  I've managed a 100+ node cluster for years without seeing any corruption. Where are you getting this from?
  - wdfx22 days ago
    I'm actually struggling to imagine exactly what warrants a 100+ node cluster of ES?
    simianwords22 days ago
    we had something like this to scale out for higher throughput. just in the 10's of thousands requests per second required 100+ nodes simply because each query would have a expensive scatter and gather
- vjerancrnjak22 days ago
  They market it as a general purpose store. Successfully, even though hc cs wizards wouldn’t touch it ever, c suite likes it
  Best example is IoT marketing, as if it can handle the load without bazillion shards, and since when does a text engine want telemetry
- YetAnotherNick22 days ago
  Neither the blogpost(beside consistency which most people don't care much about) nor your post describe any issue.
  > things got corrupted and indexes needed to be rebuilt.
  How is postgres and elastic any different here?
  - dfunckt21 days ago
    I’ve no experience with Elastic but what they’re getting at I think is indexes in Elastic is actually your data because that’s all it does due to the purpose it was built for, whereas in Postgres indexes are, well, indexes — that is, derived data, not the source of truth.
    YetAnotherNick21 days ago
    But if data is corrupt, how is rebuilding index fixing anything. What kind of corruption are we talking about.
- gloryjulio22 days ago
  We only used it on top of the primary databases, just like many other components for scaling or auxiliary functionalities. Not sure how others use it
- simianwords22 days ago
  usually in companies, people have a main durable store of information that is then streamed to other databases that store a transformation of this data with some augmentation.
  these new data stores don't usually require that level of durability or reliability.
unethical_ban22 days ago
I work in infosec and several popular platforms use elasticsearch for log storage and analysis.
I would never. Ever. Bet my savings on ES being stable enough to always be online to take in data, or predictable in retaining the data it took in.
It feels very best-effort and as a consultant, I recommend orgs use some other system for retaining their logs, even a raw filesystem with rolling zips, before relying on ES unless you have a dedicated team constantly monitoring it.
- kentm22 days ago
  Do you happen to know if ES was the only storage? Its been almost 8 years, but if I was building a log storage and analysis system, then I'd push the logs to S3 or some other object store and build an ES index off of that S3 data. From the consumer's perspective, it may look like we're using ES to store the data, but we have a durable backup to regenerate ES if necessary.
  - lillesvin22 days ago
    Searchable snapshots in Elasticsearch can be backed by S3 and they perform very well. No need to store the data on hot nodes any longer than it takes for the index to do a rollover, and from then it's all S3.
- toenail22 days ago
  Dunno, I've had three node clusters running very stable for years. Which issues did you have that require a full team?
  - PedroBatista22 days ago
    Even most toy databases "built in a weekend" can be very stable for years if:
    - No edge-case is thrown at them
    - No part of the system is stressed ( software modules, OS,firmware, hardware )
    - No plug is pulled
    Crank the requests to 11 or import a billion rows of data with another billion relations and watch what happens. The main problem isn't the system refusing to serve a request or throwing "No soup for you!" errors, it's data corruption and/or wrong responses.
    toenail22 days ago
    I'm talking about production loads, but thanks.
    pixl9722 days ago
    Production loads mean a lot of different things to a lot of different people.
  - unethical_ban22 days ago
    To be fair, I think it is chronically underprovisioned clusters that get overwhelmed by log forwarding. I wasn't on the team that managed the ELK stack a decade ago, but I remember our SOC having two people whose full time job was curating the infrastructure to keep it afloat.
    Now I work for a company whose log storage product has ES inside, and it seems to shit the bed more often than it should - again, could be bugs, could be running "clusters" of 1 or 2 instead of 3.
    xeraa22 days ago
    There are no 2-node clusters (it needs a quorum). If your setup has 2-node clusters, someone is doing this horribly wrong.
    toenail22 days ago
    I'm not even sure "get overwhelmed" is a problem, unless you need real time analytics. But yeah, sounds like a resources issue.
- yencabulator19 days ago
  > I work in infosec and several popular platforms use elasticsearch for log storage and analysis.
  Storing logs in ElasticSearch is just stupid, as it does not preserve order:
  https://logstash.jira.com/browse/LOGSTASH-192
- 1_1xdev122 days ago
  You have to slap something durable and a queue in front of it.
  Elastic’s own consultants will tell you this …
- cyberpunk22 days ago
  Meh i run hundreds of es nodes, its gotten a lot more friendly these days, but yes it can be a bit unforgiving at times.
  Turns out running complicated large distributed systems requires a bit more than a ./apply, who would have guessed it?
cluckindan22 days ago
”That means a recently acknowledged write may not show up until the next refresh.”
Which is why you supply the parameter
```
  refresh: ”wait_for”
```
in your writes. This forces a refresh and waits for it to happen before completing the request.
”schema migrations require moving the entire system of record into a new structure, under load, with no safety net”
Use index aliases. Create new index using the new mapping, make a reindex request from old index to new one. When it finishes, change the alias to point to the new index.
The other criticisms are more valid, but not entirely: for example, no database ”just works” without carefully tuning the memory-related configuration for your workload, schema and data.
- nkmnz22 days ago
  It took me years before I started tuning the memory-related configuration of postgres for workload, schema and data, in any way. It "just works" for the first ten thousand concurrent users.
  - tetha22 days ago
    PostgreSQL has 2 memory-related parameters you need to set for larger instances - work_mem and shared_buffers, as these need to be set to a percentage of the VMs memory to utilize it well. However, pretty much every PostgreSQL setup guide names these two values, and on a managed PostgreSQL hosting I'd expect these to be set.
    Outside of memory, log_duration, temp_file_limit, a good query plan visualizer and some backup and replication (e.g. PGBackrest and Patroni) are also generally recommended if self-hosting. Patroni doesn't even need an external config store anymore, which is great since you can just run it onto 3-4 nodes and get a high quality HA, easy to manage PostgreSQL cluster.
    But those two parameters are pretty much all to have a PostgreSQL process thousands of transactions per second without further tuning. Even our larger DBs hosting simple REST-applications (opposed to ETL/Data warehousing) had to grow quite a lot until further configuration was necessary, if at all.
    Checkpointing probably becomes the next issue then, but modern PostgreSQL actually has great logging there -- "Checkpoints occur too frequently, consider these 5 config parameters to look at". And don't touch VACUUM jobs, as a consultant once joked, he sometimes earns thousands of dollars to say "You killed a VACUUM job? Don't do that".
    So yeah, actually running PostgreSQL takes a few considerations, but compared to 10 - 15 years ago, you can get a lot with little effort.
    nkmnz22 days ago
    Exactly! Thank you for sharing your expertise!
  - _joel22 days ago
    I just tend to use https://github.com/le0pard/pgtune
    nkmnz22 days ago
    I agree, but that doesn’t constitute „carefully tuning“ a config, at least not in my book. OP certainly didn’t mean to imply that it’s enough to follow along a 5min tutorial.
  - cluckindan22 days ago
    Well, most people working on a car don’t have a car lift: it only makes sense when you need to safely work on a large volume of cars. If you only work on one or two, a jack and a pile of wood works just fine.
    nkmnz22 days ago
    Please don't move the goal post. Writing `no database ”just works” without (...)` is gatekeeping behavior, creating an image of complexity that for most use cases - especially for those starting out - just doesn't exist.
    cluckindan21 days ago
    In fairness, it doesn’t exist for Elasticsearch either.
    nkmnz21 days ago
    I have no clue about Elasticsearch, so you might be right – but on the other hand, you just contradicted your own statement about how difficult databases are, so I have no idea which of your statements I should trust.
    cluckindan21 days ago
    Horses are great and run on grass. Formula cars are difficult to maintain. Cruise liners are even worse!
    nkmnz20 days ago
    Then maybe you shouldn’t make a single statement about all means of transportation, claiming that you needed to be horse whisperer to ride any kind of bike, car, train, ship or plane.
  - kamma443422 days ago
    Modern JVMs are pretty effective in most scenarios right out of the box.
- aPoCoMiLogin22 days ago
  it's the other way around: `wait_for` waits for the next refresh (there is configurable refresh interval, 1s by default), `refresh: true` forces refresh without waiting for the next refresh interval.
  the difference is that waiting for refresh assures that the data will be available for search after the "insert" finishes. forcing refresh might be foot gun that will criple the servers.
  - cluckindan21 days ago
    Huh. Makes sense. Well, that’s the right option to use, anyway.
jfengel22 days ago
No, of course not. But the question is, do you need a database?
A database is a big proposition: transactions, indexes, query processing, replication, distribution, etc. A fair number of use cases are just "Take this data and give it back to me when I ask for it".
ES (or any other not-a-database) might not be a full-bore DBMS. But it might be what you need.
- immibis22 days ago
  Rule of thumb: Whenever you think you don't need relational database features, you will later discover why you do.
  The one thing relational databases don't have, that you might need, is scaling. Maintaining data consistency implies a certain level of non-concurrency. Conversely, maintaining perfect concurrency implies a certain level of data inconsistency.
  - 22 days ago
    undefined
  - 9rx22 days ago
    The other thing relational databases don't have, that you are definitely going to need, is a practical implementation.
    You could maybe consider Rel if you have a particular type of workload, but, realistically, just use a tablational database. It will be a lot easier and is arguably better.
toenail22 days ago
I think elastic always clearly documented to expect "eventual consistency", they never claimed to be a "database" in the sense that tfa defines.
- xeraa22 days ago
  First step of a marketing campaign: Claim something never said and then tell everyone why it's wrong ;)
  - cess1122 days ago
    It's not so much that Elastic is saying it as a lot of people doing the supposed wrong the advert-article describes.
    I've seen some examples of people using ES as a database, which I'd advise against for pretty much the reasons TFA brings up, unless I can get by on just a YAGNI reasoning.
    xeraa22 days ago
    It will also depend a lot on the type of data: Logs are an easy yes. Something that required multi-document transactions (unless you're able to structure it differently) is a harder tradeoff. Though loss of ACKed documents shouldn't really be a thing any more.
wwarner22 days ago
These drawbacks are all true, but sometimes storing directly to elastic is still the best way forward.
lvspiff22 days ago
Everything is a database if you believe hard enough
Feel like the christmas story kid --
>simplicity, and world-class performance, get started with XXXXXXXX.
A crummy commercial?
- Quarrelsome22 days ago
  ram is a database, you just need bigger capacitors.
  - marcosdumay22 days ago
    That's literally the original MySQL philosophy.
    And it was good for a lot of things.
vedhant22 days ago
Ofcourse it is not meant as a primary database. What baffles me is that people use it as log storage. As an application scales, storage and querying logs become the bottleneck if elasticsearch is used. I was dealing with a system that could afford only 1 week of log retention!
- SlightlyLeftPad22 days ago
  Logs are always notoriously expensive to store and also are notorious for accidentally exposing PII, API/private/db keys, etc. They should generally only be stored for a relatively short period of time at scale. In fact, to remain compliant to CCPA, 28 days is the safe number for most things.
  Metrics are much more efficient and are the tool of choice for longer term storage and debugging.
- lillesvin22 days ago
  What kind of storage do you have backing your Elasticsearch? And how have you configured sharding and phase rollover in your indices?
  I work with a cluster that holds 500+ TB logs (where most are stored for a year and some for 5 years because of regulations) in searchable snapshots backed by a locally hosted S3 solution. I can do filtering across most of the data in less than 10 seconds.
  Some especially gnarly searches may take around 60-90 seconds on the first run as the searchable snapshots are mounted and cached, but subsequent searches in the cached dataset are obviously as fast as any other search in hot data.
  Obviously Elasticsearch isn't without its quirks and drawbacks, but I have yet to come across anything that performs better and is more flexible for logs — especially in terms of architectural freedom and bang-for-the-buck.
throw_m23933922 days ago
It has an index? It has data that can be queried with indexes? it is a database. PERIOD. Let's not turn the word database into a buzzword.
It should obviously NOT be a "main" database but part of an ETL pipeline for search purposes for instance.
- 9rx22 days ago
  > Let's not turn the word database into a buzzword.
  It is much too late for that, but you're right that we'd be wise to put effort into undoing that. This is exactly how you end up with people using Elasticsearch as a primary datastore. When someone hears that they need a database, a database is what you are going to see them pick.
  If we regularly used the proper terminology with appropriate specificity then those without the deep technical knowledge required to understand all the different kinds of databases and the tradeoffs that come them are able to narrow their search to the solutions that fit within the specification.
zacksiri22 days ago
I never thought of Elasticsearch as a database and always designed systems around what elasticsearch is supposed to be an index based document store for used with search.
I think their API is great and have had amazing results with it. Their recent innovations around quantization (bbq) has been amazing for my use case building an agentic movie database for discovering movies and personalized movie recommendations.
There are benefits to not using your database for everything, even if it adds a bit of complexity by introducing another dependency. If the benefits out weigh the cost of complexity reaching for elastic has almost always been worth it for me.
jamesgresqla month ago
I know it sounds obvious, but some people are pretty determined to us it that way!
CodeCompost22 days ago
Yes but is it webscale?
(Obviously I'm referring to a famous YouTube video on the subject)
almosthere22 days ago
I've always generally used some other data source and ran a spark job to populate elastic. For live data, just have a database trigger through a message queue populate ES.
aaroninsf22 days ago
We use ES like a DB, but, not with SQL; and most importantly, it's not the source of truth/primary store. It's operational truth and best-effort.
almosthere22 days ago
Most people in this entire discussion don't really even understand the analytic queries are and think ES was for full text search. Picard Hand Over Face
largbae22 days ago
Are the folks still using ES simply unaware of the performance advantages of ClickHouse, or is there some use case that ES covers that CH is still missing?
- wpaladin22 days ago
  Full-text search. It was only added to Clickhouse relatively recently, and is still in Beta. It's a core feature of ES from the beginning.
  https://clickhouse.com/docs/engines/table-engines/mergetree-...
forinti21 days ago
ES and Postgresql integrate nicely with FDW, so you can have the best of both worlds.
gmuslera22 days ago
... for a particular, opinionated definition of what a database should be.
stefanon22 days ago
Yep!
this_user22 days ago
I mean, it is called "ElasticSEARCH", not "Elasticdatabase".
- pcthrowaway22 days ago
  Redis is called a cache but is also often used with some expectation of persistence
- _joel22 days ago
  MySQL isn't mine either, it's Larry Ellison's.
  - rpdillon22 days ago
    Well, "My" is the name of the author's daughter, rather than a reference to who owns it.
    zbentley22 days ago
    > Larry Ellison ... who owns it.
    > who
    Do not fall into the trap of anthropomorphizing Larry Ellison!
    https://www.youtube.com/watch?v=-zRN7XLCRhc&t=2300s
    rpdillon22 days ago
    Thanks, made my day
    _joel22 days ago
    Woosh
    rpdillon22 days ago
    What a strange interaction. What was the point you were trying to make?
    _joel21 days ago
    It was a joke, you missed it, hence the woosh.
trgn22 days ago
it now has dedicated index types for logs and metrics with all kinds of sugar and tweaks in default behavior, they should introduce a new one called "database" that's acid.
alittletooraph222 days ago
[dead]