Where I work, we have maybe a hundred different sources of structured logs: Our own applications, Kubernetes, databases, CI/CD software, lots of system processes. There's no common schema other than the basics (timestamp, message, source, Kubernetes metadata). Apps produce all sorts JSON fields, and we have thousands and thousands of fields across all these apps.
It'd be okay to define a small core subset, but we'd need a sensible "catch all" rule for the rest. All fields need to be searchable, but it's of course OK if performance is a little worse for non-core fields, as long as you can go into the schema and explicitly add it in order to speed things up.
Also, how does Greptime scale with that many fields? Does it do fine with thousands of columns?
I imagine it would be a good idea to have one table per source. Is it easy/performant to search multiple tables (union ordered by time) in a single query?
Secondly, I prefer wide tables to consolidate all sources for easy management and scalability. With GreptimeDB's columnar storage based on Parquet, unused columns don't incur storage costs.
I find it interesting that Greptime is completely time-oriented. I don't think you can create tables without a time PK? The last time I needed log storage, I ended up picking ClickHouse, because it has no such restrictions on primary keys. We use non-time-based tables all the time, as well as dictionaries. So it seems Greptime is a lot less flexible?
Could you elaborate on why you find this inconvenient? I assumed logs, for example, would naturally include a timestamp.
ClickHouse is such a wonderful database precisely it's so incredibly flexible. While most data I interact with is time-based, I also store lots of non-time-based data there to complement the time-based tables. The rich feature set of table engines, materialized views, and dictionaries means you have a lot of different tools to pick from to design your solution. For example, to optimize ETL lookup, I use a lot of dictionaries, which are not time-based.
As an example, let's say I'm ingesting logs into Greptime and some log lines have a customer_id. I would like the final table, or least a view, to be cross-referenced with the customer so that it can include the customer's name. I suppose one would have to continually ingest customer data into a Greptime table with today's date, and then join on today's date?
Something around the sentence structure just is offputting.
You can predictably control costs and predict costs with these models.
Does that mean you have to use SQL (or the visual SQL builder) to query logs, and you don't get access to a log query language the way Kibana gives you KQL and Lucene syntax?
If so, I think it's a little disingenuous to write an article comparing the ELK stack, which is open source and comes with a perfectly usable query UI, to Greptime's equivalent, which is not.