When benchmarked, Neo4J crashed on every graph I’ve tried <https://www.unum.cloud/blog/2020-11-12-graphs>, making SQLite and Postgres much more viable options even for network-processing workloads. So I wouldn’t be surprised to learn that people actually use pgRouting and Supabase in that setting.
With the rise of Postgres-compatible I’m wondering if it’s worth refreshing the project. Similarly, there are now more Graph DBs like MemGraph compatible with CYPHER, which should probably work much better than Neo4J.
Just starting to review it but my front of mind questions: 1) How do I handle persistence? Looks like some code is missing. 2) Do you support multi-tenancy (b2b saas graph backend for handling relations scoped to a tenant)
Thanks
1) You can persist a graph to disk. By default, this uses protobuf (`save_to_file`), however we’re migrating to Parquet in next release for better performance because we noticed loading a 100m edge graph from scratch (CSV, Pandas, or raw Parquet) is actually faster (~1M rows/sec) than from persisted proto, which isn’t ideal. There’s also a private version that uses custom memory buffers for on-disk storage, handling updates and compaction automatically.
2) You can run a Raphtory instance either as a GraphQL server or an embedded library. For the server, multiple users can query the persisted graphs, which are stored in a simple folder structure with namespaces (for different graphs). For now, access control needs to be managed externally, however it's on our roadmap!
Would be interesting to see updated benchmarks comparing these newer options against PostgreSQL extensions.
And MemGraph is nice but it's memory only where as Neo4J is designed for super large graphs that live on the filesystem. Not really that comparable.
We wanted to store a graph in postgres and ended up writing some recursive queries to pull subgraphs then had NetworkX layered over it to do some more complex graph operations. We ended up doing that for a short while but then switched to Neo4j because of how comparatively easy it was to write queries (although the Python support for Neo4j was severely lacking). Never really stressed it out on dataset size though.
I did manage to crash Redis' graph plugin pretty quickly when I was testing that.
Could you share what the data structure and scale was?
Postgres worked fine but cypher is so much more expressive and handles stuff like loop detection for you, neo4j was much easier to work with. Performance wasn't ever really an issue with either.
I've left that world now, but if you're in the market for a graph store again, it might be something to look at.
Really graph is a feature and not a product.
And couldn't disagree more that graph is a feature. You really want something optimised for it (query language / storage approach) as the data structure is so different in every way from a relational or document store.
No, it is not.
[1] https://en.wikipedia.org/wiki/Worst-case_optimal_join_algori...
Graph processing can create substantial amount of intermediate data if it is done in typical join implementation fashion (nested loops or hash join). So it may appear that graph processing needs a tailored approach.
But what can help graph algorithms can help SQL query execution as well and vice versa, see the link above.
For example, TPC-DS contains queries that (indirectly) joins same tables multiple times (query 4, for example). This is, basically, a kind of centrality metric computation for a graph represented by the tables.
Graph query languages also make those kinds of queries much easier to express in the first place.
There's also the query planner layer to think about too.
Perhaps we're talking past each other about the word "optimised".
But that is not my position! Postgres has many index access methods: hash, btree, brin, gin, gist, and there are extensions for rum, bloom, skipscans, geospatial indexes such as sp-gist, & vector indexes like ivf/hnws (see pgvector.) I mean, as far as graph databases are concerned, besides pgRouting, there's also Apache AGE which is a graph-"optimised" Postgres.
You should learn more about Postgres and databases in general. See comment above. https://news.ycombinator.com/item?id=43203833 which is closely related to the argument I am actually making.
there are some other interesting extensions in this space - onesparse[0] is early in development but pretty exciting as it builds on SuiteSparse which is very mature
LiveJournal Orkut
Nodes: 3,997,962 3,072,441
Edges: 34,681,185 117,185,037
Triangles: 177,820,130 627,583,972
Seconds Edges/Second Seconds Edges/Second
Tri Count LL: 2.69 12,892,634 32.03 3,658,602
Tri Count LU: 1.78 19,483,812 16.38 7,156,338
Tri Centrality: 1.45 23,918,059 12.22 9,589,610
Page Rank: 7.12 4,870,953 23.14 5,064,176
Orkut was as big as I could go due to limited RAM. One of my constrains is limited access to big enough hardware to do the kinds of Graphs Of Unusual Size (billions of edges, trillions of triangles) where we can really flex the scale that CUDA support gives us. Stay tuned!That said, there are graph databases.
https://memelang.net/03/ https://github.com/memelang-net/memesql3
I was thinking that given RDS has support for plrust and PostgreSQL's SPI I could use the fact they support croaring-rs there as a crate and build upon that.
I figure I can use that to represent many graph's with say 100s to ~100m nodes and many relations between these things. But each graph would be tenanted to a tenant (company/b2b saas use case).
I was thinking that by using plrust storing the roaring bitmap on the DB server in a bytea and using SPI, I can benefit from the minimal network overhead to mutate and query against the bitmap with croaring. Using SPI locally in the DB server I eliminate network overhead shipping that back to my application code.
PostgreSQL also gives me transaction safety to updates etc. And a bunch of support for other column base data such as my tenant ID column, some JSONB for relationship metadata to query on etc.
Basically something like https://jazco.dev/2024/04/20/roaring-bitmaps/ but on postgres. Given I need to support many tenanted graphs & we're already using citus this seems like something that is feasible at a larger scale too.
I was wondering though if I am going to need to create some operator classes to allow me to index relations a bit better (probably seems likely I think).
I am aware of https://github.com/ChenHuajun/pg_roaringbitmap but would prefer to use int64s and maybe start out on RDS instead of having to add another workload to our self hosted citus cluster/s.
Happy to be told I am fool and any insights would be nice. I am potentially going to try this out on some of our data sets we have because our product team is basically laying out a vision where they want us to have a graph powering a bunch of things.
I don't like the idea of neo4j when we're already deep into PostgreSQL for a bunch of workloads (~20+ TB table workloads etc so we have some reasonable inhouse PG experience).
Also huge thanks to the author of the blog post. I had been looking at pgRouting and wondering with a tilted head.. hmm seems like we can just use this as a graph DB. So that is also on my list to test out.
Apache AGE™ is a PostgreSQL that provides graph database functionality.
OP article is more like a hack, and a good one! It seems like you can achieve a lot of what you might expect from graph database with pgRouting functions and good old SQL.
Once big graphs get involved, scalable systems, and especially those that seperate storage from compute & price accordingly, get much more interesting. We work with our partners like databricks, Google spanner, AWS Neptune, etc, who have different sweet spots that really depend on workload and context, they're all pretty different. OLTP vs OLAP, etc.
Relationships is association between relations/tables, parent-child, node/edge etc, depending on model, extensions etc..
There are three basic models of databases:
model name. | basic data structure
-----------------------------------
relational | tables
hierarchical | trees
network | graph
A "relational" in RDBMS and Codd's rules is just a table data structure with some additional rules.Part of those rules are a named table, with named and typed attributes (columns) with data in the form of rows of tuples.
PgVector is nearest neighbor search for tuple values, often from a single table/relation while PgRouting is graph traversal for relational data.
There is a bit more to that and in the relational model the data is independent of the schema, and no RDBMS is pure.
It is possibly helpful to realize that pgvector is about finding neighbors among tuples, in that relation/table, and that it is very different from graph traversal.