Right now parqeye looks mainly single-file focused. Do you have plans for a “dataset mode” that takes a dir/S3 prefix and surfaces per-file/row-group summaries (row counts, min/max, null %, schema diffs vs a reference file) using just Parquet stats so it scales to tens of GB? Or do you see parqeye intentionally staying a single-file inspector?
[1] https://github.com/Vitruves/nail-parquet [2] https://github.com/NixOS/nixpkgs/pull/449066
duckdb -c "from 'foo.parquet'"
but maybe still useful for other formats or multi-file or remote situationshttps://github.com/llimllib/personal_code/blob/c1a74b1b9527f...
Another seemingly extremely similar project released in the last few days: https://github.com/raulcd/datanomy
Some kind soul made this repository then, and I found it on like the 13th page of Google while in the depths of despair. It is my most treasured GitHub star, a the shining beacon that saved me. I see it has saved 17 other people too.
https://github.com/casidiablo/parquet-tools-for-dumb-people-...
Native Mac/Windows app with multi-threaded parsing (simdjson), automatic nested object flattening, and handles 10M+ rows instantly.
For HN: Use code HN100 for free access
https://iotdatasystems.gumroad.com/
Built with C++ for native performance (~6MB app, not Electron).
Would love feedback from folks working with large JSONL files.
I think you can afford the extra characters to show the whole page in portrait mode. (iPhone 16 pro Safari)
Also just added a Data Plot feature for visualizing numeric columns.
Thanks to everyone who reported the issue!
I did submit a feature request for vi keybindings; though I could look into contributing this myself if I find a bit of spare time.
The other thing that surprised me was the size of the binaries: 90MB for a TUI tool (x64 Linux)? I wonder what the bulk of that is? Is there an issue with LTO? An other commenter noticed as well.
It also looks like you are building against a relatively recent glibc (2.34), which limits compatibility with older systems. Building against an older glibc can be hard to do, so I am not faulting you here, and you do provide a musl fallback, which is appreciated (mandatory notice that the musl allocator can dramatically degrade the performance of rust programs, just in case you were not aware of this).
A few more ideas for improvements (you probably already have your own laundry list):
- Mouse support?
- Seeing that you do have graphs, it would be fun to see a scatter plot as well as a distribution plot under statistics in the "Row Groups" tab (though you probably pull these from the metadata, so that would require further processing, which may be out of scope).
Python (uv + dataiter, but easy to modify for pandas or polars): https://github.com/otsaloma/dataiter/blob/master/bin/di-open
R (as per comment, requires also ~/.Rprofile code, nanoparquet in this case): https://github.com/otsaloma/R-tools/blob/master/r-load
Will take a look when i get to my laptop!
Also allows you to do computations on the data in place.
BTW, you can use duckdb with their ui plugin to have an interactive view of your data, not only parquet.
i tried to install with brew, but it told me my cli tools were "too out of date". Never seen that before! and also just upgraded.
Will try again tomorrow