New technique to easily partition CSV files for parallel processing(github.com)

10 pointsby Yomguithereal3 hours ago2 comments

Someone2 hours ago
FTA: “Now let's come back to our jumping thought experiment: the issue here is that, if you jump to a random byte of a CSV file, you cannot know whether you landed in a quoted cell or not. So, if you read ahead and find a line break, is it delineating a CSV row, or is just allowed here because we stand in a quoted cell? And if you find a double quote? Are you opening a quoted cell or are you closing one?
[…]
Real-life CSV data is usually consistent. What I mean is that tabular data often has a fixed number of columns. Indeed, rows suddenly demonstrating an inconsistent number of columns are typically frowned upon. What's more, columns often hold homogeneous data types: integers, floating point numbers, raw text, dates etc. Finally, rows tend to have a comparable size in number of bytes. We would be fools not to leverage this consistency.
So now, before doing any reckless jumping, let's start by analyzing the beginning of our CSV file to record some statistics that will be useful down the line.
[…]
Anyway, we now have what we need to be able to jump safely”
‘Safely’. An attacker who has control over a row in that file can easily embed data that satisfies the statistical checks, thus injecting data.
The author also admits that, saying “This technique is reasonably robust and will let you jump safely”
I agree with “reasonably robust”, but not with “will let you jump safely”.
- Yomguithereal2 hours ago
  > ‘Safely’. An attacker who has control over a row in that file can easily embed data that satisfies the statistical checks, thus injecting data.
  This is clearly not the sort of thing you should expose to anyone, it is an optimization technique. The same way you would not use a fast but DOSable hash function for your hashmap.
- starlita2 hours ago
  "robust" in the same sentence as "CSV" makes me laugh anyway ;)
gazoduke2 hours ago
If I remember correctly CSV.jl also has something of the kind: https://csv.juliadata.org/stable/reading.html#CSV.Chunks
Used statistics are a bit different though.
- topita2 hours ago
  Isn't this more robust though? I feel like using lines to detect next rows is very flimsy. I usually deal with CSV containing full press articles, I am quite sure the CSV.Chunks method would fail without the correct hyperparameter. This method seems more, I dunno, "adaptative".