100M-Row Challenge with PHP(github.com)

181 pointsby brentroose3 days ago18 comments

brentroose3 days ago
A month ago, I went on a performance quest trying to optimize a PHP script that took 5 days to run. Together with the help of many talented developers, I eventually got it to run in under 30 seconds. This optimization process with so much fun, and so many people pitched in with their ideas; so I eventually decided I wanted to do something more.
That's why I built a performance challenge for the PHP community
The goal of this challenge is to parse 100 million rows of data with PHP, as efficiently as possible. The challenge will run for about two weeks, and at the end there are some prizes for the best entries (amongst the prize is the very sought-after PhpStorm Elephpant, of which we only have a handful left).
I hope people will have fun with it :)
- Tade02 days ago
  Pitch this to whoever is in charge of performance at Wordpress.
  A Wordpress instance will happily take over 20 seconds to fully load if you disable cache.
  - rectang2 days ago
    Are you talking about a new, empty WordPress instance running the default theme? Because if so, that doesn't match my anecdotal experience.
    If you're talking about a WordPress instance with arbitrary plugins running an arbitrary theme, then sure — but that's an observation about those plugins and themes, not core.
    As someone who has to work with WordPress, I have all kinds of issues with it, but "20 seconds to load core with caching disabled" isn't one of them.
    lossyalgo2 days ago
    Can concur. I bought a plugin a few years ago after using the free version for many years, wanting to support the devs for making such a useful plugin. I installed it on a few sites, and left my PC running overnight with a tab open to the plugin and woke up the next day to a lovely rebooted Windows (I hate how default Windows behavior after BSOD is a reboot with ZERO indication that it crashed or if it was an update that rebooted). Re-opened all my tabs, and queue the same waking up the next day to a freshly rebooted Windows, which made me suspicious. I assumed at that point it must have been a BSOD, so dug into Windows event logs, eventually realizing it was Firefox. Restored tabs yet again, left browser open over night, while installing more and more debugging tools for Firefox, none of which helped me track down the culprit. What pissed me off the most was that Firefox even allowed a process to consume > 30GB of RAM and cause my PC to crash! I finally caught it one night after > 10 BSODs - the tab had been open for 20+ hours, and right as it started to spiral out of control and my PC was about to crash, as programs were starting to error out and Windows was madly paging things to disk. I got lucky, and was able to open about:memory to see the culprit - this plugin had some kind of memory leak that wasn't noticeable but then suddenly went nuts. I emailed the devs multiple times with the full debug output, and was ignored for weeks until finally they responded, which pissed me off even more having finally paid for the pro version, only to be greeted with this. The free version didn't seem to have this issue either, which was like an extra slap in the face.
    Naked Wordpress is plenty fast, but as soon as you start adding sketchy plugins and Themes, things can spiral out of control.
    breakingcups2 days ago
    A memory leak should not be able to BSOD your install. Something else is wrong with your machine, too.
  - embedding-shape2 days ago
    Microbenchmarks are very different from optimizing performance in real applications in wide use though, they could do great on this specific benchmark but still have no clue about how to actually make something large like Wordpress to perform OK out of the box.
  - tracker12 days ago
    Wordpress is something that I cannot believe hasn't been displaced by a service that uses a separate application for editing and delivery.
    It seems like something like vercel/cloudflare could host the content-side published as a worker for mostly-static content from a larger application and that would be more beneficial and run better with less risk, for that matter. Having the app editing and auth served from the same location is just begging for the issues WP and plugins have seen.
    devmor2 days ago
    As someone who built full ecommerce websites on wordpress over 15 years ago, I can tell you exactly why it hasn't been replaced - the plugin/theme ecosystem.
    There are tens of thousands of plugins and themes to make a Wordpress website do whatever you want and look however you want, either for free or a very low fee. You have to replace that entire ecosystem for the same price to replace Wordpress.
    No matter how many times people get hacked, the perceived value of getting something for nothing outweighs the eventual cost.
    hparadiz2 days ago
    I did a short contract a few years back where multiple WordPress plugins were pulling different versions of guzzle and I had to use a namespace rewriter to be able to run multiple guzzle versions at the same time.
    The thing about WordPress is you can put it on a box and lock it down so hard you just treat it as an untrusted process on your server.
    shimman2 days ago
    Maybe it's just my poor imagination but how many plugins are truly unique to WP that you can't find on other CMSs? The only ones that come to mind would be those plugins that help connect to various B2B or B2C workflows, is that where the gold is mostly found?
    AlienRobot2 days ago
    WP essentially lets plugins do anything they want. The plugins are just scripts that register callbacks to events. WP calls events on BASICALLY EVERY FUNCTION. This is without exaggeration. I don't remember the exact names right now, but if you have a function like wp_get_title that gets the title of a post, there will be a "get_title" event that can modify which title is returned. So for every function first the data is computed using the default WP way, then plugins are allowed to discard all that work and replace it without something else entirely. There are events for deciding the canonical URL, for deciding the description of a post, for deciding whether RSS links will be displayed or not (the callback just returns true or false), etc.
    In other words, every property can be modified through global event callbacks. Some events are called very early in the whole pipeline that let plugins just render whatever they want (e.g. render custom XML sitemaps).
  - monkey_monkey2 days ago
    That's often a skill issue.
    almosthere2 days ago
    skill issue being they only know php
  - rkozik19892 days ago
    Much like anything else your performance is going to vary a lot based on architecture of implementation. You really shouldn't deploying anything into production without some kind of caching. Whether that's done in the application itself or with memcached/redis or varnish or OPcache.
    slopinthebag2 days ago
    Either you use a slow language and deal with caching or you use a fast language and just put Cloudflare/Bunny/etc in front.
    paulryanrogers2 days ago
    Aren't CF, Bunny, etc CDNs and therefore cache?
    LoganDark2 days ago
    > You really shouldn't deploying anything into production without some kind of caching.
    Citation needed? You only need cache if a render is expensive to produce.
  - The_Presidenta day ago
    I’ve long since abandoned WP but this seems like an SQL resource issue as it bogs up against the oom reaper dealing with no swap. WordPress is like a mid level V6 Kia with all the options and a huge aftermarket.
- gib4442 days ago
  > A month ago, I went on a performance quest trying to optimize a PHP script that took 5 days to run. Together with the help of many talented developers, I eventually got it to run in under 30 seconds
  That's a huge improvement! How much was low hanging fruit unrelated to the PHP interpreter itself, out of curiosity? (E.g. parallelism, faster SQL queries etc)
  - brentroose2 days ago
    Almost all, actually. I wrote about it here: https://stitcher.io/blog/11-million-rows-in-seconds
    A couple of things I did:
    - Cursor based pagination - Combining insert statements - Using database transactions to prevent fsync calls - Moving calculations from the database to PHP - Avoiding serialization where possible
    tiffanyh2 days ago
    Aren’t these optimizations less about PHP, and more about optimizing how your using the database.
    toast02 days ago
    PHP is kind of like C. It can be very fast if you do things right, and it gives you more than enough rope to tie yourself in knots.
    Making your application fast is less about tuning your runtime and more about carefully selecting what you do at runtime.
    Runtime choice does still matter, an environment where you can reasonably separate sending database queries and receiving the result (async communication) or otherwise lets you pipeline requests will tend to have higher throughput, if used appropriately, batching queries can narrow the gap though. Languages with easy parallelism can make individual requests faster at least while you have available resources. Etc.
    A lot of popular PHP programs and frameworks start by spending lots of time assembling a beautiful sculpture of objects that will be thrown away at the end of the request. Almost everything is going to be thrown away at the end of the request; making your garbage beautiful doesn't usually help performance.
    tiffanyh2 days ago
    Would love to read more stories by you toast0 on things you've optimized in the past (given the huge scale you've worked on). Lessons learned, etc. I always find your comments super interesting :)
    toast02 days ago
    <3 I always love seeing your comments and questions, too!
    Well on the subject of PHP, I think I've got a nice story.
    The more recent one is about Wordpress. One day, I had this conversation:
    Boss: "will the blog stay up?"
    toast0: "yeah, nobody goes to the blog, it's no big deal"
    Boss: "they will"
    toast0: "oh, ummmm we can serve a static index.html and that should work"
    Later that day, he posted https://blog.whatsapp.com/facebook I took a snapshot to serve as index.html and the blog stayed up. A few months later, I had a good reason to tear out WordPress (which I had been wanting to do for a long time), so I spent a week and made FakePress which only did exactly what we needed and could serve our very exciting blog posts in something like 10-20 ms per page view instead of whatever WordPress took (which was especially not very fast if you hit a www server that wasn't in the same colo as our database servers). That worked pretty well, until the blog was rewritten to run on the FB stack --- page weight doubled, but since it was served by the FB CDN, load time stayed about the same. The process to create and translate blog entries was completely different, and the RSS was non-compliant: I didn't want to include a time with the date, and there is/was no available timeless date field in any of the RSS specs, so I just left the time out ... but it was sooo much nicer to run.
    Sadly, I haven't been doing any large scale optimization stuff lately. My work stuff doesn't scale much at the moment, and personal small scale fun things include polishing up my crazierl [1] demo (will update the published demo in the next few days or email me for the release candidate url), added IPv6 to my Path MTU Discovery Test [2] since I have somewhere to run IPv6 at MTU 1500, and I wrote memdisk_uefi [3], which is like Syslinux's MEMDISK but in UEFI. My goal with memdisk_uefi is to get FreeBSD's installer images to be usable with PXE in UEFI ... as of FreeBSD 15.0, in BIOS mode you can use PXE and MEMDISK to boot an installer image; but UEFI is elusive --- I got some feedback from FreeBSD suggesting a different approach than what I have, but I haven't had time to work on that; hopefully soonish. Oh and my Vanagon doesn't want to run anymore ... but it's cold out and I don't seem to want to follow the steps in the fuel system diagnosis, so that's not progressing much... I did get a back seat in good shape though so now it can carry 5 people nowhere instead of only two (caveat: I don't have seat belts for the rear passengers, which would be unsafe if the van was running)
    [1] https://crazierl.org/
    [2] http://pmtud.enslaves.us/
    [3] https://github.com/russor/memdisk_uefi
    tiffanyh2 days ago
    Re: PHP vs a rendered index.html … your story brings back fond memories of my college days (around 2001–2002).
    I was a full-time student but also worked for the university’s “internet group.” We ran a homegrown PHP CMS (this was before WordPress/Movable Type), and PHP still felt pretty new. Perl was everywhere, but I was pushing PHP because I’d heard Yahoo had started using it.
    Around then, the university launched its first online class registration system. Before that it was all phone/IVR. I warned our team lead the web server would melt down on registration day because every student would be hammering refresh at 9am to get the best class times and professors. He brushed it off, so I pre-rendered the login page as a static index.html and dropped it in the web root.
    He noticed, got mad (he had built the CMS and was convinced it could handle the load), and deleted my pre-rendered index.html. So young and dumb me wrote a cron job that pinged the site every few minutes, and if it looked down, it copied my static index.html back into the web directory. Since Apache would serve index.html ahead of PHP, it became an instant fallback page.
    Sure enough, at 9am the entire university website went down. Obviously orders of magnitude less scale than your FB story (and way less exciting of an event), but for my small university it was brief moment panic. But my little cron job kicked in and at least kept the front door standing.
    While I’m not in active day to day development anymore, I do still work in tech and think a lot about ways to avoid computation. And something I’ve learned a lot from reading your posts over the years and my own personal experiences is just how big you can scale when you can architect in a way that “just pushes bits” (eg “index.html”) as opposed to computes/transforms/renders something … and I’m not sure you can ever really learn that expect through real world experience.
    Regarding your links, I’ve seen you post about 1 before and have read about it - it looks very cool. I don’t recall seeing 2 or 3 before and look forward to reading more about those. Thanks as always for your insights!
    toast02 days ago
    > Regarding your links, I’ve seen you post about 1 before and have read about it - it looks very cool. I don’t recall seeing 2 or 3 before and look forward to reading more about those. Thanks as always for your insights!
    So #1 now has dist connection stuff as of a few hours ago. Not super obvious, but you can load two (or more) nodes and call nodes() and see they're connected. Dist connection opens up lots of neat possibilities... but I do need to add an obvious application so it's like actually neat instead of just potentially neat.
    #2 is a pretty neat way to diagnose path mtu problems. And I've been seeing people use it and link to it on networking forums all over, even forums in other languages. Which is pretty awesome. Maybe a few links in forums over the past year, but it's always cool to see people using stuff I built mostly for me. :)
    #3 is like I dunno, probably not that useful, I think you could do a lot of similar stuff already, but it felt like a tool that was missing... but I also got some feedback that maybe there's other ways to do it already too, so shrug. But pxe booting is always fun.
    hu32 days ago
    It's still valid as as example to the language community of how to apply these optimizations.
    swasheck2 days ago
    in all my years doing database tuning/admin/reliability/etc, performance have overwhelmingly been in the bad query/bad data pattern categories. the data platform is rarely the issue
    tosti2 days ago
    The worst offenders I've seen were looping over a shitty ORM
    cobbzilla2 days ago
    hey don’t forget, that shitty ORM also empowers you to write beautiful, fluent code that, under the hood, generates a 12-way join that brings down your entire database.
    edoceo2 days ago
    And that is true across languages.
  - Joel_Mckay2 days ago
    In general, it is bad practice to touch transaction datasets in php script space. Like all foot-guns it leads to Read-modify-write bugs eventually.
    Depending on the SQL engine, there are many PHP Cursor optimizations that save moving around large chunks of data.
    Clean cached PHP can be fast for REST transactional data parsing, but it is also often used as a bodge language by amateurs. PHP is not slow by default or meant to run persistently (low memory use is nice), but it still gets a lot of justified criticism.
    Erlang and Elixir are much better for clients/host budgets, but less intuitive than PHP =3
- NorwegianDudea day ago
  Fun challenge, but running the benchmark on Apple hardware is a weird decision as Apple doesn't even have server hardware. Would make much more sense to run it on a dedicated Linux box as that is more accessible and more realistic.
- contingencies2 days ago
  Hehe. Optimization ... it's a good way to learn. Earlier in my career I did a lot of PHP. Usually close to bare.
  Other than the obvious point that writing an enormous JSON file is a dubious goal in the first place (really), while PHP can be very fast this is probably faster to implement in shell with sed/grep, or ... almost certainly better ... by loading to sqlite then dumping out from there. Your optimization path then likely becomes index specification and processing, and after the initial load potentially query or instance parallelization.
  The page confirms sqlite is available.
  If the judges whinge and shell_exec() is unavailable as a path, as a more acceptable path that's whinge-tolerant, use PHP's sqlite feature then dump to JSON.
  If I wanted to achieve this for some reason in reality, I'd have the file on a memory-backed blockstore before processing, which would yield further gains.
  Frankly, this is not much of a programming problem, it's more a system problem, but it's not being specced as such. This shows, in my view, immaturity of conception of the real problem domain (likely IO bound). Right tool for the job.
- ge962 days ago
  5 days to 30 seconds? What kind of factor/order of magnitude is that damn
  What takes 5 days to run
  - hosteur2 days ago
    Poorly made analytics/datawarehouse stuff.
  - slopinthebag2 days ago
    One query per column per row
- CyberDildonics2 days ago
  Using a language that is 100x slower than naive native programs to do a "speed challenge" is like spending your entire day speed walking to run errands when you can just learn how to drive a car.
- lofaszvanitt2 days ago
  Do not update the leaderboard.... at all.
- user39393822 days ago
  exec(‘c program that does the parsing’);
  Where do I get my prize? ;)
  - brentroose2 days ago
    The FAQ states that solutions like FFI are not allowed because the goal is to solve it with PHP :)
    kpcyrd2 days ago
    What about using the filesystem as an optimized dict implementation?
    olmo232 days ago
    this is never going to be faster because it requires syscalls
    kpcyrd2 days ago
    The time you lose at the syscall boundary you may be able to win back during much shorter GC pauses.
    cess112 days ago
    Some suggestions that are already in the PR list disable GC.
    https://www.php.net/manual/en/function.gc-disable.php
- onion2k2 days ago
  A month ago, I went on a performance quest trying to optimize a PHP script that took 5 days to run. Together with the help of many talented developers, I eventually got it to run in under 30 seconds.
  When people say leetcode interviews are pointless I might share a link to this post. If that sort of optimization is possible there is a structures and algorithms problem in the background somewhere.
  - nicoburns2 days ago
    I find that these kind of optimizations are usually more about technical architecture than leetcode. Last time I got speedups this crazy the biggest win was reducing the number of network/database calls. There were also optimisations around reducing allocations and pulling expensive work out of hot loops. But leetcode interview questions don't tend to cover any of that.
    They tend to be about the implementation details of specific algorithms and data structures. Whereas the important skill in most real-world scenarios would be to understand the trade-offs between different algorithms and data structures so that you pick an appropriate off-the-shelf implementation to use.
    LollipopYakuza2 days ago
    I agree. The "advanced" leetcode is about those last % of optimization. But when network latency is involved in a flow, it is usually the most obvious low hanging fruit.
  - tuetuopay2 days ago
    Well leetcode asks you to implement the data structure, not how and when to use which data structure. I don’t need to know how to implement a bloom filter on a whiteboard off the top of my head to know when to use it.
    Twirrim2 days ago
    Hell, the number of times I've used a lot of the data structures that come up in leetcode exercises without at least looking at some reference material is pretty small. I usually assume I'm going to misremember it, and go double check before I write it so I don't waste ages debugging later.
  - slopinthebag2 days ago
    Do you think they achieved that performance optimisation with a networked service because they switched from insertion sort to quicksort?
    hparadiz2 days ago
    I did the same thing in PHP before. Issue was a foreach over a foreach with a search. The fix was to build the result set as you populate the main array of objects. Basically as you add stuff to the array if there's a value that matches you throw it into another "results" array for any cardinality (unique value) that exists. Since in PHP objects are always just pointers your results arrays are relatively painless. Just a series of int32 integers basically. Then when you need an answer your result is instant. I ended up getting a 80-90% speed up. This is not just a php thing either. Folks end up doing this type of stuff in every language.
pxtail2 days ago
Side note - I wasn't aware that there is active collectors scene for Elephpants, awesome!
https://elephpant.me/
- t1234s2 days ago
  Elephpants should be for second and third place. First place should be the double-clawed hammer.
- thih92 days ago
  Excellent project. My favorites: the joker, php storm, phplashy, Molly.
  - vietyork15 hours ago
    which molly project?
  - normie30002 days ago
    I love Mollie!
semiquaver2 days ago
Are they just confused about what characters require escaping in JSON strings or is PHP weirder than I remember?
```
    {
        "\/blog\/11-million-rows-in-seconds": {
            "2025-01-24": 1,
            "2026-01-24": 2
        },
        "\/blog\/php-enums": {
            "2024-01-24": 1
        }
    }
```
- daviddoran2 days ago
  PHP has always escaped forward slashes to help prevent malicious JSON from injecting tags into JavaScript I believe. Because it was common for PHP users to json_encode some data and then to write it out into the HTML in a script tag. A malicious actor could include a closing script tag, and then could inject their own HTML tags and scripts etc.
- idoubtit2 days ago
  The weirdness is partly in JSON . In the JSON spec, the slash (named "solidus" there) is the only character that can be written plainly or prefixed with a backslash (AKA "reverse solidus").
  See page 4, section 9 of the latest ECMA for JSON: https://ecma-international.org/wp-content/uploads/ECMA-404_2...
- CapitaineToinon2 days ago
  That's the default output when using json_encode with the JSON_PRETTY_PRINT flag in php.
  - idoubtit2 days ago
    > That's the default output when using json_encode with the JSON_PRETTY_PRINT flag in php.
    JSON_PRETTY_PRINT is irrelevant. Escaping slashes is the default behavior of json_encode(). To switch it off, use JSON_UNESCAPED_SLASHES.
    CapitaineToinon2 days ago
    Ah, my bad. Thanks for the correction, TIL!
- poizan422 days ago
  > The output should be encoded as a pretty JSON string.
  So apparently that is what they consider "pretty JSON". I really don't want to see what they would consider "ugly JSON".
  (I think the term they may have been looking for is "pretty-printed JSON" which implies something about the formatting rather than being a completely subjective term)
  - ourmandave2 days ago
    Pretty JSON not meaning formatting, but more "That was pretty JSON of you."
Xeoncross2 days ago
This is why I jumped from PHP to Go, then why I jumped from Go to Rust.
Go is the most battery-included language I've ever used. Instant compile times means I can run tests bound to ctrl/cmd+s every time I save the file. It's more performant (way less memory, similar CPU time) than C# or Java (and certainly all the scripting languages) and contains a massive stdlib for anything you could want to do. It's what scripting languages should have been. Anyone can read it just like Python.
Rust takes the last 20% I couldn't get in a GC language and removes it. Sure, it's syntax doesn't make sense to an outsider and you end up with 3rd party packages for a lot of things, but can't beat it's performance and safety. Removes a whole lot of tests as those situations just aren't possible.
If Rust scares you use Go. If Go scares you use Rust.
- thinkingtoilet2 days ago
  It's almost comical how often bring up Rust. "Here's a fun PHP challange!" "Let's talk about Rust..."
  - hu32 days ago
    Yep. It's like a crossfit vegan religion at this point.
    You don't even have to ask. They will tell you and usually add nothing to the conversation while doing so.
    Quite off-putting.
  - Xeoncross2 days ago
    Sorry, but it's honestly just a lot of our journeys. Started on scripting languages like PHP/Ruby/Lua (self-taught) or Java/VB/C#/Python (collage) and then slowly expanded to other languages as we realized we were being held back by our own tools. Each new language/relationship makes you kick yourself for putting up with things so long.
    thinkingtoilet2 days ago
    I understand that but there's a time and a place. Rust has nothing to do with this. 100% of the people on this site understand that this challenge can be done faster in C, or Rust, or whatever. This is a PHP challenge. Perhaps we could discuss the actual submission as opposed to immediately derailing it.
    Klonoar2 days ago
    > I understand that but there's a time and a place.
    Dude, this is a website where a bunch of developer nerds congregate and talk shop. They're fine, this is the same kind of shit that's been happening across these kinds of sites for decades.
    tracker12 days ago
    I don't know about that... I like Rust a lot... but I also like a lot of things about C# or TS/JS... I'll still reach for TS first (Deno) for most things, including shell scripting.
  - slopinthebag2 days ago
    I mean, it's kinda like complaining that people are mentioning excavators on your "how I optimised digging a massive ditch with teaspoons" post.
    hu32 days ago
    It's more like mentioning excavators in posts about the best vehicles to do groceries with.
- tracker12 days ago
  Can't speak for go... but for the handful of languages I've thrown at Claude Code, I'd say it's doing the best job with Rust. Maybe the Rust examples in the wild are just better compared to say C#, but I've had a much smoother time of it with Rust than anything else. TS has been decent though.
- codegeek2 days ago
  I am not that smart to use Rust so take it with a grain of salt. However, its syntax just makes me go crazy. Go/Golang on the other hand is a breath of fresh air. I think unless you really need that additional 20% improvement that Rust provides, Go should be the default for most projects between the 2.
  - Xeoncross2 days ago
    I hear you, advanced generics (for complex unions and such) with TypeScript and Rust are honestly unreadable. It's code you spend a day getting right and then no one touches it.
    I'm just glad modern languages stopped throwing and catching exceptions at random levels in their call chain. PHP, JavaScript and Java can (not always) have unreadable error handling paths not to mention hardly augmenting the error with any useful information and you're left relying on the stack trace to try to piece together what happened.
tzs2 days ago
What's a decent time for this?
I was curious what it would take if I approached it the way I do with most CSV transformation tasks that I'm only intending to do once: use Unix command line tools such as cut, sed, sort, and uniq to do the bulk of the work, and then do something in whatever scripting language seems appropriate to put the final output in whatever format is needed.
The first part, using this command [1], produces output lines that look like this:
```
  219,/blog/php-81-before-and-after,2021-06-21
```
and is sorted by URL path and then date.
With 1 million lines that took 9 or 10 seconds (M2 Max Mac Studio). But with 100 million it took 1220 seconds, virtually all of which was sorting.
Turning that into JSON via a shell script [2] was about 15 seconds. (That script is 44% longer than it would have been had JSON allowed a comma after the last element of an array).
So basically 22 minutes. The sorting is the killer with this type of approach, because the input is 7 GB. The output is only 13 MB and the are under 300 pages and the largest page count is under 1000 so building the output up in memory as the unsorted input is scanned and then sorting it would clearly by way way faster.
[1] cut -d / -f 4- | sed -e 's/T..............$//' | sort | uniq -c | sed -e 's/^ *//' -e 's/ /,\//'
[2]
```
  #!/bin/zsh
  echo "{"
  PAGE=none
  while read LINE; do
      COLS=("${(@s/,/)LINE}")
      COUNT=${COLS[1]}
      URL=${COLS[2]}
      DATE=${COLS[3]}
      if [ $URL != $PAGE ]; then
          if [ $PAGE != "none" ]; then
              echo
              echo "    },"
          fi
          PAGE=$URL
          echo "    \"\\$URL\": {"
          FINISHDATE=no
      else
          if [ $FINISHDATE = "yes" ]; then
              echo ","
          fi
      fi
      echo -n "        \"$DATE\": $COUNT"
      FINISHDATE=yes
  done
  echo
  echo "}"
```
- arwt2 days ago
  You can check the benchmarks here: https://github.com/tempestphp/100-million-row-challenge/blob...
  A "good" run seems to be around 20-40s mark.
chrismarlow92 days ago
I don't have time to put together a submission but I'm willing to bet you can use this:
https://github.com/kjdev/php-ext-jq
And replicate this command:
jq -R ' [inputs | split(",") | {url: .[0], date: .[1] | split("T")[0]}] | group_by(.url) | map({ (.[0].url): ( map(.date) | group_by(.) | map({(.[0]): length}) | add ) }) | add ' < test-data.csv
And it will be faster than anything you can do in native php
Edit: I'm assuming none of the urls have a comma with this but it's more about offloading it through an extension, even if you custom built it
- Retr0id2 days ago
  The rules exclude FFI etc.
Twirrim2 days ago
I took a quick look, the dependency on php 8.5 is mildly irritating, even Ubuntu 26.04 isn't lined up to ship with that version, it's on 8.4.11.
You mention in the README that the goal is to run things in a standard environment, but then you're using a near bleeding edge PHP version that people are unlikely to be using?
I thought I'd just quickly spin up a container and take a look out of interest, but now it looks like I'll have to go dig into building my own PHP packages, or compiling my own version from scratch to even begin to look at things?
- gaigalas2 days ago
  Those are quite good:
  https://launchpad.net/~ondrej/+archive/ubuntu/php
  Anyway, whatever you write in an earlier PHP version is likely to work on future versions. PHP has remarkable BC.
  If you're just experimenting, might as well start in the browser:
  https://alganet.github.io/phasm/
  Not all extensions available there, but it has the essentials.
tveita2 days ago
> Also, the generator will use a seeded randomizer so that, for local development, you work on the same dataset as others
Except that the generator script generates dates relative to time() ?
- brentroose2 days ago
  True, it's a bug that I'm going to fix, but it only impacts local test data sets and not the real benchmark :)
tzs2 days ago
You should say in the output formatting rules that the pages should be output in the order that the pages are in the input file. Currently it only specifies the order of the visits within the entry for each page.
spiderfarmer2 days ago
Awesome. I’ll be following this. I’ll probably learn a ton.
Retr0id2 days ago
How large is a sample 100M row file in bytes? (I tried to run the generator locally but my php is not bleeding-edge enough)
- brentroose2 days ago
  Around 7GB
lofaszvanitt2 days ago
Submit at the very end, so others wouldn't know you have a better solution.
csjh2 days ago
Obligatory DuckDB solution:
> duckdb -s "COPY (SELECT url[20:] as url, date, count(*) as c FROM read_csv('data.csv', columns = { 'url': 'VARCHAR', 'date': 'DATE' }) GROUP BY url, date) TO 'output.json' (ARRAY)"
Takes about 8 seconds on my M1 Macbook. JSON not in the right format, but that wouldn't dominate the execution time.
- cess112 days ago
  This log in one of the PR:s claims a 5.4s running time on some Mac.
  https://github.com/tempestphp/100-million-row-challenge/pull...
tzsa day ago
I'm looking at the leaderboard and it raises some interesting questions. Currently the fastest are ~3.4 seconds.
Yesterday the README said that benchmarks were run on a "Premium Intel Digital Ocean Droplet with 2vCPUs and 1.5GB of available memory".
Today it says they are run on a "Mac Mini M1 with 12GB of RAM of available memory", which if the net is to be believed is quite a bit faster than the DO Droplet they said they had been using. I'm going to assume those 3.4 seconds results on the leaderboard were benchmarked on the Mac.
I've got an M2 Max Mac Studio which should be faster than the Mac Mini.
A program to do this challenge must read the entire input file, and it is going to have to at least some computation for every character in the file while parsing.
So I thought to try to get an idea of what an upper limit might be for how fast this could be done. One idea for that was this:
```
  $ time WC_ALL=C wc -l data.csv
```
The idea is wc should be written in C or C++, and counting lines just requires checking each character to see if it is newline so it is pretty minimal computation. WC_ALL=C should keep any Unicode stuff from happening which might slow it down.
This is taking 7.1 seconds. (Same without WC_ALL=C BTW).
OK, that was unexpected. I then wrote a line counter in C. Allocate a buffer of size N, loop doing (read N bytes from stdin into buffer, scan those bytes counting '\n's) until no more input. With a 1 MiB buffer it took 1 second. With a 1024 byte buffer it took 4.3 seconds. With a 512 byte buffer it took 7.1 seconds.
So...maybe wc just has a small buffer?
Then I decided to try "wc -c". That's 0.008 seconds. That's faster than "cat > /dev/null" (0.6) seconds, suggesting the "wc -c" is not reading the file. Someone probably decided to special case requests for just the number of characters and just use stat/fstat to get the file size or seeks to the end and gets the offset or something like that.
I then looked at the source for wc [1]. It does indeed special case things like -c. It also special cases -l, because lines, unlike words, can be counted without having to deal with locale stuff.
But my guess it is using a small buffer is wrong. Buffer size is 1 MiB same as mine. So why is my line counter 1 seconds and "wc -l" is 7.1 seconds?
Looking at it I see that wc is also finding the longest line, even if you have only asked for the number of lines. When I add finding the longest line to mine it then takes 5.1 seconds.
There is also more error handling in wc. Mine just loops as long as read() > 0 and then prints the stats and exits, where as wc loops as long as read() != 0, and then in the loop does an "if (len < 0)" to see if there was an error.
There is also a check in the loop in wc to see if a flag that gets set on SIGINFO is set. If it is then wc prints the current stats.
Still, on the 7 GB data.csv file, with a 1 MiB read buffer, the read loop should run under 7000 times so that "if (len < 0)" and "if (siginfo)" are only going to happen under 7000 times, and their enclosed code is only going to run if there is a read error for the first and every time I hit CTRL-T for second. In my tests that's 0 times for both of those.
That's not nearly enough to explain why it is 2.1 seconds slower than my line counter which now has the same buffer size, finds the longest line too, and aside from those two under 7000 times not taken if statements is essentially the same loop.
Maybe latter I'll see what it takes to build wc locally and try to find where the time is going.
[1] https://github.com/apple-oss-distributions/text_cmds/blob/te...
poizan422 days ago
> The output should be encoded as a pretty JSON string.
...
> Your parser should store the following output in $outputPath as a JSON file:
```
    {
        "\/blog\/11-million-rows-in-seconds": {
            "2025-01-24": 1,
            "2026-01-24": 2
        },
        "\/blog\/php-enums": {
            "2024-01-24": 1
        }
    }
```
They don't define what exactly "pretty" means, but superflous escapes are not very pretty in my opinion.
- kijin2 days ago
  They probably mean "Should look like the output of json_encode($data, JSON_PRETTY_PRINT)". Which most PHP devs would be familiar with.
  - poizan422 days ago
    It sounds plausible, but they really need to spell out exactly what the formatting requirements are, because it can make a huge difference in how efficiently you can write the json out.
    maleldil2 days ago
    It's a challenge for PHP programmers. I imagine the relevant people would recognise that format.
matei882 days ago
It reminds me of a good read about optimizing PHP for 1 billion rows challenge. TLDR; at some point you hit a limit in PHP’s stream layer
https://dev.to/realflowcontrol/processing-one-billion-rows-i...
wangzhongwang2 days ago
[dead]
tomaytotomato2 days ago
Tempted to submit a Java app wrapped in PHP exec() :D
- brentroose2 days ago
  The rules state that FFI and the likes isn't allowed because the goal is to do it in PHP :)