3 pointsby zokrezyla month ago4 comments
  • clauseckera month ago
    You can do it like this, assuming A is the mask of newlines and B is the mask of non-spaces.

    1. Compute M1 = ~A & ~B, which is the mask of all spaces that are not newlines 2. Compute M2 = M1 + (A << 1) + 1, which is the first non-space or newline after each newline and then additional bits behind each such newline. 3. Compute M3 = M2 & ~M1, which removes the junk bits, leaving only the first match in each section

    Here is what it looks like:

        10010000 = A
        01100110 = B
        00001001 = M1 = ~A & ~B
        00101010 = M2 = M1 + (A << 1) + 1
        00100010 = M3 = M2 & ~M1
    
    Note that this code treats newlines as non-spaces, meaning if a line comprises only spaces, the terminating NL character is returned. You can have it treat newlines as spaces (meaning a line of all spaces is not a match) by computing M4 = M3 & ~A.
    • zokrezyla month ago
      Thanks for the suggestion. Indeed, that would be the first approach. That is how I started. This is however not considering the state (am I inside a 'statement' or not)

      statement meaning string from first non-space till next EOL or EOF.

      Problem starts when you need to cover the "corner cases". Without the corner cases the algo is not algo.

      • zokrezyla month ago
        Do you mind trying out your solution? The code is in https://github.com/zokrezyl/yaal-cpp-poc Thanks a lot!

        Obviously if your solutions gets closed to the memory bandwith limit, we will proudly mention it!

        • clauseckera month ago
          So I've thought about it and I don't really feel like spending more time to convince you that this works. If you have questions I am happy to answer them, but please write your own code.
          • zokrezyla month ago
            It's fine and thank you! I am playing arround with the idea, in theory all is good.. Only thing is that things like "first non ..." often involve branching that corrupts the prediction ability of the CPU. Therefore I kindly invited you to show it in code.
            • clauseckera month ago
              You can find the first set bit in an integer with a machine instruction, it's completely branch free. gcc has __builtin_ctz() for this. You'll either need to iterate over all set bits (so one branch per set bit) or use a compression instruction (requiring AVX-512) to turn the bit set into a set of integers.

              That said, as you seem to actually want to do something with the results, you'll take a branch per match anyway, so I don't see the problem.

      • clauseckera month ago
        This does track the state. If you want to track it across multple vectors of input, you'll need to carry it over manually.
  • camel-cdra month ago
    The problem should be equivalent to: https://www.reddit.com/r/simd/comments/1hmwukl/mask_calculat...

    Falvyu's and bremac's solution seems to be the best.

    • zokrezyla month ago
      thanks for pointing out! I tried the borrowing trick from the previous segment, was pretty obvious, but for some reason failed as could not avoid at least one conditinonal... will try again.
      • zokrezyla month ago
        For my problem describe under the link above the suggestions above eliminate indeed the branches, but same time the extra instructions slow down the same as my initial branches. Meaning, detecting newlines would work almost 100% of memory throughput, but detecting first non-space reduces the speed to bit above 50% of bandwith

        https://gist.github.com/zokrezyl/8574bf5d40a6efae28c9569a8d6...

  • pestatijea month ago
    wheres the code?...have a look at codereview[5], the whole site is geared for this kind of challenges

    [5] codereview.stackexchange.com

    • zokrezyla month ago
      I do not have one "implementation" but have been trying with different approaches that all delivered under 50% of memory bandwith... I guess if anyone can purpose a solution should be from scratch... The problem is that all approaches I tried end up generating unpredictable branches that do not allow the CPU to optimally keep loading text from memory.
    • zokrezyla month ago
  • zokrezyla month ago
    [dead]