Anthropic's original take home assignment open sourced(github.com)

639 pointsby myahioa month ago59 comments

lbreakjaia month ago
I consider myself rather smart and good at what I do. It's nice to have a look at problems like these once in a while, to remind myself of how little I know, and how much closer I am to the average than to the top.
- epolanskia month ago
  Computing is a very broad topic. Even Linus or Carmack have no skills or knowledge about countless topics that would be mundane to you.
  It doesn't matter really, what matters is our ability to stare into the void of what we don't know and start making progress.
  Our ability to process and master new topics is part of the job.
  I'm sure you've done that countless times.
- TrackerFFa month ago
  Well it is a specialized problem. If you've never worked on anything similar previously, it is going to take time. Don't even need to interview for selective billion dollar companies like Anthropic to encounter these types of problems - after college I interviewed for various electronics/hardware companies where you'd get asked to optimize low-level code - which would have looked quite foreign, if you had never actually worked on such problems before.
  - Onavoa month ago
    If you ask an EE to debug react state management code without prior exposure they won't do too well either. But on the other hand they can easily pick up most of it after a week long crash course while training a performance engineer who can optimize code for a specific architecture would take months.
    sublineara month ago
    > they can easily pick up most of it after a week long crash course
    I have to disagree and question what you mean by "optimization". It's very easy to write web code that technically accomplishes a task, but does so poorly. This is the natural consequence of having so many options available.
    The vast majority of web devs with less than 5 years of experience simply don't understand plain javascript well enough. It's a longstanding problem that devs will reach for the most ergonomic tools, not the best tools.
    Lacking sufficient experience, they can't help it. This happens in all programming languages and in all layers of software. AI slop is even worse because it tends towards the mean.
    ontouchstarta month ago
    Engineering is more or less about getting familiar with the proper tools and use them to solve specific problems: add new features, debugging, refactoring and optimizing.
    And the tools themselves are built by other engineers and they need new features, debugging, optimization etc. It is turtles all the way down.
    But each layer has its own jargons, conventions and unwritten hacks. That is where experience comes in. Once you get out off a rabbit hole or pothole, you are one step closer to becoming the “domain expert”. There is no short cut.
    johnnyanmaca month ago
    >The vast majority of web devs with less than 5 years of experience simply don't understand plain javascript well enough
    they are never tested on it, and many won't dig that deep in the day-to-day. Whose fault is it that they don't know plain javascript well enough? That's the result of shipping "content" over any other metric of proper software engineering.
    Funnily enough I did take a mini-course (not a week, but we're talking maybe 100 hours of work as a recreational online summer class) in plain javascript at my university. Quite the quirky language. But this was in ES3 or so, so maybe there's many more guard rails these days against the core jank that makes up JS
    a month ago
    undefined
    ignoramousa month ago
    > EE to debug react state management ... easily pick up most of it after a week long crash course while training a performance engineer ... would take months
    Isn't that mostly because as you go up the abstraction layer, tools and docs to teach yourself the tricks of trade fast are in abundance (let alone a popular layer like React)? Which inturn is likely a function of incentives and opportunities.
    fnya month ago
    It's because the higher up the stack you go, tools become more declarative and literate. Calling sort is far easier than understanding the algorithm for example.
    giancarlostoroa month ago
    > Calling sort is far easier than understanding the algorithm for example.
    This was one of my gripes in college, why am I implementing something if I just need to understand what it does? I'm going to use the built-in version anyway.
    robmccolla month ago
    Because that's the entire point of college. It's supposed to teach you the fundamentals - how to think, how to problem solve, how to form mental models and adapt them, how things you use actually work. Knowing how different sorting functions work and what the tradeoffs are allows you to pick the best sorting function for your data and hardware. If the tools you have aren't doing the job, you can mend them or build new tools.
    godelskia month ago
    So you know which sort to call because there isn't a right answer for all cases.
    And so you can write your own because you're probably going to want to sort data in a specific way. Sort doesn't mean in numerical increasing or decreasing order, it means whatever order you want. You're sorting far more often than you're calling the sort function.
    ksenzeea month ago
    The problem is that a computer science degree isn't the right training for most software engineering jobs.
    giancarlostoroa month ago
    My degree was not specifically CS, it was a related degree, the focus was on landing jobs, but they still covered some CS concepts because some students were in fact doing a CS degree. I was more focused on show me what I need to build things. I have never had to hand-craft any algorithm in my 15 years of coding, it just makes no sense to me. Someone else figured it out, I'm contempt understanding the algorithms.
    shaknaa month ago
    In my twenty years, I've rerolled famous algorithms "every now and then".
    Its almost wild to me that you never have.
    Sometimes you need a better sort for just one task. Sometimes you need a parser because the data was never 100% standards compliant. Sometimes you need to reread Knuth for his line-breaking algorithm.
    Fwirta month ago
    My high school computer science teacher (best one I ever had) once told us this anecdote when we were learning sorting algorithms:
    He was brought in by the state to do some coaching for existing software devs back in the 90s. When he was going over the various different basic algorithms (insertion sort, selection sort, etc.) one of the devs in the back of the class piped up with, "why are you wasting our time? C++ has qsort built in."
    When you're processing millions of records, many of which are probably already sorted, using an insertion sort to put a few new records into a sorted list, or using selection sort to grab the few records you need to the front of the queue, is going to be an order of magnitude faster than just calling qsort every time.
    Turned out he worked for department of revenue. So my teacher roasted him with "oh, so you're the reason it takes us so long to get our tax returns back."
    Thinking that you can just scoot by using the built-in version is how we get to the horrible state of optimization that we're in. Software has gotten slow because devs have gotten lazy and don't bother to understand the basics of programming anymore. We should be running a machine shop, not trying to build a jet engine out of Lego.
    johnnyanmaca month ago
    I mean, the lesson I got from my 10X class was pretty much that: "never write your own math library, unless you're working on maintaining one yourself".
    funnily enough, this wasn't limited to contributing to some popular OS initiative. You can call YAGNI, but many companies do in fact have their own libraries to maintain internally. So it comes up more than you expect.
    On a higher level, the time I took to implement a bunch of sorts helped me be able to read the docs for sort(), realize it's a quicksort implentation, and make judgements like
    1. yeah, that works
    2. this is overkill for my small dataset, I'll just whip up basic bubblesort
    3. oh, there's multiple sort API's and some sorts are in-place. I'll use this one
    4. This is an important operation and I need a more robust sorting library. I'll explain it to the team with XYZ
    The reasoning was the important lesson, not the ability to know what sorting is.
    komali2a month ago
    > why am I implementing something if I just need to understand what it does?
    So you can pass job interviews, of course!
  - johnnyanmaca month ago
    >Don't even need to interview for selective billion dollar companies like Anthropic to encounter these types of problems
    I'll take any interviews at this point in time.
    But yes, every domain has its jargon. I work tangentially to this and quickly understood this as a GPGPU problem. A relatively elementary one if you studied this space, though a time limit of 2 hours seems overly restrictive if you aren't actively studying this stuff.
- fergiea month ago
  I'm 30 years in, and literally don't understand the question.
  - WithinReasona month ago
    After a quick look this is can be seen as a low level GPU/TPU optimization problem where you have to consider the throughput and depth of different arithmetic pipelines. If you want to hire people who understand how to do that you unfortunately have to give them such a convoluted task and emulate the relevant parts of HW. (In reality this is probably more like TPU since it has scalar pipelines, but the optimization methods are not that different)
    The task is to parallelize tree traversal, which is embarrassingly unparallel so it's tricky.
    WithinReasona month ago
    This also shows that a performance engineer's job, even at Anthropic, is to be a glorified human compiler, who is often easily beaten by LLMs.
    0xffff2a month ago
    > who is often easily beaten by LLMs
    Is that really the case? My experience is fairly limited, but I've found that the LLM's willingness to fill in plausible sounding (but not necessarily at all accurate) numbers where it needs them to be a significant hindrance when asking it to think about performance.
    scottyaha month ago
    I think the job is to be one of the few that's better than LLMs.
    johnnyanmaca month ago
    And how would one do that these days if they didn't spend their career doing this pre-LLM? Just expect to study and perform such projects as a hobby for a few years on the side? These are specialized problems that you only really do for a few select companies.
    cheikhcheikha month ago
    I mean yeah... You kind of have to learn this stuff (performance engineering) by yourself (a strong education background helps a lot of course). There are transferable parts of it and there are platform-specific parts where you need to be somewhat familiar with GPUs.
    johnnyanmaca month ago
    Seeks like another catch 22 when companies still care about 3-5 years of experience in industry, even if you work on some hobby projects. I'm not in this sector but I had similar struggles getting noticed in another specific domain despite studying it for a while.
  - bsdera month ago
    Since it's a CPU, you start with the idea that there is an ALU and spiral outward from that. That gives you something concrete to wrap your head around while you climb up the abstraction levels.
    However, when I hit "scratch_write" and it wasn't in the Machine class and it wasn't coming from some Decorator and it was getting defined and deleted by a member function ... I stopped. That's paying lip service to the variable typing that is scattered around and actively hampers even basic IDE usage. Probably the typing was added by AI/LLM after the fact, and it missed that unusual usage. The Python convention used to be that those kinds of variables got declared as "_scratch_write" with a leading underscore to flag that they were "private/internal".
    That was the gigantic red "We write shitty code" signal or worse "We don't care about wasting your time" signal. Human review should have flagged that.
    Shame. I was kinda looking forward to the technical problem, but I'm not going to spend a bunch of time using grep to untangle garbage code to get at it.
    I suspect everything would actually be much clearer if you wrote it in SystemVerilog and tested with Cocotb. Let's see if their LLMs can handle that porting job. HAH!
    arsl16a month ago
    What is variable typing?
    bsdera month ago
    The types on the variables. Python recently adopted "gradual typing", but it isn't enforced by default. Consequently, you may have to actually execute a Python program to determine what an unlabeled variable type is.
    A lot of people write Python code and then run "AI" on it to fill in the variable types. This, of course, is error prone and shitty. And the AI will miss strange usages like the one I flagged.
    Although I am sorry for phrasing it as "variable typing". I can see how you might read that as "typing that varies" instead.
  - mike_hearna month ago
    The question isn't clearly written down anywhere, that's why. Presumably actual candidates would have been given more info over the phone or email. Part of the "challenge" is reverse engineering their Python; unclear if that's intentional.
    If you look at the top of perf_takehome.py then there is a brief comment saying the challenge is to optimize a kernel. Kernel in GPU land means a program that computes on data in parallel, it's not an OS kernel:
    Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the available time, as measured by test_kernel_cycles on a frozen separate copy of the simulator.
    However, this kernel doesn't run on an actual GPU. It runs on a little interpreter for a custom assembly language written in Python. Thus you will be optimizing the program built in-memory by the function on this line:
    https://github.com/anthropics/original_performance_takehome/...
    This function is described only as:
    Like reference_kernel2 but building actual instructions. Scalar implementation using only scalar ALU and load/store.
    The KernelBuilder class has some fields like "instrs" but we can't immediately see what they're meant to be because this is Python and types are optional. Nonetheless we can see that instructions are being added to a list, and below we can see the test_kernel_cycles function that runs the interpreter on the program. So our mission is to change the build_kernel function to make a better program. And it says this is an assembly version of the python function reference_kernel2 which is found in problem.py.
    What exactly is this kernel doing? The reference_kernel2 function doesn't explain itself either - it's some sort of parallel tree walk. Let's put that to one side for a second and explore the machine, which is defined in problem.py. The machine itself is also largely undocumented, but there's a brief description in a docstring on line 66.
    At this point it helps to understand the design of exotic processors. The emulator is for a fictional CPU that uses a VLIW SIMD ISA. Normal programmers will never encounter such a chip. Intel tried to make such a machine decades ago and it never took off, since then the concept has been largely dead. I believe it's still used in some mobile DSPs like Qualcomm's Hexagon. Notably, NVIDIA PTX is not such an ISA so this seems to have been chosen just to make things harder. As the comment explains, in a VLIW machine multiple instructions are packed together into a "slot" and executed in parallel. In a normal CPU the hardware reads a serial stream of instructions and works out just in time which can be executed in parallel, using fancy out-of-order circuitry. In a VLIW machine that's done ahead of time by the compiler or (in this case) the humble programmer, you. But this isn't just a VLIW machine, it's also multi-core, and multi-"engine", so there are multiple levels of execution going on. And it's SIMD, meaning each instruction can itself operate on multiple bits of data simultaneously.
    This machine doesn't have registers or cache but it does have "scratch space", and so you can use the vector instructions to load data into a series of 32 bit scratch words and then do things on them in parallel. And multiple vector instructions can also run in parallel. "Broadcasting a scalar" in SIMD-speak means taking a single value and repeating it over multiple scratch space slots (or register subwords in a real machine), so you take e.g. 0xFF and get 0xFFFFFFFFFFFFFFFF.
    And that's it, that's all we get. As the code says: "This comment is not meant to be full ISA documentation though, for the rest you should look through the simulator code". Possible point of confusion: real ISAs are serialized to bytes but this one is just Python tuples. The code is only partially typed; sometimes you're just left guessing.
    So to recap, the problem is to optimize an undocumented program expressed in undocumented data structures returned by a Python function whose result is interpreted by a partly documented Python class that simulates a fictional exotic CPU architecture using an abandoned design that gives a lot of parallel computational capacity, but which requires all parallelism to be statically declared ahead of time, whilst simultaneously reverse engineering the Python that does all this.
    Does that help? Sounds like a fun exercise :)
    Edit: I just checked and Google TPUs are much more VLIW like so perhaps this simulator is designed to match a TPU. I know Anthropic rely on TPUs for serving and have done some optimization for them.
    HarHarVeryFunnya month ago
    It does seem a bit of a strange challenge - a bit reminiscent of high school math problems where understanding the question was as much part of it as actually solving the problem when you understood it.
    Since the focus of the challenge appears(?) intended to be optimization, not reverse engineering, it's a bit odd that they don't give a clear statement of what the kernel is meant to be computing. Perhaps the challenge is intended to be a combination of the two, but then the correct reverse engineering part of it becomes a gate for the optimization part, else you'll be solving the wrong problem.
    Given the focus on results achieved by Opus 4.5, maybe that's the main point - to show how well Opus can reverse engineer something like this. If they gave the actual clear problem statement, then maybe you could brute force an optimal solution using tree search.
    HarHarVeryFunnya month ago
    I just threw this prompt at Gemini, and it seems (I haven't analyzed the problem to see if it is correct), to be able to extract a clear understanding of the problem, and a specification for the kernel.
    "Can you "reverse engineer" what the kernel in this optimization exercise is actually doing - write a specification for it?
    https://github.com/anthropics/original_performance_takehome"
    Gemini says it's doing inference on a random forest - taking a batch of inputs, running each one through each decision tree, and for each input outputting the sum of these decision tree outputs - the accumulated evidence.
    HarHarVeryFunnya month ago
    So looking at the actual code (reference_kernel() in problem.py), this "random forest inference" is completely wrong!
    It's doing some sort of binary tree traversal, but the hashing and wrap around looks weird - maybe just a made up task rather than any useful algorithm?
    saagarjhaa month ago
    Yes, it’s made up.
    fc417fc802a month ago
    This isn't "reverse engineering" it's merely "being able to read fairly simple code you didn't write". A much simpler version of the kernel is provided at the end of problem.py as reference_kernel2.
    If you can't make sense of such a small codebase or don't immediately recognize the algorithm that's being used (I'm guilty of the latter) then you presumably aren't someone that they want to hire.
    HarHarVeryFunnya month ago
    Fair enough, and there are clues in the comments too, but why not just provide the specification of the kernel (inputs and outputs) as part of the problem?
    fc417fc802a month ago
    They do. They provide reference_kernel which shows the algorithm itself, build_mem_image which shows the data format you will be working with, and finally reference_kernel2 which implements said algorithm on said data format.
    They then provide you with a very naive implementation that runs on their (very simple) VLIW architecture that you are to optimize.
    If at the end of that someone is still lost I think it is safe to say it was their goal that person should fail.
    HarHarVeryFunnya month ago
    Well, yes, they have a reference implementation as documentation, just as they have the simulator as documentation for the ISA ...
    The problem is about pipelining memory loads and ALU operations, so why not just give clear documentatation and state the task rather than "here's a kernel - optimize it"? \_(ツ)_/
    fc417fc802a month ago
    Presumably that is only one of two purposes, with the other being to test your ability to efficiently read, understand, and edit low level code that you didn't write. I imagine you'd regularly run into raw PTX if you worked for them in the relevant capacity.
    And perhaps a third purpose is to use the simulator to test your ability to reason about hardware that you are only just getting familiar with.
    HarHarVeryFunnya month ago
    I would assume that anyone optimizing kernels at Anthropic has full documentation and specs for what they are working on, as well as a personal butler attending to their every need. This is big money work - every 1% performance improvement must translate to millions of cost savings.
    Maybe they specified the challenge in this half-assed way to deliberately test those sorts of skills (even if irrelevant to the job), or maybe it was just lazily put together.
    The other thing to note is that if you look at what the reference_kernel() is actually doing, it really looks like a somewhat arbitrary synthetic task (hashes, wraparound), so any accurate task specification would really need to be a "line by line" description of the steps, at which point you may as well just say "here's some code - do this".
    menaerusa month ago
    In a fast-paced domain such as this one, and especially wrt the (global) competitiveness, development/leadership process is most likely chaotic and "best" practices that we would normally find in other lower-paced companies cannot be followed here. I think that by underspecifiying the assignment they wanted to test the ability of a candidate to fit into such environment, apart from the obvious reason and which is to filter out not enough motivated candidates.
    saagarjhaa month ago
    They do, but documentation is not always complete or correct.
    mike_hearna month ago
    > as well as a personal butler attending to their every need
    I think they do and his name is Claude ;)
    dist-epocha month ago
    > but which requires all parallelism to be statically declared ahead of time
    this is what all specialized chips like TPU/Cerebras require today, and it allows for better optimization than a generic CPU since you can "waste" 30 min figuring out the perfect routing/sequencing of operations, instead of doing it in the CPU in nanoseconds/cycles
    another benefit is you can throw away all the CPU out-of-order/branch prediction logic and put useful matrix multipliers in it's place
    forgotpwd16a month ago
    This is nice writeup. Thanks. Another commenter said will've taken them 2h just to sketch out ideas; sans LLMs will've taken me more than 2h just to collect all this info let alone start optimizing it.
    mike_hearna month ago
    It took me about 10 minutes to generate that writeup the old fashioned 100% organic way, because one of the things that's unspecified is whether you're allowed to use AI to help solve it! So I assumed as it's a job interview question you're not allowed, but now I see other comments saying it was allowed. That would let you get much further.
    I think I'd be able to make some progress optimizing this program in two hours but probably not much. I'm not a performance engineer but have designed exotic emulated CPU architectures before, so that helps a lot.
    maccarda month ago
    I've not written a VM before, but the comments in perf_takehome.py and problem.py explain the basics of this.
    I gleaned about half of this comment in a few minutes of just skimming the code and reading the comments on the functions and classes. There's only 500 lines of code really (the rest is the benchmark framework).
    fc417fc802a month ago
    Same thought. I doubt they provided additional explanation to candidates - it seems that basic code literacy within the relevant domain is one of the first things being tested.
    On the whole I don't think I'd perform all that well on this task given a short time limit but it seems to me to be an extremely well designed task given the stated context. The reference kernel easily fits on a single screen and even the intrinsic version almost does. I think this task would do a good job filtering the people they don't want working for them (and it seems quite likely that I'm borderline or maybe worse by their metric).
    owlbitea month ago
    I think calling VLIW "an adandoned design" is somewhat of an exaggeration, such architectures are pretty common for embedded audio processing.
    matt_da month ago
    Worth adding on that note:
    From JAX to VLIW: Tracing a Computation Through the TPU Compiler Stack, https://patricktoulme.substack.com/p/from-jax-to-vliw-tracin...
    Google’s Training Chips Revealed: TPUv2 and TPUv3, HotChips 2020, https://hc32.hotchips.org/assets/program/conference/day2/Hot...
    Ten Lessons From Three Generations Shaped Google’s TPUv4i, ISCA 2021, https://gwern.net/doc/ai/scaling/hardware/2021-jouppi.pdf
    mike_hearna month ago
    Thanks, that JAX writeup was interesting.
    mike_hearna month ago
    Sure. I did mention DSPs. But how many people write code for DSPs?
    HarHarVeryFunnya month ago
    x86-64 SSE and AVX are also SIMD
    vel0citya month ago
    SIMD and VLIW are somewhat similar but very different in the end.
    HarHarVeryFunnya month ago
    True.
    The ISA in this Anthropic machine is actually both, VLIW and SIMD, and both are relevant to the problem.
    b40d-48b2-979ea month ago
    Sounds like a fun exercise :)
    I'll be honest, that sounds like the opposite of fun since the worst parts of my job are touching the parts of a Python codebase that are untyped. The sad part is this work codebase isn't even that old, maybe a few years, and the developers definitely should have known better if they had anyone capable leading them. Alas, they're all gone now.
    Harder than figuring out the instruction set for some exotic CPU are definitely the giant untyped dicts/lists common in data science code.
    carschnoa month ago
    On the one hand, this exercise probably reflects a realistic task. Daily engineering work comprises a lot of reverse engineering and debugging of messy code. On the other hand, this does not seem very suitable as an isolated assignment. The lack of code base-specific context has a lot of potential for frustration. I wonder what they really tested on the candidates, and whether this was what they wanted to filter for.
    fc417fc802a month ago
    > The lack of code base-specific context has a lot of potential for frustration.
    I think that's one of the intentional points. Being able to quickly understand what the provided source code is doing.
    fergiea month ago
    Wow! Thanks for the explanation :)
    mannyva month ago
    "Performance can be optimized by not using python."
  - measurablefunca month ago
    Generate instructions for their simulator to compute some numbers (hashes) in whatever is considered the memory of their "machine"¹. I didn't see any places where they actually disallow cheating b/c it says they only check the final state of the memory² so seems like if you know the final state you could just "load" the final state into memory. The cycle count is supposedly the LLM figuring out the fewest number of instructions to compute the final state but again, it's not clear what they're actually measuring b/c if you know the final state you can cheat & there is no way to tell how they're prompting the LLM to avoid the answers leaking into the prompt.
    ¹https://github.com/anthropics/original_performance_takehome/...
    ²https://github.com/anthropics/original_performance_takehome/...
    saagarjhaa month ago
    Well, they read your code in the actual hiring loop.
    measurablefunca month ago
    My point still stands. I don't know what the LLM is doing so my guess is it's cheating unless there is evidence to the contrary.
    red75primea month ago
    I guess your answer to "Try to run Claude Code on your own 'ill-defined' problem" would be "I'm not interested." Correct? I think we can stop here then.
    KeplerBoya month ago
    Well that's certainly a challenge when you use LLMs for this test driven style of programming.
    saagarjhaa month ago
    Why do you assume it’s cheating?
    measurablefunca month ago
    Because it's a well know failure mode of neural networks & scalar valued optimization problems in general: https://www.nature.com/articles/s42256-020-00257-z
    saagarjhaa month ago
    Again, you can just read the code
    measurablefunca month ago
    You're missing the point. There is no evidence to support their claims which means they are more than likely leaking the memory into the LLM prompt & it is cheating by simply loading constants into memory instead of computing anything. This is why formal specifications are used to constrain optimization. Without proof that the code is equivalent you might as well just load constants into memory & claim victory.
    fc417fc802a month ago
    > There is no evidence to support their claims
    Do you make a habit of not presuming even basic competence? You believe that Anthropic left the task running for hours, got a score back, and never bothered to examine the solution? Not even out of curiosity?
    Also if it was cheating you'd expect the final score to be unbelievably low. Unless you also suppose that the LLM actively attempted to deceive the human reviewers by adding extra code to burn (approximately the correct number of) cycles.
    measurablefunca month ago
    This has nothing to do w/ me & consistently making it a personal problem instead of addressing the claims is a common tactic for people who do not know what it means to present evidence for their claims. Anthropic has not provided the necessary evidence for me to conclude that their LLM is not cheating. I have no opinion on their competence b/c that is not what is at issue. They could be incompetent & not notice that their LLM is cheating at their take home exam but I don't care about that.
    fc417fc802a month ago
    You are implying that you believe them to be incompetent since otherwise you would not expect evidence in this instance. They also haven't provided independent verification of their claims - do you suspect them of lying as well?
    How do you explain the specific score that was achieved if as you suggest the LLM simply copied the answer directly?
    measurablefunca month ago
    Either they have proof that their LLM is not cheating or they don't. The linked post does not provide evidence that the LLM is not cheating. I don't have to explain anything on my end b/c my claim is very simple & easily refuted w/ the proper evidence.
    red75primea month ago
    And? Anthropic is not aware of this 2020 paper? The problem is not solvable?
    measurablefunca month ago
    Why are you asking me? Email & ask Anthropic.
    red75primea month ago
    Obviously, because you use this old paper as an argument.
    measurablefunca month ago
    I don't have any insider information on what they know or don't know so you're welcome to keep asking nonsensical questions but eventually I'll stop answering.
  - a month ago
    undefined
  - a month ago
    undefined
  - PeterStuera month ago
    Which part exactly are ypu having trouble with?
    - Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the available time, as measured by test_kernel_cycles on a frozen separate copy of the simulator
  - karmajunkiea month ago
    Thank goodness, I thought it was just me...
- mangatmodia month ago
  Smart is different than the knowledge. If you learn about these concepts andwork on these problems, then you will be able to solve them.
  It's not about you being average, just a different knowledge set.
- xenihna month ago
  It comes with test suites, so that gives you a base to start from. You can at the very least do trial-and-error and come up with some heuristics on the fly. You're at a huge disadvantage to someone who has some familiarity but can convincingly play it off as being a newcomer, though.
- chisteva month ago
  What we know is a drop, what we don't know is an ocean.
- elzbardicoa month ago
  There's a big chance you're falling in a subtle form of imposter syndrome that manifests itself by largely over-estimating the average skill level.
  But this is good. Staying humble makes you hungrier for learning.
- ActorNightlya month ago
  Yours is a good mentality to have because it creates the emotional drive to learn more, so don't lose that. That being said, this isn't really that complicated. Its just a matter of taking enough time to look at the code and understand how its structured. I feel like the thing that differentiates developer skill is pretty much being able to do that, specifically in the process of having the model of the program in your head.
  - sigbottlea month ago
    Does it?
    For me, I've had that mentality for the longest time and I didn't get anything done because, well, "I'm just average".
    For me, a little bit of arrogance (there's no way I couldn't do X, let's go do it), even if I end up "looking stupid" (see, I told you it was that hard!), was far more valuable to my development
- gervwyka month ago
  Don’t stress, its very likely that this problem was vibe coded :) It’s insane how much better Claude Code is compared to alternatives lately.
- LouisSayersa month ago
  It's the type of thing you'd be exposed to in a computer science degree - operating systems / compilers.
  Always room to learn in software :)
- deadbabea month ago
  If you think you’re average, you’re not average.
- apsurda month ago
  disagree. nobody has a monopoly on what metric makes someone good. I don't understand all this leet code optimization. actually i do understand it, but it's a game that will attract game optimizers.
  the hot take is, there are other games.
  - tuetuopaya month ago
    This is the opposite of leet code.
    Yes, this applies to some simulated imaginary CPU with an artificial problem. Except that the job asked here is exactly the core of what a performance engineer will do at anthropic: optimize kernels for their fleet of GPUs. Is it simplified? Yes! (e.g. the simulator does not restrict memory access patterns)
    This is a real-world problem adapted to a lab setting that can fit in one's head in a matter of hours. Leetcode would have you reimplement the hashmap used in there.
  - saagarjhaa month ago
    This is explicitly not Leetcode, in fact its goal is to attract optimizers
  - sevenzeroa month ago
    Also leetcode does not really provide insight into ones ability to design business solutions. Whether it be system design, just some small feature implementation or communication skills within a team. Its just optimizers jerking each other off on some cryptic problems 99.999999999% of developers will never see in real life. Maybe it would've been useful like 30 years ago, but all commonly used languages have all these fancy algorithms baked into their stdlib, why would I ever have to implement them myself?
    lbreakjaia month ago
    But this is an interview problem at Anthropic, not at your local CRUD factory. They _are_ looking for the optimizers, because they _are_ working on cryptic problems the 99.9999% of us will never encounter.
    thorncoronaa month ago
    Or more likely, the commonality is how you're applying your software skills?
    In every other field it's helpful to understand the basics. I don't think software is the exception here.
    sevenzeroa month ago
    Understanding basics is very different to being able to memorize algorithms. I really dont see why I'd ever have to implement stuff like quicksort myself somewhere. Yes I know what recursion is, yes I know what quick sort is, so if I ever need it I know what to look for. Which was good enough throughout my career.
pvalue005a month ago
I suspect this was released by Anthropic as a DDOS attack on other AI companies. I prompted 'how do we solve this challenge?' into gemini cli in a cloned repo and it's been running non-stop for 20 minutes :)
- bjackmana month ago
  Lately with Gemini CLI / Jules it doesn't seem like time spent is a good proxy for difficulty. It has a big problem with getting into loops of "I am preparing the response for the user. I am done. I will output the answer. I am confident. Etc etc".
  I see this directly in Gemini CLI as the harness detects loops and bails the reasoning. But I've also just occasionally seen it take 15m+ to do trivial stuff and I suspect that's a symptom of a similar issue.
  - aiiotnoodlea month ago
    I've noticed using antigravity and vscode, Gemini 3 pro often comes back with model too busy or something like that and basically 500s.
    Seems like capacity because it works a lot better late at night.
    I don't see the same with the claude models in antigravity.
    menaerusa month ago
    I also noticed that and I also noticed that it starts to struggle when the workspace "tab" you're working in gets longer - it basically gets stuck at "Starting agent ...". I initially thought it must be a very big context that the model is struggling with but since since restarting the "app" and kill -9 fixes it, it suggests that it's a local issue. Strange.
    trillica month ago
    Anecdotally, I notice better performance and output quality across most providers outside of 8a-5p ET.
    bjackmana month ago
    Yeah that's a separate issue though, it predates the time when the looping issues got really common, for me at least.
  - mixela month ago
    I saw this too. Sometimes it "think" inside of the actual output and its much more likely to end up in the loop of "I am ready to answer" while it is doing that already
  - sva_a month ago
    I feel like sometimes it just loops those messages when it doesn't actually generate new tokens. But I might be wrong
    bjackmana month ago
    There are some other failure modes that all feel kinda vaguely related that probably help with building a hypothesis about what's going wrong:
    Sometimes Gemini tools will just randomly stop and pass the buck back to you. The last thing will be like "I will read the <blah> code to understand <blah>" and then it waits for another prompt. So I just type "continue" and it starts work again.
    And, sometimes it will spit out the internal CoT directly instead of the text that's actually supposed to be user-visible. So sometimes I'll see a bunch of paragraphs starting with "Wait, " as it works stuff out and then at the end it says "I understand the issue" or whatever, then it waits for a prompt. I type "summarise" and it gives me the bit I actually wanted.
    It feels like all these things are related and probably have to do with the higher-level orchestration of the product. Like I assume there are a whole bunch of models feeding data back and forth to each other to form the user-visible behaviour, and something is wrong at that level.
    hackpelicana month ago
    At one point it started spitting out its CoT in the comments of the code it’s supposed to be changing.
    bjackmana month ago
    Ah yeah I've seen that too. Definitely seems related.
    I suspect this is also something like the "inverse" of a prompt hijacking situation. Basically it's losing track of where its output is flowing to (whereas prompt injection is when it loses track of where its input is flowing from).
- bird0861a month ago
  Which Gemini model did you use? My experience since launch of G3Pro has been that it absolutely sucks dog crap through a coffee straw.
  - pvalue005a month ago
    /model: Auto (Gemini 3) Let Gemini CLI decide the best model for the task: gemini-3-pro, gemini-3-flash
    After ~40 minutes, it got to:
    The final result is 2799 cycles, a 52x speedup over the baseline. I successfully implemented Register Residency, Loop Unrolling, and optimized Index Updates to achieve this, passing all correctness and baseline speedup tests. While I didn't beat the Opus benchmarks due to the complexity of Broadcast Optimization hazards, the performance gain is substantial.
    It's impressive as I definitely won't be able to do what it did. I don't know most of the optimization techniques it listed there.
    I think it's over. I can't compete with coding agents now. Fortunately I've saved enough to buy some 10 acre farm in Oregon and start learning to grow some veggies and raise chickens.
    light_hue_1a month ago
    Keep in mind that the boat on competing with machines to generate assembly sailed for 99% of programmers half a century ago. It is not surprising that this is an area where AI is strong.
    IsToma month ago
    Did you check that it did the things it claims it did?
    triyambakama month ago
    > grow some veggies and raise chickens.
    Maybe Claude will be able to do that soon, too.
    ecea month ago
    After an hour with a few prompts, the first working version got to 3529 cycles (41x speedup) for me. I was using Gemini 3 pro preview.
    apsurda month ago
    we've lost the plot.
    you can't compete with an AI on doing an AI performance benchmark?
    kqra month ago
    This is not an AI performance benchmark, this is an actual exercise given to potential human employees during a recruitment process.
  - bird0861a month ago
    Hilarious that this got a downvote, hello Satya!
  - Mashimoa month ago
    > sucks dog crap through a coffee straw.
    That would be impressive.
    stronglikedana month ago
    Only if the dog didn't get too much human food the night before.
    anematodea month ago
    New LLM benchmark incoming? I bet once it's done, people will still say it's not AGI.
    dotancohena month ago
    When they get the hardware capable of that, a different industry will be threatened by AI. The oldest industry.
    darepublica month ago
    Song of Solomon I guess
    cess11a month ago
    Textile?
    nineteen999a month ago
    The emperor's (empresses?) new textile.
languid-photica month ago
Naively tested a set of agents on this task.
Each ran the same spec headlessly in their native harness (one shot).
Results:
```
    Agent                        Cycles     Time
    ─────────────────────────────────────────────
    gpt-5-2                      2,124      16m
    claude-opus-4-5-20251101     4,973      1h 2m
    gpt-5-1-codex-max-xhigh      5,402      34m
    gpt-5-codex                  5,486      7m
    gpt-5-1-codex                12,453     8m
    gpt-5-2-codex                12,905     6m
    gpt-5-1-codex-mini           17,480     7m
    claude-sonnet-4-5-20250929   21,054     10m
    claude-haiku-4-5-20251001    147,734    9m
    gemini-3-pro-preview         147,734    3m
    gpt-5-2-codex-xhigh          147,734    25m
    gpt-5-2-xhigh                147,734    34m
```
Clearly none beat Anthropic's target, but gpt-5-2 did slightly better in much less time than "Claude Opus 4 after many hours in the test-time compute harness".
- lawrencechena month ago
  codex cli + gpt-5-2-codex-xhigh got to 1606 with the prompt "beat 1487 cycles. go." ~53 minutes.
  - jstummbilliga month ago
    Will you look at this man's prompting skills?!
  - dudewhocodesa month ago
    Serious prompt engineering right here
  - mettamagea month ago
    Wow, is gpt-5-2-codex-xhigh really that good in general? Is this the 200$ per month version?
    woadwarrior01a month ago
    gpt-5.2-codex xhigh with OpenAI codex on the $20/month plan got to 1526 cycles with OP's prompt for me. Meanwhile claude code with Opus 4.5 on the team premium plan ($150/month) gave up with a bunch of contrived excuses at 3433 cycles.
  - a month ago
    undefined
- HarHarVeryFunnya month ago
  That Claude Opus 4.5 result of 4,973 is what you get if you just vectorize the reference kernel. In fact you should be under 4,900 doing that with very little effort (I tried doing this by hand yesterday).
  The performance killer is the "random" access reads of the tree node data which the scalar implementation hides, together with the lack of load bandwidth, and to tackle that you'd have to rewrite the kernel to optimize the tree data loading and processing.
- ponyousa month ago
  Very interesting thanks! I wonder what would happen if you kept running Gemini in a loop for a while. Considering how much faster it ended it seems like there is a lot more potential.
- a24ja month ago
  Can you share the agent-comparison harness code or point to something similar? I want to learn about benchmarking models in a basic or practical sense.
  - languid-photica month ago
    Sure!
    https://github.com/voratiq/voratiq
    a24ja month ago
    Thanks so much!!
- raphaelja month ago
  Could you try with some open-weighted models, e.g. Qwen3-coder, GLM-4.7 or Devstral-2?
  - kevindaya month ago
    I tried GLM-4.7 running locally on a beefy GPU server, in about 3 minutes it got to 25846 cycles, but then struggled in circles for about 90 minutes without making any meaningful progress, making the same mistakes repeatedly and misdiagnosing the cause most of the time. It seems to understand what needs to happen to reach the goal, but keeps failing on the implementation side. It seemed to understand that to beat the target an entirely new approach would be required (it kept leaning towards a wavefront design), but wasn't seeing the solution due to the very limited ISA.
- forgotpwd16a month ago
  Could you make a repo with solutions given by each model inside a dir/branch for comparison?
  - kitrak95a month ago
    Are you giving instructions to a stranger on the internet?
    forgotpwd16a month ago
    Instructions?! Just asked since GP already did it. No need to realize top comment's "DDOS attack on other AI companies" joke.
    edf13a month ago
    I think he’s asking rather than giving instructions
    pelagicAustrala month ago
    He's prompting
- giancarlostoroa month ago
  I do wonder how Grok would compare, specifically their Claude Code Fast model.
game_the0rya month ago
> If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.
This is an interesting way to recruit. Much better than standard 2 leetcode medium/hard questions in 45 mins.
- paxysa month ago
  This is simply to enter the recruiting pipeline. once you're in you will do the same leetcode interviews as everyone else.
  - alt227a month ago
    You would hope that if you manage to beat their engineers best optimisations at launch, then you would leapfrog a certain amount of the initial stages.
    Then again, this may just be a way to get free ideas at optimising their product from outside the box.
    benlivengooda month ago
    One could use any number of LLMs on a take-home problem so in-person interviews are a must.
    legela month ago
    One could use any number of LLMs on real-world problems.
    Why are we still interviewing like its 1999?
    game_the0rya month ago
    Old habits die hard. And engineers are pretty lazy when it comes to interviews, so just throwing the same leetcode problem into coder pad in every interview makes interviews easier for the person doing the interview.
    selkina month ago
    If you want people to interview better, you have to both allocate resources to it, and make it count on perf. It’s not laziness, it’s ROI.
    yodsanklaia month ago
    As an interviewer, I ask the same problems because it makes it much easier to compare candidates.
    game_the0rya month ago
    How do you know if one candidate happened to see the problem on leetcode and memorized the solution versus one who struggled but figured it out slower?
    yodsanklaia month ago
    It's very easy to tell, but it doesn't make much difference. The best candidates have seen the problems before and don't even try to hide it, they just propose their solution right away.
    I try give positive feedback for candidates who didn't know the problem but could make good use of hints, or had the right approach. But unfortunately, it's difficult to pass a Leetcode interview if you haven't seen a similar problem to what is asked before. Most candidates I interview nowadays seem to know all questions.
    That's what the company has decided so we have to go along. The positive side is that if you do your part, you have good chances of being hired, even if you disagree with the process.
    bradlysa month ago
    It doesn’t matter. It’s about looking for candidates who have put in the time for your stupid hazing ritual. It signals on people who are willing to dedicate a lot of time to meaningless endeavors for the sake of employment.
    This type of individual is more likely to follow orders and work hard - and most importantly - be like the other employees you hired.
    legela month ago
    Once upon a time, the "stupid hazing ritual" made sense.
    Now it means company is stupid.
    benlivengooda month ago
    Because if you want to hire engineers then you have to ask engineering questions. Claude and GPT and Gemini are super helpful but they're not autonomous coders yet so you need an actual engineer to vet their outcome still.
  - driverdana month ago
    Is this a fact or an assumption?
- yodsanklaia month ago
  It would take something like one week full time to work on this. It's not something you can do if you have a full-time job and apply to several other companies. I find it unreasonable to ask a candidate to spend that much time for an uncertain result.
  It's true that being ready for leetcode takes practice, but at least it's standard so you can re-use the skills to other interviews. Optimizing some generated code is certainly fun, but it's as useless as leetcode for your average programmer.
  - tcoff91a month ago
    As long as there are qualified candidates willing to do unreasonable tasks for the chance to work at a company, there's not much incentive for the company to change their system. Those people will also probably work unreasonably hard and make unreasonable sacrifices for the company.
  - menaerusa month ago
    > It's not something you can do if you have a full-time job
    > I find it unreasonable to ask a candidate to spend that much time
    And same for some reason does not apply to leetcode style interviews?
    > It would take something like one week full time to work on this
    I am not sure if this is satire or what? You need months of continuous preparation to be ready for the leetcode style interview.
    > Optimizing some generated code is certainly fun, but it's as useless as leetcode for your average programmer.
    No, it is not. This is specifically the type of job you would be doing tomorrow at Anthropic team if hired. And they are specifically hiring people who are already good enough at that very task. The same cannot be said for the leetcode, not even remotely comparable.
abra0a month ago
This is a really fun problem! I suggest anyone who likes optimization in a very broad sense to try their hand at it. Might be the most fun I've had while interviewing. I had to spend a week-worth of evenings on it to fully scratch the itch, and I managed to get 1112 cycles. But that was mostly manual, before the current crop of agentic models (clopus 4.5, gpt5.2). I wonder how far you can RalphWiggum it!
- lukaha month ago
  I've never heard AI-assisted coding referred to as "RalphWiggum"ing a problem, and now I will have to use that always. Thank you.
  - usgroupa month ago
    https://awesomeclaude.ai/ralph-wiggum
- clocksmitha month ago
  Did you get an offer?
avaera month ago
It's pretty interesting how close this assignment looks to demoscene [1] golf [2].
[1] https://en.wikipedia.org/wiki/Demoscene [2] https://en.wikipedia.org/wiki/Code_golf
It even uses Chrome tracing tools for profiling, which is pretty cool: https://github.com/anthropics/original_performance_takehome/...
- wiz21ca month ago
  I was in the demoscene long ago and that kind of optimisation is definitely in the ballpark of what we did: optimize algorithm down to machine code level (and additionally, cheat like hell to make you believe we ran the algorithm for real :-)).
  But to be honest, I wonder what algorithm they implement. I have read the code for 2 minutes, and it sound like random forest prediction. Anyone knows what the code does ?
  - saagarjhaa month ago
    It’s some useless problem like a random tree walk or something like that, the actual algorithm is not particularly important to the problem
    psb217a month ago
    Yeah, I assume it was partly chosen since the problem structure provides some convenient hooks for selectively introducing subtle and less subtle inefficiencies in the baseline algorithm that match common optimization patterns.
- KeplerBoya month ago
  perfetto is pretty widely used for such traces, because building a viewer for your traces is a completely avoidable pain.
- nice_bytea month ago
  it's designed to select for people who can be trusted to manually write ptx :-)
sureglymopa month ago
Having recently learned more about SIMD, PTX and optimization techniques, this is a nice little challenge to learn even more.
As a take home assignment though I would have failed as I would have probably taken 2 hours to just sketch out ideas and more on my tablet while reading the code before even changing it.
- forgotpwd16a month ago
  Unless misread, 2 hours isn't the time limit for the candidate to do this but the time Claude eventually needed to outperform best returned solution. Best candidate could've taken 6h~2d to achieve this result.
  - fhd2a month ago
    Their Readme.md is weirdly obsessed with "2 hours":
    "before Claude Opus 4.5 started doing better than humans given only 2 hours"
    "Claude Opus 4.5 in a casual Claude Code session, approximately matching the best human performance in 2 hours"
    "Claude Opus 4.5 after 2 hours in our test-time compute harness"
    "Claude Sonnet 4.5 after many more than 2 hours of test-time compute"
    So that does make one wonder where this comes from. Could just be LLM generated with a talking point of "2 hours", models can fall in love with that kind of stuff. "after many more than 2 hours" is a bit of a tell.
    Would be quite curious to know though. How I usually design take home assignments is:
    1. Candidate has several _days_ to complete (usually around a week).
    2. I design the task to only _take_ 2-4 hours, informing the candidate about that, but that doesn't mean they can't take longer. The subsequent interview usually reveals if they went overboard or struggled more than expected.
    But I can easily picture some places sending a candidate the assignment and asking them to hand in their work within two hours. Similar to good old coding competitions.
  - alcasaa month ago
    No the 2 hours is their time limit for candidates. The thing is that you are allowed to use any non-human help for their take homes (open book), so if AI can solve it in below 2 hours, it's not very good at assessing the human.
    saagarjhaa month ago
    4 hours but AI help is (was?) allowed. I assume it was retired because of Opus basically oneshotting it
    alcasaa month ago
    Fair enough. I feel like designing AI-proof take-homes is getting ever more futile. Given the questions need to be sufficiently low context to be human-doable in a short time and timespans for AI tasks increasing, I'm not sure take homes can actually serve any filtering function whatsoever, besides checking if applicants are willing to put in a minimal amount of effort.
amirhirscha month ago
I'm at 1137 with one hour with opus now... Pipelined vectorized hash, speculation, static code for each stage, epilogues and prologues for each stage-to-stage...
I think I'm going to get sub 900 since i just realized i can in-parallel compute whether stage 5 of the hash is odd just by looking at bits 16 and 0 of stage 4 with less delay.....
- WithinReasona month ago
  Submit it to the leaderboard: https://www.kerneloptimization.fun/
  - amirhirscha month ago
    I think I can hit #1 (current #1 is 1000). sub 900 not possible though.
    Let me put down my thought process: You have to start to think of designing a 6-slot x8-len vector pipeline doing 48 hashes in parallel first which needs at least 10 steps —- if you convert three stages to multiply adds and do parallel XORs for the other three) —- the problem with 10 cycle hashing is you need to cram 96 scalar xors along side your vector pipeline, so that will use all 12 ALUs for 8 of those cycles. Leaving you only 24 more scalar ops per hash cycle which isn’t enough for the 48 tree value xors..
    so you must use at least 11 steps per hash, with 96 xors (including the tree value xor) done in the scalar alus using 8 steps, and giving 3*12 Alu ops per hash cycle. You need 12 more ops per hash to do odd/even, so you must be 12 stages, and just do all of the hash ops in valu, 4 cycles of 12 alus doing modulo, 8 cycles x 12 alus free
    With 12 steps and 48 parallel you’re absolute minimum could be 4096/48 x 12 = 1,024 cycles, since stage 10 can be optimized (you don’t need the odd/even modulo cycle, and can use some of those extra scalar cycles to pre-xor the constant can save you ~10 cycles. 1024 gonna be real hard, but I can imagine shenanigans to get it down to 1014, sub-1000 possible by throwing more xor to the scalar alus.
    icelancera month ago
    > sub 900 not possible though.
    I performed a similar analysis to you and found it very difficult to imagine sub-1000. Your comment I think convinced me that it may be possible, though. Interesting.
    I'm below the threshold for recruiting but not below Claude at the moment. Not sure where I am going wrong.
    amirhirscha month ago
    Here’s some other hints: combine hash stages 2 and 3, it can be two muladds and a XOR
    For the first several rounds (when every tree value is in use) Combine the stage 5 XOR with the subsequent round’s tree XORs. You can determine even/odd in hash stage 5 starting with a ^ (a>>16) without Xoring the constant, then you can only need one XOR, this saves you a ton of XORs
    Create separate instruction bundles for the first round, rounds 1-5 (combining hash stages 5 XOR with next round tree XORs) and 6-9 (not every tree node is used anymore), round 10 round 11-14 and round 15 and combine them.
    you can use add_imm in parallel to load consts. stage 0 you have to do load the tree first and the vals, by later stages when everything is in scratch, you could use 12 scalar XORs and 6 vector XORs on scratch. once you vload vals, you can start to do XORs but can only advance so much at a time, so I’m starting to work on getting hash stages moving to different rounds faster to hide the initial vloads and get to the heavy load section sooner and spread the load pain.
  - menaerusa month ago
    Why do you need an X account for it? Seems like a ridiculous requirement
- lalaland1125a month ago
  How do you avoid the load bottleneck?
  - amirhirscha month ago
    ======================================================================
    BROADCAST LOAD SCHEDULE
    ======================================================================
    Round | Unique | Load Strategy
    ------|--------|------------------------------------------
    0 | 1 | 1 broadcast → all 256 items 1 | 2 | 2 broadcasts → groups 2 | 4 | 4 broadcasts → groups 3 | 8 | 8 broadcasts → groups 4 | 16 | 16 broadcasts → groups 5 | 32 | 32 broadcasts → groups 6 | 63 | 63 loads (sparse, use indirection) 7 | 108 | 108 loads (sparse, use indirection) 8 | 159 | 159 loads (sparse, use indirection) 9 | 191 | 191 loads (sparse, use indirection) 10 | 224 | 224 loads (sparse, use indirection) 11 | 1 | 1 broadcast → all 256 items 12 | 2 | 2 broadcasts → groups 13 | 4 | 4 broadcasts → groups 14 | 8 | 8 broadcasts → groups 15 | 16 | 16 broadcasts → groups
    Total loads with grouping: 839
    Total loads naive: 4096
    Load reduction: 4.9x
  - amirhirscha month ago
    take advantage of index collisions, optimizing round 0 and 11, speculative pre-loading, and the early branch predictor (which now I am doing looking at bits output at stage 3)
    lzhoua month ago
    it's actually pretty funny since opus will suggest both of these with enough prying (though with a single-prompt it might not try it).
fabian4a month ago
[flagged]
- tap12783487a month ago
  [flagged]
  - epiccolemana month ago
    It definitely bears all the LLM hallmarks we've come to know. emdash, the "this isn't X. it's Y" structure - and then, to cap it off, a single pithy sentence to end it.
    nostrademonsa month ago
    Also bears all the hallmarks of an ordinary post (by someone fairly educated) on the Internet. This would make sense, because LLMs were trained on lots of ordinary posts on the Internet, plus a fair number of textbooks and scientific papers.
    epiccolemana month ago
    The — character is the biggest cause of suspicion. It's difficult to type manually so most people - myself included - substitute the easily typed hyphen.
    I know real people do sometimes use it, but it's a smell.
    nostrademonsa month ago
    I think some software will automatically substitute "smart quotes" for regular quotes and an em-dash for a double hyphen -- I know MS Word used to do this. Curious if any browsers do. This comment was typed in Brave, which doesn't appear to, but I didn't check if Chrome or IE or Opera does.
    menaerusa month ago
    The comment was not wrong though so I am not sure I understand if flagging it for the sole "it was most likely written by the use of AI" reason is completely valid.
    haliskerbasa month ago
    I've noticed people who are using LLMs more, myself included, are starting to talk like that.
    Oops I mean, you're absolutely right, those ARE hallmark signs of an LLM. Let me breakdown why this isn't just your imagination but actually...
bytesandbitsa month ago
Having done a bunch of take home for big (and small) AI labs during interviews, this is the 2nd most interesting one I have seen so far.
- pettersa month ago
  And the answer to the obvious follow-up question is...?
  - mrklola month ago
    Milk before cereals
    matthews3a month ago
    Milk, then cereal, then bowl!
    Xmd5aa month ago
    How about a bowl, and then, 30 minutes ~ 1 hour later, milk with cereals?
  - darkwatera month ago
    Maybe it's under NDA :)
  - kevthecodera month ago
    42
  - reader9274a month ago
    fries
    a month ago
    undefined
  - bytesandbitsa month ago
    imbue
koolbaa month ago
What is the actual assignment here?
The README only gives numbers without any information on what you’re supposed to do or how you are rated.
- glalondea month ago
  "Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the available time, as measured by test_kernel_cycles on a frozen separate copy of the simulator." from perf_takehome.py
- vermilinguaa month ago
  Think that means you failed :(
  - nice_bytea month ago
    +1
    being cryptic and poorly specified is part of the assignment
    just like real code
    in fact, it's _still_ better documented an self contained than most of the problems you'd usually encounter in the wild. pulling on a thread to end up with a clear picture of what needs to be accomplished is like 90% of the job very often.
    throwaway81523a month ago
    I didn't see much cryptic except having to click on "perf_takehome.py" without being told to. But, 2 hours didn't seem like much to bring the sample code into some kind of test environment, debug it enough to works out details of its behaviour, read through the reference kernel and get some idea of what the algorithm is doing, read through the simulator to understand the VM instruction set, understand the test harness enough to see how the parallelism works, re-code the algorithm in the VM's machine language while iterating performance tweaks and running simulations, etc.
    Basically it's a long enough problem that I'd be annoyed at being asked to do it at home for free, if what I wanted from that was a shot at an interview. If I had time on my hands though, it's something I could see trying for fun.
    tayo42a month ago
    2 hours does seem short. It took me a half hour to get through all you listed and figure out how to get the valu instruction working.
    I suspect it would take me another hour to get it implemented. Leaving 30 minutes to figure out something clever?
    Idk maybe I'm slow or really not qualified.
    ithkuila month ago
    My instinct to read about the problem was to open the "problem.py" file, which states "Read the top of perf_takehome.py for more introduction"
    So yeah. They _could_ have written it much more clearly in the readme.
    nice_bytea month ago
    it's "cryptic" for an interview problem. e.g. the fact that you have to actually look at the vm implementation instead of having the full documentation of the instruction set from the get go.
    throwaway81523a month ago
    That seems normal for an interview problem. They put you in front of some already-written code and you have to fix a bug or implement a feature. I've done tons of those in live interviews. So that part didn't bother me. It's mostly the rather large effort cost in the case where the person is a job applicant, vs an unknown and maybe quite low chance of getting hired.
    With a live interview, you get past a phone screening, and now the company is investing significant resources in the day or so of engineering time it takes to have people interview you. They won't do that unless they have a serious level of interest in you. The take-home means no investment for the company so there's a huge imbalance.
    There's another thread about this article, which explains an analogous situation about being asked to read AI slop: https://zanlib.dev/blog/reliable-signals-of-honest-intent/
    avaera month ago
    It's definitely cleaner than what you will see in the real world. Research-quality repositories written in partial Chinese with key dependencies missing are common.
    IMO the assignment('s purpose) could be improved by making the code significantly worse. Then you're testing the important stuff (dealing with ambiguity) that the AI can't do so well. Probably the reason they didn't do that is because it would make evaluation harder + more costly.
- a month ago
  undefined
NightBlossoma month ago
I just withdrew my application over this test. It forces an engineering anti-pattern: requiring runtime calculation for static data (effectively banning O(1) pre-computation).
When I pointed out this contradiction via email, they ignored me completely and instead silently patched the README to retroactively enforce the rule.
It’s not just a bad test; it’s a massive red flag for their engineering culture. They wasted candidates' time on a "guess the hidden artificial constraint" game rather than evaluating real optimization skills.
- hackern3972a month ago
  This isn't the gotcha moment you think it is. Storing the result on disk is some stupid "erm achkually" type solution that goes against the spirit of the optimization problem.
  They want to see how you handle low level optimizations, not get tripped over some question semantics.
  - NightBlossoma month ago
    You are missing the point. This isn't "storing result on disk." In high-performance engineering, if the input is static and known at build time, the only correct optimization is pre-computation.
    I didn't simply "skip" the problem. I implemented a compiler that solves the problem entirely at build time, resulting in O(0) runtime execution.
    Here is the actual "Theorem" I implemented in my solution. If a test penalizes this approach because it "goes against the spirit," then the test is fundamentally testing for inefficiency.
    """ Theorem 1 (Null Execution): Let P: M → M be a program with postcondition φ(M). If ∃M' s.t. φ(M') ∧ M ≅ M', then T(P) = 0.
    Complexity: O(n) compile-time, O(0) runtime """
    If they wanted to test runtime loop optimizations, they should have made the inputs dynamic.
nine_ka month ago
This is a kind of task that's best solved by possibly spending more than the allocated 2 hours on it, once any obvious low-hanging fruit is picked. An optimization task is what a machine does best. So the real problem would be to construct a machine that would be able to run the optimization. A right optimization framework that results from the effort could also efficiently solve many more similar problems in the future.
I understand that this test is intended to somehow test the raw brianpower, the ability to tackle an unfamiliar and complicated domain, and to work under stress. But I hope it's not representative of the actual working conditions at Anthropic. It's like asking a candidate to play a Quake deathmatch when hiring to a special forces assault squad.
- saagarjhaa month ago
  > So the real problem would be to construct a machine that would be able to run the optimization.
  This is a valid way to solve the problem.
tucnaka month ago
The snarky writing of "if you beat our best solution, send us an email and MAYBE we think about interviewing you" is really something, innit?
- ahussaina month ago
  They wrote:
  > If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.
  That doesn’t seem snarky to me. They said if you beat Opus, not their best solution. Removing “perhaps” (i.e. MAYBE) would be worse since that assumes everyone wants to interview at Anthropic. I guess they could have been friendlier: “if you beat X, we’d love to chat!”
  - 0x3fa month ago
    I suppose you could interpret it either way, but having dealt with their interview pipeline I'd choose the snark.
    a month ago
    undefined
    dude250711a month ago
    Yeah, a nerd bypassed HR and showed their true character. They are swimming in easy money.
  - lovicha month ago
    That paraphrases to
    "do better than we have publicly admitted most of humanity can do, and we may deign to interview you"
    It sounds incredibly condescending, if not snarky, but I would classify those adjectives as mostly synonymous.
    miki123211a month ago
    I suspect this is partially legal CYA.
    There's more to employees than their raw ability to go below some performance threshold. If somebody passes the test, but lives in an US sanctioned country with no plans to move, is well known for using the n-word on social media or has previously broken an NDA, Anthropic probably doesn't want to interview them.
    andrubya month ago
    I understand how it can be interpreted as snarky, but how could it have been written better? It's a hard path to walk and recruiting/interviewing is inherently sensitive it seems.
    Aurornisa month ago
    > It's a hard path to walk and recruiting/interviewing is inherently sensitive it seems.
    Hiring and interviewing is in a weird place right now. We’re coming off of a period where tech jobs were easy to get and companies were competing for candidates. A lot of candidates quickly got used to the idea of companies working hard to charm and almost beg them to join. When those candidates encounter what it’s like to apply for highly competitive companies who have 1000x more applicants than they’d ever consider, the resulting straightforwardness can be shocking.
    lovicha month ago
    The original
    >If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.
    Not condescending
    > If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code so we can schedule an interview.
    entroxa month ago
    But now the meaning is different: you went from a potential interview to a guaranteed one.
    lovicha month ago
    No fucking shit, I paraphrased Anthropic's comments as
    > do better than we have publicly admitted most of humanity can do, and we may deign to interview you
    If you think telling someone that after passing a test that 99.999% of humanity cannot pass, that they _may_ get an interview, you are being snarky/condescending.
    retsibsia month ago
    That's not how paraphrasing works. They probably intentionally held back from guaranteeing an interview, for various reasons. One that seems obvious to me is that with the bar set at "Claude Opus 4.5's best performance at launch", it's plausible that someone could meet it by feeding the problem into an LLM. If a bunch of people do that, they won't want to waste time interviewing them all.
    Nevermarka month ago
    Or honest?
    You may want to consider the distribution and quantity of replies before stating that you WILL do something that might just waste more people’s time or not be practical.
    The classy thing to do would be responding to every qualifying submission, even if it’s just to thank everyone and let some people know the field was very competitive if an interview won’t be happening.
    YetAnotherNicka month ago
    So I like these public challenges, but as someone who set some public questions, ask any company who ran any public contest for their opinion. The pool is filled with scammers who either bought the solutions through sites like Chegg or sometimes even just stackoverflow.
    lovicha month ago
    Ok, so they have a reason to be condescending in your mind.
    Does that change the fact that they are condescending?
    lechatonnoira month ago
    i think by your logic, they only thing that they do that is condescending is to say that an interview is not guaranteed.
    people are mentioning that they do this for a reason, which explains away that behavior, so yeah, it kinda does change the fact of whether they are being condescending.
    andrubya month ago
    pedantic: 0.001% of humanity is still 80K humans which might be a lot.
    (yes, yes, not every human will try this test)
    throwaway743a month ago
    I took the "perhaps" as a decision to be considered by the applicant, considering they'd be competent enough to get in at a place of their choice, not just anthropic.
    lovicha month ago
    Does the applicant or the employer decide if an interview happens in your experience?
    Do you think if the applicants are really in that level of demand that they would be getting a take home test instead of being actively recruited?
    Legitimately lay out your understanding of a world where an employer is chasing after employees who are high in demand, give them a test that is expected to take hours, and have a hedged bet in their wording, instead of saying we will absolutely hire you if you pass X bar?
- riffraffa month ago
  I feel that came out wrong but the "maybe" was intended to be a way of saying "no guarantees", to avoid giving people the idea "solve this, get hired".
  - Bootvisa month ago
    Should have asked Claude how to write it better.
  - maercha month ago
    In that case, removing „perhaps“ would have helped a lot. It is not about maybe being hired, but about maybe being interviewed.
    dmurraya month ago
    They don't want to guarantee an interview to everyone who sends them an improved solution, either.
    If three people send them improvements, they'll probably get interviews. If three thousand do, the problem is easier than they thought or amenable to an LLM or one bright person figured out a trick and shared it with all his classmates or colleagues or all of GitHub.
- NewJazza month ago
  They may not be able to hire folks in certain jurisdictions. Or even interview them. (Iran, NK)
- kristopolousa month ago
  If you're an asshole that wants millions of dollars...i mean there's still places to say no
- sourcegrifta month ago
  Pride comes before fall thankfully
- altmanaltmana month ago
  its anthrophic. their entire marketing is just being an pompous ass and AI fear mongering.
FriendlyMikea month ago
They should just have you create a problem that can't be solved by an llm in two hours. That's the real problem here
- ec109685a month ago
  Solvable in more than 2 but not less than 2 would be the real trick.
- OisinMorana month ago
  "You have 1 minute to design a maze that takes 2 minutes to solve"
NitpickLawyera month ago
The writing was on the wall for about half a year (publicly) now. The oAI 2nd place at the atcoder world championship competition was the first one, and I remember it being dismissed at the time. Sakana also got 1st place in another atcoder competition a few weeks ago. Google also released a blog a few months back on gemini 2.5 netting them 1% reduction in training time on real-world tasks by optimising kernels.
If the models get a good feedback loop + easy (cheap) verification, they get to bang their tokens against the wall until they find a better solution.
- cgearharta month ago
  I think this is the actual “bitter lesson”—the scalable solution (letting LLMs bang against the problem nonstop) will eventually far outperform human effort. There will come a point—whether sooner or later—where this’ll be the expected norm for handling such problems. I think the only question is whether there is any distinction between problems like this (clearly defined with a verifiable outcome) vs the space of all interesting computer programs. (At the moment I think there’s space between them. TBD.)
- lostmsua month ago
  1% doesn't sound like a lot at all.
  - _aavaa_a month ago
    That depends on how close to the theoretical max you think they are.
- myahioa month ago
  Sakana is a grift from what I understand
  - NitpickLawyera month ago
    Eh. I'd call them overly enthusiastic :) I know they publish hype-y stuff, they jumped the gun on a few things, I get that. But their recent result was on a "live" contest, and they did share agent traces, so that's likely a legit result.
tayo42a month ago
I wonder if the Ai is doing anything novel? Or if it's like a brute force search of applying all types of existing optimizations that already exist and have been written about.
- piokocha month ago
  How something that generates next token, given a list of previous tokens, can do something novel?
  - rellfya month ago
    By that same logic, humans would not be able to do anything novel either.
LarsKrimia month ago
I liked the core challenge. Finding the balance of ALU and VALU, but I think that the problem with the load bandwidth could lead to problems
Like optimizing for people who assume the start indices always will be zero. I am close to 100% sure that's required to get below 2096 total loads but it's just not fun
If it however had some kind of dynamic vector lane rotate that could have been way more interesting
eisbawa month ago
I got to 1364 cycles for now, semi-manually: Using design space exploration organized via backlog.md project, and then recombination from that. 20 agents in parallel.
Asked to generate drawio for the winner so I can grok it more easily, then I gave feedback.
Edit: 1121 cycles
- karmasimidaa month ago
  Same just make it a survival game
- eisbawa month ago
  1023 cycles
seamossfeta month ago
I'm getting flashbacks from my computer engineering curriculum. Probably the first place I'd start is replacing comparison operators on the ALU with binary arithmetic since it's much faster than branch logic. Next would probably be changing the `step` function from brute iterators on the instructions to something closer to a Btree? Then maybe a sparse set for the memory management if we're going to do a lot of iterations over the flat memory like this.
Maroa month ago
> This repo contains a version of Anthropic's original performance take-home, before Claude Opus 4.5 started doing better than humans given only 2 hours.
Was the screening format here that this problem was sent out, and candidates had to reply with a solution within 2 hours?
Or, are they just saying that the latest frontier coding models do better in 2 hours than human candidates have done in the past in multiple days?
- saagarjhaa month ago
  4 hours
- mrklola month ago
  Oh, I thought candidates got 2 hours but now I am confused too
throwaway0123_5a month ago
> Claude Opus 4.5 in a casual Claude Code session, approximately matching the best human performance in 2 hours
Is this saying that Claude matched the best human performance, where the human had two hours? I think that is the correct reading, but I'm not certain they don't mean that Claude had two hours, and matched the best human performance where the human had an arbitrary amount of time. The former is impressive but the later would be even more so.
pickpocketa month ago
I cleared this assignment but did not clear the follow up interview that was way easier than this. So I gave up on tech interviews in general, stayed where I was.
- arsl16a month ago
  I got this but I am an embedded SWE, might not be my cup of tea
kristianpaula month ago
“If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.”
- afro88a month ago
  > at launch
  Does this confirm they actually do knee cap models after the launch period to save money, without telling users?
  - mediamana month ago
    No, they later updated the harness for this and it subsequently got better scores.
- sevenzeroa month ago
  The company that wanted to simply get away with the thievery of terabytes of intellectual property, what a great place to work at! Not. Anthropic has no shame.
nottorpa month ago
Is it "write 20 astroturfing but somewhat believable posts about the merits of "AI" and how it is going to replace humans"?
- atomliba month ago
  I'm afraid that position is already filled by the CEO.
- falloutxa month ago
  It should be "can you gaslight a CEO into firing 90% of their software engineers?"
demirbey05a month ago
It's showcase more than being take home assignment. I couldnt understand what the task is ,only performance comparisons between their LLM
- measurablefunca month ago
  The task is ill-defined.
  - saagarjhaa month ago
    You make it faster
    measurablefunca month ago
    Fewer instructions doesn't mean it's faster. It can be faster but it's not guaranteed in general. Obvious counterexample is single threaded vs multi-threaded code. Single threaded code will have fewer instructions but won't necessarily be faster.
    saagarjhaa month ago
    It does in this case; you can read the assignment to see that it is all single-threaded
    measurablefunca month ago
    I read it, you're mistaken.
    saagarjhaa month ago
    I did the assignment my guy
    measurablefunca month ago
    That's great but I didn't ask & that's still not addressing my point.
    saagarjhaa month ago
    I didn’t ask you to be rude or wrong either, yet here we are. The assignment is explicitly single core and cycle accurate. Your point is completely irrelevant and shows a disconnect with the content being discussed.
    measurablefunca month ago
    It's neither rude nor wrong to ask for evidence to support claims being made in what appears to be corporate advertising. The claim is their LLM is better than a person, I asked for evidence. None was presented. It's not complicated.
    saagarjhaa month ago
    You first claimed this task was poorly specified (it’s not) and then completely misrepresented what it’s looking for. When I pointed this out you became defensive and claimed this was not your point at all. That’s what I’m talking about.
    measurablefunca month ago
    Still not addressing any of my points & making it personal is not going to make you any less confused.
    saagarjhaa month ago
    You’re going to have to lay out “your points” or there is no way anybody is going to respond to them. I’ve been replying to what you’ve been writing.
    measurablefunca month ago
    People manage to respond to them just fine.
    saagarjhaa month ago
    Well, I won't.
    measurablefunca month ago
    Noted.
a month ago
undefined
torginusa month ago
Are you allowed to change the instruction sequence? I see some optimization opportunities - it'd be obviously the correct thing to do an optimizing compiler, but considering the time allotted, Id guess you could hand-optimize it, but that feels like cheating.
- saagarjhaa month ago
  Yes, in fact this will be one of the first things you will want to do.
Incipienta month ago
>so we can be appropriately impressed and perhaps discuss interviewing.
Something comes across really badly here for me. Some weird mix of bragging, mocking, with a hint of aloof.
I feel these top end companies like the smell of their own farts and would be an insufferable place to work. This does nothing but reinforce it for some reason.
- sponnatha month ago
  I have to agree. It's off-putting to me too. I'm impressed by the performance of their models on this take-home but I'm not impressed at their (perhaps unintentional) derision of human programmers.
- qbanea month ago
  Remember: It is a company that keep saying how much production code can be written by AI in xx years, but at the same time recruiting new engineers.
- yodsanklaia month ago
  Thanks for noticing this. I got the same feeling when reading this. It may not sound like much, and it doesn't mean it's an insufferable place to work, but it's a hint it might be.
  Rant: On a similar note, I recently saw a post on Linkedin from Mistral, where they were bragging to recruit candidates from very specific schools. That sounded very pretentious (and also an HR mistake on several levels IMHO).
mips_avatara month ago
Going through the assignment now. Man it’s really hard to pack the vectors right
svilen_dobreva month ago
if anyone is interested to try their agent-fu, here's some more-real-world rabbit-hole i went optimizing in 2024. Note this is now dead project, noone's using it, and probably same for the original. i managed to get it 2x-4x faster than original, took me several days then. btw There are some 10x optimizations possible but they break few edge cases, so not entirely correct.
https://github.com/svilendobrev/transit-python3
htrpa month ago
Idle side note: surprised that https://github.com/anthropic is just some random dude in Australia
arsl16a month ago
Fellas should I even attempt it? I got it recently and lets say it brings back memories of computer architecture class.
spencerflema month ago
Oh wow it’s by Tristan Hume, still remember you from EyeLike!
- Graziano_Ma month ago
  I recognized the name and dug around too. I played DEFCON CTF with him back in the day!
karmasimidaa month ago
I am able to beat this 1487 benchmark by switching between LLMs, doesn't seem that hard lol. Albeit, I do not fully understand what the solution is, loll
- lostmsua month ago
  Yeah, GPT 5.2 on high got down to 1293 on the 5th try (about 32mins).
piokocha month ago
Interesting... Who would spend hours working for free for some company that promised only that they would invite you for a job interview. Maybe.
- Aurornisa month ago
  When this was being used it was probably given to candidates who had already started the interview loop and been screened.
  The current e-mail invitation in the README is just another avenue for exceptional people to apply. If someone is already highly qualified from their background and resume they can go through the front door (direct application). For those who have incredible talent but not necessarily the background or resume to unlock the front door yet, this is a fun way to demonstrate it.
- cjrpa month ago
  I guess someone who enjoys solving these kinds of problems anyway, and thinks the potential upside if they do get hired is worth it.
saagarjhaa month ago
Oh, this was fun! If you like performance puzzles you should really do it. Actually I might go back and see if I can improve on it this weekend…
a month ago
undefined
greesila month ago
This is a knowledge test of GPU architecture?
- avaera month ago
  Kind of, but not any particular GPU.
  The machine is fake and simulated: https://github.com/anthropics/original_performance_takehome/...
  But presumably similar principles apply.
- benreesmana month ago
  It's a test of polyhedral layout algebra, what NVIDIA calls CuTe and the forthcoming C++ standard calls std::mdspan.
  This is the general framework for reasoning about correct memory addressing in the presence of arbitrary constraints like those of hardware.
  - saagarjhaa month ago
    You can get pretty far without needing to care about this fwiw
    greesila month ago
    Not far enough if you're turning cash into waste heat with GPUs :)
sublimefirea month ago
Did a bit of soul searching and manually optimised to 1087 but I give up. What is the number we are chasing here? IMO I would not join a company giving such a vague problem because you can feel really bad afterwards, especially if this does not open a door to the next stage of the interview. As an alternative we could all instead focus on a real kernel and improve it :)
- trishumea month ago
  Author of the take-home here: That's quite a good cycle count, substantially better than Claude's, you should email it to performance-recruiting@anthropic.com.
pshirshova month ago
Yet Claude is the only agent which deadlocks (blocks in GC forever) after an hour of activity.
potato-peelera month ago
What does clock cycles mean? Don’t think they are referring to the cpu clock?
NightBlossoma month ago
I could only cut it down to 41 cycles.
pickpocketa month ago
i cleared this one but didn't clear the follow up interview that was way easier than this
mayankda month ago
Problem solving is eternal!
zeroCaloriesa month ago
It shocks me that anyone supposedly good enough for anthropic would subject themselves to such a one sided waste of time.
- pclmulqdqa month ago
  I generally have a policy of "over 4 hours and I charge for my time." I did this in the 4-hour window, and it was a lot of fun. Much better than many other take-home assignments.
  - heavyset_goa month ago
    I don't do take home assignments, but when I did, I would offer to do it at my hourly rate, even if it was just an hour. It's time I would otherwise spend making money.
    Anyone worth working with respected that and I landed several clients who forwent the assignment altogether. It's chump change in the grand scheme of things, and often a formality.
    Does help that I have a very public web presence and portfolio, though.
    theptipa month ago
    For many reasons, you’re not gonna get into Anthropic with that attitude.
    PlanksVariablea month ago
    And Anthropic will never land heavyset_go with their attitude. I guess we’re at an impasse.
    heavyset_goa month ago
    I don't care
    dheeraa month ago
    Time is the issue, not money.
    I couldn't care less about getting paid for a few hours, what's truly annoying when you're job hunting is the company having an extremely high rejection rate even at the take-home stage. That's an inordinate waste of time multiplied by a lot of companies.
    If you have a >50% chance of rejecting, don't even give the candidate a take-home. Be at least 90% sure you want them before you get to that stage.
    ramraj07a month ago
    I have foregone our take home for exceptional candidates, but let me ask you, do you also demand compensation for in person or zoom call 1-1 interviews? Surely thats the same time of your life.
    zeroCaloriesa month ago
    It signals a degree of investment from the other side if they're willing to burn their own time talking to you. I can understand a small screening process to filter candidates, but I'm not going to do your silly dance for multiple hours if you're not going to do it with me.
    heavyset_goa month ago
    They're paying with their time, and I have questions I want to ask them. It's a mutually beneficial experience.
    Being told "here do this arbitrary thing that will take 4 hours of your time and maybe we'll look at it, and then if we even bother to do that, maybe we'll respond" is different than an interview where both parties invest their time face-to-face.
  - Aurornisa month ago
    > I generally have a policy of "over 4 hours and I charge for my time.
    Worth mentioning that demanding to be paid to apply for a company is usually equivalent to rejecting the job. Most companies are going to end the interview there. Few HR departments would allow one applicant to be paid for the same interview loop as other candidates.
    I was helping out in a mentoring program during the ZIRP period when the idea of charging companies for take-home interviews started to become popular. I can’t think of anyone it actually worked for in that group. I’ve heard anecdotes online of some people doing it with success, but any company like Anthropic is just going to close your application and move on if you request to be paid for applying. They have a zillion other qualified candidates in line.
    If someone is giving a take-home problem that looks like you’re actually doing work for the company, that’s a different story. This problem is not actually work, obviously.
    pclmulqdqa month ago
    Yeah, I have told HR people this and been rejected. I do say this upfront because I don't want to send you a surprise bill. The main response I get is "OK, that's fine, don't spend more than 4 hours on it." The Anthropic recruiter told me, "no problem, it's a 4-hour test anyway."
    Aurornisa month ago
    > I do say this upfront because I don't want to send you a surprise bill.
    Sending a company a surprise bill that they didn't agree upon is bad practice. Interviews are customarily not compensated, so it's unreasonable to surprise bill someone for it.
    If you send a company a surprise bill for the interview, it's going to give the HR people a good laugh as they cross you off the candidates list. Everyone involved is going to forever remember you as the person who tried surprise billing for the interview and make a mental note to never interview you again at future companies.
    It's not a good thing to try.
    pclmulqdqa month ago
    I only mention this because I think some people have done that.
  - whateveraccta month ago
    4 hours continuous or no? I can't imagine finding 4 hours of straight focus.
    ryanjshawa month ago
    These kinds of roles are for youngsters with minimal commitments who are looking for their shot to break into a wild industry. It’s not for the middle aged single parent with FTE and just enough free time to do an extra load of laundry.
    saagarjhaa month ago
    Continuous
    whateveraccta month ago
    damn that sucks
    i guess that ensures you either hire the childless
    or those with children who are fine with be not present for that long willingly (so they are probably gonna be job-obsessed enough)
    or they are currently unemployed so they won't have an existing job as anchoring leverage
    well played, anthropic
    saagarjhaa month ago
    I’m trying to imagine what would make it impossible to not pay attention to your children for four hours and the only thing I can think of that can’t be scheduled around is…a very young newborn, maybe? If they’re prone to waking up constantly?
    whateveraccta month ago
    Babies and toddlers need parental care and attention too.
    saagarjhaa month ago
    Usually you can get four hours of time where they’re not likely to bother you from them.
    scottyaha month ago
    I can't imagine wanting to hire someone as an FTE who is unable to spend 4hrs working in a day.
    whateveraccta month ago
    i can't but i put out staff level work and get paid happily accordingly for years now
    nobody i know ever spends 4hrs uninterrupted working remotely lolol
- djmipsa month ago
  If you look at it as a puzzle game then it's not any different than the time you use to play other games.
  - aleph_minus_onea month ago
    > it's not any different than the time you use to play other games.
    This assumes that the candidate has a lot of time for playing other games.
- browningstreeta month ago
  I’ve been sent the Anthropic interview assignments a few times. I’m not a developer so I don’t bother. At least at the time they didn’t seem to have technical but not-dev screenings. Maybe they do now.
  - throwa356262a month ago
    Care to elaborate the first part?
    Did you apply for a position? Did they send you the assignment without prior discussion?
- sealecka month ago
  Why is writing code to execute a program using the fewest instructions possible on a virtual machine a waste of time?
  - 0x3fa month ago
    The expected time you spend on it is much less than the expected time they'll spend on it.
  - efilifea month ago
    you don't get paid for it
- mips_avatara month ago
  It’s kind of an interesting problem.
dhruv3006a month ago
I wonder if OpenAI follows suit.
- rvza month ago
  They should.
SinghCodera month ago
why is their github handle anthropics and not anthropic :D
alexpadulaa month ago
Looks rather fun!
mrdootdoota month ago
“In English, Data”
yasmineroy3324 days ago
[dead]
OhNoNotAgain_99a month ago
[dead]
mannykannota month ago
I beat the target by deleting the parts that were causing the cycle count to be too high. /s
- eisbawa month ago
  submit and see if Anthropic accepts it
kartibbba month ago
[flagged]
kartibbba month ago
[flagged]
tmp-127853716a month ago
[flagged]
- falloutxa month ago
  Well working under someone who keeps insisting Software engineering is dead sounds like a toxic work environment.
- woofa month ago
  "1) Python is unreadable."
  Would you prefer C or C++?
  "2) AI companies are content with slop and do not even bother with clear problem statements."
  It's a filter. If you don't get the problem, you'll waste their time.
  "3) LOC and appearance matter, not goals or correctness."
  The task was goal+correctness.
  "4) Anthropic must be a horrible place to work at."
  Depends on what you do. For this position it's probably one of the best companies to work at.
  - tap12783487a month ago
    It is a filter for academics who write horrible Python code and feel smart, yes.
    I think they also have open positions for stealing other people's code and DDoS-ing other people's websites.
  - am17ana month ago
    1) Python is unreadable." Would you prefer C or C++?
    > Unironically, yes. Unless I never plan to look at that code again
myahioa month ago
[flagged]
a month ago
undefined
jackblemminga month ago
Seems like they’re trying to hire nerds who know a lot about hardware or compiler optimizations. That will only get you so far. I guess hiring for creativity is a lot harder.
And before some smart aleck says you can be creative on these types of optimization problems: not in two hours, it’s far too risky vs regurgitating some standard set of tried and true algos.
- onion2ka month ago
  And before some smart aleck says you can be creative on these types of optimization problems: not in two hours, it’s far too risky vs regurgitating some standard set of tried and true algos.
  You're both right and wrong. You're right in the sense that the sort of creativity the task is looking for isn't really possible in two hours. That's something that takes a lot of time and effort over years to be able to do. You're wrong because that's exactly the point. Being able to solve the problem takes experience. Literally. It's having tackled these sorts of problems over and over in the past until you can draw on that understanding and knowledge reasonably quickly. The test is meant to filter out people who can't do it.
  I also think it's possible to interpret the README as saying humans can't do better than the optimizations that Claude does when Claude spends two hours of compute time, regardless of how long the human takes. It's not clear though. Maybe Claude didn't write the README.
- tmulea month ago
  Your comments history suggests you’re rather bitter about “nerds” who are likely a few standard deviations smarter than you (Anthropic OG team, Jeff Dean, proof nerds, Linus, …)
  - jackblemminga month ago
    And they’re all dumber than John von Neumann, who cares?
    margalabargalaa month ago
    Transitively, you haven't thought the most thoughts or cared the most about anything, therefore we should disregard what you think and care about?
    jackblemminga month ago
    The person replying was trying to turn the conversation into some sort of IQ pissing contest. Not sure why, that seems like their own problem. I was reminding them that there is always someone smarter.
    wiseowisea month ago
    Your comment history is littered with “nerds”, “elite”, “better” and all sorts of comparisons.
    > I was reminding them that there is always someone smarter.
    And even with this comment you literally do not understand that you have some skewed view of the world. Do you have some high school trauma?
    efilifea month ago
    > Do you have some high school trauma?
    I am not sure ad personam is appropriate here
    wiseowisea month ago
    This is a thread about their personality.
    https://news.ycombinator.com/item?id=46701378
    jackblemminga month ago
    Where I come from, nerd is a term of endearment buddy.
    > And even with this comment you literally do not understand that you have some skewed view of the world.
    I’m well aware I don’t have a perfect view of reality and the map isn’t the territory. Do you?
    wiseowisea month ago
    My bad. I jumped on incorrect conclusion. Sorry.
- mugluga month ago
  If they're hiring performance engineers then they're hiring for exactly these sets of skills.
  It's a take-home test, which means some people will spend more than a couple of hours on it to get the answer really good. They would have gone after those people in particular.
- Analemma_a month ago
  This would be an inappropriate assignment for a web dev position, but I'm willing to bet that a 1% improvement in cycles per byte in inference (or whatever) saves Anthropic many millions of dollars. This is one case where the whiteboard assignment is clearly related to the actual job duties.
- rvza month ago
  > Seems like they’re trying to hire nerds who know a lot about hardware or compiler optimizations. That will only get you so far. I guess hiring for creativity is a lot harder.
  Good. That should be the minimum requirement.
  Not another Next.js web app take home project.
- saagarjhaa month ago
  The solution was explicitly graded on creativity fwiw