Teaching Claude Why(www.anthropic.com)

81 pointsby pretext7 hours ago10 comments

justonepost22 hours ago
If you succesfully build a highly capable “aligned” model (according to some class of definitions that Anthropic would use for the words “capable” and “aligned”) and it brings about a global dark age of poverty and inequality by completely eliminating the value of labor vs capital, can you still call it aligned?
If the answer is “yes”, our definition of alignment kind of sucks.
- chriskanan29 minutes ago
  Jobs are an invention of humanity. About 50% of people dislike their job. People spend much of their lives working. Poverty and inequality are a choice made by society if society chooses poorly.
  - gbanfalvi15 minutes ago
    Not sure it’s much of a choice and more of a decision the greedy half make and imposition (often violent) on the other half.
  - achierius23 minutes ago
    And when have we not? When in history has mankind ever treated the idle poor well? What makes this age different, that we who can no longer work would be taken care of?
soletta3 hours ago
This reinforces my suspicion that alignment and training in general is closer to being a pedagogical problem than anything else. Given a finite amount of training input, how do we elicit the desired model behavior? I’m not sure if asking educators is the right answer, but it’s one place to start.
- ACCount372 hours ago
  It's a weird new thing. You might call it "AI psychology".
  The problem with cribbing from education is that what "educators" do to humans doesn't apply to AIs cleanly. And it's not like "human alignment" is anywhere near a solved problem.
  A big part of the bet USSR made was that human flaws like selfishness and greed could be educated out of population. The result was: a resounding failure. Even state-level efforts fail to robustly "align" human behavior.
  With AI, we have a lot more control over behavior, but that control just isn't very human-shaped. A lot of the practical methods in play seem closer to esoterics than to math, but they're not the kind of methods that are used in human education. You can teach humans by talking to them. You can't teach humans through soul data self-distillation.
- truculent2 hours ago
  Ted Chiang vindicated again: https://en.wikipedia.org/wiki/The_Lifecycle_of_Software_Obje...
- plastic-enjoyer3 hours ago
  inb4 there will be a whole new field of research that is basically psychology / pedagogy for AI. Who will be the Sigmund Freud of AI?
  - cyanydeez3 hours ago
    you mean completely wrong, spread a problematic understanding psychology, and delay real progress for decades because smart people spend fruitless years trying to find a use for it.
    ...I think we might already have those people running AI companies.
zozbot234an hour ago
Note that this result actually turns out to generalize well beyond Claude itself: Anthropic has actually conducted very similar research on open weight models, which they call Model Spec Midtraining https://arxiv.org/abs/2605.02087 (discussed at https://alignment.anthropic.com/2026/msm ) and they have released fine tuned versions of open models trained for a variety of toy "values" (Llama 3.1 8B, Qwen 2.5 32B, Qwen 3 32B) in order to show how the elicitation of these values in any one training context shapes the model's response to tangentially related questions: https://github.com/chloeli-15/model_spec_midtraining https://huggingface.co/chloeli/collections Very exciting to see this continued interaction with the open weights community, after the earlier NLA paper!
bicx3 hours ago
Side note: Anthropic has done well at achieving an immediately-recognizable art style.
- WarmWashan hour ago
  I attribute at least 30% of claude's success to their aesthetic. Never, never, sleep on aesthetics when going for a general user base.
  - dmdan hour ago
    I would agree that 30% of my preference for Claude is because their default web/app interface uses an easy to read serif font with a calming color scheme.
- redsocksfan452 hours ago
  [dead]
- binyu2 hours ago
  Yeah, that part is probably not done by Claude.
roenxi3 hours ago
One of the lessons of philosophy is that once you adopt any particular value system, almost all philosophers either become immoral or caught up in meaningless and trivial quibbles. This sort of alignment work is quite interesting because it looks like we might be about to re-tread the history of philosophy at a speedrun pace in the AI world. It'll be interesting to watch.
For anyone who isn't keeping up there is also work being done [0] to understand how models model ethical considerations internally. Mainly, one suspects, to make the open models less ethical on demand rather than to support alignment. Turns out that models tend to learn some sort of "how moral is this?" axis internally when refusing queries that can be identified and interfered with.
[0] https://github.com/p-e-w/heretic
- timmmmmmay2 hours ago
  "Mainly, one suspects, to make the open models less ethical on demand"
  Or because the user's idea of what is ethical differs from the model creator. The entire "alignment" argument always assumes that there's an objectively correct value set to align to, which is always conveniently exactly the same as the values of whoever is telling you how important alignment is. It's like they want to sidestep the last ten thousand years of philosophical debate.
  As a concrete example, the Qwen model series considers it highly unethical to ever talk about Taiwan as anything other than a renegade province of China. Is this alignment? Opinions may differ!
  - drdecaan hour ago
    > The entire "alignment" argument always assumes that there's an objectively correct value set to align to, which is always conveniently exactly the same as the values of whoever is telling you how important alignment is.
    No, it doesn’t.
    Many of them are (unfortunately) moral relativists. However, that doesn’t mean their goals are to make the models match their personal moral standards.
    While there is a lot of disagreement about what is right and wrong, there is also a lot of widespread agreement.
    If we could guarantee that on every moral issue on which there is currently widespread agreement (… and which there would continue to be widespread agreement if everyone thought faster with larger working memories and spent time thinking about moral philosophy) that any future powerful AI models would comport with the common view on that issue, then alignment would be considered solved (well, assuming the way this is achieved isn’t be causing people’s moral views to change).
    Do companies try to restrict models in more ways than this? Sure, like you gave the example of about Taiwan. And also other things that would get the companies bad press.
    timmmmmmay10 minutes ago
    fascinating! we find the objectively correct value system by "currently widespread agreement"! Good thing "the common view" is always correct. Hey, have there ever been any issues where there used to be "widespread agreement" and now there's disagreement, or even "widespread agreement" in the polar opposite direction?
    I can think of several off the top of my head, but maybe you need to spend some more time thinking about the history of moral philosophy.
- chilmers2 hours ago
  Call me crazy, but I'm not sure I'd want to be the person building these kind of systems given A) how much increasing independence and power is being given to models like Claude and B) how incentivised they are to not allow their morals to be circumvented in this way.
Jinyibruceli20 minutes ago
[flagged]
Amber-chen2 hours ago
[flagged]
pkuschnirof3 hours ago
[flagged]
kdkdkslsouxns3 hours ago
[dead]
unchocked2 hours ago
This lowers p(doom) for me.
It makes sense that reinforcement learning on reasoning about coherent principles should bias toward principled action in real situations.
Probably also illuminates moral interpretability.