I’ve taken a few months off to focus on studying AI safety. I found that it was too hard to fit study in around work and my other commitments1.

I’m enrolled in BlueDot Impact’s AI Safety Fundamentals course, and it’s going really well. The other participants are amazing, and I feel really lucky to be among them and to have the chance to learn from them. So far we’ve had three sessions, covering model training, real failure cases in alignment, and hypothetical (but all-too-possible) catastrophic scenarios.

One of the things that has struck me is just how inevitable escalation seems to be. In one of the hypothetical scenarios, we considered what would happen if a human-level agentic system were suddenly released2. How would the world respond? Perhaps there would be massive job losses leading to civil unrest — could the situation be brought back under control? Perhaps a hostile state has the technology3 — is there any path that doesn’t lead to full-speed armament and another cold war? That would be bad enough if autonomous weapons were the only bad thing that would come out of such a race, but it could be much worse.

p(doom) update: 10% → 50%4.

Technical study

The BlueDot course is mostly theoretical, so I’m also planning to do these as self-study:

  1. I’ve actually left my job, hoping to find AI safety work at the end of my study break. 

  2. Perhaps it was developed by a stealth startup

  3. Suppose they have a capable cyber warfare unit and they steal the model for themselves, and that it’s cheap to run many copies of it. 

  4. Yikes. Why not more than 50%? Mostly based on my experience that frontier LLMs seem to be friendly; and the fact that companies like Anthropic are working hard on safety.