SAGA / Reading list

Reading list

The papers, courses and books we keep coming back to in sessions. You don't need to read everything — pick the section that matches where you are.

Orientation

If you're new to all of this, start here. No technical background needed.

AI Safety Fundamentals

BlueDot Impact

Free, structured courses on alignment and governance. The single best on-ramp; several of us went through it ourselves.

Risks from advanced AI — problem profile

80,000 Hours

A careful, readable case for why this might be one of the most pressing problems — with honest treatment of the counterarguments.

Core views on AI safety

Anthropic, 2023

A frontier lab's own account of when and why AI could become dangerous, and what they think can be done about it.

Statement on AI risk

CAIS, 2023

One sentence, signed by many of the field's leading researchers and lab heads. Useful as a marker of how mainstream the concern has become.

AISafety.com

Field map

A living map of the AI safety ecosystem — organisations, courses, funders, communities. Good for finding your next step.

Technical alignment

The core research problem: how do you make capable AI systems reliably do what their designers and users intend?

Concrete problems in AI safety

Amodei et al., 2016

The paper that translated vague worries into concrete research problems — reward hacking, safe exploration, robustness to distribution shift. Still a useful vocabulary.

The alignment problem from a deep learning perspective

Ngo, Chan & Mindermann, 2022

Why standard training methods could produce systems that pursue goals their developers didn't intend. Probably the best single technical overview.

Transformer Circuits

Anthropic, ongoing

Mechanistic interpretability — actually opening up neural networks to see what's computed inside. Start with "Toy Models of Superposition."

Constitutional AI: harmlessness from AI feedback

Bai et al., 2022

How current frontier models are actually aligned in practice — training models against written principles rather than only human labels.

Alignment Forum

Forum

Where the field argues with itself in public. Uneven, but the best place to watch ideas get stress-tested.

Shallow review of technical AI safety, 2025

LessWrong, 2025

A broad survey of where technical safety research stands right now — useful for seeing which problems are getting traction and which remain open.

Risk & strategy

The bigger-picture arguments — for and against treating advanced AI as a potential catastrophe.

Is power-seeking AI an existential risk?

Carlsmith, 2022

The most careful version of the x-risk argument, built as an explicit chain of premises with probabilities attached — so you can see exactly where you'd disagree.

An overview of catastrophic AI risks

Hendrycks et al., 2023

A taxonomy of how things could go badly — malicious use, racing dynamics, organisational failure, rogue systems — beyond just the classic misalignment story.

Why I am not (as much of) a doomer (as some people)

Scott Alexander, 2023

A thoughtful case for lower risk estimates, from inside the conversation. We try to keep the strongest versions of both sides on the table.

AI 2027

Kokotajlo et al., 2025

A concrete, contested scenario forecast of the next few years. Read it less as prediction and more as a way to make your own expectations explicit.

Governance

Policy levers, international coordination, and what governments can do now.

GovAI research

Centre for the Governance of AI

The central academic hub for AI governance — compute governance, frontier-lab policy, international agreements.

Governance of the Transition to AGI

UNCPGA, 2025

A report for the UN General Assembly on governing the transition to AGI — urgent policy considerations as industry leaders anticipate AGI within this decade.

Managing extreme AI risks amid rapid progress

Bengio et al., Science 2024

A short consensus piece by senior researchers on what governments should do now. A good baseline for policy discussions.

India

The domestic AI landscape — policy, community, and ecosystem.

IndiaAI

Govt. of India

The national AI programme — useful for understanding where Indian state capacity and attention currently sit.

India AI Tracker

Tracks AI developments, policy moves and ecosystem growth across India — a useful pulse-check on the domestic landscape.

Carnegie India — technology research

Carnegie India

Some of the most serious writing on AI policy from an Indian vantage point.

Effective Altruism India

EA India

The Indian EA community hub — connects people working on high-impact causes including AI safety, and a good way to find others thinking about these problems locally.

Books

For longer-form treatment. Links go to overviews; copies circulate within the group.

The Alignment Problem

Brian Christian, 2020

The best narrative introduction — how alignment research actually emerged, told through the people doing it.

Human Compatible

Stuart Russell, 2019

One of AI's founding textbook authors argues the field's standard objective is the problem, and proposes a redesign.

Superintelligence

Nick Bostrom, 2014

The book that put x-risk on the map. Dated in places, but many of the core concepts still frame the debate.

Going deeper

If the reading has you wanting to contribute, these are the doors people in our group have actually used or recommend.

MATS — ML Alignment Theory Scholars

Research program

A mentored research program that has become one of the main pipelines into full-time alignment work.

ARENA

Technical bootcamp

An intensive technical curriculum (interpretability, RL, evals) — also excellent for self-study; the materials are open.

Apart Research

Hackathons & sprints

Regular global research sprints and hackathons — a low-cost way to try doing safety research before committing to it.

80,000 Hours job board

Careers

The most complete listing of safety-relevant roles across research, engineering, policy and operations.