Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
johnswentworthΩ5916318
18
I think a very common problem in alignment research today is that people focus almost exclusively on a specific story about strategic deception/scheming, and that story is a very narrow slice of the AI extinction probability mass. At some point I should probably write a proper post on this, but for now here are few off-the-cuff example AI extinction stories which don't look like the prototypical scheming story. (These are copied from a Facebook thread.) * Perhaps the path to superintelligence looks like applying lots of search/optimization over shallow heuristics. Then we potentially die to things which aren't smart enough to be intentionally deceptive, but nonetheless have been selected-upon to have a lot of deceptive behaviors (via e.g. lots of RL on human feedback). * The "Getting What We Measure" scenario from Paul's old "What Failure Looks Like" post. * The "fusion power generator scenario". * Perhaps someone trains a STEM-AGI, which can't think about humans much at all. In the course of its work, that AGI reasons that an oxygen-rich atmosphere is very inconvenient for manufacturing, and aims to get rid of it. It doesn't think about humans at all, but the human operators can't understand most of the AI's plans anyway, so the plan goes through. As an added bonus, nobody can figure out why the atmosphere is losing oxygen until it's far too late, because the world is complicated and becomes more so with a bunch of AIs running around and no one AI has a big-picture understanding of anything either (much like today's humans have no big-picture understanding of the whole human economy/society). * People try to do the whole "outsource alignment research to early AGI" thing, but the human overseers are themselves sufficiently incompetent at alignment of superintelligences that the early AGI produces a plan which looks great to the overseers (as it was trained to do), and that plan totally fails to align more-powerful next-gen AGI at all. And at that point, they'r
links 1/13/2025: https://roamresearch.com/#/app/srcpublic/page/01-13-2025 * https://www.construction-physics.com/p/why-skyscrapers-became-glass-boxes * plain glass box skyscrapers were, in fact, more cost-effective for developers. it's not all about architectural tastes. architects in real life are very far from all-powerful. * in fact, I really think people should stop writing books/movies/etc about auteur architects; it only encourages more young people to go into architecture and become unemployed. I'm looking at you, Francis Ford Coppola * https://www.betonit.ai/p/the-typical-man-disgusts-the-typical * pointing in the right direction, but overstated/inflammatory. women don't go around being "disgusted" by every man they interact with socially. * rather, most women find the idea of having sex with a randomly selected unfamiliar man disgusting, even if there's nothing particularly the matter with him. typical straight women are cautious/selective about sex and fairly slow to warm up sexually to new people. not much "lust at first sight." * but yeah, getting rejected when you ask women out does not in fact mean you are inadequate or unattractive! getting rejections in dating is normal, just like every author gets rejected manuscripts and every job applicant gets rejected from jobs. the average man gets a lot of "no"s and at least one "yes", and eventually marries a "yes." * also i share Bryan Caplan's view that women shouldn't be offended by being asked out by someone they aren't interested in. sure, persistent harassment can be a problem, but a simple question isn't. * https://nabeelqu.substack.com/p/principles Nabeel Qureshi * I agree with most of this, but "you don't need much sleep" is very individual. some of us very much need plenty of sleep and our lives improve dramatically when we face that fact. * things I googled while reading about Venice * https://en.m.wikipedia.org/wiki/Festa_della_Sensa * https://en.m.wikipedia
Sometimes I wonder if people who obsess over the "paradox of free will" are having some "universal human experience" that I am missing out on. It has never seemed intuitively paradoxical to me, and all of the arguments about it seem either obvious or totally alien. Learning more about agency has illuminated some of the structure of decision making for me, but hasn't really effected this (apparently) fundamental inferential gap. Do some people really have this overwhelming gut feeling of free will that makes it repulsive to accept a lawful universe? 
Raemon470
0
My Current Metacognitive Engine Someday I might work this into a nicer top-level post, but for now, here's the summary of the cognitive habits I try to maintain (and reasonably succeed at maintaining). Some of these are simple TAPs, some of them are more like mindsets. * Twice a day, asking “what is the most important thing I could be working on and why aren’t I on track to deal with it?” * you probably want a more specific question (“important thing” is too vague). Three example specific questions (but, don’t be a slave to any specific operationalization) * what is the most important uncertainty I could be reducing, and how can I reduce it fastest? * what’s the most important resource bottleneck I can gain, or contribute to the ecosystem, and would gain me that resource the fastest? * what’s the most important goal I’m backchaining from? * Have a mechanism to iterate on your habits that you use every day, and frequently update in response to new information * for me, this is daily prompts and weekly prompts, which are: * optimized for being the efficient metacognition I obviously want to do each day * include one skill that I want to level up in, that I can do in the morning as part of the meta-orienting (such as operationalizing predictions, or “think it faster”, or whatever specific thing I want to learn to attend to or execute better right now) * The five requirements each fortnight: * be backchaining * from the most important goals * be forward chaining * through tractable things that compound * ship something * to users every fortnight * be wholesome * (that is, do not minmax in a way that will predictably fail later) * spend 10% on meta (more if you’re Ray in particular but not during working hours. During working hours on workdays, meta should pay for itself within a week) * Correlates: * have a clear, written model of what you’re backchaining from * have a clear, written model of
"In an argument between a specialist and a generalist, the expert usually wins by simply (1) using unintelligible jargon, and (2) citing their specialist results, which are often completely irrelevant to the discussion. The expert is, therefore, a potent factor to be reckoned with in our society. Since experts both are necessary and also at times do great harm in blocking significant progress, they need to be examined closely. All too often the expert misunderstands the problem at hand, but the generalist cannot carry though their side to completion. The person who thinks they understand the problem and does not is usually more of a curse (blockage) than the person who knows they do not understand the problem.’ —Richard W. Hamming, “The Art of Doing Science and Engineering” *** (Side note:  I think there's at least a 10% chance that a randomly selected LessWrong user thinks it was worth their time to read at least some of the chapters in this book. Significantly more users would agree that it was a good use of their time (in expectation) to skim the contents and introduction before deciding if they're in that 10%.  That is to say, I recommend this book.)

Popular Comments

Recent Discussion

"There's a 70% chance of rain tomorrow," says the weather app on your phone. "There’s a 30% chance my flight will be delayed," posts a colleague on Slack. Scientific theories also include chances: “There’s a 50% chance of observing an electron with spin up,” or (less fundamental) “This is a fair die — the probability of it landing on 2 is one in six.”

We constantly talk about chances and probabilities, treating them as features of the world that we can discover and disagree about. And it seems you can be objectively wrong about the chances. The probability of a fair die landing on 2 REALLY is one in six, it seems, even if everybody in the world thought otherwise. But what exactly are these things called “chances”?

Readers...

One pet peeve of mine is that actual weather forecasts for the public don't disambiguate interpretations of rain chance. Is it the chance of any rain at some point in that day or hour? Is it the expected proportion of that day or hour during which it will be raining?

2transhumanist_atom_understander
Yes, de Finetti's theorem shows that if our beliefs are unchanged by exchanging members of the sequence, that's mathematically equivalent to having some "unknown probability" that we can learn by observing the sequence. Importantly, this is always against some particular background state of knowledge, in which our beliefs are exchangeable. We ordinary have exchangeable beliefs about coin flips, for example, but may not if we had less information (such as not knowing they're coin flips) or more (like information about the initial conditions of each flip). In my post on unknown probabilities, I give more detail on how they are definedc which turns out to involve a specific state of background knowledge, so they only act like a "true" probability relative to that background knowledge. And how they can be interpreted as part of a physical understanding of the situation. Personally, rather than observing that my beliefs are exchangeable and inferring an unknown probability as a mathematical fiction, I would rather "see" the unknown probability directly in my understanding of the situation, as described in my post.
4winstonBosan
Great stuff! I don't have strong fundamentals in math and statistics but I was still able to hobble along and understand the post. It reminds me of what Rissanen said about data/observation - that data is really all we have, and there is no true state of nature. Our job is to squeeze as much alpha out of observation as possible, instead of trying to find a "true" generator function. This post hit the same spot for me :)

TLDR: LessWrong + Lighthaven need about $3M for the next 12 months. Donate or send me an email, DM, signal message (+1 510 944 3235), or public comment on this post, if you want to support what we do. We are a registered 501(c)3, have big plans for the next year, and due to a shifting funding landscape need support from a broader community more than in any previous year. [1] 

I've been running LessWrong/Lightcone Infrastructure for the last 7 years. During that time we have grown into the primary infrastructure provider for the rationality and AI safety communities. "Infrastructure" is a big fuzzy word, but in our case, it concretely means: 

  • We build and run LessWrong.com and the AI Alignment Forum.[2]
  • We built and run Lighthaven (lighthaven.space), a ~30,000 sq.
...

Have donated $400. I appreciate the site and its team for all it's done over the years. I'm not optimistic about the future wrt to AI (I'm firmly on the AGI doom side), but I nonetheless think that LW made a positive contribution on the topic.

Anecdote: In 2014 I was on a LW Community Weekend retreat in Berlin which Habryka either organized or did a whole bunch of rationality-themed presentations in. My main impression of him was that he was the most agentic person in the room by far. Based on that experience I fully expected him to eventually accomplish some arbitrary impressive thing, though it still took me by surprise to see him specifically move to the US and eventually become the new admin/site owner of LW.

2FeepingCreature
Haven't heard back yet...
2Chipmonk
I wonder if you could set up a conditional donation? “I donate $X, minus if total donations exceed $3M"
5Dmitrii Krasheninnikov
Donated $100 for now. Thanks for the great work!

Epistemic status -- sharing rough notes on an important topic because I don't think I'll have a chance to clean them up soon.

Summary

Suppose a human used AI to take over the world. Would this be worse than AI taking over? I think plausibly:

  • In expectation, human-level AI will better live up to human moral standards than a randomly selected human. Because:
    • Humans fall far short of our moral standards.
    • Current models are much more nice, patient, honest and selfless than humans.
      • Though human-level AI will have much more agentic training for economic output, and a smaller fraction of HHH training, which could make them less nice.
    • Humans are "rewarded" for immoral behaviour more than AIs will be
      • Humans evolved under conditions where selfishness and cruelty often paid high dividends, so evolution
...

Almost no competent humans have human extinction as a goal. AI that takes over is clearly not aligned with the intended values, and so has unpredictable goals, which could very well be ones which result in human extinction (especially since many unaligned goals would result in human extinction whether they include that as a terminal goal or not).

I don't think we have good evidence that almost no humans would pursue human extinction if they took over the world, since no human in history has ever achieved that level of power. 

Most historical conquerors ... (read more)

4kave
I think this is slightly a non sequitor. I take Tom to be saying "AIs will care about stuff that is natural to express in human concept-language" and your evidence to be primarily about "AIs will care about what we tell it to", though I could imagine there being some overflow evidence into Tom's proposition. I do think the limited success of interpretability is an example of evidence against Tom's proposition. For example, I think there's lots of work where you try and replace an SAE feature or a neuron (R) with some other module that's trying to do our natural language explanation of what R was doing, and that doesn't work.
6eggsyntax
Thanks, that's a totally reasonable critique. I kind of shifted from one to the other over the course of that paragraph.  Something I believe, but failed to say, is that we should not expect those misgeneralized goals to be particularly human-legible. In the simple environments given in the goal misgeneralization spreadsheet, researchers can usually figure out eventually what the internalized goal was and express it in human terms (eg 'identify rulers' rather than 'identify tumors'), but I would expect that to be less and less true as systems get more complex. That said, I'm not aware of any strong evidence for that claim, it's just my intuition. I'll edit slightly to try to make that point more clear.
2Nathan Helm-Burger
I thought the argument about the kindly mask was assuming that the scenario of "I just took over the world" is sufficiently out-of-distribution that we might reasonably fear that the in-distribution track record of aligned behavior might not hold?

I mostly work on risks from scheming (that is, misaligned, power-seeking AIs that plot against their creators such as by faking alignment). Recently, I (and co-authors) released "Alignment Faking in Large Language Models", which provides empirical evidence for some components of the scheming threat model.

One question that's really important is how likely scheming is. But it's also really important to know how much we expect this uncertainty to be resolved by various key points in the future. I think it's about 25% likely that the first AIs capable of obsoleting top human experts[1] are scheming. It's really important for me to know whether I expect to make basically no updates to my P(scheming)[2] between here and the advent of potentially dangerously scheming models, or whether I expect...

Abilities/intelligence come almost entirely from pretraining, so all the situation awareness and scheming capability that current (and future similar) frontier models possess is thus also mostly present in the base model.

Yes, but for scheming, we care about whether the AI can self-locate itself as an AI using its knowledge. The fact that (at a minimum) sampling from the system is required for it to self-locate as an AI might make a big difference here.

Who cares if it greatly reduces competitiveness in experimental training runs?

Yes, reducing situati... (read more)

Traditional economics thinking has two strong principles, each based on abundant historical data:

  • Principle (A): No “lump of labor”: If human population goes up, there might be some wage drop in the very short term, because the demand curve for labor slopes down. But in the longer term, people will find new productive things to do, such that human labor will retain high value—in other words, the demand curve will move right. Indeed, if anything, the value of labor will ultimately go up, not down—for example, dense cities are engines of economic growth!
  • Principle (B): “Experience curves”: If the demand for some product goes up, there might be some price increase in the very short term, because the supply curve slopes up. But in the longer term, people will
...
2Steven Byrnes
I’m still pretty sure that you think I believe things that I don’t believe. I’m trying to narrow down what it is and how you got that impression. I just made a number of changes to the wording, but it’s possible that I’m still missing the mark. When I stated Principle (A) at the top of the post, I was stating it as a traditional principle of economics. I wrote: “Traditional economics thinking has two strong principles, each based on abundant historical data”, and put in a link to a wikipedia article with more details. You see what I mean? I wasn’t endorsing it as always and forever true. Quite the contrary: The punchline of the whole article is: “here are three traditional economic principles, but at least one will need to be discarded post-AGI.” I did some rewriting of this part, any chance that helps?
Lorec10

When I stated Principle (A) at the top of the post, I was stating it as a traditional principle of economics. I wrote: “Traditional economics thinking has two strong principles, each based on abundant historical data”,

I don't think you think Principle [A] must hold, but I do think you think it's in question. I'm saying that, rather than taking this very broad general principle of historical economic good sense, and giving very broad arguments for why it might or might not hold post-AGI, we can start reasoning about superintelligent manufacturing [includ... (read more)

3Hzn
The purely technical reason why principle A does not apply in this way is opportunity cost. Let's say S is a highly productive worker who could generate $500,000 for the company over 1 year. Moreover S is willing to work for only $50,000! But if investing $50,000 in AI instead would generate $5,000,000, the true cost of hiring S is actually $4,550,000.
2Steven Byrnes
We can imagine a hypothetical world where a witch cast a magical spell that destroyed 99.9999999% of existing chips, and made it such that it’s only possible to create one new computer chip per day. And the algorithms are completely optimized—as good as they could possibly be. In that case, the price of compute would get bid up to the maximum economic value that it can produce anywhere in the world, which would be quite high. The company would not have an opportunity cost, because using AI would not be a cheap option. See what I mean? You’re assuming that the price of AI will wind up low, instead of arguing for it. As it happens, I do think the price of AI will wind up low!! But if you want to convince someone who believes in Principle (A), you need to engage with the idea of this race between the demand curve speeding to the right versus the supply curve speeding to the right. It doesn’t just go without saying.

We should only use AGI once to make it so that no one, including ourselves, can use it ever again.

I'm terrified of both getting atomized by nanobots and of my sense of morality disintegrating in Extremistan. We don't need AGI to create a post-scarcity society, cure cancer, solve climate change, build a Dyson sphere, colonize the galaxy, or any of the other sane things we're planning to use AGI for. It will take us hard work and time, but we can get there with the power of our own minds. In fact, we need that time to let our sense of morality adjust to our ever-changing reality. Even without AGI, most people already feel that technological progress is too fast for them to keep up.

Some of the...

3Thane Ruthenis
AGI is not the only technology or set of technologies that could be used to let a small set of people (say, 1-100) attain implacable, arbitrarily precise control over the future of humanity. Some obvious examples: * Sufficiently powerful industrial-scale social-manipulation/memetic warfare tools. * Superhuman drone armies capable of reliably destroying designated targets while not-destroying not-designated-targets. * Self-replicating nanotechnology capable of automated end-to-end manufacturing of arbitrarily complicated products out of raw natural resources. * Brain uploading, allowing to create a class of infinitely exploitable digital workers with AGI-level capabilities. Any of those would be sufficient to remove the need to negotiate the direction of the future with vast swathes of humanity. You can just brainwash them into following your vision, or threaten them into compliance with overwhelming military power, or just disassemble them into raw materials for superior manufacturers. Should we ban all of those as well? Generalizing, it seems that we should ban technological progress entirely. What if there's some other pathway to ultimate control that I've overlooked when I've thought about it for a minute? Perhaps we should all return to the state of nature? I don't mean to say you don't have a point. Indeed, I largely agree that there are no humans or human processes that humanity-as-a-whole is in the epistemic position to trust with AGI (though there are some humans I would trust with it; it's theoretically possible to use it ethically). But "we must ban AGI, it's the unique Bad technology" is invalid. Humanity's default long-term prospects seem overwhelmingly dicey well without it. I don't have a neat alternate proposal for you. But what you're suggesting is clearly not the way.

I appreciate you engaging and writing this out. I read your other post as well, and really liked it.

I do think that AGI is the unique bad technology. Let me try to engage with the examples you listed:

  • Social manipulation: I can't even begin to imagine how this could let 1-100 people have arbitrarily precise control the entire rest of the world. Social manipulation implies that there are societies, humans talking to each other. That's just too large a system for a human mind to fully take the combinatorial explosion of all possible variables into account. Ma
... (read more)
3Aram Panasenco
I think this post may be what you're referring to. I really like this comment in that post: Providing for material needs is less than 0.0000001% of the range of powers and possibilities that an AGI/ASI offers. Consider the trans debate. Disclaimer: I'm not trying to take any side in this debate, and am using it for illustrative purposes only. A hundred years ago someone saying "I feel like I'm in the wrong body and feel suicidal" could only be met with one compassionate response, which is to seek psychological or spiritual help. Now scientific progress has advanced enough that it can be hard to determine what the compassionate response is. Do we have enough evidence to determine whether puberty blockers are safe? Are hospitals holding the best interests of the patients at heart or trying to maximize profit from expensive surgeries? If a person is prevented from getting the surgery and kills themselves, should the person who kept them from getting the surgery be held liable? If a person does get the surgery, but later regrets it, should the doctors who encouraged them be held liable? Should doctors who argue against trans surgery lose their medical licenses? ASI will open up a billion possibilities that will got up to such a scale that if the difficulty of determining whether eating human babies is moral is a 1.0 and the difficulty of determining whether encouraging trans surgeries is moral is a 2.0, each of those possibilities will be in the millions. Our sense of morality will just not apply, and we won't be able to reason ourselves into a right or wrong course of action. That which makes us human will drown in the seas of black infinity.
4Seth Herd
I'm sorry, I just don't have time to engage on these points right now. You're talking about the alignment problem. It's the biggest topic on LessWrong. You're assuming it won't be solved, but that's hotly debated among people like me who spend tons of time on the details of the debate. My recommended starting point is my Cruxes of disagreement on alignment difficulty post. It explains why some people think it's nearly impossible, some think it's outright easy, and people like me who think it's possible but not easy are working like mad to solve it before people actually build AGI.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

links 1/13/2025: https://roamresearch.com/#/app/srcpublic/page/01-13-2025

  • https://www.construction-physics.com/p/why-skyscrapers-became-glass-boxes
    • plain glass box skyscrapers were, in fact, more cost-effective for developers. it's not all about architectural tastes. architects in real life are very far from all-powerful.
      • in fact, I really think people should stop writing books/movies/etc about auteur architects; it only encourages more young people to go into architecture and become unemployed. I'm looking at you, Francis Ford Coppola
  • https://www.betonit.ai/
... (read more)

I'm behind on a couple of posts I've been planning, but am trying to post something every day if possible. So today I'll post a cached fun piece on overinterpreting a random data phenomenon that's tricked me before.

Recall that a random walk or a "drunkard's walk" (as in the title) is a sequence of vectors in some such that each is obtained from by adding noise. Here is a picture of a 1D random walk as a function of time:

Weirdly satisfying.

A random walk is the "null hypothesis" for any ordered collection of data with memory (a.k.a. any "time series" with memory). If you are looking at some learning process that updates state to state with some degree of stochasticity, seeing a random walk...

Interesting! Perhaps one way to not be fooled by such situations could be to use a non-parametric statistical test. For example, we could apply permutation testing: by resampling the data to break its correlation structure and performing PCA on each permuted dataset, we can form a null distribution of eigenvalues. Then, by comparing the eigenvalues from the original data to this null distribution, we could assess whether the observed structure is unlikely under randomness. Specifically, we’d look at how extreme each original eigenvalue is relative to those... (read more)

Much of this content originated on social media. To follow news and announcements in a more timely fashion, follow me on TwitterThreadsBluesky, or Farcaster.

Contents

  • From me and RPI
  • Jobs and fellowships
  • Other opportunities
  • Events
  • Questions
  • Announcements
  • Commentary on the wildfires
  • Sam Altman: AI workers in 2025, superintelligence next
  • Never underestimate elasticity of supply
  • “The earnestness and diligence of smart technical people”
  • “Americans born on foreign soil”
  • Undaunted
  • Eli Dourado’s model of policy change
  • Stats
  • Links
  • AI
  • Inspiration
  • Politics
  • China biotech rising
  • Predictions about war
  • Why did we wait so long for the camera?
  • Housing without homebuilders
  • Charts
  • Fun

From me and RPI

Jobs and fellowships

  • Epoch AI hiring a Technical Lead “to develop a next-generation computer-use benchmark
...
ryan_b20

Regarding The Two Cultures essay:

I have gained so much buttressing context from reading dedicated history about science and math that I have come around to a much blunter position than Snow's. I claim that an ahistorical technical education is technically deficient. If a person reads no history of math, science, or engineering than they will be a worse mathematician, scientist, or engineer, full stop.

Specialist histories can show how the big problems were really solved over time.[1] They can show how promising paths still wind up being wrong, and the ... (read more)