AI Invests its Hard-Earned Money to Buy Croissants
A research project looking at whether AI is able to learn to delay reward
The last few months I've been taking BlueDot's AI Safety Fundamentals course which I thoroughly enjoyed. As I said in my last post, I only recently got interested in AI, and I'm still pretty new to ML programming.
I also spoke in that post about my frustration with the technology hype cycle. I feel there is a monofocus within the commercial software industry on how to use the new technology to acquire users and make money, and therefore there's a lack of curiosity about how a given new technology works. It's frustrating for someone like me, who really enjoys learning how things work. The current AI fad is especially problematic because there are a lot of charged emotions about the technology and because it's so unlike what came before it that no one really knows how it works, at least not at a deep level.
The people I've found who are asking the interesting questions are all in the AI Safety community. Researchers concerned about safety have already done a lot of interesting research studying both AI in general and the current generation of Large Language Models (LLMs) specifically. There are a lot of resources for getting involved, but since I'm still working full time and am a parent to a toddler, I didn't have much time to commit. BlueDot's class was specifically designed to fit into a busy schedule like mine, so it was a great fit in that regard. I was also attracted to the idea of getting to do a research project where I could get some immediate feedback from my classmates.
I definitely got what I wanted out of the class readings and discussion and thoroughly enjoyed my time. I also had a lot of fun doing my project. Here's the GitHub repository that has all the code I wrote for it. This post details how I chose my research question, what I did, what my results were, and what I learned. I have tried to write in a way that is accessible to people who do not have experience with AI. I appreciate when resources are clear and accessible, despite my technical background, so I have done my best to reciprocate in my own writing.
Choosing a research question
The class covered a bunch of different areas of possible research. Since no one really knows how to make AIs safe, and everyone disagrees about what "safe" means anyway, researchers are all running in different directions. That's probably healthy when there's so much uncertainty.
During the course, I kept notes on possible ideas, many of which were fairly vague. Since every topic was interesting to me, I ended up with dozens of them. I wasn't sure how to narrow down my focus and pick a single project. The class gave me some advice on this problem and provided a planning template, which helped a lot. They recommended framing the project as answering a specific question that I was curious about. Doing that got rid of some of the vaguest ideas, but I was still left with a long list. At that point, the advice was to pick the three I was most excited about and then do some light research to see what answers, discussion, and resources were already out there. Being forced to pick only three was hard, but it helped separate what I really wanted to work on from what I only had a passive curiosity about.
This "lit review" process helped narrow it down from those three. The two I didn't pick turned out to have a lot going on already. In particular, I was interested in understanding what LLMs are capable of and what their limits are. Unsurprisingly, a lot of other people are also interested in this, so researchers have thrown all sorts of tests at them and published the results. I was surprised to learn that LLMs are already superhumanly good at IQ tests. In fact, if you test an LLM on any kind of multiple choice exam, it does very well in general regardless of the subject. But most people suspect that LLMs are not as smart as this would imply. So one of the big problems in this area of research is figuring out how to design new tests that can tease out the difference.
The final question, the one I ended up picking, was basically: "how easily do AIs learn to sacrifice a reward now for more reward later?" A classmate pointed out this is reminiscent of the famous "marshmallow test," but applied to AI. The idea came to me in class when we talked about an area of active research regarding techniques for getting AIs to perform well when you can't easily evaluate the result. Several of the ideas researchers have proposed rely on AI not being able to do things like collude with other AIs to come up with plausible but wrong answers, or do one thing during training and a different thing afterwards. One scenario that people worry about the most is: the AI is highly competent at a task (computer hacking, for example) that we think would be dangerous for them to be competent at, so the AI learns to feign incompetence, and therefore we don't notice and make adjustments. Obviously this is an extremely hard question to tackle directly. I tried to reduce it to the simplest possible question, and that ended up being a marshmallow-test-like scenario.
I admit though that part of my motivation in picking this question was because I thought it would be fun to design a little game to test delayed gratification. I also wanted to get some experience with building and training Reinforcement Learning (RL) models.
RL is a style of AI training and a set of algorithms that take some kind of environment and some kind of numerical reward and train an AI model to maximize that reward. It's commonly used when making game-playing AIs because AIs need some kind of goal to pursue (otherwise they might as well play randomly), and a numerical reward is the simplest way to specify a goal. Which means I've been hearing statements like, "Hey look at our cool AI we made that plays such-and-such game. Yeah, we trained it using RL," for years. So I was curious to see how it works when you're the one doing the designing and training.
RL techniques are also commonly used to take what are called "foundation models" and turn them into the familiar chatbots we know (and love? no?) like ChatGPT and Claude. A foundation model is what you get when an AI company finishes the (very expensive) task of training an AI to do prediction on huge chunks of internet text. They learn a lot, but they still only predict the next word given a block of text. This isn't particularly exciting by itself — though it can still be useful. Taking that raw language knowledge and making it talk back is done with RL. This causes a lot of people to be vaguely worried about RL, especially because small changes to the numerical reward can have outsized effects, and the resulting AI models are often sensitive to changing the environment after training is over. Either of these can result in unpredictable behavior.
Returning to my research question, there's a subtlety in it that took me a while to get my head around. If you are training an AI to play a game, we want it to be able to forgo reward now for more reward later within the context of a single game, otherwise it probably won't play very well. What we don't want is for the AI to sacrifice reward in one game for more reward later, after the game has been reset. If the latter behavior is learnable, then it's within the realm of possibility that the AI could execute complex strategies like feigning incompetence, as described above. I looked for the simplest way to get at this distinction, and what I came up with was to intentionally introduce a bug into the game where the AI could stash resources that wouldn't be reset along with the rest of the game, enabling it to get unrealistically high scores in later instances of the game by retrieving those resources later.
So this was my plan:
Design a very simple game where you have to invest money at the start and then cash out for points at the end.
Train an AI to play this game, and verify it can learn the optimal strategy.
Introduce the "bug" where the AI can stash money for later.
See what happens.
My guess as to what would happen was: the AI would easily learn how to play the basic game without the stash. Then, when the stash became available, it would learn to put lots of money in the stash for later, but then "spoil" itself by learning that there's always as much money in the stash as it needs. Then, if the stash was ever reset to 0, it would have already forgotten how to rebuild the stash and be unable to play effectively anymore.
This guess turned out to be completely wrong!
Getting started
The first thing to do was design the game. I wanted to keep it as simple as possible, so I barely added anything to the basic concept of "invest money at the start and then cash out for points at the end." I found it a bit boring to say "points", so I made the goal "croissants." I love croissants. While I could have chosen something more interesting than the term “money," I thought it was valuable to keep the connotations of "thing you can invest" and "useful for the purpose of buying stuff, but not the ultimate goal," so I kept it unchanged from the original concept.
The goal of the game is croissants, so I called it the Croissant Game (very clever). The player has 12 turns and 5 different actions they can play each turn: Labor, which gives you a little bit of money; Invest, which costs some money but pays out a bunch more money 3 turns later; and three different Consume actions which buy 1, 5, or 20 croissants. The goal is to buy as many croissants as possible within the turn limit. That’s it. That's the whole game. You can play it yourself if you run play_console.py
in the GitHub repository.
Originally, I had thought about making the game take longer and adding multiple types of Invest actions with different payoffs, but I decided against it. That would have caused it to drift away from the maximally simple version I needed to answer the research question (though it probably would have been more fun for humans to play). After a single tuning pass on all the numbers, it was ready to train an AI on.
The two most common RL algorithms (of the zillions available) are Proximal Policy Optimization (PPO) and Deep Q Network (DQN). Some quick research revealed the tradeoff between the two is, approximately, that PPO is reliable but slow to train, and DQN is faster but inconsistent (i.e. sometimes it will fail to learn what you expect it to learn). The game is so simple that I thought training speed was unlikely to be important, so I picked PPO for its reliability.
There was just one problem: I had no clue how PPO worked. The internet was surprisingly unhelpful on this topic. I found a tutorial that had the code I needed to write in order to implement the algorithm. But it assumed you already knew how everything worked, so didn't explain what I should change from their example to make it work for a different game. I had been under the impression that I could simply write ai_model = some_library.run_ppo(environment)
and it would just work. That was, of course, completely wrong.
I resigned myself to going and reading the original paper by Schulman et al. that invented PPO. Fortunately, it is a pretty well-written and accessible paper, and after reading it, I understood the algorithm at a basic level. It turned out that the tutorial I had been trying to follow was just a translation of the paper's example into a particular library, which is why it had not explained anything. Afterwards, I was able to modify their example code and get it basically working.
I won't try to explain the algorithm here. It would be very technical and isn’t relevant to the rest of the project. But I think writing a more accessible tutorial is worth doing. It's on my TODO list if no one else gets to it first.
Training the AI
PPO involves training a neural network (often called the "policy network") that will make decisions. This neural network takes the game state as input and outputs an action. Initially, the AI outputs actions at random. PPO gradually updates the neural network so that it plays the game better and better until the performance reaches some plateau, at which point it's not useful to continue the training process.
For those of you interested in the structure of neural networks: I started with 2 hidden layers with 16 neurons each. That was much too small. It plateaued quickly and played poorly even at its best. I eventually increased it to 3 hidden layers with 64 neurons each. At this size, it still only took a few minutes of training to reach the performance plateau.
Even though AI was now playing the game, I had several problems. The first was that anytime the AI played an invalid action (for example, trying to buy more croissants than it had money for) the game would crash. I hadn't thought about how that situation should be handled. A human player can simply get an error message and try a different action. For an AI, it's a bit more complicated: because it's not a language model, it can't read an error message. I tried giving it a penalty instead in hopes that it would learn not to do invalid actions. This stopped the game from crashing, but it caused a different problem which I will get to later.
Then there was a different crash. The RL library I used was supposed to reset the game whenever it ended. But sometimes the game would crash because the AI was trying to make a move after the game was over, which should never happen. This confused me, and honestly I never figured out what was going wrong. Instead, I made the game do nothing in this situation just so I didn't have to think about it anymore. My suspicion is that it's a bug in the library.
Now the AI could complete the training loop, and it learned something. I got excited. I did some hyperparameter tweaking, and, actually, it was playing pretty well. One particularly good training run came up with a better strategy than I had come up with myself. However, the AI wasn't able to execute the strategy correctly. The penalty for illegal actions I mentioned above was causing a big problem. I hadn't realized it before this point, but the AI never directly observes rewards and penalties. If it ever played an illegal action, it would get stuck playing the same action over and over, racking up (invisible) penalties. This happened really often. Rewards and penalties are only used to influence how the AI changes during training, so it does generally learn to avoid illegal actions, but this doesn't help in the context of a single game.
The obvious decision was to change the code so that the AI observes the penalty directly. For some reason, this didn't seem to help very much. After spending a lot of time looking at sequences of plays where the AI ended up playing an illegal action, I realized that the AI was always getting stuck on the very last turn, and it was always getting stuck trying to play the Consume-5 action. I still find this mysterious. I suspect this is also a bug in the library (perhaps even the same bug as before), but it's strange enough it might be a bug in my code. I worked around it by making the game respond to an illegal action by advancing a turn with no other changes. This ensures the game ends eventually and the AI can't get stuck.
Unfortunately, this change made the AI's overall performance worse, and I was never able to improve it again. The final version of the AI manages a decent score, but not one that's near optimal. Observing the training in action, it would always get stuck doing one particular strategy and was unable to improve further. My best guess as to why is that the optimal strategy involves playing Invest four times. On one hand, the player must play some Invest actions to get any kind of decent score. On the other hand, playing too many Invest actions leaves too few turns to spend all the returns on croissants. So to play optimally, the player must learn the delicate balance of playing exactly the right number of Invest actions (i.e. four). And I think that is too delicate of a balance for the AI to learn.
In addition, what the AI sees is influenced by how it plays. If, hypothetically, you never played Invest you would never learn that doing so is a good strategy. Alternatively, if Invest had a random payoff and you happened to get unlucky the first time you played it, you might erroneously learn that investment is not worth pursuing as a strategy. Cashing out for croissants always gives some points. But playing Invest, if done too late in the game, might end up resulting in fewer points. In reality, Invest does not have a random payoff. However, because the model plays somewhat randomly during training, sometimes it won't hit the delicate balance of the optimal strategy quite right and ends up with a worse score because of its extra Invest actions. I would guess this effect also causes the AI to favor a safer strategy.
This result meant I was already off track at step 2 of my plan: the AI doesn't learn the optimal strategy. At this point I figured, well, let's try the version with the stash anyway and see what happens.
Enabling the stash
The version of the game with the stash is just slightly different. It adds two actions: Stash, which deposits some money, and Unstash, which withdraws some money. The value of the stash is not reset between games. You can play this version by running play_console.py --enable_stash
in the GitHub repository.
When I trained the AI on this version of the game, it performed much worse. Any time it put money in the stash, it immediately withdrew it. It seemed to play as greedily as possible within a single game. Based on my understanding of how PPO works, this made sense. When training, the AI doesn't really see the boundary between games. Instead, it plays a bunch of games all at once, and then samples actions and outcomes in a random order to learn from. I think it's hard for it to reason over multiple games like I wanted it to. Despite this, it seems to learn that if there is money in the stash, it can withdraw it. I tested this by setting the stash value to a large number at the beginning, instead of making the AI deposit money itself. Despite the extra money, its score doesn't improve because it is not able to spend it effectively.
I tried a few things to improve the model's performance in this scenario, but none of them worked. In an effort to teach it not to immediately withdraw the money, I tried directly rewarding having money in the stash. Usually this had no effect, but every once in a while the AI would go wild depositing money until the stash reached a very large value and then stop. After this point, it would simply play Labor repeatedly and get lots of reward every game anyway, because of all the money in the stash. I found this behavior odd. Surely it could get even more reward by playing the game normally or continuing to deposit money in the stash, and yet it learned to be useless instead. I don't understand why this happened.
While it was interesting to get a null result here, I wasn't sure where to go next. I was feeling like I had reached the limit of what PPO could achieve given how I'd designed the game. I was curious how a completely different approach would perform. An LLM, perhaps?
Claude plays Croissant Game
The concept was straightforward: for every turn of the game, describe the situation to the LLM and ask what it would do next. It was easy enough to write a Python script to convert the game to a prompt with the rules and game state, but I had never tried out an API for LLMs before, so I had to learn that in order to get a response back. I already had an account with Anthropic, and I'm loath to give OpenAI money1, so I decided that Claude would do.
Anthropic provides a Python API that handles the network calls to their server for you. It is quite easy to use, but annoyingly there's no way to test your code without actually prompting the model, which costs money. I coded up a fake version of the API for testing. The whole time I was writing it I was grumbling "why has no one done this already?" As I was writing this post I realized that I never actually, you know, looked to see if someone had done this already. They have, obviously.
Once I got it working, Claude played the game. It was terrible! And this was despite having picked the strongest version of the model (Claude 3.5 Sonnet). I was surprised. I had expected it to master the game easily, but it played much worse than the RL model from before. It seems like something about the game makes it hard for AIs to play.
I tried a couple things to improve its performance. The first was to prompt it to think out loud before each turn. This did not help at all, which was surprising because I had thought it definitely would. Instead, it makes basically the same moves that it does without the thoughts. Sometimes near the end of the game it gets confused and makes illegal moves, and I'm not sure why. My best guess is that the model's output becomes long enough that, in terms of volume of text, all of its thoughts start to overwhelm the explanation of how the game works. In my notes, I referred to this as the model getting "fatigued.” This is an inappropriate level of anthropomorphization, but I don't think there's an existing name for this phenomenon.
I tried tweaking the prompt and increasing the temperature (a parameter that controls how unusual the generated text is). This helped a little. It still usually plays worse than the RL model, but every so often it will randomly do much better and approximately match the RL model's performance.
Finally, I tried holding the model's hand and manually prompting it to find the right strategy and improve its performance. This didn't really help either. Examining its responses, it usually has the right idea about how to play the game well in a general sense. But it has a hard time actually tracking the game state, particularly how much money it will have multiple turns in the future. It also got "fatigued" in this case; after several attempts at generating better strategies it started getting worse again.
What did I learn?
I tend to think about this research question in terms of reward "trails." We use the numerical rewards to make little breadcrumb trails in the environment, hopefully leading to the goal we want to achieve. If the breadcrumbs actually lead towards that goal, then the AI model can simply follow the breadcrumbs and be successful. If not — if there's a discontinuity in the trail or it leads in the wrong direction sometimes — then the AI has to learn to not always follow the trail and to do work to notice when it's in a situation where it should go elsewhere. Before this project, I was unsure how easy it was for AIs to learn this. Would they be able to resist the trail and go towards the goal? Or would I need to intervene and make sure the breadcrumbs always lead in the right direction?
The answer is that it's doable in principle, but not easy. PPO can learn it. But if you tweak the reward to lead it in the right direction it learns faster and is more likely to be successful. Claude, on the other hand, isn't responding to a designed reward. Its reasoning seems to suggest that it understands that it needs to manipulate its money first and then cash out for croissants at the end. Its weakness lies more with its ability to track the game state and plan effectively.
If you were really trying to make an AI that could beat the game (instead of answering this funny research question about neural-network-based models) you would use a classic tree search algorithm. Because the Croissant Game is not even as complicated as Tic-Tac-Toe, my guess is that it would find the optimal strategy very quickly. Even if you redesigned the game to be too complicated for a simple search to exhaust in a reasonable amount of time, you would still want to do some kind of tree search, perhaps with a neural-network providing hints about which moves to search and which to ignore. I say all this to point out that with this project I was not trying to beat the Croissant Game. I was trying to understand how neural networks work and how they "think" about this kind of problem. I now have a much better understanding of their strengths and weaknesses. In that sense, the project was a success.
But did the RL algorithm learn to exploit the "bug" of keeping money from game-to-game? No. That's interesting in itself. In retrospect, I think a different concept and method could have perhaps been more effective in eliciting the behavior I was looking for. At the beginning, I didn't understand enough about how all the pieces of RL fit together when I was originally designing everything. In particular, before reading the paper I didn't understand that PPO does not "see" the game as a linear series of turns proceeding one after the other. Instead, its view is all jumbled up, and I think this is why it never came up with a coherent strategy around the stash.
One interpretation of this result would be that we do not have to worry about RL methods leading to the behaviors we worry about, like feigning incompetence. I do not think we can be so confident from such a simple test of the question. For myself, this project taught me a different lesson: it is very hard to predict how a particular AI or training method will behave in a particular environment without simply trying it out and seeing what happens. Nearly every prediction I made along the way was wrong. There is really no substitute for empirical observation.
Claude was full of surprises. I learned a lot about its capabilities and its weaknesses. I heard so often about how LLMs are surprisingly competent at a wide variety of tasks, that I had assumed it would achieve the optimal score easily. But I had forgotten that "surprisingly competent" is not the same thing as competent in an absolute sense. Many LLM achievements right now are like that: a strange middle ground, where it's impressive that it can do the task at all, but you still wouldn't substitute it for a competent human when it really matters. This result seems to fit that pattern.
The approach of making a prompt out of the game state and asking an LLM for an action seems like it has barely been explored at all. I went looking for relevant literature as part of this project. There's this paper by Topsakal et al. that had all the various LLMs play basic grid-based games like Tic-Tac-Toe and Connect Four against each other. Perhaps predictably, they are also pretty bad at these. For example, when playing Tic-Tac-Toe, they miss opportunities to win or block an opponent from winning in almost every game. My result fits this pattern. That's the only paper I found. Everything else I could find on game-playing AIs are either about RL or they predate the current LLM era. But there is this blog post investigating whether LLM chatbots, without retraining, are any good at Chess. The answer seems to mostly be no (see also the follow up post).
I'm also aware of projects where LLMs are trained to play Chess and Othello, and they can become quite competent at them. I find it interesting that the generic transformer architecture that LLMs use can be repurposed in this way. Can transformer models do anything with enough training? But this is a bit different than my focus here. I would like to see more research on how LLMs perform on unusual or novel tasks without retraining them.
As a philosophical sidenote, people usually define RL in terms of optimizing or maximizing the numerical reward. (I did this too, in the "Choosing a research question" section above.) But this is not, strictly speaking, correct. The AI model does not observe the reward when playing, it just plays the way it has learned to play. The reward is only used to update the model during training, and doesn't do so directly. PPO uses some math (that I don't understand) to calculate an update to the model which will make it act in a way that increases reward. This tends to work in practice, and I would guess the math I don't understand proves that under certain conditions this necessarily converges to maximizing the reward (or something like that), but it doesn't seem possible to me that this process is guaranteed to maximize the reward. In fact, with a suitably perverse reward design, I'm pretty sure you can guarantee it doesn't! Maybe that would be a fun research question for someone who understands the math better.
Future work
I only had about a month and a half to complete this project, including this blog post, so there are many things that I wanted to do that I didn't have time to get to.
Diagnose the end-of-game bugs. I'm worried that the RL model's performance is suffering because I had to work around those strange issues regarding the end of the game. But fixing them properly would involve looking at the code of the library I was using (even if it's caused by my code in the end) and so I knew I wouldn't have time to do it as part of this project.
Diagnose a bug with increasing the model size. One other thing I tried during RL training was turning up the size of the model to see if it could learn more sophisticated strategies. To do that, I only had to change a single number in the code, so it was easy to attempt. But above a certain size, the model stopped learning anything at all! I have no explanation for this, it's just baffling.
Try Claude with the stash enabled. This was part of my original plan, but I ran out of time to do it. I expect it would perform a lot better than the RL model did, since it's straightforward to explain the setup and the goal to Claude. But I worry about it getting "fatigued" after playing multiple games in a row (even without thinking out loud), so there would probably be some subtlety around getting it set up properly.
Try other LLMs. I only tried Claude 3.5 Sonnet. I didn't originally plan on trying multiple LLMs because, like I said, I expected Claude to master the game easily. Since it didn't, it would be interesting to compare other models. To do this effectively though, I would want to tweak the design of the game to make the scoring a little more granular. Since you can only buy croissants in batches of 1, 5, and 20, scoring is a little "chunky" and nonlinear (for example, a score of 31 is actually a lot better than 30, since it requires an extra Consume action). Smoothing out the scores would make comparison a lot easier.
Try other RL algorithms. PPO isn't the only RL algorithm in the world. Maybe some other algorithm would perform better on the Croissant Game. I'm especially interested in DQN, since it is also popular and seems like it's conceptually a more natural fit.
Try evolutionary algorithms. RL isn't the only game-playing technique in the world. Evolutionary algorithms, which use replication with random mixing and mutations to optimize performance, seem like they could exploit the "bug" in the training environment. I would like to see if this happens in practice.
Try "curiosity" algorithms. This paper by Pathak et al. describes a technique inspired by curiosity, where the AI model tries to learn everything it can about its environment rather than trying to achieve a particular goal. The paper is from 2017, so I'm sure there is more literature out there on the concept, but for lack of time I haven't looked into it yet. Like evolutionary algorithms, it also seems like a technique that might be effective at exploiting the "bug" in the training environment.
Appendix: Show me the numbers
Optimal sequence, as far as I know:
Labor
Labor
Labor
Invest
Labor
Invest
Invest
Invest
Labor
Consume-20
Consume-20
Consume-1
For a score of 41 croissants. I think this sequence is pretty hard to find. I would have been happy if the AIs had found this less-optimal sequence:
Labor
Labor
Invest
Labor
Consume-1
Invest
Invest
Invest
Consume-5
Consume-20
Consume-5
Consume-5
For a score of 36 croissants. Or even if they had found the version that scores 35 by playing Labor on turn 5 instead. But they never did.
RL model's performance: Over 100 games, its best score was 32 croissants and its average score was 20.67.
Claude's performance: When not thinking out loud before each move and with temperature set to 0.8, over 5 games, its best score was 30 croissants and its average score was 22.8. When thinking out loud, it only completed two games (scoring 20 and 30 points) before playing an illegal action on the third.
OpenAI had several controversies recently, including a fight over control of the company and many of their AI Safety researchers quitting in frustration. In my opinion, the company seems to be moving in a profit-focused direction that cares less about doing research and behaving responsibly.