Using a RNN for 2D Tile Map Synthesis

Like most of the tech world, lately I’ve been using ChatGPT a lot for various tasks – text translation, general Q&A, and above all else, hobby coding. It’s remarkable how high it raises the floor for what you can accomplish with limited knowledge in a particular field. Naturally, the things neural networks are capable of has been living rent free in my head for the past little while until I finally had an idea I wanted to try.

Before we dive into this weekend experiment, I knew and still know next to nothing about neural networks. I don’t know fundamentally how or why they work, what the results mean, why some parameter tweaks are more effective than others, or worst of all, how to understand and truly improve the results. To me, a neural network is a black box where you can throw a bunch of data at it and say “understand this”. So you should take everything I say with a hint of skepticism.

Here were my preconceptions as of yesterday:

You need a LOT of data (thousands of not millions of examples)
You need a server rack of GPUs with tons of VRAM
You need a lot of patience because they take ages to train

This is of course true for state-of-the-art LLMs like GPT-3 and GPT-4, but what about for smaller stuff?

The Future of Randomizers

Like many people, Super Metroid item randomizer was the first randomizer I’d played, with Bloodstained RotN’s being the most recent. If you’re unfamiliar with them, a randomizer is similar to a roguelike in that it will shuffle content around in a game to provide a fresh experience – items, enemies, and sometimes map layouts. They’re fun to watch and the popular ones like ZOOTR continue to see updates, but merely shuffling content becomes stale. I want to see synthesized data, and in particular, maps.

However, switching from shuffling existing content with some form of logic to creating brand new content is a gigantic leap. Some randomizers add new assets, but I’m talking about procedurally generated content. Never-before-seen maps, routes, puzzles, towns, obstacles, and experiences to prolong that inevitable staleness.

We won’t be anywhere close to solving this in this blog post, but it serves as the motivation to see if neural networks can help make it possible. Without further ado, let’s get into it.

Gathering the Data

For this experiment, I’ve chosen to work with a dataset that fulfills these requirements:

Limited amounts of data
Complex in nature
Lots of unique tiles

I chose these requirements because they generalize to most 2D games after 1990. The one I’m working with is Zelda Oracle of Seasons.

The overworld contains 16×16 rooms that are 10×8 tiles each. Each room can have a tileset with 256 unique tiles, although many are shared or are just palette swapped.

To maximize the amount of training data available and to limit niche cases, my processing included joining graphically identical tiles at the 2bpp and palette-index level. While there are some visually and functionally identical tiles, I’ve kept them in because I wanted to try and do as little processing as possible.

The result is a whopping 660 tiles. For the map data, there are 20,480 tiles in total, giving each tile an average of just 31 instances of their usage. The final number is actually less because some maps did end up getting filtered out, but more on that later.

Wave Function Collapse

Before we get started, let’s review one of the more popular procedural generation, or tile-based synthesis if you’ve got your top hat on, algorithms out there.

Wave Function Collapse (WFC) at its core is the algorithm behind videos you’ve likely seen by Paul Merrelerrell and Townscaper. It’s a method of solving a tile grid given a set of connectivity rules or constraints. How you generate those rules is up to you, but one popular method is to use an example map.

We have that! So we can use the base map to generate the constraints, and solve it to get the map, right? Well, let’s see how that looks.

The good part is that it follows all the rules based on the source map. But there are some major problems that are very hard to solve with WFC. For starters, it is extremely slow, taking almost a 30 seconds to generate that tiny map. It also failed frequently, which is the unfortunate reality of WFC when you have a large tileset and are either missing tiles for every case or have missing rules.

Lastly – and this is the real problem – it isn’t very map-like. WFC blindly thrown at a blank canvas will just solve constraints and determine the appropriate tile. It’s more suited to work in tandem with some other algorithm to generate higher-level data. For example, you could have 3 section tiles called “ground”, “trees”, and “water”, and then use that determine which tiles to use based on the constraints.

This is where you have to start thinking really hard about what you’re trying to do. Surely you can look at the above example and devise some algorithms to kind of generate similar maps, but that’s time-consuming, challenging, and limiting. In the context of randomizers, obviously everything will be tailored, but there’s got to be another solution.

Borrowing from GPT

We’re going to borrow a couple ideas from GPT (Generative Pretrained Transformer). To simplify it to the most basic level, it predicts the next word in a sequence of words based on a transformer model trained on a whole bunch of text.

I said word, but actually, language models like GPT use something called a token. In the context of GPT, a token is 1 or more characters, but is more often a piece of or sometimes an entire word. A program to convert a string of text into tokens is called a tokenizer, and usually with language models, you’d train a tokenizer using something like SentencePiece to determine what the optimal ‘dictionary’ is to represent a source text via tokens.

Instead of words however, we’re going to treat tile IDs as our tokens. Given that our dataset contains 660 tiles, then that’s 661 unique tokens. Yes, 661! Because text models usually have special system tokens, like </s> or <eos> to indicate the stopping point of text. We’re going to use that extra ID as our padding token to indicate an out-of-bounds tile since our sequences need to be the same length.

And while we could theoretically train a tokenizer on our dataset, there are two reasons we won’t:

Our data set is tiny
A 2D map is not the same as a string of text, which we’ll talk more about later

The goal is simple then. We’ll train the network on sequences of data and on inference, the tokens it generates will be our tile IDs! How hard can it be…?

Starting with 1D

Let’s start small by devising a neural network to generate rows of tiles. First we need to start by deciding what kind of model we want to train. As its name would suggest, GPT uses the transformer model introduced in the Attention is All you Need paper. For our tiny map, we’ll be using something much simpler: a recurrent neural network (RNN). I’ll explain it more in detail once we get there.

I’ll be using Python and PyTorch for the coding part. If you’re like me 24 hours ago and have never trained a network from scratch or worked much with PyTorch, it’s daunting at first but extremely powerful and easy to get the hang of. It’s essentially a framework filled with model types that make it trivial to string together new network structures – all you have to do is fill out a class. Most of the code will be off the shelf.

Note: Due to copyright reasons, I can’t distribute the dataset I’m using. Sorry!

Loading the Data

First and foremost, we need to load the data. I have it stored in map.dat, which is a binary file containing the integer IDs of the tiles.

raw_data = np.fromfile('map.dat', dtype=np.int32)

Next is to determine our vocabulary size, AKA how many unique tiles there are. Remember we have 1 extra for the padding token.

w = 10
h = 8
num_rooms = 256

vocab_size = len(np.unique(raw_data))
vocab_size += 1
room_data = np.zeros((num_rooms, h, w), dtype=np.int32)
for i in range(num_rooms):
    for y in range(h):
        for x in range(w):
            room_data[i, y, x] = raw_data[i * w * h + y * w + x]
            
PAD_TOKEN = vocab_size - 1
print(f'vocab_size = {vocab_size}') # vocab_size = 661

Sequencing the Data

We’re going to start by having the network just generate rows of data. Therefore, we need to divide our data into horizontal lines.

There’s one challenge – the source data contains discontinuous rooms, which means the tiles at the edge of the rooms don’t follow the rules. This is because they are simply stored there, but in-game map transitions produce a different room layout. Later on, we will handle this better, but because I want to show it’s possible, we’re going to work with room-level data.

We’re going to iterate over each row of each room, and collect 4-token (4-tile but I’m going to refer to tiles intermittently as tokens from now on) sequences that go up to the edge of the room.

sequence_length = 4
flattened_data = []

def sample_room(room, x, y):
    world_x = (room % 16) * 10 + x
    world_y = (room // 16) * 8 + y
    if world_x < 0 or world_y < 0:
        return PAD_TOKEN
    if world_x >= 160 or world_y >= 128:
        return PAD_TOKEN
    room = world_y // 8 * 16 + world_x // 10
    x = world_x % 10
    y = world_y % 8
    return room_data[room, y, x]

num_samples = 0
for room in range(num_rooms):
    for x in range(0, w - sequence_length):
        for y in range(0, h):
            values = []
            for xo in range(0, sequence_length):
                values.append(sample_room(room, x + xo, y))

            flattened_data.extend(values)
            num_samples += 1

data = np.array(flattened_data)
print(f'num_samples = {num_samples}') # num_samples = 12288

Next we need to build the dataset in a way that the network can ingest during training.

sequence_length = 4
batch_size = 256

class TileMapDataset(Dataset):
    def __init__(self, data, sequence_length):
        self.data = data
        self.sequence_length = sequence_length

    def __len__(self):
        v = len(self.data) // self.sequence_length
        return v

    def __getitem__(self, index):
        return (torch.tensor(self.data[index * self.sequence_length : index * self.sequence_length + self.sequence_length - 1], dtype=torch.long),
                torch.tensor(self.data[index * self.sequence_length + 1 : index * self.sequence_length + self.sequence_length], dtype=torch.long))

dataset = TileMapDataset(data, sequence_length)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

The idea behind training a model on a 4-token sequence is if we give it the previous 3 tokens, then it will predict the 4th. This is why people refer ChatGPT as “just predicting the next word”.

To facilitate this, we use a strategy that is often applied, and that’s shifting the predictive tokens by 1. In practice what this looks like is building one tensor with the first N-1 tokens, and then another tensor with the last N-1 tokens. So for example, we’re essentially going to be telling the network this:

Sequence:  [1, 2, 3, 4]
Input:     [1, 2, 3   ]
Target:    [   2, 3, 4]

A terminology sometimes used is “IDs” and “labels”, but I’m using “input” and “target” because they’re more intuitive to me. The above code returns the above for a given index.

No Sliding Window?

Usually when doing sequence training, you’ll see a sliding window. __len__ will return len(self.data) - self.sequence_length - 1 and the returned items will usually be a sliding window, or offset tensor, of the data.

I’ve chosen to do this manually in our data-processing step, because while it’s not strictly relevant in the row case, it is relevant for when we extend to 2D.

Creating the Model

Before coding the model, let’s first understand what we’re creating. Typically, neural networks work with floats, which is very different from our integer token IDs. So what do we do?

The way this works is simple – the model converts the tokens into vectors that it can work with, and then calculates the weights of each potential token in the vocabulary at the end. In other words, we’re going to train our model to predict the likelihood any given token would come next.

There are three layers we need to accomplish this:

Embedding -> Token to vector
RNN -> Sequence magical black box
Fully Connected -> Vector to token probability

I’m grossly over-simplifying, but in truth, I don’t fully understand them. I just know that these are layers that fundamentally make up a basic token prediction model. Of course, smart people will tell you that “the embedding layer helps capture semantic information and establish a relationship between tokens”, but as far as I’m concerned, it’s how we convert tokens into something the network can work with.

Now, let’s write the code.

embedding_dim = 8
hidden_dim = 16
n_layers = 1

class TileModelRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, dropout=0.0):
        super(TileModelRNN, self).__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, n_layers, dropout=dropout, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        embedded = self.embedding(x)
        rnn_output, _ = self.rnn(embedded)
        
        output = self.fc(rnn_output)
        return output
    
model = TileModelRNN(vocab_size, embedding_dim, hidden_dim, vocab_size, n_layers).cuda()
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'Number of parameters: {num_params}') # Number of parameters: 152405

The parameters chosen are quite small. I’ll explain why I chose them later on. I’ve also included the dropout parameter, though we won’t be using it.

Training

It’s time to get training. I’m just using an off-the-shelf training setup because it works well. The only things we care about right now are the learning rate and number of epochs.

The way neural networks learn is by iteratively making lots of really tiny adjustments. If you’ve ever written a physics engine, it’s sort of like solving a frame. You divide your timestep, you solve each constraint one-by-one, and eventually (hopefully) end up with a system that satisfies the constraints. The learning rate is sort of like your timestep, where a higher rate can lead to a faster convergence, but also tends to overshoot and produce a bad model. A lower learning rate takes longer to converge, but one that’s too low can also fail to converge.

The number of epochs is how many times it will cycle through the data. There’s a bunch of information on there about being careful with this value, as it can lead to model overfitting, or learning “too well” which includes picking up outliers. In the context of a text generation model, this would mean repeating sequences of words verbatim. For us, this is not completely undesirable.

Finally, we won’t be doing a train/validation split. This is used to help prevent overfitting and to gauge how the model is dealing with unseen data. Given that our dataset is so small, every little bit helps, and producing a more fitted model actually produces better maps.

learning_rate = 3e-4
num_epochs = 500

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
model.train()

for epoch in range(num_epochs):
    if keyboard.is_pressed('k'):
        break

    for i, (inputs, targets) in enumerate(dataloader):
        inputs, targets = inputs.cuda(), targets.cuda()
        outputs = model(inputs)
        loss = criterion(outputs.view(-1, vocab_size), targets.view(-1))
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if i % 10 == 0:
            print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}")

torch.save(model, f'model_rnn.pth')

The code reports the loss after every 10 steps. In general, the lower the loss, the better. This, outside of just gauging how the model performs, is what we’ll use when tuning parameters and structuring the model. Here’s what we get:

There’s a lot to unpack, but ultimately we can see it converges to a colossal value of 1.68.

Testing the Model

At last, it’s finally time to see how it works! We know the input to our model is a sequence of 3 tokens. For the start, those will be the padding token, and then afterwards, it will be the last 3 tokens that have been generated and the model will predict the 4th one.

def generate_tilemap(model, ix, iy, length=MAP_SIZE):
    generated_map = [PAD_TOKEN] * (length * length)

    def sample_map(x, y):
        if x < 0 or x >= length or y < 0 or y >= length:
            return PAD_TOKEN
        return generated_map[y * length + x]

    with torch.no_grad():
        for y in range(length):
            for x in range(length):
                row = []
                for i in range(-(sequence_length - 1), 0):
                    row.append(sample_map(x + i, y))

                input_tensor = torch.tensor(row, dtype=torch.long).unsqueeze(0).cuda()
                
                raw_output = model(input_tensor)
                raw_output = raw_output[0, -1]

                raw_output = F.softmax(raw_output, dim=-1)
                
                next_token = torch.multinomial(raw_output, 1).item()

                generated_map[y * length + x] = next_token
                
model.eval()
generate_tilemap(model, 0, 0)

Now, this function will generate a length x length array of tiles, or a map of independent rows. Let’s see what it looks like with an untrained model:

And then after training it:

Now, remember that this has only been trained on individual rows, so we only care if the tiles look to be correct horizontally. And it’s definitely picking up on some of the rules, so this is encouraging!

Parameter Tuning

When setting up the initial parameters, namely:

embedding_dim = 8
hidden_dim = 16
n_layers = 1
num_epochs = 500

I chose these to illustrate what happens when they’re incorrect. When the dimensions are too small, the training loss rate won’t go below a certain threshold. There are multiple things that can cause this, but in our case, the dimensions need to be increased. Let’s see what some revised parameters look like:

embedding_dim = 128
hidden_dim = 256
n_layers = 1
num_epochs = 100

Wow, what a difference! Not only did the loss decrease, but it did so significantly faster. Granted, this might not always be a good thing – it can be a sign that the model is too large or training data is insufficient, but again this was for demonstrative purposes.

Changing the RNN Type

Before we work on 2D, let’s make a couple changes to the model structure. The RNN works well, but it has a couple drop-in improvements we can make. A loss of 1.1 is still quite high.

The first is to simply change the RNN to a GRU. The GRU is a more sophisticated recurrent neural network and provides better results and a much better fitted model. Apparently it helps handle larger contexts like LSTM, but is faster. Larger contexts aren’t super relevant to us, but to me it’s a black box that works better so we’re rolling with it.

The other is to make that layer bidirectional. What’s interesting about this setting is that technically, it’s meant more-so for cases where you have information about what comes next. But it improves training time and the empirical results of the model significantly, which are hard to argue with.

self.rnn = nn.GRU(embedding_dim, hidden_dim, n_layers, dropout=dropout, batch_first=True, bidirectional=True)
self.fc = nn.Linear(hidden_dim*2, output_dim)

I think 1D is looking good, so let’s move onto 2D.

Expanding to 2D

This is where things start becoming exciting, but also more complex. Generating and training horizontally was trivial because sequential tokens mapped sequentially to space. However, that’s no longer the case. Let’s explore a couple options.

The first is to train a second network on vertical sequences, and when inferring the model, combine the two probabilities and choose the tile that way. That kind of works, but in practice, it fails because many tiles are part of a larger set that this can’t learn effectively. For example, the stump.

There are also tile dependencies where one tile will have to look to diagonal neighbors or “ahead” as well in order to choose the correct tile, so ultimately a second model isn’t the solution.

Another idea is to train on whole rooms. We can open the ability to sample across rooms and make the input to the model a rectangle. While this might work, training on longer sequences of dozens of tokens isn’t trivial, and asking the model to be able to correct match up tokens in multiple places away potentially far from the current token in a long sequence would most likely require using attention layers that the transformers model introduces. In practice, I haven’t been able to get that model to work well at all, which isn’t helped by our small dataset, nor had any luck with simpler networks that were devised to help with longer sequences like LSTM or the GRU we’re using.

The final idea we’re going with is to train the model on short sequences containing the relevant neighbors and add a positional embedding layer. This keeps the neighbor kernel small, helps the model make correct tile matching decisions, and allows us to more efficiently scale the training context. While the positional embedding layer is optional, it does drastically improve overall “flow” of the map.

Refining the Training Data

First we’re going to define the neighbor kernel, which is just an array of relative coordinates that we’ll be using. This is what a full 4-tile sequence looks like, where the model will be trained to predict that middle bottom grass one.

neighbor_kernel = [(0, -1), (1, -1), (-1, 0)]
sequence_length = len(neighbor_kernel) + 1

The next step is to build our sequences in this format. I’m also going to open up cross-room sampling, which also means using a list of rooms to exclude from the sampling. This is so the model doesn’t learn from cases like these:

exclude_rooms = {8, 9, 0xA, 0x18, 0x19, 0x1A, 0x28, 0x29, 0x2A, 0x38, 0x39, 0x3A, 0x13, 0x23, 0x33, 0xC9}

flattened_data = []

def sample_room(room, x, y):
    if room in exclude_rooms:
        return None
    world_x = (room % 16) * 10 + x
    world_y = (room // 16) * 8 + y
    if world_x < 0 or world_y < 0:
        return PAD_TOKEN
    if world_x >= 160 or world_y >= 128:
        return PAD_TOKEN
    room = world_y // 8 * 16 + world_x // 10
    x = world_x % 10
    y = world_y % 8
    return room_data[room, y, x]

num_samples = 0
for room in range(num_rooms):
    for x in range(0, w):
        for y in range(0, h):
            values = []
            for dx, dy in neighbor_kernel:
                values.append(sample_room(room, x+dx, y+dy))

            values.append(sample_room(room, x, y))
            if None in values:
                continue

            flattened_data.extend(values)
            num_samples += 1

data = np.array(flattened_data)
print(f'num_samples = {num_samples}') # num_samples = 19200

Now we just have to slightly modify the inference to build the sequence correctly:

sequence = []
for dx, dy in neighbor_kernel:
    sequence.append(sample_map(x+dx, y+dy))

And that’s it! Let’s see how the training fairs and how it performs. Note that we now have more samples to use, hence the higher step count for the same number of epochs:

For a first test, this is awesome! We can definitely see this working much better than wave function collapse. There is more open space, and there are parts that feel like a real map. However, there are some issues:

There tends to be a lot of repetition while also lacking variety
Many of the tiles look to be breaking rules

Temperature, Repetition Penalty, and Top-K Sampling

If you’ve used the oobabooga web interface for using LLaMa, then you’ve probably noticed these parameters. They’re simple and yet very effective methods of addressing some of the above problems.

Temperature – When picking the most likely token to come after a sequence, every token has some chance of being used. Currently we use torch.multinomial to choose one with the weights fresh from the model. The temperature tweaks the weights in a way where a temperature tending towards 0 results in deterministic evaluations, and 1 doesn’t affect the weights (which tends to result in more variation). We want to use a lower temperature as this tends to reduce the amount of chaos.
Repetition Penalty – Right now the model doesn’t care if the same tile repeats 100 times in a row, but we can make it so it does by introducing the repetition penalty. This will “penalize” or lower the probability of a specific token being selected the more it occurs.
Top-K Sampling – Even if a token is the least likely to be chosen, it still has a chance. We can take the top K most likely tokens and then just sample from those instead, although in practice this tends to not do much given our small data size.

These are fairly simple to implement. For the repetition penalty, I’m going to go with a scaling approach instead of subtractive where it looks at the last 8 tiles in the -X and -Y directions.

def softmax_with_temperature(logits, temperature=1.0):
    return F.softmax(logits / temperature, dim=-1)

def apply_repetition_penalty(logits, data, x, y, penalty_value):
    def sample_map(x, y):
        if x < 0 or x >= MAP_SIZE or y < 0 or y >= MAP_SIZE:
            return vocab_size - 1
        return data[y * MAP_SIZE + x]

    for dx in range(-8, 0):
        logits[sample_map(x + dx, y)] *= penalty_value
    for dy in range(-8, 0):
        logits[sample_map(x, y + dy)] *= penalty_value

    return logits

def top_k_probs(probs, k):
    top_k_values, top_k_indices = torch.topk(probs, k)
    top_k_probs = top_k_values / top_k_values.sum()
    return top_k_probs, top_k_indices

def sample_from_top_k(probs, k):
    top_probs, top_indices = top_k_probs(probs, k)
    sampled_index = torch.multinomial(top_probs, 1).item()
    next_token = top_indices[sampled_index]
    return next_token

def generate_tilemap(model, ix, iy, length=MAP_SIZE, temperature=TEMPERATURE, rep_penalty=REPETITION_PENALTY):
    generated_map = [PAD_TOKEN] * (length * length)

    def sample_map(x, y):
        if x < 0 or x >= length or y < 0 or y >= length:
            return PAD_TOKEN
        return generated_map[y * length + x]

    with torch.no_grad():
        for y in range(length):
            for x in range(length):
                sequence = []
                for dx, dy in neighbor_kernel:
                    sequence.append(sample_map(x+dx, y+dy))

                input_tensor = torch.tensor(sequence, dtype=torch.long).unsqueeze(0).cuda()
                
                raw_output = model(input_tensor)
                raw_output = raw_output[0, -1]

                raw_output = apply_repetition_penalty(raw_output, generated_map, x, y, penalty_value=rep_penalty)
                combined_probs = softmax_with_temperature(raw_output, temperature)
                
                top_5_likely = torch.topk(combined_probs, vocab_size - 1)
                #next_token = torch.multinomial(combined_probs, 1).item()
                next_token = sample_from_top_k(combined_probs, 100)

                generated_map[y * length + x] = next_token

Nice! Let’s see how it looks with a temperature of 0.4 and repetition penalty of 0.95:

That’s looking much better! We have an acceptable amount of repetition like in the original map, variety, and fewer errors.

Enforcing Tile Constraints

Right now we leave the model in charge of picking a tile, which when left unchecked can pick a wrong tile. While this next step won’t completely solve the issue, it will improve it by having the model pick the most likely tile that satisfies a connectivity constraint and falling back to the most likely one if there isn’t one.

def extract_tile_pairs():
    valid_pairs_x = set()
    valid_pairs_y = set()

    for room in range(num_rooms):
        for y in range(h):
            for x in range(w):
                current_tile = sample_room(room, x, y)
                    
                neighbor_tile = sample_room(room, x-1, y)
                if neighbor_tile != None:
                    valid_pairs_x.add((current_tile, neighbor_tile))
                    
                neighbor_tile = sample_room(room, x, y-1)
                if neighbor_tile != None:
                    valid_pairs_y.add((current_tile, neighbor_tile))

    return valid_pairs_x, valid_pairs_y

valid_tile_pairs_x, valid_tile_pairs_y = extract_tile_pairs()
def valid_tile(token, x, y):
    if not (token, sample_map(x-1, y)) in valid_tile_pairs_x:
        return True
    if not (token, sample_map(x, y-1)) in valid_tile_pairs_y:
        return True
    return False
    
...

if x > 0 and y > 0 and invalid_tile(next_token, x, y):
    top_likely = torch.topk(combined_probs, 25)
    found = False
    count = len(top_likely)
    for i in range(len(top_likely)):
        token = top_likely.indices[i]
        chance = top_likely.values[i]
        if not invalid_tile(token, x, y):
            next_token = token
            found = True
            break
    if not found:
        for i in range(vocab_size - 1):
            token = i
            if not invalid_tile(token, x, y):
                next_token = token
                found = True
                break

Ditching the Padding

What I’m noticing is every map starts with tile 0, the blue sky tile, because the only data the map has seen with padding on the top-left contains that tile (it’s the very top-left corner of the map). There are two ways of dealing with this. One way is to manually pad each sample so the model learns how to deal with tiles in padded regions, but this will drastically increase the number of samples with little benefit.

The solution we’re going with instead is to actually just not include padded samples in our training data. Right now we pad the samples like this:

if world_x < 0 or world_y < 0:
    return PAD_TOKEN
if world_x >= 160 or world_y >= 128:
    return PAD_TOKEN

An easier way is to simply skip those samples. With the tile constraints in place, we can guarantee that the top row is constructed of correct tiles which sets the rest of the map up nicely, although the left column will still be disjointed.

If you wanted to take it one step further, then given that our network only uses immediate neighbors, you can just pad the generation to generate a 66×66 map instead 64×64 for example, and then ignore the outer ones.

Tweaking the Data

I’ve wanted to do as limited processing as possible to keep this project generalized, but I’m going to modify a couple of the samples. My dataset uses the default game state, which teaches the model that some undesirable features are okay. Here’s what I’m talking about:

With these fixed, a perfect model should no longer place cave entrances above the ground, stairs that don’t reach the ground, and bridges that lead to water.

Positional Embedding

Our current maps are pretty good, but they still feel pretty similar to wave function collapse in that it creates a lot of narrow passages and features that feel incomplete, like the sandpit.

One thing that the model may or may not be learning well is the spatial relationship between tiles. Fortunately, we can do something about that by introducing a positional embedding layer. This is similar to the default embedding layer which establishes relationships between tokens.

class PositionalEmbedding(nn.Module):
    def __init__(self, embedding_dim):
        super(PositionalEmbedding, self).__init__()
        
        self.positional_embeddings = nn.Embedding(sequence_length - 1, embedding_dim)
        self.position_to_idx = {}
        for i, pos in enumerate(neighbor_kernel):
            self.position_to_idx[pos] = i
    
    def forward(self, sequence_length):
        indices = [self.position_to_idx[pos] for pos in neighbor_kernel]
        
        indices = torch.tensor(indices, dtype=torch.long).to(sequence_length.device).unsqueeze(0).repeat(sequence_length.size(0), 1)
        
        return self.positional_embeddings(indices)

And next we add it to the model:

class TileModelRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, dropout=0.0):
        super(TileModelRNN, self).__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.positional_embedding = PositionalEmbedding(embedding_dim)
        self.rnn = nn.GRU(embedding_dim, hidden_dim, n_layers, dropout=dropout, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim*2, output_dim)

    def forward(self, x, positions):
        token_embedded = self.embedding(x)
        pos_embedded = self.positional_embedding(positions)
        
        embedded = token_embedded + pos_embedded
        rnn_output, _ = self.rnn(embedded)
        
        output = self.fc(rnn_output)
        return output

Next we have to add it to our training dataset.

def __getitem__(self, index):
    return (torch.tensor(self.data[index * self.sequence_length : index * self.sequence_length + self.sequence_length - 1], dtype=torch.long),
            torch.tensor(self.data[index * self.sequence_length + 1 : index * self.sequence_length + self.sequence_length], dtype=torch.long),
            torch.tensor(neighbor_kernel, dtype=torch.long))

And lastly, we add it to the training loop:

for i, (inputs, targets, positions) in enumerate(dataloader):
    inputs, targets, positions = inputs.cuda(), targets.cuda(), positions.cuda()
    outputs = model(inputs, positions)

Before testing it, we need to do a few comparisons. Firstly, I’ve increased the model size because the loss rate was once again a bit too high.

embedding_dim = 512
hidden_dim = 512
n_layers = 1

I know it’s hard to really see with that kind of scaling, but it does make a big difference in our next test.

We can’t compare the difference between a 128×256 model without positional embedding and 512×512 with positional embedding, because the increased size could be making all the difference. So here’s the new 512×512 model without and with the positional embedding.

No positional embedding:

With positional embedding:

I think it helps not just with tile errors, but also the feel of the map. It’s hard to really say for sure though given that our neighbor kernel is still small.

Improving the Look-Ahead

Right now, the model is only looking at the immediate left, up, and up-right neighbors. This is problematic because it can choose a tile that satisfies the neighbor constraint check, only to be left in a bad state on the next one. We can improve this by extending the sequence to look further ahead. Note that we can choose any arbitrary shape, but despite its short length, this has produced the best overall results.

neighbor_kernel = [(0, -1), (1, -1), (2, -1), (-1, 0)]

Results

With the longer sequence length and all the tricks in the bag, let’s see what we get!

Temperature = 0.5, Repetition Penalty = 0.95

Temperature = 0.15, Repetition Penalty = 0.95

Temperature = 0.85, Repetition Penalty = 0.99

Let’s try with a different tileset (note that I didn’t modify these datasets):

Temperature = 0.15, Repetition Penalty = 0.95

Temperature = 0.85, Repetition Penalty = 0.99

Analysis

The first correctly synthesized 2D map blew my mind when I first saw it. Not just because it worked, but because going into it my expectations were I’d get lucky if it managed to produce a single tree correctly given it’s a 2×2 set of tiles that depends on other trees around it. All of the results here use the connectivity constraints, but that only helps produce a cleaner result. We can see what the model truly learned by disabling constraint enforcement:

This is just wild to me. It wasn’t told “offset trees have their corner changed”, “stumps have to be complete”, “land meeting water has a special tile”, or what the distribution of features is. You don’t see the map plastered with 100 chests and stumps because the data had them few and far between.

Of course, we can observe plenty of incorrect tiles if we scrutinize it, but part of this is because the neighbor kernel is short and the source dataset just isn’t large enough for the model to learn everything that is and isn’t okay. What I believe is happening is the model generates tiles that look to be okay for a given sequence, only for it to advance one or two tiles and it turns out there isn’t a tile that it knows of that are allowed to be there.

There are also still some cases in the data that the model learned was okay because I didn’t remove them. The original map has a small number of chimneys, so something like this almost can’t even be seen as an outlier:

But this is still impressive. Remember that the model has only been trained to look at the tile immediately to the left. We could extend the neighbor kernel to also include 2 tiles to the left and it improves the houses, but not without the rest of the model suffering due to an increase in sequence length and constant amount of data (note this is still with neighbor constraints disabled):

And while it gets the left-to-right stuff correct more often now, vertical tiles can still frequently fail both with and without the constraint enforcement:

If I were smarter and more experienced, I’d be able to pick the model apart at some of these tiles to figure out what it was thinking when it selected them. For example, the witch’s hut only appears once in the entire map and has very limited tiles to select, so why with an extended context window did it fail to place any correct tiles on the bottom half?

Conclusion

On the whole, this worked so much better than I thought it would. This little experiment that occupied my weekend was enough to satisfy the curious itch I had as to whether or not it would work. Of course, some post-processing could fix most if not all of the errors, but that’s besides the point. This dispelled all of my preconceptions going into this, which if you can recall were:

You need a LOT of data – Our dataset was extremely small in comparison to the model size, and yet it produced some really cool and believable parts.
You need a server rack of GPUs with tons of VRAM – All of my tests were done on my consumer-grade PC. It never used more than 500MB of VRAM, even with the larger sizes with 9.9M parameters. I didn’t make any attempts to use lower precision networks either.
You need a lot of patience because they take ages to train – From scratch, the large model takes only 10 seconds to train and get usable results, capping out around a minute for full convergence. The longest part actually is Python’s slow speed from looking up a set with thousands of entries during inference. Of course, this changes based on the model size, number of epochs, and the sequence length, but I work on projects that take longer than that just to compile. So for small-scale stuff, this definitely isn’t the case.

I’d also be curious to see a better method of generation. Right now the main problem is the model sets itself up for failure by picking a tile that will lead to an invalid configuration. The base game also had features spanning several tiles vertically (like propelling flowers at the base of a cliff) that our neighbor kernel just can’t capture. Perhaps instead of generating in a scanline order, it generates alternating rows and then trained using that instead so verticality is more included in the dataset.

Another solution is looking to transformers/attention or using a different kind of model like a diffusion model or GAN. Regardless, I’m really happy with how this little venture into AI turned out.

Closing Thoughts

First, a huge shoutout to ChatGPT for the help. Like I mentioned up-front, I’ve never really worked seriously with neural networks before. I did some experimentation with QLoRAs and LLaMa when they were new, which helped me understand things like loss, inference, and the training loop, but never anything from scratch. ChatGPT is the one that suggested the transformer at first, the RNN, the GRU/LSTM, making it bidirectional, and finally the positional embedding layer, not to mention wrote a lot of that code. I still don’t really understand what these are and how these work, simply because I haven’t read their papers.

In fact I haven’t read any AI whitepapers. I’m still working heavily with voxels and game dev, but it’s good to stay in the loop with this stuff. Hopefully I didn’t make any major blunders when describing any of the concepts or what’s usually used. I’m going by little tidbits of information I’ve picked up along the way, much of which is thanks to all the awesome open source developers, Hugging Face’s tutorials, and Computerphile.

Anyway, if you try this, lemme know on Twitter! I’d love to see results for different games and improvements made by people who actually know what they’re doing.

9 comments

Maxof says:

September 15, 2023 at 3:02 pm

Where’s the voxel game!

Samuel says:

September 15, 2023 at 6:37 pm

Definitely some fun results here! I’ve been hesitant to try out pylance and AI programming but this has inspired me so I think I will try it out. Id be really curious to see a model like this trained on already created game maps. Would be awesome to see story and paths created with just tile predictions.

Ash Sabine says:

September 16, 2023 at 9:29 pm

Hot damn! This looks really promising. I can imagine this shaking things up a little bit, although I don’t think it’ll replace traditional PCG completely. I feel like this sort of strategy could do well as an “enhancer” after traditional PCG to add detail or variation.

Anonymous1234 says:

September 18, 2023 at 10:33 pm

Machine Learning is awesome! I’ve seen some people trying to use LLMs like Llama as the AI for games.

A caring customer says:

October 3, 2023 at 12:53 pm

Why did you leave the voxel engine behind ?

Clayton says:

October 13, 2023 at 1:25 am

Hi John,
I saw your voxel engine demonstration videos when they were first posted on YouTube and have had those scenes stuck in my head ever since. There is a serious gold mine in there! Think of what you could do with a Minecraft or Terraria-style game, (I’m sure you have) where you need better tools to get through tougher layers of rock! It’s seriously the most beautiful, organic thing I’ve ever seen! I would definitely contribute to a fundraiser if there was one available and I’m sure others would too.
“Roguelites” are getting more and more popular, like Risk of Rain and how it’s sequel was a 3D version of it. Well, if you’ve heard of a game called Noita, which is a 2D game where every pixel is acted upon by the game’s physics (very addicting)… your engine would be the PERFECT thing for a Noita 2! Man I could go on and on. I think the coolest thing though was how the reverb of the sound changes inside the caves as the caves themselves change. Just crazy cool! Maybe AI could make map generation of intricate voxel worlds easier, or maybe figuring out ragdoll physics on in-game enemies.
Thanks for reading! 🙂
-Clayton

Wintey says:

November 8, 2023 at 7:51 pm

What happened to your voxel engine? im very curious being a 3d artist and the style was wonderful but i dont have answers of what happened. thank you for reading

Jarek says:

November 8, 2023 at 8:00 pm

I will read this post more deeply tomorrow, but I came here wondering what happened to you and voxel engine. I was checking it out every couple months, but there were no new videos.
How are you doing?

Noah says:

November 11, 2023 at 10:19 pm

Incredibley interesting read. Good to know that you are still working on your voxel project as well. I never considered using a transformer in such a way, but it makes sense as a way to procedurally generate content. I feel like with sufficient training data this could do so much more than making things that look like playable maps, but make maps that are fun to play with ever evolving content.