D&D 5e Monster Features: Exploratory Data Analysis with Julia

Learning Objectives

Load, inspect, and clean tabular data using Julia's DataFrames ecosystem
Explore categorical and numerical features through summary statistics and visualizations
Filter and subset data to answer specific questions about your dataset
Apply basic probability to a real decision-making scenario (balancing combat encounters)

Introduction

❧ ✾ ❧

What is Dungeons & Dragons?

Dungeons & Dragons (D&D) is a tabletop role-playing game where players navigate a story guided by a Dungeon Master (DM), who designs the world, controls monsters, and adjudicates rules. Combat is resolved by rolling a 20-sided die (d20). Every monster has a stat block with key attributes: Hit Points (HP), the damage it can absorb; Armor Class (AC), the d20 threshold to land a hit; and Challenge Rating (CR), a rough difficulty index for the party.

❧ ✾ ❧

Why Data-Informed Dungeon Mastering?

A DM preparing encounters faces a design problem: the Monster Manual contains hundreds of creatures, and browsing by hand is tedious and error-prone. Treating monster selection as a data management problem changes this: you can filter, sort, group, and visualize attributes programmatically, asking precise questions like "Which Large undead have AC below 14?". This is the same principle behind any evidence-based workflow: systematic exploration of your option space leads to better decisions.

About This Notebook

This notebook walks through an exploratory data analysis of D&D 5th Edition monster statistics using Julia. The workflow (load, inspect, explore, visualize, filter, analyze) is universal and transfers directly to any tabular dataset in any domain.

⚙ Running This Notebook

To run the source notebook interactively, you need:

Install Julia (v1.9 or later recommended).
Install VS Code and the Julia extension.
Open the .ipynb notebook file in VS Code; the Julia extension provides full notebook support with syntax highlighting, inline plots, and a built-in REPL.
When prompted, install the required packages: CSV, DataFrames, DataFramesMeta, Statistics, StatsPlots.

❧ ✾ ❧

Why Julia?

Julia is an open-source language designed for technical computing. It resolves a common tension: languages easy to write (Python, R) tend to be slow, while fast languages (C, Fortran) are harder to prototype in. Julia compiles just-in-time, so readable high-level code runs at speeds comparable to compiled languages. Its syntax reads close to mathematical notation, and its ecosystem covers statistics, data manipulation, and visualization. If you are new to Julia, the official documentation and Julia Academy are good starting points.

❧ ✾ ❧

Screenshot of the Julia REPL (Read-Eval-Print Loop) interactive terminal — The Julia REPL: an interactive terminal where you can test expressions, load packages, and explore data line by line before committing to a script.

Setup and Data Loading

# --- Package imports ---
using CSV              # Read/write CSV files
using DataFrames       # Tabular data structures
using DataFramesMeta   # Convenient macros: @subset, @groupby
using Statistics        # mean, median, std, etc.
using Random           # Reproducible random sampling
using StatsPlots       # Plotting: bar, boxplot, violin, pie

# Global plot defaults for readability
default(
    size = (820, 580),
    guidefontsize = 12,
    tickfontsize = 10,
    titlefontsize = 14,
    legendfontsize = 11,
    fontfamily = "sans-serif",
    margin = 5Plots.mm,
    dpi = 150
)

A DataFrame is a spreadsheet in code: rows are records (monsters), columns are attributes (HP, size, type). This structure lets you filter, group, summarize, and visualize patterns, the same way for any tabular data.

To load a CSV file into a DataFrame in Julia:

# Load the dataset; semicolon suppresses output
df_monsters = CSV.read("cleaned_monsters_basic.csv", DataFrame);

# Preview the first 5 rows
first(df_monsters, 5)

Name	Type	Alignment	AC	HP	Str	Speed	CR
Aboleth	Aberration	Lawful Evil	17	135	21	10	10
Acolyte	Humanoid (any race)	Any alignment	10	9	10	30	1/4
Adult Black Dragon	Dragon	Chaotic Evil	19	195	23	40	14
Adult Blue Dragon	Dragon	Lawful Evil	19	225	25	40	16
Adult Brass Dragon	Dragon	Chaotic Good	18	172	23	40	13

This data was downloaded from Kaggle.

Inspecting column names

Before any analysis, check what features (columns) are available:

for name in names(df_monsters)
    print(name, "\t")
end

Column1  name  size  monster_type  alignment  ac  hp  strength  str_mod  dex  dex_mod  con  con_mod  intel  int_mod  wis  wis_mod  cha  cha_mod  senses  languages  cr  str_save  dex_save  con_save  int_save  wis_save  cha_save  speed  swim  fly  climb  burrow  number_legendary_actions  history  perception  stealth  persuasion  insight  deception  arcana  religion  acrobatics  athletics  intimidation

The dataset contains 45 columns. For this notebook, we will focus on a manageable subset: name, monster_type, alignment, ac, hp, strength, speed, and cr.

Exploring Categorical Features

Before computing statistics, check what values exist in your data. This prevents surprises like typos, unexpected categories, or missing entries.

How many different sizes exist?

unique(df_monsters[!, :size])

6-element Vector{String15}:
 "Large"
 "Medium"
 "Huge"
 "Gargantuan"
 "Small"
 "Tiny"

Visualizing Distributions

Bar plot: Monster count by size

When you have discrete categories (Small, Medium, Large), bar charts make comparison intuitive: taller bars mean more items. Sorting bars by a logical order (Tiny → Gargantuan) or by frequency helps you immediately spot which categories dominate and which are rare.

function plot_frequency_distribution(string_array;
        title="Frequency Distribution",
        size=(800,600),
        rotation=45,
        var_order=nothing)

    # Count occurrences of each category
    freq_dict = Dict{String, Int}()
    for s in string_array
        freq_dict[s] = get(freq_dict, s, 0) + 1
    end

    # Sort by frequency (descending) by default
    sorted_pairs = sort(collect(freq_dict), by=x->x[2], rev=true)

    # Optional: reorder by a custom index (e.g. Tiny→Gargantuan)
    if var_order !== nothing
        if length(var_order) != length(sorted_pairs)
            error("var_order length must match unique elements")
        end
        ordered_pairs = [sorted_pairs[i] for i in var_order]
    else
        ordered_pairs = sorted_pairs
    end

    labels = [pair[1] for pair in ordered_pairs]
    counts = [pair[2] for pair in ordered_pairs]

    bar(labels, counts,
        title=title,
        xlabel="Categories",
        ylabel="Frequency",
        size=size,
        xrotation=rotation,
        legend=false,
        color=:steelblue)
end

plot_frequency_distribution(df_monsters[!, :size],
    title="Monster Size",
    var_order=[4, 5, 1, 2, 3, 6])

Bar chart showing monster size frequency distribution — Monster count by size category, ordered from Tiny to Gargantuan.

What to notice: Medium monsters dominate the dataset (~40%), reflecting the Monster Manual's emphasis on humanoid-scale enemies. For a DM, this means the most variety is available in the Medium category, but if you want Large or Huge creatures for dramatic set-piece encounters, your options narrow, making systematic filtering more valuable.

Pie chart: Size as proportion of the whole

Pie charts work best when you want to emphasize that categories are parts of a whole (100%). "Half of all monsters are Medium-sized" reads more intuitively as a pie slice than a bar. Avoid pie charts when you have more than 5–6 categories.

function pie_chart_feat(string_array;
        title="PieChart Distribution",
        size=(600,400),
        legendfontsize=10)

    # Count occurrences
    freq_dict = Dict{String, Int}()
    for s in string_array
        freq_dict[s] = get(freq_dict, s, 0) + 1
    end

    labels = collect(keys(freq_dict))
    counts = collect(values(freq_dict))

    # Compute percentages and build "Label (X%)" strings
    total = sum(counts)
    percentages = round.((counts ./ total) .* 100, digits=1)
    labels_pct = [string(l, " (", p, "%)")
                  for (l, p) in zip(labels, percentages)]

    pie(labels_pct, counts,
        title=title, legend=:outertopright,
        size=size, legendfontsize=legendfontsize)
end

pie_chart_feat(df_monsters[!, :size],
    title="Monster Size Distribution",
    size=(1000,800),
    legendfontsize=16)

Pie chart showing monster size distribution percentages — Proportional breakdown of monster sizes in the D&D 5e Monster Manual.

Boxplots: How do numerical features vary across sizes?

Boxplots show the median, interquartile range (middle 50%), and outliers for each category. Violin plots add a density curve revealing the full distributional shape. Use boxplots for compact comparisons; violins when shape matters (bimodality, heavy skew).

The DataFramesMeta package provides the @groupby macro, which partitions a DataFrame into sub-tables by a categorical variable, similar to SQL's GROUP BY or pandas' groupby().

function plot_distributions_by_category(df, categorical_col, numeric_cols;
        title="Distributions by Category",
        size=(1600, 600),
        plot_type=:boxplot)

    # Split DataFrame into sub-tables by category
    grouped_df = @groupby(df, categorical_col)
    categories = [key[categorical_col] for key in keys(grouped_df)]

    # Determine subplot grid layout (max 3 columns)
    n_plots = length(numeric_cols)
    n_cols = min(3, n_plots)
    n_rows = ceil(Int, n_plots / n_cols)

    plots_array = []

    for (i, col) in enumerate(numeric_cols)
        all_values = Float64[]
        all_labels = String[]

        for category in categories
            category_data = grouped_df[(;
                Dict(categorical_col => category)...)]
            values = category_data[!, col]
            clean_values = filter(!ismissing, values)
            clean_values = Float64.(clean_values)

            if !isempty(clean_values)
                append!(all_values, clean_values)
                append!(all_labels,
                    fill(string(category), length(clean_values)))
            end
        end

        if plot_type == :violin
            p = violin(all_labels, all_values,
                title=string(col), legend=false)
        elseif plot_type == :boxplot
            p = boxplot(all_labels, all_values,
                title=string(col), legend=false)
        end
        push!(plots_array, p)
    end

    plot(plots_array...,
        layout=(n_rows, n_cols), size=size,
        plot_title=title)
end

plot_distributions_by_category(df_monsters,
    :size, [:hp, :speed, :strength])

Boxplots of HP, speed, and strength by monster size — Distribution of HP, speed, and strength across monster size categories.

What to notice: HP scales strongly with size; Gargantuan creatures have median HP far above the others, with wide spread. Speed is relatively consistent across sizes (most creatures move 30–40 ft), but Gargantuan creatures show more variance. Strength increases with size as expected, but the overlap between Medium and Large is substantial, meaning a Large creature is not always stronger than a Medium one. For a DM, this means size alone is not a reliable proxy for difficulty: you need to look at the full stat profile.

Filtering and Subsetting

Real-world analysis rarely uses the entire dataset at once. You filter to answer specific questions; in our case, "Which Large monsters have surprisingly low HP?"

Are there Large monsters with low HP?

Using the @subset macro from DataFramesMeta, we can combine conditions. The dot (.) prefix on operators means "apply element-wise"; Julia checks each row individually.

using DataFramesMeta

# Filter: Large monsters with fewer than 20 HP
# The dot (.) broadcasts the comparison across all rows
low_hp_large = @subset(df_monsters,
    :hp .< 20,
    :size .== "Large")

Name	Type	Alignment	AC	HP	Str	Speed	CR
Axe Beak	Beast	Unaligned	11	19	14	50	1/4
Camel	Beast	Unaligned	9	15	16	50	1/8
Constrictor Snake	Beast	Unaligned	12	13	15	30	1/4
Crocodile	Beast	Unaligned	12	19	15	20	1/2
Draft Horse	Beast	Unaligned	10	19	18	40	1/4
Elk	Beast	Unaligned	10	13	16	50	1/4
Giant Goat	Beast	Unaligned	11	19	17	40	1/2
Giant Lizard	Beast	Unaligned	12	19	15	30	1/4
Giant Owl	Beast	Neutral	12	19	13	5	1/4
Giant Sea Horse	Beast	Unaligned	13	16	12	0	1/2
Hippogriff	Monstrosity	Unaligned	11	19	17	40	1
Riding Horse	Beast	Unaligned	10	13	16	60	1/4
Warhorse	Beast	Unaligned	11	19	18	60	1/2

All 13 results are Beasts (plus one Monstrosity, the Hippogriff), with CRs at or below 1. These are large-bodied but fragile creatures, useful for wildlife encounters or travel sequences where you want imposing visuals without deadly stakes.

Filtering by multiple categories

For more complex selections, say you are a DM looking for Small-to-Large Aberrations and Dragons:

# Define allowed values for each filter
monster_sizes = ["Small", "Medium", "Large"]
monster_types = ["Aberration", "Dragon"]

# .∈ broadcasts "is element of" row-wise
# Ref() prevents Julia from iterating over the array itself
monster_options = @subset(df_monsters,
    :size .∈ Ref(monster_sizes),
    :monster_type .∈ Ref(monster_types));

size(monster_options)[1]

This gives 26 candidates. Here are the first six:

Name	Type	Alignment	AC	HP	Str	Speed	CR
Aboleth	Aberration	Lawful Evil	17	135	21	10	10
Black Dragon Wyrmling	Dragon	Chaotic Evil	17	33	15	30	2
Blue Dragon Wyrmling	Dragon	Lawful Evil	17	52	17	30	3
Brass Dragon Wyrmling	Dragon	Chaotic Good	16	16	15	30	1
Bronze Dragon Wyrmling	Dragon	Lawful Good	17	32	17	30	2
Chuul	Aberration	Chaotic Evil	16	93	19	30	4

Random selection

When you have a filtered pool and want to pick one at random (for a random encounter table or to break decision paralysis):

# Simulate a dice roll to pick a random monster from the pool
monster_dice_roll = rand(1:size(monster_options, 1))
monster_options[monster_dice_roll, :]

Name	Type	Alignment	AC	HP	Str	Speed	CR
Chuul	Aberration	Chaotic Evil	16	93	19	30	4

Random selection is foundational in data science more broadly: train/test splits, bootstrap sampling, and randomized assignment all use this same operation.

Campaign-specific filtering with regex

Most encounter-building depends on the campaign. If you're running Curse of Strahd, you want Undead creatures specifically. The @subset macro works with regular expressions via occursin():

# Case-insensitive regex match on monster_type
rege = r"Undead"i
my_monster_selection = @subset(df_monsters,
    occursin.(rege, :monster_type))

This returns 18 Undead monsters, from the Skeleton (CR 1/4) to the Lich (CR 21):

Name	Type	Alignment	AC	HP	Str	Speed	CR
Ghast	Undead	Chaotic Evil	13	36	16	30	2
Ghost	Undead	Any alignment	11	45	7	0	4
Ghoul	Undead	Chaotic Evil	12	22	13	30	1
Lich	Undead	Any Evil	17	135	11	30	21
Minotaur Skeleton	Undead	Lawful Evil	12	67	18	40	2
Mummy	Undead	Lawful Evil	11	58	16	20	3
Mummy Lord	Undead	Lawful Evil	17	97	18	20	15
Ogre Zombie	Undead	Neutral Evil	8	85	19	30	2
Shadow	Undead	Chaotic Evil	12	16	6	40	1/2
Skeleton	Undead	Lawful Evil	13	13	10	30	1/4
Specter	Undead	Chaotic Evil	12	22	1	0	1
Vampire	Undead	Lawful Evil	16	144	18	30	13
Vampire Spawn	Undead	Neutral Evil	15	82	16	30	5
Warhorse Skeleton	Undead	Lawful Evil	13	22	18	60	1/2
Wight	Undead	Neutral Evil	14	45	15	30	3
Will-O'-Wisp	Undead	Chaotic Evil	19	22	1	0	2
Wraith	Undead	Neutral Evil	13	67	6	0	5
Zombie	Undead	Neutral Evil	8	22	13	20	1/4

Having this pool organized and queryable lets you design a progression of encounters across a campaign like Curse of Strahd.

Balancing a Fight

When putting together a fight, we want to strike a balance between challenge and ensuring that our players have fun. To do this, we might want to take a look at the chance our players will hit a given monster on each roll. Keeping on the theme of the undead, let's calculate the distribution of success chances against these monsters to see if our players are ready to take on the living dead!

❧ ✾ ❧

A Brief Probability Spell

A d20 roll produces a uniform distribution over the integers $\{1, 2, \ldots, 20\}$; each outcome has a $\frac{1}{20} = 5\%$ chance. To hit a monster, the player needs:

$$\text{d20 roll} + \text{modifiers} \geq \text{AC}$$

So the probability of hitting is:

$$P(\text{hit}) = \frac{\text{number of rolls that meet or exceed (AC - modifiers)}}{20}$$

With two special cases from the D&D rules: a natural 20 always hits (minimum 5% chance regardless of AC), and a natural 1 always misses (maximum 95% chance regardless of modifiers).

This is structurally identical to any threshold-detection problem in statistics or engineering.

❧ ✾ ❧

Calculating hit probabilities across Undead monsters

We calculate by doing the following:

For each monster, calculate the probability of hitting. For this example, we are looking at a situation where there are no modifiers, where the player has +5, and where there are two players with +2 and +7.
Given the AC of the monster, what is the distribution of a player's chances in hitting this monster (e.g., rolling above the given AC)?
Plot this on a bar chart to see who our players would have a chance to defeat!

# Hit probability for a d20 roll against a given AC
function calculate_d20_probability(threshold::Int, modifier::Int)
    effective_threshold = threshold - modifier

    # D&D rules: nat 20 always hits, nat 1 always misses
    if effective_threshold > 20
        return 0.05  # Natural 20 always hits
    elseif effective_threshold <= 1
        return 0.95  # Natural 1 always misses
    else
        # Count how many d20 faces meet the threshold
        favorable_outcomes = 21 - effective_threshold
        return favorable_outcomes / 20.0
    end
end

# Load data and filter to Undead only
df = CSV.read("cleaned_monsters_basic.csv", DataFrame)
filtered_df = filter(row -> row.monster_type == "Undead", df)
sort!(filtered_df, :ac)  # Order by Armor Class for readable plots

# Four scenarios: no modifier, +5, and two players (+2 vs +7)
mod_0 = 0
mod_5 = 5
p1_mod = 2
p2_mod = 7

# Calculate probabilities
names = filtered_df.name
chance_0 = calculate_d20_probability.(
    Int.(filtered_df.ac), Ref(mod_0))
chance_5 = calculate_d20_probability.(
    Int.(filtered_df.ac), Ref(mod_5))
chance_p1 = calculate_d20_probability.(
    Int.(filtered_df.ac), Ref(p1_mod))
chance_p2 = calculate_d20_probability.(
    Int.(filtered_df.ac), Ref(p2_mod))

# --- Three-panel dashboard ---

# Plot 1: No Modifier
p1 = bar(names, chance_0,
    title = "Base Success (Modifier: 0)",
    color = :lightgrey,
    xticks = :all, xrotation = 45,
    ylims = (0, 1), legend = false)
hline!([0.5], color=:red, linewidth=2, linestyle=:dash)

# Plot 2: Single Modifier (+5)
p2 = bar(names, chance_5,
    title = "Modified Success (Modifier: +5)",
    color = :skyblue,
    xticks = :all, xrotation = 45,
    ylims = (0, 1), legend = false)
hline!([0.5], color=:red, linewidth=2, linestyle=:dash)

# Plot 3: Player Comparison
p3 = bar(names, [chance_p1 chance_p2],
    title = "Player Comparison (+2 vs +7)",
    label = ["Player 1 (+2)" "Player 2 (+7)"],
    color = [:orange :purple],
    fillalpha = 0.5,
    ylabel = "Hit Probability",
    xticks = :all, xrotation = 45,
    ylims = (0, 1.1),
    legend = false)
hline!([0.5], color=:red, linewidth=2, linestyle=:dash)

# Combine into a single 3-row figure
plot(p1, p2, p3,
    layout = (3, 1),
    size = (1200, 1200),
    margin = 10Plots.mm)

Dashboard showing hit probability across undead monsters for different player modifiers — Hit probability for Undead monsters under four modifier scenarios. Red dashed line marks the 50% threshold.

What to notice: The red dashed line marks 50%, the coin-flip threshold. With no modifier (top panel), most Undead monsters sit at or below a 50% hit rate, meaning players will miss more often than they hit. The Zombie and Ogre Zombie (AC 8) are the easiest to hit. Adding a +5 modifier (middle panel) pushes nearly every monster above the 50% line; the encounter feels much more manageable.

The bottom panel is where it gets interesting for DMs. Player 1 (+2) struggles against AC 16+ monsters (Vampire, Lich), dropping to around 35% hit chance, while Player 2 (+7) stays above 50% against everything. This asymmetry matters: if you put a Vampire against this party, Player 1 will feel ineffective in direct combat, which might be frustrating, or might be a deliberate design choice that pushes them toward creative problem-solving. The data lets you make that choice intentionally rather than discovering it mid-session.

Summary

This notebook demonstrated a universal data analysis workflow applied to D&D monster statistics:

Load and inspect your data to understand its structure and available features.
Explore categorical variables (size, type) to see what categories exist and how they're distributed.
Visualize distributions to compare numerical features (HP, speed, strength) across groups.
Filter and subset to answer specific questions relevant to your use case.
Apply quantitative reasoning, here basic probability, to inform decisions.

The tools and thinking transfer directly to any domain with tabular data: ecological surveys, clinical records, economic indicators, or sensor measurements. The key insight is the same in all of them: systematically exploring your data gives you access to options and patterns that intuition alone would miss. For a Dungeon Master, that means better-balanced encounters, more variety, and more confidence in design choices. For a researcher, it means better experimental design and more robust conclusions.

Appendix: Julia ↔ Python Quick Reference

If you're coming from Python with pandas and matplotlib, Julia will feel familiar. The syntax is clean and readable, but Julia compiles your code, which typically results in faster execution.

Concept	Python (pandas)	Julia (DataFrames.jl)
Import library	`import pandas as pd`	`using DataFrames`
Read CSV	`df = pd.read_csv(...)`	`df = CSV.read(..., DataFrame)`
Access column	`df['column']`	`df[!, :column]`
Unique values	`df['col'].unique()`	`unique(df[!, :col])`
Row count	`df.shape[0]`	`size(df, 1)`
Filter rows	`df[df['hp'] < 20]`	`@subset(df, :hp .< 20)`
Group by	`df.groupby('col')`	`@groupby(df, :col)`
Random sample	`df.sample(1)`	`df[rand(1:nrow(df)), :]`
Bar plot	`df['col'].value_counts().plot(kind='bar')`	`bar(labels, counts)`
Boxplot	`sns.boxplot(x=..., y=...)`	`boxplot(labels, values)`

The ! in df[!, :column] means "give me the actual column, not a copy." Think of it as Julia being explicit about data access.