Learning Objectives
  1. Load, inspect, and clean tabular data using Julia's DataFrames ecosystem
  2. Explore categorical and numerical features through summary statistics and visualizations
  3. Filter and subset data to answer specific questions about your dataset
  4. Apply basic probability to a real decision-making scenario (balancing combat encounters)

Introduction

❧ ✾ ❧
What is Dungeons & Dragons?

Dungeons & Dragons (D&D) is a tabletop role-playing game where players navigate a story guided by a Dungeon Master (DM), who designs the world, controls monsters, and adjudicates rules. Combat is resolved by rolling a 20-sided die (d20). Every monster has a stat block with key attributes: Hit Points (HP), the damage it can absorb; Armor Class (AC), the d20 threshold to land a hit; and Challenge Rating (CR), a rough difficulty index for the party.

❧ ✾ ❧

Why Data-Informed Dungeon Mastering?

A DM preparing encounters faces a design problem: the Monster Manual contains hundreds of creatures, and browsing by hand is tedious and error-prone. Treating monster selection as a data management problem changes this: you can filter, sort, group, and visualize attributes programmatically, asking precise questions like "Which Large undead have AC below 14?". This is the same principle behind any evidence-based workflow: systematic exploration of your option space leads to better decisions.

About This Notebook

This notebook walks through an exploratory data analysis of D&D 5th Edition monster statistics using Julia. The workflow (load, inspect, explore, visualize, filter, analyze) is universal and transfers directly to any tabular dataset in any domain.

⚙ Running This Notebook

To run the source notebook interactively, you need:

  1. Install Julia (v1.9 or later recommended).
  2. Install VS Code and the Julia extension.
  3. Open the .ipynb notebook file in VS Code; the Julia extension provides full notebook support with syntax highlighting, inline plots, and a built-in REPL.
  4. When prompted, install the required packages: CSV, DataFrames, DataFramesMeta, Statistics, StatsPlots.
❧ ✾ ❧
Why Julia?

Julia is an open-source language designed for technical computing. It resolves a common tension: languages easy to write (Python, R) tend to be slow, while fast languages (C, Fortran) are harder to prototype in. Julia compiles just-in-time, so readable high-level code runs at speeds comparable to compiled languages. Its syntax reads close to mathematical notation, and its ecosystem covers statistics, data manipulation, and visualization. If you are new to Julia, the official documentation and Julia Academy are good starting points.

❧ ✾ ❧
Screenshot of the Julia REPL (Read-Eval-Print Loop) interactive terminal
The Julia REPL: an interactive terminal where you can test expressions, load packages, and explore data line by line before committing to a script.

Setup and Data Loading

# --- Package imports ---
using CSV              # Read/write CSV files
using DataFrames       # Tabular data structures
using DataFramesMeta   # Convenient macros: @subset, @groupby
using Statistics        # mean, median, std, etc.
using Random           # Reproducible random sampling
using StatsPlots       # Plotting: bar, boxplot, violin, pie

# Global plot defaults for readability
default(
    size = (820, 580),
    guidefontsize = 12,
    tickfontsize = 10,
    titlefontsize = 14,
    legendfontsize = 11,
    fontfamily = "sans-serif",
    margin = 5Plots.mm,
    dpi = 150
)

A DataFrame is a spreadsheet in code: rows are records (monsters), columns are attributes (HP, size, type). This structure lets you filter, group, summarize, and visualize patterns, the same way for any tabular data.

To load a CSV file into a DataFrame in Julia:

# Load the dataset; semicolon suppresses output
df_monsters = CSV.read("cleaned_monsters_basic.csv", DataFrame);

# Preview the first 5 rows
first(df_monsters, 5)
Name Type Alignment AC HP Str Speed CR
Aboleth Aberration Lawful Evil 17 135 21 10 10
Acolyte Humanoid (any race) Any alignment 10 9 10 30 1/4
Adult Black Dragon Dragon Chaotic Evil 19 195 23 40 14
Adult Blue Dragon Dragon Lawful Evil 19 225 25 40 16
Adult Brass Dragon Dragon Chaotic Good 18 172 23 40 13

This data was downloaded from Kaggle.

Inspecting column names

Before any analysis, check what features (columns) are available:

for name in names(df_monsters)
    print(name, "\t")
end
Column1  name  size  monster_type  alignment  ac  hp  strength  str_mod  dex  dex_mod  con  con_mod  intel  int_mod  wis  wis_mod  cha  cha_mod  senses  languages  cr  str_save  dex_save  con_save  int_save  wis_save  cha_save  speed  swim  fly  climb  burrow  number_legendary_actions  history  perception  stealth  persuasion  insight  deception  arcana  religion  acrobatics  athletics  intimidation

The dataset contains 45 columns. For this notebook, we will focus on a manageable subset: name, monster_type, alignment, ac, hp, strength, speed, and cr.


Exploring Categorical Features

Before computing statistics, check what values exist in your data. This prevents surprises like typos, unexpected categories, or missing entries.

How many different sizes exist?

unique(df_monsters[!, :size])
6-element Vector{String15}:
 "Large"
 "Medium"
 "Huge"
 "Gargantuan"
 "Small"
 "Tiny"

Visualizing Distributions

Bar plot: Monster count by size

When you have discrete categories (Small, Medium, Large), bar charts make comparison intuitive: taller bars mean more items. Sorting bars by a logical order (Tiny → Gargantuan) or by frequency helps you immediately spot which categories dominate and which are rare.

function plot_frequency_distribution(string_array;
        title="Frequency Distribution",
        size=(800,600),
        rotation=45,
        var_order=nothing)

    # Count occurrences of each category
    freq_dict = Dict{String, Int}()
    for s in string_array
        freq_dict[s] = get(freq_dict, s, 0) + 1
    end

    # Sort by frequency (descending) by default
    sorted_pairs = sort(collect(freq_dict), by=x->x[2], rev=true)

    # Optional: reorder by a custom index (e.g. Tiny→Gargantuan)
    if var_order !== nothing
        if length(var_order) != length(sorted_pairs)
            error("var_order length must match unique elements")
        end
        ordered_pairs = [sorted_pairs[i] for i in var_order]
    else
        ordered_pairs = sorted_pairs
    end

    labels = [pair[1] for pair in ordered_pairs]
    counts = [pair[2] for pair in ordered_pairs]

    bar(labels, counts,
        title=title,
        xlabel="Categories",
        ylabel="Frequency",
        size=size,
        xrotation=rotation,
        legend=false,
        color=:steelblue)
end
plot_frequency_distribution(df_monsters[!, :size],
    title="Monster Size",
    var_order=[4, 5, 1, 2, 3, 6])
Bar chart showing monster size frequency distribution
Monster count by size category, ordered from Tiny to Gargantuan.
What to notice: Medium monsters dominate the dataset (~40%), reflecting the Monster Manual's emphasis on humanoid-scale enemies. For a DM, this means the most variety is available in the Medium category, but if you want Large or Huge creatures for dramatic set-piece encounters, your options narrow, making systematic filtering more valuable.

Pie chart: Size as proportion of the whole

Pie charts work best when you want to emphasize that categories are parts of a whole (100%). "Half of all monsters are Medium-sized" reads more intuitively as a pie slice than a bar. Avoid pie charts when you have more than 5–6 categories.

function pie_chart_feat(string_array;
        title="PieChart Distribution",
        size=(600,400),
        legendfontsize=10)

    # Count occurrences
    freq_dict = Dict{String, Int}()
    for s in string_array
        freq_dict[s] = get(freq_dict, s, 0) + 1
    end

    labels = collect(keys(freq_dict))
    counts = collect(values(freq_dict))

    # Compute percentages and build "Label (X%)" strings
    total = sum(counts)
    percentages = round.((counts ./ total) .* 100, digits=1)
    labels_pct = [string(l, " (", p, "%)")
                  for (l, p) in zip(labels, percentages)]

    pie(labels_pct, counts,
        title=title, legend=:outertopright,
        size=size, legendfontsize=legendfontsize)
end
pie_chart_feat(df_monsters[!, :size],
    title="Monster Size Distribution",
    size=(1000,800),
    legendfontsize=16)
Pie chart showing monster size distribution percentages
Proportional breakdown of monster sizes in the D&D 5e Monster Manual.

Boxplots: How do numerical features vary across sizes?

Boxplots show the median, interquartile range (middle 50%), and outliers for each category. Violin plots add a density curve revealing the full distributional shape. Use boxplots for compact comparisons; violins when shape matters (bimodality, heavy skew).

The DataFramesMeta package provides the @groupby macro, which partitions a DataFrame into sub-tables by a categorical variable, similar to SQL's GROUP BY or pandas' groupby().

function plot_distributions_by_category(df, categorical_col, numeric_cols;
        title="Distributions by Category",
        size=(1600, 600),
        plot_type=:boxplot)

    # Split DataFrame into sub-tables by category
    grouped_df = @groupby(df, categorical_col)
    categories = [key[categorical_col] for key in keys(grouped_df)]

    # Determine subplot grid layout (max 3 columns)
    n_plots = length(numeric_cols)
    n_cols = min(3, n_plots)
    n_rows = ceil(Int, n_plots / n_cols)

    plots_array = []

    for (i, col) in enumerate(numeric_cols)
        all_values = Float64[]
        all_labels = String[]

        for category in categories
            category_data = grouped_df[(;
                Dict(categorical_col => category)...)]
            values = category_data[!, col]
            clean_values = filter(!ismissing, values)
            clean_values = Float64.(clean_values)

            if !isempty(clean_values)
                append!(all_values, clean_values)
                append!(all_labels,
                    fill(string(category), length(clean_values)))
            end
        end

        if plot_type == :violin
            p = violin(all_labels, all_values,
                title=string(col), legend=false)
        elseif plot_type == :boxplot
            p = boxplot(all_labels, all_values,
                title=string(col), legend=false)
        end
        push!(plots_array, p)
    end

    plot(plots_array...,
        layout=(n_rows, n_cols), size=size,
        plot_title=title)
end
plot_distributions_by_category(df_monsters,
    :size, [:hp, :speed, :strength])
Boxplots of HP, speed, and strength by monster size
Distribution of HP, speed, and strength across monster size categories.
What to notice: HP scales strongly with size; Gargantuan creatures have median HP far above the others, with wide spread. Speed is relatively consistent across sizes (most creatures move 30–40 ft), but Gargantuan creatures show more variance. Strength increases with size as expected, but the overlap between Medium and Large is substantial, meaning a Large creature is not always stronger than a Medium one. For a DM, this means size alone is not a reliable proxy for difficulty: you need to look at the full stat profile.

Filtering and Subsetting

Real-world analysis rarely uses the entire dataset at once. You filter to answer specific questions; in our case, "Which Large monsters have surprisingly low HP?"

Are there Large monsters with low HP?

Using the @subset macro from DataFramesMeta, we can combine conditions. The dot (.) prefix on operators means "apply element-wise"; Julia checks each row individually.

using DataFramesMeta

# Filter: Large monsters with fewer than 20 HP
# The dot (.) broadcasts the comparison across all rows
low_hp_large = @subset(df_monsters,
    :hp .< 20,
    :size .== "Large")
Name Type Alignment AC HP Str Speed CR
Axe Beak Beast Unaligned 11 19 14 50 1/4
Camel Beast Unaligned 9 15 16 50 1/8
Constrictor Snake Beast Unaligned 12 13 15 30 1/4
Crocodile Beast Unaligned 12 19 15 20 1/2
Draft Horse Beast Unaligned 10 19 18 40 1/4
Elk Beast Unaligned 10 13 16 50 1/4
Giant Goat Beast Unaligned 11 19 17 40 1/2
Giant Lizard Beast Unaligned 12 19 15 30 1/4
Giant Owl Beast Neutral 12 19 13 5 1/4
Giant Sea Horse Beast Unaligned 13 16 12 0 1/2
Hippogriff Monstrosity Unaligned 11 19 17 40 1
Riding Horse Beast Unaligned 10 13 16 60 1/4
Warhorse Beast Unaligned 11 19 18 60 1/2

All 13 results are Beasts (plus one Monstrosity, the Hippogriff), with CRs at or below 1. These are large-bodied but fragile creatures, useful for wildlife encounters or travel sequences where you want imposing visuals without deadly stakes.

Filtering by multiple categories

For more complex selections, say you are a DM looking for Small-to-Large Aberrations and Dragons:

# Define allowed values for each filter
monster_sizes = ["Small", "Medium", "Large"]
monster_types = ["Aberration", "Dragon"]

# .∈ broadcasts "is element of" row-wise
# Ref() prevents Julia from iterating over the array itself
monster_options = @subset(df_monsters,
    :size .∈ Ref(monster_sizes),
    :monster_type .∈ Ref(monster_types));
size(monster_options)[1]
26

This gives 26 candidates. Here are the first six:

Name Type Alignment AC HP Str Speed CR
Aboleth Aberration Lawful Evil 17 135 21 10 10
Black Dragon Wyrmling Dragon Chaotic Evil 17 33 15 30 2
Blue Dragon Wyrmling Dragon Lawful Evil 17 52 17 30 3
Brass Dragon Wyrmling Dragon Chaotic Good 16 16 15 30 1
Bronze Dragon Wyrmling Dragon Lawful Good 17 32 17 30 2
Chuul Aberration Chaotic Evil 16 93 19 30 4

Random selection

When you have a filtered pool and want to pick one at random (for a random encounter table or to break decision paralysis):

# Simulate a dice roll to pick a random monster from the pool
monster_dice_roll = rand(1:size(monster_options, 1))
monster_options[monster_dice_roll, :]
Name Type Alignment AC HP Str Speed CR
Chuul Aberration Chaotic Evil 16 93 19 30 4

Random selection is foundational in data science more broadly: train/test splits, bootstrap sampling, and randomized assignment all use this same operation.

Campaign-specific filtering with regex

Most encounter-building depends on the campaign. If you're running Curse of Strahd, you want Undead creatures specifically. The @subset macro works with regular expressions via occursin():

# Case-insensitive regex match on monster_type
rege = r"Undead"i
my_monster_selection = @subset(df_monsters,
    occursin.(rege, :monster_type))

This returns 18 Undead monsters, from the Skeleton (CR 1/4) to the Lich (CR 21):

Name Type Alignment AC HP Str Speed CR
Ghast Undead Chaotic Evil 13 36 16 30 2
Ghost Undead Any alignment 11 45 7 0 4
Ghoul Undead Chaotic Evil 12 22 13 30 1
Lich Undead Any Evil 17 135 11 30 21
Minotaur Skeleton Undead Lawful Evil 12 67 18 40 2
Mummy Undead Lawful Evil 11 58 16 20 3
Mummy Lord Undead Lawful Evil 17 97 18 20 15
Ogre Zombie Undead Neutral Evil 8 85 19 30 2
Shadow Undead Chaotic Evil 12 16 6 40 1/2
Skeleton Undead Lawful Evil 13 13 10 30 1/4
Specter Undead Chaotic Evil 12 22 1 0 1
Vampire Undead Lawful Evil 16 144 18 30 13
Vampire Spawn Undead Neutral Evil 15 82 16 30 5
Warhorse Skeleton Undead Lawful Evil 13 22 18 60 1/2
Wight Undead Neutral Evil 14 45 15 30 3
Will-O'-Wisp Undead Chaotic Evil 19 22 1 0 2
Wraith Undead Neutral Evil 13 67 6 0 5
Zombie Undead Neutral Evil 8 22 13 20 1/4

Having this pool organized and queryable lets you design a progression of encounters across a campaign like Curse of Strahd.


Balancing a Fight

When putting together a fight, we want to strike a balance between challenge and ensuring that our players have fun. To do this, we might want to take a look at the chance our players will hit a given monster on each roll. Keeping on the theme of the undead, let's calculate the distribution of success chances against these monsters to see if our players are ready to take on the living dead!

❧ ✾ ❧
A Brief Probability Spell

A d20 roll produces a uniform distribution over the integers $\{1, 2, \ldots, 20\}$; each outcome has a $\frac{1}{20} = 5\%$ chance. To hit a monster, the player needs:

$$\text{d20 roll} + \text{modifiers} \geq \text{AC}$$

So the probability of hitting is:

$$P(\text{hit}) = \frac{\text{number of rolls that meet or exceed (AC - modifiers)}}{20}$$

With two special cases from the D&D rules: a natural 20 always hits (minimum 5% chance regardless of AC), and a natural 1 always misses (maximum 95% chance regardless of modifiers).

This is structurally identical to any threshold-detection problem in statistics or engineering.

❧ ✾ ❧

Calculating hit probabilities across Undead monsters

We calculate by doing the following:

# Hit probability for a d20 roll against a given AC
function calculate_d20_probability(threshold::Int, modifier::Int)
    effective_threshold = threshold - modifier

    # D&D rules: nat 20 always hits, nat 1 always misses
    if effective_threshold > 20
        return 0.05  # Natural 20 always hits
    elseif effective_threshold <= 1
        return 0.95  # Natural 1 always misses
    else
        # Count how many d20 faces meet the threshold
        favorable_outcomes = 21 - effective_threshold
        return favorable_outcomes / 20.0
    end
end

# Load data and filter to Undead only
df = CSV.read("cleaned_monsters_basic.csv", DataFrame)
filtered_df = filter(row -> row.monster_type == "Undead", df)
sort!(filtered_df, :ac)  # Order by Armor Class for readable plots

# Four scenarios: no modifier, +5, and two players (+2 vs +7)
mod_0 = 0
mod_5 = 5
p1_mod = 2
p2_mod = 7

# Calculate probabilities
names = filtered_df.name
chance_0 = calculate_d20_probability.(
    Int.(filtered_df.ac), Ref(mod_0))
chance_5 = calculate_d20_probability.(
    Int.(filtered_df.ac), Ref(mod_5))
chance_p1 = calculate_d20_probability.(
    Int.(filtered_df.ac), Ref(p1_mod))
chance_p2 = calculate_d20_probability.(
    Int.(filtered_df.ac), Ref(p2_mod))
# --- Three-panel dashboard ---

# Plot 1: No Modifier
p1 = bar(names, chance_0,
    title = "Base Success (Modifier: 0)",
    color = :lightgrey,
    xticks = :all, xrotation = 45,
    ylims = (0, 1), legend = false)
hline!([0.5], color=:red, linewidth=2, linestyle=:dash)

# Plot 2: Single Modifier (+5)
p2 = bar(names, chance_5,
    title = "Modified Success (Modifier: +5)",
    color = :skyblue,
    xticks = :all, xrotation = 45,
    ylims = (0, 1), legend = false)
hline!([0.5], color=:red, linewidth=2, linestyle=:dash)

# Plot 3: Player Comparison
p3 = bar(names, [chance_p1 chance_p2],
    title = "Player Comparison (+2 vs +7)",
    label = ["Player 1 (+2)" "Player 2 (+7)"],
    color = [:orange :purple],
    fillalpha = 0.5,
    ylabel = "Hit Probability",
    xticks = :all, xrotation = 45,
    ylims = (0, 1.1),
    legend = false)
hline!([0.5], color=:red, linewidth=2, linestyle=:dash)

# Combine into a single 3-row figure
plot(p1, p2, p3,
    layout = (3, 1),
    size = (1200, 1200),
    margin = 10Plots.mm)
Dashboard showing hit probability across undead monsters for different player modifiers
Hit probability for Undead monsters under four modifier scenarios. Red dashed line marks the 50% threshold.
What to notice: The red dashed line marks 50%, the coin-flip threshold. With no modifier (top panel), most Undead monsters sit at or below a 50% hit rate, meaning players will miss more often than they hit. The Zombie and Ogre Zombie (AC 8) are the easiest to hit. Adding a +5 modifier (middle panel) pushes nearly every monster above the 50% line; the encounter feels much more manageable.

The bottom panel is where it gets interesting for DMs. Player 1 (+2) struggles against AC 16+ monsters (Vampire, Lich), dropping to around 35% hit chance, while Player 2 (+7) stays above 50% against everything. This asymmetry matters: if you put a Vampire against this party, Player 1 will feel ineffective in direct combat, which might be frustrating, or might be a deliberate design choice that pushes them toward creative problem-solving. The data lets you make that choice intentionally rather than discovering it mid-session.


Summary

This notebook demonstrated a universal data analysis workflow applied to D&D monster statistics:

  1. Load and inspect your data to understand its structure and available features.
  2. Explore categorical variables (size, type) to see what categories exist and how they're distributed.
  3. Visualize distributions to compare numerical features (HP, speed, strength) across groups.
  4. Filter and subset to answer specific questions relevant to your use case.
  5. Apply quantitative reasoning, here basic probability, to inform decisions.

The tools and thinking transfer directly to any domain with tabular data: ecological surveys, clinical records, economic indicators, or sensor measurements. The key insight is the same in all of them: systematically exploring your data gives you access to options and patterns that intuition alone would miss. For a Dungeon Master, that means better-balanced encounters, more variety, and more confidence in design choices. For a researcher, it means better experimental design and more robust conclusions.


Appendix: Julia ↔ Python Quick Reference

If you're coming from Python with pandas and matplotlib, Julia will feel familiar. The syntax is clean and readable, but Julia compiles your code, which typically results in faster execution.

Concept Python (pandas) Julia (DataFrames.jl)
Import library import pandas as pd using DataFrames
Read CSV df = pd.read_csv(...) df = CSV.read(..., DataFrame)
Access column df['column'] df[!, :column]
Unique values df['col'].unique() unique(df[!, :col])
Row count df.shape[0] size(df, 1)
Filter rows df[df['hp'] < 20] @subset(df, :hp .< 20)
Group by df.groupby('col') @groupby(df, :col)
Random sample df.sample(1) df[rand(1:nrow(df)), :]
Bar plot df['col'].value_counts().plot(kind='bar') bar(labels, counts)
Boxplot sns.boxplot(x=..., y=...) boxplot(labels, values)

The ! in df[!, :column] means "give me the actual column, not a copy." Think of it as Julia being explicit about data access.