- Load, inspect, and clean tabular data using Julia's DataFrames ecosystem
- Explore categorical and numerical features through summary statistics and visualizations
- Filter and subset data to answer specific questions about your dataset
- Apply basic probability to a real decision-making scenario (balancing combat encounters)
Introduction
Dungeons & Dragons (D&D) is a tabletop role-playing game where players navigate a story guided by a Dungeon Master (DM), who designs the world, controls monsters, and adjudicates rules. Combat is resolved by rolling a 20-sided die (d20). Every monster has a stat block with key attributes: Hit Points (HP), the damage it can absorb; Armor Class (AC), the d20 threshold to land a hit; and Challenge Rating (CR), a rough difficulty index for the party.
Why Data-Informed Dungeon Mastering?
A DM preparing encounters faces a design problem: the Monster Manual contains hundreds of creatures, and browsing by hand is tedious and error-prone. Treating monster selection as a data management problem changes this: you can filter, sort, group, and visualize attributes programmatically, asking precise questions like "Which Large undead have AC below 14?". This is the same principle behind any evidence-based workflow: systematic exploration of your option space leads to better decisions.
About This Notebook
This notebook walks through an exploratory data analysis of D&D 5th Edition monster statistics using Julia. The workflow (load, inspect, explore, visualize, filter, analyze) is universal and transfers directly to any tabular dataset in any domain.
To run the source notebook interactively, you need:
- Install Julia (v1.9 or later recommended).
- Install VS Code and the Julia extension.
- Open the
.ipynbnotebook file in VS Code; the Julia extension provides full notebook support with syntax highlighting, inline plots, and a built-in REPL. - When prompted, install the required packages:
CSV,DataFrames,DataFramesMeta,Statistics,StatsPlots.
Julia is an open-source language designed for technical computing. It resolves a common tension: languages easy to write (Python, R) tend to be slow, while fast languages (C, Fortran) are harder to prototype in. Julia compiles just-in-time, so readable high-level code runs at speeds comparable to compiled languages. Its syntax reads close to mathematical notation, and its ecosystem covers statistics, data manipulation, and visualization. If you are new to Julia, the official documentation and Julia Academy are good starting points.
Setup and Data Loading
# --- Package imports ---
using CSV # Read/write CSV files
using DataFrames # Tabular data structures
using DataFramesMeta # Convenient macros: @subset, @groupby
using Statistics # mean, median, std, etc.
using Random # Reproducible random sampling
using StatsPlots # Plotting: bar, boxplot, violin, pie
# Global plot defaults for readability
default(
size = (820, 580),
guidefontsize = 12,
tickfontsize = 10,
titlefontsize = 14,
legendfontsize = 11,
fontfamily = "sans-serif",
margin = 5Plots.mm,
dpi = 150
)
A DataFrame is a spreadsheet in code: rows are records (monsters), columns are attributes (HP, size, type). This structure lets you filter, group, summarize, and visualize patterns, the same way for any tabular data.
To load a CSV file into a DataFrame in Julia:
# Load the dataset; semicolon suppresses output
df_monsters = CSV.read("cleaned_monsters_basic.csv", DataFrame);
# Preview the first 5 rows
first(df_monsters, 5)
| Name | Type | Alignment | AC | HP | Str | Speed | CR |
|---|---|---|---|---|---|---|---|
| Aboleth | Aberration | Lawful Evil | 17 | 135 | 21 | 10 | 10 |
| Acolyte | Humanoid (any race) | Any alignment | 10 | 9 | 10 | 30 | 1/4 |
| Adult Black Dragon | Dragon | Chaotic Evil | 19 | 195 | 23 | 40 | 14 |
| Adult Blue Dragon | Dragon | Lawful Evil | 19 | 225 | 25 | 40 | 16 |
| Adult Brass Dragon | Dragon | Chaotic Good | 18 | 172 | 23 | 40 | 13 |
This data was downloaded from Kaggle.
Inspecting column names
Before any analysis, check what features (columns) are available:
for name in names(df_monsters)
print(name, "\t")
end
Column1 name size monster_type alignment ac hp strength str_mod dex dex_mod con con_mod intel int_mod wis wis_mod cha cha_mod senses languages cr str_save dex_save con_save int_save wis_save cha_save speed swim fly climb burrow number_legendary_actions history perception stealth persuasion insight deception arcana religion acrobatics athletics intimidation
The dataset contains 45 columns. For this notebook, we will focus on a manageable subset: name, monster_type, alignment, ac, hp, strength, speed, and cr.
Exploring Categorical Features
Before computing statistics, check what values exist in your data. This prevents surprises like typos, unexpected categories, or missing entries.
How many different sizes exist?
unique(df_monsters[!, :size])
6-element Vector{String15}:
"Large"
"Medium"
"Huge"
"Gargantuan"
"Small"
"Tiny"
Visualizing Distributions
Bar plot: Monster count by size
When you have discrete categories (Small, Medium, Large), bar charts make comparison intuitive: taller bars mean more items. Sorting bars by a logical order (Tiny → Gargantuan) or by frequency helps you immediately spot which categories dominate and which are rare.
function plot_frequency_distribution(string_array;
title="Frequency Distribution",
size=(800,600),
rotation=45,
var_order=nothing)
# Count occurrences of each category
freq_dict = Dict{String, Int}()
for s in string_array
freq_dict[s] = get(freq_dict, s, 0) + 1
end
# Sort by frequency (descending) by default
sorted_pairs = sort(collect(freq_dict), by=x->x[2], rev=true)
# Optional: reorder by a custom index (e.g. Tiny→Gargantuan)
if var_order !== nothing
if length(var_order) != length(sorted_pairs)
error("var_order length must match unique elements")
end
ordered_pairs = [sorted_pairs[i] for i in var_order]
else
ordered_pairs = sorted_pairs
end
labels = [pair[1] for pair in ordered_pairs]
counts = [pair[2] for pair in ordered_pairs]
bar(labels, counts,
title=title,
xlabel="Categories",
ylabel="Frequency",
size=size,
xrotation=rotation,
legend=false,
color=:steelblue)
end
plot_frequency_distribution(df_monsters[!, :size],
title="Monster Size",
var_order=[4, 5, 1, 2, 3, 6])
Pie chart: Size as proportion of the whole
Pie charts work best when you want to emphasize that categories are parts of a whole (100%). "Half of all monsters are Medium-sized" reads more intuitively as a pie slice than a bar. Avoid pie charts when you have more than 5–6 categories.
function pie_chart_feat(string_array;
title="PieChart Distribution",
size=(600,400),
legendfontsize=10)
# Count occurrences
freq_dict = Dict{String, Int}()
for s in string_array
freq_dict[s] = get(freq_dict, s, 0) + 1
end
labels = collect(keys(freq_dict))
counts = collect(values(freq_dict))
# Compute percentages and build "Label (X%)" strings
total = sum(counts)
percentages = round.((counts ./ total) .* 100, digits=1)
labels_pct = [string(l, " (", p, "%)")
for (l, p) in zip(labels, percentages)]
pie(labels_pct, counts,
title=title, legend=:outertopright,
size=size, legendfontsize=legendfontsize)
end
pie_chart_feat(df_monsters[!, :size],
title="Monster Size Distribution",
size=(1000,800),
legendfontsize=16)
Boxplots: How do numerical features vary across sizes?
Boxplots show the median, interquartile range (middle 50%), and outliers for each category. Violin plots add a density curve revealing the full distributional shape. Use boxplots for compact comparisons; violins when shape matters (bimodality, heavy skew).
The DataFramesMeta package provides the @groupby macro, which partitions a
DataFrame into sub-tables by a categorical variable, similar to SQL's GROUP BY or pandas'
groupby().
function plot_distributions_by_category(df, categorical_col, numeric_cols;
title="Distributions by Category",
size=(1600, 600),
plot_type=:boxplot)
# Split DataFrame into sub-tables by category
grouped_df = @groupby(df, categorical_col)
categories = [key[categorical_col] for key in keys(grouped_df)]
# Determine subplot grid layout (max 3 columns)
n_plots = length(numeric_cols)
n_cols = min(3, n_plots)
n_rows = ceil(Int, n_plots / n_cols)
plots_array = []
for (i, col) in enumerate(numeric_cols)
all_values = Float64[]
all_labels = String[]
for category in categories
category_data = grouped_df[(;
Dict(categorical_col => category)...)]
values = category_data[!, col]
clean_values = filter(!ismissing, values)
clean_values = Float64.(clean_values)
if !isempty(clean_values)
append!(all_values, clean_values)
append!(all_labels,
fill(string(category), length(clean_values)))
end
end
if plot_type == :violin
p = violin(all_labels, all_values,
title=string(col), legend=false)
elseif plot_type == :boxplot
p = boxplot(all_labels, all_values,
title=string(col), legend=false)
end
push!(plots_array, p)
end
plot(plots_array...,
layout=(n_rows, n_cols), size=size,
plot_title=title)
end
plot_distributions_by_category(df_monsters,
:size, [:hp, :speed, :strength])
Filtering and Subsetting
Real-world analysis rarely uses the entire dataset at once. You filter to answer specific questions; in our case, "Which Large monsters have surprisingly low HP?"
Are there Large monsters with low HP?
Using the @subset macro from DataFramesMeta, we can combine conditions. The dot
(.) prefix on operators means "apply element-wise"; Julia checks each row individually.
using DataFramesMeta
# Filter: Large monsters with fewer than 20 HP
# The dot (.) broadcasts the comparison across all rows
low_hp_large = @subset(df_monsters,
:hp .< 20,
:size .== "Large")
| Name | Type | Alignment | AC | HP | Str | Speed | CR |
|---|---|---|---|---|---|---|---|
| Axe Beak | Beast | Unaligned | 11 | 19 | 14 | 50 | 1/4 |
| Camel | Beast | Unaligned | 9 | 15 | 16 | 50 | 1/8 |
| Constrictor Snake | Beast | Unaligned | 12 | 13 | 15 | 30 | 1/4 |
| Crocodile | Beast | Unaligned | 12 | 19 | 15 | 20 | 1/2 |
| Draft Horse | Beast | Unaligned | 10 | 19 | 18 | 40 | 1/4 |
| Elk | Beast | Unaligned | 10 | 13 | 16 | 50 | 1/4 |
| Giant Goat | Beast | Unaligned | 11 | 19 | 17 | 40 | 1/2 |
| Giant Lizard | Beast | Unaligned | 12 | 19 | 15 | 30 | 1/4 |
| Giant Owl | Beast | Neutral | 12 | 19 | 13 | 5 | 1/4 |
| Giant Sea Horse | Beast | Unaligned | 13 | 16 | 12 | 0 | 1/2 |
| Hippogriff | Monstrosity | Unaligned | 11 | 19 | 17 | 40 | 1 |
| Riding Horse | Beast | Unaligned | 10 | 13 | 16 | 60 | 1/4 |
| Warhorse | Beast | Unaligned | 11 | 19 | 18 | 60 | 1/2 |
All 13 results are Beasts (plus one Monstrosity, the Hippogriff), with CRs at or below 1. These are large-bodied but fragile creatures, useful for wildlife encounters or travel sequences where you want imposing visuals without deadly stakes.
Filtering by multiple categories
For more complex selections, say you are a DM looking for Small-to-Large Aberrations and Dragons:
# Define allowed values for each filter
monster_sizes = ["Small", "Medium", "Large"]
monster_types = ["Aberration", "Dragon"]
# .∈ broadcasts "is element of" row-wise
# Ref() prevents Julia from iterating over the array itself
monster_options = @subset(df_monsters,
:size .∈ Ref(monster_sizes),
:monster_type .∈ Ref(monster_types));
size(monster_options)[1]
26
This gives 26 candidates. Here are the first six:
| Name | Type | Alignment | AC | HP | Str | Speed | CR |
|---|---|---|---|---|---|---|---|
| Aboleth | Aberration | Lawful Evil | 17 | 135 | 21 | 10 | 10 |
| Black Dragon Wyrmling | Dragon | Chaotic Evil | 17 | 33 | 15 | 30 | 2 |
| Blue Dragon Wyrmling | Dragon | Lawful Evil | 17 | 52 | 17 | 30 | 3 |
| Brass Dragon Wyrmling | Dragon | Chaotic Good | 16 | 16 | 15 | 30 | 1 |
| Bronze Dragon Wyrmling | Dragon | Lawful Good | 17 | 32 | 17 | 30 | 2 |
| Chuul | Aberration | Chaotic Evil | 16 | 93 | 19 | 30 | 4 |
Random selection
When you have a filtered pool and want to pick one at random (for a random encounter table or to break decision paralysis):
# Simulate a dice roll to pick a random monster from the pool
monster_dice_roll = rand(1:size(monster_options, 1))
monster_options[monster_dice_roll, :]
| Name | Type | Alignment | AC | HP | Str | Speed | CR |
|---|---|---|---|---|---|---|---|
| Chuul | Aberration | Chaotic Evil | 16 | 93 | 19 | 30 | 4 |
Random selection is foundational in data science more broadly: train/test splits, bootstrap sampling, and randomized assignment all use this same operation.
Campaign-specific filtering with regex
Most encounter-building depends on the campaign. If you're running Curse of Strahd, you want
Undead creatures specifically. The @subset macro works with regular expressions via
occursin():
# Case-insensitive regex match on monster_type
rege = r"Undead"i
my_monster_selection = @subset(df_monsters,
occursin.(rege, :monster_type))
This returns 18 Undead monsters, from the Skeleton (CR 1/4) to the Lich (CR 21):
| Name | Type | Alignment | AC | HP | Str | Speed | CR |
|---|---|---|---|---|---|---|---|
| Ghast | Undead | Chaotic Evil | 13 | 36 | 16 | 30 | 2 |
| Ghost | Undead | Any alignment | 11 | 45 | 7 | 0 | 4 |
| Ghoul | Undead | Chaotic Evil | 12 | 22 | 13 | 30 | 1 |
| Lich | Undead | Any Evil | 17 | 135 | 11 | 30 | 21 |
| Minotaur Skeleton | Undead | Lawful Evil | 12 | 67 | 18 | 40 | 2 |
| Mummy | Undead | Lawful Evil | 11 | 58 | 16 | 20 | 3 |
| Mummy Lord | Undead | Lawful Evil | 17 | 97 | 18 | 20 | 15 |
| Ogre Zombie | Undead | Neutral Evil | 8 | 85 | 19 | 30 | 2 |
| Shadow | Undead | Chaotic Evil | 12 | 16 | 6 | 40 | 1/2 |
| Skeleton | Undead | Lawful Evil | 13 | 13 | 10 | 30 | 1/4 |
| Specter | Undead | Chaotic Evil | 12 | 22 | 1 | 0 | 1 |
| Vampire | Undead | Lawful Evil | 16 | 144 | 18 | 30 | 13 |
| Vampire Spawn | Undead | Neutral Evil | 15 | 82 | 16 | 30 | 5 |
| Warhorse Skeleton | Undead | Lawful Evil | 13 | 22 | 18 | 60 | 1/2 |
| Wight | Undead | Neutral Evil | 14 | 45 | 15 | 30 | 3 |
| Will-O'-Wisp | Undead | Chaotic Evil | 19 | 22 | 1 | 0 | 2 |
| Wraith | Undead | Neutral Evil | 13 | 67 | 6 | 0 | 5 |
| Zombie | Undead | Neutral Evil | 8 | 22 | 13 | 20 | 1/4 |
Having this pool organized and queryable lets you design a progression of encounters across a campaign like Curse of Strahd.
Balancing a Fight
When putting together a fight, we want to strike a balance between challenge and ensuring that our players have fun. To do this, we might want to take a look at the chance our players will hit a given monster on each roll. Keeping on the theme of the undead, let's calculate the distribution of success chances against these monsters to see if our players are ready to take on the living dead!
A d20 roll produces a uniform distribution over the integers $\{1, 2, \ldots, 20\}$; each outcome has a $\frac{1}{20} = 5\%$ chance. To hit a monster, the player needs:
$$\text{d20 roll} + \text{modifiers} \geq \text{AC}$$So the probability of hitting is:
$$P(\text{hit}) = \frac{\text{number of rolls that meet or exceed (AC - modifiers)}}{20}$$With two special cases from the D&D rules: a natural 20 always hits (minimum 5% chance regardless of AC), and a natural 1 always misses (maximum 95% chance regardless of modifiers).
This is structurally identical to any threshold-detection problem in statistics or engineering.
Calculating hit probabilities across Undead monsters
We calculate by doing the following:
- For each monster, calculate the probability of hitting. For this example, we are looking at a situation where there are no modifiers, where the player has +5, and where there are two players with +2 and +7.
- Given the AC of the monster, what is the distribution of a player's chances in hitting this monster (e.g., rolling above the given AC)?
- Plot this on a bar chart to see who our players would have a chance to defeat!
# Hit probability for a d20 roll against a given AC
function calculate_d20_probability(threshold::Int, modifier::Int)
effective_threshold = threshold - modifier
# D&D rules: nat 20 always hits, nat 1 always misses
if effective_threshold > 20
return 0.05 # Natural 20 always hits
elseif effective_threshold <= 1
return 0.95 # Natural 1 always misses
else
# Count how many d20 faces meet the threshold
favorable_outcomes = 21 - effective_threshold
return favorable_outcomes / 20.0
end
end
# Load data and filter to Undead only
df = CSV.read("cleaned_monsters_basic.csv", DataFrame)
filtered_df = filter(row -> row.monster_type == "Undead", df)
sort!(filtered_df, :ac) # Order by Armor Class for readable plots
# Four scenarios: no modifier, +5, and two players (+2 vs +7)
mod_0 = 0
mod_5 = 5
p1_mod = 2
p2_mod = 7
# Calculate probabilities
names = filtered_df.name
chance_0 = calculate_d20_probability.(
Int.(filtered_df.ac), Ref(mod_0))
chance_5 = calculate_d20_probability.(
Int.(filtered_df.ac), Ref(mod_5))
chance_p1 = calculate_d20_probability.(
Int.(filtered_df.ac), Ref(p1_mod))
chance_p2 = calculate_d20_probability.(
Int.(filtered_df.ac), Ref(p2_mod))
# --- Three-panel dashboard ---
# Plot 1: No Modifier
p1 = bar(names, chance_0,
title = "Base Success (Modifier: 0)",
color = :lightgrey,
xticks = :all, xrotation = 45,
ylims = (0, 1), legend = false)
hline!([0.5], color=:red, linewidth=2, linestyle=:dash)
# Plot 2: Single Modifier (+5)
p2 = bar(names, chance_5,
title = "Modified Success (Modifier: +5)",
color = :skyblue,
xticks = :all, xrotation = 45,
ylims = (0, 1), legend = false)
hline!([0.5], color=:red, linewidth=2, linestyle=:dash)
# Plot 3: Player Comparison
p3 = bar(names, [chance_p1 chance_p2],
title = "Player Comparison (+2 vs +7)",
label = ["Player 1 (+2)" "Player 2 (+7)"],
color = [:orange :purple],
fillalpha = 0.5,
ylabel = "Hit Probability",
xticks = :all, xrotation = 45,
ylims = (0, 1.1),
legend = false)
hline!([0.5], color=:red, linewidth=2, linestyle=:dash)
# Combine into a single 3-row figure
plot(p1, p2, p3,
layout = (3, 1),
size = (1200, 1200),
margin = 10Plots.mm)
The bottom panel is where it gets interesting for DMs. Player 1 (+2) struggles against AC 16+ monsters (Vampire, Lich), dropping to around 35% hit chance, while Player 2 (+7) stays above 50% against everything. This asymmetry matters: if you put a Vampire against this party, Player 1 will feel ineffective in direct combat, which might be frustrating, or might be a deliberate design choice that pushes them toward creative problem-solving. The data lets you make that choice intentionally rather than discovering it mid-session.
Summary
This notebook demonstrated a universal data analysis workflow applied to D&D monster statistics:
- Load and inspect your data to understand its structure and available features.
- Explore categorical variables (size, type) to see what categories exist and how they're distributed.
- Visualize distributions to compare numerical features (HP, speed, strength) across groups.
- Filter and subset to answer specific questions relevant to your use case.
- Apply quantitative reasoning, here basic probability, to inform decisions.
The tools and thinking transfer directly to any domain with tabular data: ecological surveys, clinical records, economic indicators, or sensor measurements. The key insight is the same in all of them: systematically exploring your data gives you access to options and patterns that intuition alone would miss. For a Dungeon Master, that means better-balanced encounters, more variety, and more confidence in design choices. For a researcher, it means better experimental design and more robust conclusions.
Appendix: Julia ↔ Python Quick Reference
If you're coming from Python with pandas and matplotlib, Julia will feel familiar. The syntax is clean and readable, but Julia compiles your code, which typically results in faster execution.
| Concept | Python (pandas) | Julia (DataFrames.jl) |
|---|---|---|
| Import library | import pandas as pd |
using DataFrames |
| Read CSV | df = pd.read_csv(...) |
df = CSV.read(..., DataFrame) |
| Access column | df['column'] |
df[!, :column] |
| Unique values | df['col'].unique() |
unique(df[!, :col]) |
| Row count | df.shape[0] |
size(df, 1) |
| Filter rows | df[df['hp'] < 20] |
@subset(df, :hp .< 20) |
| Group by | df.groupby('col') |
@groupby(df, :col) |
| Random sample | df.sample(1) |
df[rand(1:nrow(df)), :] |
| Bar plot | df['col'].value_counts().plot(kind='bar') |
bar(labels, counts) |
| Boxplot | sns.boxplot(x=..., y=...) |
boxplot(labels, values) |
The ! in df[!, :column] means "give me the actual column, not a copy." Think of
it as Julia being explicit about data access.