Biblical kings and boxplots

When you read through the biblical books of Kings, you may have been struck by a phrase that repeats itself for every monarch:

In the Xth year of (king of kingdom B), (name of king) became king of (kingdom A). He reigned N years, and did (evil|good) in the sight of the Lord.

If you’ve read through these books several times, you will probably have noticed that the shorter reigns tend to belong to kings deemed to have done evil, with a record-breaking 3 months for Jehoahaz and Jehoiachin. Let’s see if there’s any relationship between reign duration and “goodness” of the king. First we prepare the data in a form suitable for analysis:

kings.data <- c("Name Deeds Reign
                David Good 40
                Solomon Good 40
                Rehoboam Evil 17
                Abijah Evil 3
                Asa Good 41
                Jehoshaphat Good 25
                Jehoram Evil 8
                Ahaziah Evil 1
                Joash Good 40
                Amaziah Good 29
                Azariah Good 52
                Jotham Good 16
                Ahaz Evil 16
                Hezekiah Good 29
                Manasseh Evil 55
                Amon Evil 2
                Josiah Good 31
                Jehoahaz Evil 0.25
                Jehoiakim Evil 11
                Jehoiachin Evil 0.25
                Zedekiah Evil 11")

kings <- read.table(textConnection(kings.data),
                    header = TRUE,
                    row.names = 1)

The kings data frame holds one row for each king. The row names are the names of the kings; the column Deeds is a two-level factor telling if their reign was deemed good or evil. (We assume here that Solomon was a good guy in spite of what happened towards the end of his life.) The Reign column records the length of the reign as given in the Bible in years, with fractional values for reigns shorter than one year.

We have 11 evil kings and 10 good kings:

> table(kings$Deeds)

Evil Good 
  11   10 

Here we compute the median reign duration depending on the rating of the deeds:

> with(kings, tapply(Reign, Deeds, median))
Evil Good 
 8.0 35.5 

There’s already a good indication that the length of the reign depends on the deeds. We can now plot the length of the reigns:

library(car)
Boxplot(Reign ~ Deeds,
        kings)

Duration of biblical reign, depending whether the Bible judges the king to have been a just or an evil one

This plot confirms our impression: “evil” kings tend to have shorter reigns that “good” kings, with the obvious exception of Manasseh, the same one of whom it was said

I’ve arranged for four kinds of punishment: death in battle, the
corpses dropped off by killer dogs, the rest picked clean by vultures,
the bones gnawed by hyenas. They’ll be a sight to see, a sight to
shock the whole world—and all because of Manasseh son of Hezekiah and
all he did in Jerusalem. (Jer 15:3-4)

So what does this all prove? Probably nothing. Plotting data is its own reward. 🙂

Book review: Advanced R

I would like to call this the best second book on R, except that I wouldn’t know what the first one would be. I learned R from classes and tutorials about 10 years ago, used it on my PhD and four articles, and use it today on a daily basis at work; yet only now, after reading this book, do I feel like I could possibly be called an R programmer rather than just a user.

The book deals with a variety of topics that are seldom discussed in the R tutorials you are likely to find freely available. Some are perhaps unnecessary in a book like this (Ch. 5 Style Guide), some could easily deserve an entire book (Ch. 7 OO field guide), but the chapters on Functions, Environments, the three chapters in Part II (Functional Programming) and the chapter on Non-standard evaluation are easily reasons enough to buy this book.

How many time indeed have you spent hours, frustrated, trying to write a function that would act as a wrapper around, say, lattice plotting functions that use non-standard evaluation? Or try to call subset() from another function, only to see cryptic error messages? Now, for the first time, the solution is not only clear to me; I feel like I could also explain to a colleague why things work the way they do.

R is incredibly powerful and dynamic and will, most of the time, do just what you expect it to do. But if you want to really understand what is going on under the hood, or if you want to write your own packages (whether for self-, internal-, or widespread use), you owe it to yourself to read this book.

Bayesian tanks

The frequentist vs bayesian debate has plagued the scientific community for almost a century now, yet most of the arguments I’ve seen seem to involve philosophical considerations instead of hard data.

Instead of letting the sun explode, I propose a simpler experiment to assess the performance of each approach.

The problem reads as follows (taken from Jaynes’s Probability Theory):

You are traveling on a night train; on awakening from sleep, you notice that the train is stopped at some unknown town, and all you can see is a taxicab with the number 27 on it. What is then your guess as to the number N of taxicabs in the town, which would in turn give a clue as to the size of the town?

In different setting, this problem is also known as the German tank problem, where again the goal is to estimate the total size of a population from the serial number observed on a small sample of that population.

The frequentist and bayesian approaches yield completely different estimates for the number N. To see which approach gives the most satisfactory estimates, let’s generate a large number of such problems and see the error distribution that arise from either approach.

n.runs <- 10000
N.max <- 1000
N <- sample.int(n = N.max, size = n.runs, replace = TRUE)
m <- sapply(N, sample, size = 1)

We run this experiment n.runs times. Each time we select a random population size N uniformly drawn between 1 and N.max. We draw a random number m between 1 and N, representing the serial number that is observed.

The frequentist approach gives the following formula for estimating $N$: $\hat{N} = 2m-1$. Intuitively, the argument runs that the expected value for $m$ will be $N/2$. $m$ is therefore our best estimate for half of $N$, and hence, our best estimate for $N$ will be twice $m$. And I’m not sure where the ${}-1$ thing comes from.

The bayesian approach is more complex and requires one to provide an estimate for the maximum possible number of taxicabs. Let’s therefore assume that we know that the total number of cabs will not be larger than 1000. (The frequentist approach cannot use this information, but to make a fair comparison we will cap the frequentist estimate at 1000 if it is larger.)

Then the bayesian estimate is given by $\hat{N} = (N_\textrm{max} +1 – m) / \log(N_\textrm{max} / (m – 1))$.

Putting it all together, we create a data frame with the errors found for both approaches:

frequentist <- pmin(m * 2 - 1, N.max.bayes) - N
bayesian <- (N.max.bayes + 1 - m) / log(N.max.bayes / (m - 1)) - N
errors <- rbind(data.frame(approach = "FREQ",
                           errors = frequentist),
                data.frame(approach = "BAYES",
                           errors = bayesian))

The mean square error for each approach is then given by:

> by(errors$errors^2, errors$approach, mean)
errors$approach: FREQ
[1] 73436.81
------------------------------------------------------------------------------- 
errors$approach: BAYES
[1] 44878.61

The bayesian approach yields close to half the square error of the frequentist approach. The errors can also be plotted:

library(lattice)
histogram(~ errors | approach, errors)

Taxicabs errors

Both error distributions are skewed towards negative values, meaning that both approaches tend to underestimate $N$. However, the bayesian errors have a tighter distribution around 0 than the frequentist ones.

The bottom line is that, given exactly the same information, the bayesian approach yields estimates whose squared error is about 60% that of the frequentist approach. For this particular problem, there is no question that the bayesian approach is the correct one.