Is The Ratio of Normal Variables Normal?

In Trustworthy Online Controller Experiments I came across this quote, referring to a ratio metric $M = \frac{X}{Y}$, which states that:

Because $X$ and $Y$ are jointly bivariate normal in the limit, $M$, as the ratio of the two averages, is also normally distributed.

That’s only partially true. According to https://en.wikipedia.org/wiki/Ratio_distribution, the ratio of two uncorrelated noncentral normal variables $X = N(\mu_X, \sigma_X^2)$ and $Y = N(\mu_Y, \sigma_Y^2)$ has mean $\mu_X / \mu_Y$ and variance approximately $\frac{\mu_X^2}{\mu_Y^2}\left( \frac{\sigma_X^2}{\mu_X^2} + \frac{\sigma_Y^2}{\mu_Y^2} \right)$. The article implies that this is true when $Y$ is unlikely to assume negative values, say $\mu_Y > 3 \sigma_Y$.

As always, the best way to believe something is to see it yourself. Let’s generate some uncorrelated normal variables far from 0 and their ratio:

ux = 100
sdx = 2
uy = 50
sdy = 0.5

X <- rnorm(1000, mean = ux, sd = sdx)
Y <- rnorm(1000, mean = uy, sd = sdy)
Z <- X / Y

Their ratio looks normal enough:

hist(Z)

Which is confirmed by a q-q plot:

qqnorm(Z)

What about the mean and variance?

mean(Z)
[1] 1.998794
ux / uy
[1] 2
var(Z)
[1] 0.001783404
ux^2 / uy^2 * (sdx^2 / ux^2 + sdy^2 / uy^2)
[1] 0.002

Both the mean and variance are very close to their theoretical values.

But what happens now when the denominator $Y$ has a mean close to 0?

ux = 100
sdx = 2
uy = 10
sdy = 2

X <- rnorm(1000, mean = ux, sd = sdx)
Y <- rnorm(1000, mean = uy, sd = sdy)
Z <- X / Y

Hard to call the resulting ratio normally distributed:

hist(Z)

Which is also clear with a q-q plot:

qqnorm(Z)

In other words, it is generally true that ratio metrics where the denominator is far from 0 will also be close enough to a normal distribution for practical purposes. But when the denominator’s mean is, say, closer than 5 sigmas from 0 that assumption breaks down.

Controlling for covariates is not the same as “slicing”

To detect small effects in experiments you need to reduce the experimental noise as much as possible. You can do it by working with larger sample sizes, but that doesn’t scale well. A far better approach consists in controlling for covariates that are correlated with your response.

I recently gave a talk at our company on the design of online experiments, and someone pointed out that our automated experiment analysis tool implemented “slicing”, that is, running separate analyses on subsets of the data. Wasn’t that the same thing as controlling for covariates?

Controlling for covariates means you include them in your statistical model. Running separate analyses means each of your sub-analyses has a smaller sample size; you may gain in precision because your response will be less variable in each subset, but you lose the benefits that come from using a larger sample size.

Let’s illustrate this with a simulation. Let’s say we wish to measure the impact of some treatment, whose effect is about 10 times smaller than the standard deviation of the error term:

mu <- 10  # some intercept
err <- 1  # standard deviation of the error
treat_effect <- 0.1  # the treatment effect to estimate

Let’s say we have a total of 1000 units in each arm of this two-sample experiment, and that they belong to 4 different equal-sized groups labeled ABC, and D:

n <- 1000

predictor <- 
  data.frame(
    group = gl(4, n / 4, n * 2, labels = c('A', 'B', 'C', 'D')),
    treat = gl(2, n, labels = c('treat', 'control')))

Let’s simulate the response. For simplicity, let’s say that the group membership has an impact on the response equal to the treatment effect:

group_effect <- treat_effect

response <-
  with(
    predictor,
    mu + as.integer(group) * group_effect + (treat == 'treat') * treat_effect + rnorm(n * 2, sd = err))

df <- cbind(predictor, response)

summary(df)
 group       treat         response     
 A:500   treat  :1000   Min.   : 6.368  
 B:500   control:1000   1st Qu.: 9.607  
 C:500                  Median :10.335  
 D:500                  Mean   :10.325  
                        3rd Qu.:11.015  
                        Max.   :13.757  

The following plot shows how the response is distributed in each group. This is one of those instances where you need statistical models to detect effects that are hard to see in a plot:

ggplot2::ggplot(df, ggplot2::aes(x = group, y = response)) +
  ggplot2::geom_boxplot()

Fitting the full model yields the following confidence intervals:

mod_full <- lm(response ~ group + treat, df)
confint(mod_full)
                    2.5 %      97.5 %
(Intercept)  10.122391874 10.32334506
groupB        0.009174272  0.26336218
groupC        0.106049411  0.36023732
groupD        0.116794148  0.37098206
treatcontrol -0.193199676 -0.01346168

All coefficients are estimated correctly, and the width of the confidence interval of the treatment effect is about 0.18. The treatment effect is statistically significant. Recall that the error is taken to have a standard deviation of 1, and that $n=1000$ per arm, so we would except the 95% confidence interval on the treatment effect to be $2 \times 1.96 \times \sigma \times \sqrt{2/n}$, or about 0.18. We are not very far off.

What happens now if, instead of controlling for the group, we “sliced” the analysis, i.e. we fit four separate models, one per group? On one hand we will have a smaller error than if we fitted a global model that did not control for the group covariate; on the other hand we will have fewer observations per group, which will hurt our confidence intervals. Let’s check:

confint(lm(response ~ treat, df, subset = group == 'A'))
                  2.5 %      97.5 %
(Intercept)  10.0482082 10.30293094
treatcontrol -0.2967421  0.06349024
confint(lm(response ~ treat, df, subset = group == 'B'))
                  2.5 %      97.5 %
(Intercept)  10.2324904 10.48601591
treatcontrol -0.3688051 -0.01026586
confint(lm(response ~ treat, df, subset = group == 'C'))
                 2.5 %      97.5 %
(Intercept)  10.303706 10.55182387
treatcontrol -0.309723  0.04116836
confint(lm(response ~ treat, df, subset = group == 'D'))
                  2.5 %      97.5 %
(Intercept)  10.4170354 10.66328177
treatcontrol -0.3825046 -0.03425957

The estimates for the treatment effect remain unbiased, but now the confidence intervals are about 0.35—or $2 \times 1.96 \times \sqrt{2/(n/4)}$, which is what you would expect for a sample size that’s four times smaller. That’s twice as large as when fitting the whole data with a model that includes the group covariate. In fact most of the groups now have statistically unsignificant results.

I’m all for automated experiment analysis tools; but when the goal is to detect small effects, I think there’s currently no substitute for a manual analysis by a trained statistician (which I am not). Increasing sample sizes can only take you so far; remember that the confidence intervals scale with $\sqrt{n}$ only. It is almost always better to search for a set of covariates correlated with the response, and include them in your statistical model. And that’s what controlling for a covariate means.

Feature standardization considered harmful

Many statistical learning algorithms perform better when the covariates are on similar scales. For example, it is common practice to standardize the features used by an artificial neural network so that the gradient of its objective function doesn’t depend on the physical units in which the features are described.

The same advice is frequently given for K-means clustering (see Do Clustering algorithms need feature scaling in the pre-processing stage?, Are mean normalization and feature scaling needed for k-means clustering?, and In cluster analysis should I scale (standardize) my data if variables are in the same units?), but there’s a great counter-example given in The Elements of Statistical Learning that I try to reproduce here.

Consider two point clouds ($n=50$ each), randomly drawn around two origins 3 units away from the origin:

Continue reading

No, you have not controlled for confounders

When observational data includes a treatment indicator and some possible confounders, it is very tempting to simply regress the outcome on all features (confounders and treatment alike), extract the coefficients associated with the treatment indicator, and proudly proclaim that “we have controlled for confounders and estimated the treatment effect”.

This approach is wrong. Very wrong. At least as wrong as that DIY electrical job you did last week: it looks all good and neat but you’ve made a critical mistake and there’s no way you can find out without killing yourself.

Or worse, by thinking you’ve controlled for confounders when you haven’t

I can’t explain why this is wrong (I’m not sure I understand it myself) but I can show you some examples proving that this approach is wrong. We’ll work through a few examples, where we compare the results with a traditional regression with a couple of legit causal inference libraries. Since we use simulated data, we’ll also be able to compare with the “true” treatment effect.

Continue reading

A/B testing my resume

Internet wisdom is divided on whether one-page resumes are more effective at landing you an interview than two-page ones. Most of the advice out there seems much opinion- or anecdotal-based, with very little scientific basis.

Well, let’s fix that.

Being currently open to work, I thought this would be the right time to test this scientifically. I have two versions of my resume:

The purpose of a resume is to land you an interview, so we’ll track for each resume how many applications yield a call for an interview. Non-responses after one week are treated as failures. We’ll model the effectiveness of a resume as a binomial distribution: all other things being considered equal, we’ll assume all applications using the same resume type have the same probability ($p1$ or $p2$) of landing an interview. We’d like to estimate these probabilities, and decide if one resume is more effective than the other.

Continue reading

Monty Hall: a programmer’s explanation

I take it we’re all familiar with the infamous Monty Hall problem:

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say A, and the host, who knows what’s behind the doors, opens another door, say C, which has a goat. He then says to you, “Do you want to pick door B?” Is it to your advantage to switch your choice?

The correct strategy consists in accepting the offer to switch doors. The expected gain when doing so is 2/3, vs 1/3 when choosing not to change doors (there are only two doors left closed, so one of them must have the car; hence the expected gains sum to 1).

https://xkcd.com/1282/

The mathematical proof is straightforward but most people (including me) have a hard time understanding intuitively what’s going on. There are plenty of explanations available online, but I’ve recently simulated the problem and come across one that I think is simpler and more elegant than anything I’ve seen so far. And I didn’t make it up; the simulation code did. Here’s how.

Classic Monty

Let’s simulate 10,000 rounds of the game. The car vector will hold the door which has the car behind it:

n <- 10000
car <- sample(c('A', 'B', 'C'), size = n, replace = TRUE)

Quick check that the door appears randomly assigned:

table(car)
## car
##    A    B    C 
## 3329 3310 3361

Let’s now simulate the door being opened by Monty. Without loss of generality, let’s assume you choose door A. Then if the car is behind door A, Monty will open door B or C at random; else he will open the door that has a goat.

door <- ifelse(
  car == 'A',
  sample(c('B', 'C'), size = n, replace = TRUE), # chosen at random 
  ifelse(car == 'B', 'C', 'B'))  # otherwise whichever door has a goat

Now the gain, when declining to switch, is given by how many cases have the car behind door A:

sum(car == 'A') / n
## [1] 0.3329

As expected, it is about 1/3.

The gain, when switching doors, is given by how many times the unopened door has the car:

unopened <- ifelse(door == 'B', 'C', 'B')
sum(car == unopened) / n
## [1] 0.6671

But the car is either behind the unopened door or behind A; you win if the car is not behind A. So the last expression is equivalent to

sum(car != 'A') / n
## [1] 0.6671

And this gives exactly the same result. The probability of winning the car when declining to switch is the probability that the car is behind door A. The probability of winning when switching is just the probability that the car is not behind door A, i.e., 2/3. And that’s the simplest, most elegant explanation of the paradox I’ve ever found.

But there’s more. Notice that in our final expression, we had no need to know which door was opened. And this has interesting consequences for a variation of the game which I’ll call the deterministic Monty Hall.

Deterministic Monty

In the classic version of the game, Monty will select a door at random between B and C if the car happens to be behind A, the door chosen by the player. Let’s now suppose instead that when the car is behind A, Monty always chooses to open door B. Will that information change our strategy?

On one hand, we would feel that when Monty opens door B, the car will more likely be behind door A, and perhaps we should instead choose to stay. Well, let’s simulate this.

door_deterministic <- ifelse(
  car == 'A', 'B', # not chosen at random anymore
  ifelse(car == 'B', 'C', 'B'))

The expression for the expected gains are very similar. If choosing to stay, the expected gain remains

sum(car == 'A') / n
## [1] 0.3329

While when choosing to switch, the gain becomes

unopened_deterministic <- ifelse(door_deterministic == 'B', 'C', 'B')
sum(car == unopened_deterministic) / n
## [1] 0.6671

But this last expression is again simply equal to

sum(car != 'A') / n
## [1] 0.6671

because no matter which strategy Monty used, the car still has to be behind A or the unopened door.

Now this was completely counterintuitive to me: surely, knowing Monty’s strategy should give you an edge? But let’s consider separately the two possibilities:

Monty opens B

Unlike the preceding scenario, this happens in all scenarios where the car is behind A or C (2/3 of the cases). And among these scenarios, in exactly half the cases the car is behind A. So you know that the car is equally likely to be behind A or C, and switching doors will neither help nor hurt you. So this scenario contributes $2/3 \times 1/2 = 1/3$ to the total expected gain.

Monty opens C

This can only happen when the car is behind B, so you certainly benefit from switching doors. But this happens only 1/3 of the time, and this scenario contributes $1/3 \times 1 = 1/3$ to the total expected gain.

So in this variant of the game, your total expected gain remains exactly 2/3.

“Computational” proof

I’m not sure this would be accepted as a rigorous proof, but when I wrote the expression for the expected gain I always ended up with sum(car != 'A') / n, no matter what Monty’s strategy was. Or in plain English:

The expected gain when switching doors is the probability that the car was not behind A in the first place.

This, of course, is:

  • obvious
  • independent of Monty’s strategy
  • always equal to 2/3
  • possibly the most succinct explanation of Monty’s paradox I’ve come across

But it never hurts to double-check what the maths tell us.

Mathematical proof

Let’s work out the probability that the car should be behind door B when Monty opens door C.

$\DeclareMathOperator{\Open}{Open}$

\begin{align*}
\Pr(B \mid \Open(C)) &= \frac{\Pr(\Open(C) \mid B) \times \Pr(B)}{\Pr(\Open(C))} && \text{by Bayes} \\
&= \frac{1 \times \Pr(B)}{\Pr(\Open(C))} && \text{because $B \implies \Open(C)$} \\
&= \frac{1/3}{\Pr(\Open(C))} && \text{because each door is equally probable} \\
&= \frac{1}{3} \frac{1}{\Pr(\Open(C) \mid A) \times \Pr(A) + \Pr(\Open(C) \mid B) \times \Pr(B)} && \text{because $A$ or $B$ must be true when $C$ is open} \\
&= \frac{1}{\Pr(\Open(C) \mid A) + \Pr(\Open(C) \mid B)} && \text{because $\Pr(A) = \Pr(B) = 1/3$}, \\
\end{align*}

and similarly for $\Pr(C \mid \Open(B))$. This result does not depend on Monty’s strategy, but the final figures do:

\begin{align*}
\text{Random choice:} && \Pr(B \mid \Open(C)) &= 2/3 \\
&& \Pr(C \mid \Open(B)) &= 2/3 \\
\text{Deterministic:} && \Pr(B \mid \Open(C)) &= 1 \\
&& \Pr(C \mid \Open(B)) &= 1/2,
\end{align*}

but the expected gain when switching doors (without knowing Monty’s strategy) depends on $\Pr(\Open(B))$ and $\Pr(\Open(C))$. When Monty’s choice is random, $\Pr(\Open(B)) = \Pr(\Open(C)) = 1/2$, but when his choice is deterministic, we have $\Pr(\Open(B)) = 2/3$ and $\Pr(\Open(C)) = 1/3$. Either way, you can easily check that the expected gain is 2/3.

Still. What the computer program taught me was a much simpler way. All you had to do was observe that

$$\Pr(B \mid \Open(C)) = \Pr(\neg A \mid \Open(C)) = \Pr(\neg A) = 2/3,$$

since $A$ and $\Open(C)$ are independent in the random variant. In the deterministic case we’re not really interested in $\Pr(B \mid \Open(C))$ but in the probability that the car is not behind $A$, when all we know is that some door will be open. But as the computer program tells us, these are independent probabilities, so again the probability of win is simply $\Pr(\neg A) = 2/3.$

Machine Learning in R: Start with an End-to-End Test

As a data scientist, you will likely be asked one day to automate your analysis and port your models to production environments. When that happens you cross the blurry line between data science and software engineering, and become a machine learning engineer. I’d like to share a few tips we’re exploring at Expedia on how to make that transition as successful as possible.

Let’s first discuss testing, and let’s assume without loss of generality that you develop your machine learning application in R. Just like any other software system, your application needs to be thoroughly tested before being deployed. But how do you ensure that your application will perform as expected when dealing with real data? If you’ve structured your application as an R package you already know that you can write unit tests that will run automatically when you check your package; but how do you test the application as a whole?

Testing an application as whole locally, known as integration testing, should normally not be part of your suite of unit tests: integration tests are frequently much slower than unit tests, and would break your regular build-test workflow. (Besides, if you forgive some pedantry, an integration test is not a unit test by definition since it tests the whole system.)

Nevertheless, integration tests serve a useful purpose in your workflow. They’re often the last test you’ll run locally before releasing your application to more ‘serious’ testing. But that doesn’t mean you should write them last; in this article I’m going to argue that writing them first, before even writing a line of code, tends to produce a better, clearer, and more testable design.

Let’s illustrate with an example. You work for a travel agency whose CEO would like a system that predicts the survival probability of its customers, should their cruise ship hit an iceberg. You are told that we have extensive data on shipwreck survivors since 1912, when the Titanic sank. You are told that your application should apply machine learning on a dataset that has the same structure as the famous Titanic survivor data set, but you cannot access the entire set directly.

You are relatively free to choose the output format of your application, because it will be picked up downstream by other teams. You decide to start with a CSV file for now.

Well, where to begin? From an earlier analysis (ideally documented in a vignette, a notebook, or another report) you know that a random forest is the best model for this problem. So it would make sense to verify whether indeed your application trains a random forest from the training data. How would we know that? The simplest way to save the trained model along with the other output, and check that it is indeed a random forest, trained with the right number of training samples. And this is definitely something which we know how to test.

I’m going to assume you have the basic structure of an R package in place, such as the one created by default by RStudio when you start a new project. We’ll call it app:

app
├── DESCRIPTION
├── NAMESPACE
├── R
│   └── hello.R
├── app.Rproj
└── man
    └── hello.Rd

Besides the testthat package, which has become the standard library for writing unit tests in R, I recommend using testhis, which facilitates creating and running integration tests:

usethis::use_testthat()
testthis::use_integration_tests()

This sets up the basic structure for running unit tests and other integration tests:

app
├── DESCRIPTION
├── NAMESPACE
├── R
│   └── hello.R
├── app.Rproj
├── man
│   └── hello.Rd
└── tests
    ├── testthat
    │   └── integration_tests
    └── testthat.R

Any test you create under tests/testhat/integration_tests will be run when you call testthis::test_integration(), but not when running your regular unit tests.

I like to begin with a failing test to make sure that everything is setup correctly. Let’s go ahead and write our end-to-end test, calling it tests/testthat/integrations_test-end_to_end.R:

# integrations_test-end_to_end.R
test_that("this test fails", {
  expect_true(FALSE)
})

As expected, we can now run the “integration” test and see it fail:

> testthis::test_integration()
Loading app
✔ |  OK F W S | Context
✖ |   0 1     | test_end-to-end
───────────────────────────────────────────────────
test-test_end-to-end.R:2: failure: this test fails
FALSE isn't true.
───────────────────────────────────────────────────

══ Results ════════════════════════════════════════
OK:       0
Failed:   1
Warnings: 0
Skipped:  0

Good. Let’s now write test code as if a model had been serialized, and check its number of training samples.

# integrations_test-end_to_end.R
test_that("a model is output with the right number of training samples", {
  app::main()
})

Hold it right there. This test will, of course, not even run because the main() function doesn’t exist. Let’s create R/main.R with the following contents before we continue:

# main.R
main <- function() {}

The integration test now runs, so we can complete it. But before we can continue, we will need some training data, which will be the Titanic survival dataset. We don’t want main() to read the data from file, but from an interface that is as similar as possible to the actual environment.

We’ve been told that the application will read from a PostgreSQL relational database. We could certainly have our test code set up a local PostgreSQL instance, and load it with the training dataset; this will sometimes be the right strategy, but in this case I think it is overkill. Instead, we will exploit the fact that most database systems can be abstracted away through ODBC drivers. R code that talks to PostgreSQL through its ODBC driver should work just as well against any other database system. In particular, we’ll use the in-memory SQLite database for integration testing.

So our integration test code will look like this, in pseudo-code:

# integrations_test-end_to_end.R
test_that("a model is output with the right number of training samples", {
  <load titanic training set>
  <copy it to SQLite>
  app::main()
  <load serialized model>
  <verify number of training samples>
})

Well, let’s do it. Create the tests/testthat/testdata-raw folder (or just call testthis::use_testdata_raw()), and copy there the train.csv dataset downloaded from https://www.kaggle.com/c/titanic/data, renamed to titanic.csv.

We’ll proceed with turning our pseudo-code into real code, refactoring as we go along:

# integrations_test-end_to_end.R
test_that("a model is output with the right number of training samples", {
  # Load Titanic training set
  titanic <- readr::read_csv("../testdata-raw/titanic.csv")
  # Copy it to SQLite
  con <- odbc::dbConnect(
    odbc::odbc(),
    driver = "SQLite",
    database = "file::memory:?cache=shared"
  )
  dplyr::copy_to(con, titanic, temporary = FALSE)
  # Call main()
  app::main()
  # Read serialized model
  model <- readRDS("output/model.rds")
})

Notice the file::memory:?cache=shared URI used to instantiate the in-memory SQLite database. If you simply pass the usual :memory: filename you won’t be able to connect to it from more than one place in your program; but we want the test code to populate the same database that will be accessed by the main application, so we must pass this special file::memory:?chache=shared URI.

Notice also the temporary = FALSE argument to dplyr::copy_to(). This is required for the table to be accessible from another connection than the current one.

Calling testthis::test_integration() should now fail:

> testthis::test_integration()
Loading app
✔ |  OK F W S | Context
✖ |   0 1 1   | test_end-to-end
───────────────────────────────────────────────────
test-test_end-to-end.R:14: warning: a model is output with the right number of training samples
cannot open compressed file 'output/model.rds', probable reason 'No such file or directory'

test-test_end-to-end.R:14: error: a model is output with the right number of training samples
cannot open the connection
1: readRDS("output/model.rds") at /Users/dlindelof/Work/app/tests/testthat/integration_tests/test-test_end-to-end.R:14
2: gzfile(file, "rb")
───────────────────────────────────────────────────

══ Results ════════════════════════════════════════
OK:       0
Failed:   1
Warnings: 1
Skipped:  0

It fails indeed, so we stop writing test code at this point and fix it. It fails because no model is serialized yet; we don’t want to overengineer at this point and fix that by instantiating a fake model and serializing it:

# main.R
main <- function() {
  model <- lm(~ 0)
  if (!dir.exists("output")) dir.create("output")
  saveRDS(model, "output/model.rds")
}

The tests pass now, so we keep on going:

# integrations_test-end_to_end.R
test_that("a model is output with the right number of training samples", {
  # Load Titanic training set
  titanic <- readr::read_csv("../testdata-raw/titanic.csv")
  # Copy it to SQLite
  con <- odbc::dbConnect(
    odbc::odbc(),
    driver = "SQLite",
    database = "file::memory:?cache=shared"
  )
  dplyr::copy_to(con, titanic, temporary = FALSE)
  # Call main()
  main()
  # Read serialized model
  model <- readRDS("output/model.rds")
  # Verify number of training samples
  expect_equal(length(model$fitted.values), nrow(titanic)) # (1)
})
# (1) check if the model has been trained with the right number of samples

There are several issues with that test, but it is complete (for now) and, as expected, fails:

> testthis::test_integration()
Loading app
✔ |  OK F W S | Context
✖ |   0 1     | test_end-to-end
───────────────────────────────────────────────────
test-test_end-to-end.R:16: failure: a model is output with the right number of training samples
length(model$fitted.values) not equal to nrow(titanic).
1/1 mismatches
[1] 0 - 891 == -891
───────────────────────────────────────────────────

══ Results ════════════════════════════════════════
OK:       0
Failed:   1
Warnings: 0
Skipped:  0

To pass the test, the main() function needs to open an ODBC connection to SQLite, read the training dataset, train (some) model with it, and serialize the model. But we run into a problem: how does our application know where to find the database it should connect to? And how does it know which driver to load? A reasonable solution is to use the config package with a configuration file. When we run integration tests we’ll use the test settings, and when we run in production we’ll use the default settings.

So here’s a first attempt at config.yml:

# config.yml
test:
  driver: "SQLite"
  database: "file::memory:?cache=shared"

default:
  driver: "PostgreSQL"
  database: "TBD"

We should now be able to read the titanic dataset from main():

# main.R
main <- function(env = "default") {
  connection_params <- config::get(config = env)
  con <- do.call(odbc::dbConnect, c(odbc::odbc(), connection_params))
  train <- dplyr::tbl(con, "titanic") %>% dplyr::collect()
  model <- lm(~ 0, data = train)
  if (!dir.exists("output")) dir.create("output")
  saveRDS(model, "output/model.rds")
}

Running the tests, we see that the dataset gets read without errors by main(), but the number of training samples in the model remains 0:

> testthis::test_integration()
Loading app
✔ |  OK F W S | Context
✖ |   0 1     | test_end-to-end [0.1 s]
───────────────────────────────────────────────────
test-test_end-to-end.R:16: failure: a model is output with the right number of training samples
length(model$fitted.values) not equal to nrow(titanic).
1/1 mismatches
[1] 0 - 891 == -891
───────────────────────────────────────────────────

══ Results ════════════════════════════════════════
Duration: 0.1 s

OK:       0
Failed:   1
Warnings: 0
Skipped:  0

That is simply because we have not specified any target variable in our call to lm(). Let’s fix that:

# main.R
main <- function(env = "default") {
  connection_params <- config::get(config = env)
  con <- do.call(odbc::dbConnect, c(odbc::odbc(), connection_params))
  train <- dplyr::tbl(con, "titanic") %>% dplyr::collect()
  model <- lm(Survived ~ 0, data = train)
  if (!dir.exists("output")) dir.create("output")
  saveRDS(model, "output/model.rds")
}

And the tests now pass:

> testthis::test_integration()
Loading app
✔ |  OK F W S | Context
✔ |   1       | test_end-to-end

══ Results ════════════════════════════════════════
OK:       1
Failed:   0
Warnings: 0
Skipped:  0

Let’s now recap what we have achieved. We have an integration test case that verifies that a linear model (or any model, really, provided it provides a fitted.values attribute) gets trained by our application, using a dataset read using an ODBC connection we provide. But more importantly, we have written an automated test that exercises the whole application, and will keep on doing so as we implement the missing pieces.

It doesn’t feel like we have accomplished much; after all, our model is certainly not the best one for the classification task we have at hand, and we don’t even train it in a sensible manner. But consider this: before you began the project, chances are the customer (or the boss) didn’t really know what they wanted. Now you are in a position to hold the following conversation:

you: We have decided that the first user story should prove that the application, when run, trains a model on the Titanic dataset. This is what this first integration test demonstrates.
boss: How good is that model?
you: Its accuracy is 62%, if that’s what you mean.
boss: Really? That’s pretty impressive for a first shot. How did you achieve that?
you: It predicts that everybody dies. And 62% of passengers in the training dataset did indeed die.
boss: (shifts uncomfortably) hm I see well I guess we should think of alternative metrics. We’d rather avoid false positives, even at the cost of some accuracy. Can you do something about it?

So by getting a woefully incomplete, but working, application out so soon you’ve been able to help your stakeholders understand what really matters to them. Better, yet, you are even in a position to write the next item on your backlog as a precise user story:

As a customer, I want to know if I'm going to die on this cruise
but a death prediction had better be right 90% of the time

Time to write a new integration test and our first unit tests.

Where to define S4 generics

You need to declare generic functions in S4 before you can define methods for them. If no definition exists you will see the following error:

> setClass("track", slots = c(x="numeric", y = "numeric"))
> setMethod("foo", signature(x = "track"))
Error in setMethod("foo", signature(x = "track"), definition = function(x) cat(x)) : 
  no existing definition for function ‘foo’

Generic functions are declared with the setGeneric() function, which must precede the call to setMethod():

> setClass("track", slots = c(x="numeric", y = "numeric"))
> setGeneric("foo", def = str)
[1] "foo"
> setMethod("foo", signature(x = "track"), definition = function(x) cat(x))

But when you develop an R package you may have several classes that define their own methods for the same generic function. So where to put the definition of the generic function? One solution is to put all your classes in the same source file, and have the call to setGeneric() precede all calls to setMethod(). This will work but maintainability dictates that each class should belong to its own source file, named R/<class name>.R.

Instead, you might consider placing all calls to setGeneric() in a source file guaranteed to be loaded before all other files. For example, you might call that file __generics.R, which will be loaded first because when R loads a package it reads the source files in alphabetical order. This will work too but there’s a more elegant way.

R will read the source files in alphabetical order unless a Collate field in the DESCRIPTION file says otherwise. That field lets you specify the order in which you want your source files to be loaded. If present, it must list all source files.

Maintaining such a field for more than about 5 source files can quickly become tedious. Fortunately, the roxygen2 package has a little-known feature which will generate the Collate field for you. If you use the @include tag anywhere in a file, roxygen will generate a Collate field which includes all source files, in such an order that all included files are listed first.

For example, if you have a file R/generics.R with the following declaration:

setGeneric("foo", ...)

and your MyClass needs to define that generic, use the following:

#' @include generics.R

MyClass <- setClass(...)

setMethod("foo", ...)

When you now run roxygen2 (and don’t forget this step, it won’t be done automatically for you) it will generate a Collate field with all your R code files, correctly sorted.

Connecting to SQL Server from R on a Mac with a Windows domain user

Connecting to an SQL Server instance as a Windows domain user is relatively straightforward when you run R on Windows, you have the right ODBC driver installed, and your network is setup properly. You normally don’t need to supply credentials, because the ODBC driver uses the built-in Windows authentication scheme. Assuming your odbcinst.ini file includes an entry for SQLServer, you typically just need the following:

con <- odbc::dbConnect(
  odbc::odbc(),                 
  Driver = "SQLServer",
  Server = "mysqlhost",
  Database = "mydbname",
  Port = 1433
)

But if you want to connect to SQL Server from a Mac, things are less simple.

Don’t bother installing the ODBC driver supplied by Microsoft; it just doesn’t work with a Windows domain user. No matter what I tried, I always got the following error message: Error: nanodbc/nanodbc.cpp:950: 28000: [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]Login failed for user 'dlindelof'. I’ve tried setting the user id to domain\\username, I’ve tried passing an extra parameter DOMAIN, all to no avail.

As far as I could determine, it simply is not possible to connect to SQL Server with a domain user using the ODBC driver supplied by Microsoft. Instead, obtain the open-source FreeTDS driver. If you use Homebrew, this is done with brew install freetds. Once installed you should see its libraries installed as /usr/local/lib/libtdsodbc.so. Edit your /usr/local/etc/odbcinst.ini file accordingly:

freetds                              = Installed
# ...
[freetds]
Driver = /usr/local/lib/libtdsodbc.so

You can, but don’t need to, also edit /usr/local/odbc.ini and /usr/local/etc/freetds.conf if you want to specify human-friendly aliases to specific database connections. I never used that.

You can now create a database source in R using the usual connection parameters, with the important gotcha that (unless you edit freetds.conf) you must specify the port number. The username must be prefixed by the domain and a double backslash (to prevent escaping). Putting it all together, my connection call looks like this:

con <- odbc::dbConnect(
   odbc::odbc(),                 
   Driver = "freetds",
   Server = "mysqlhost",
   Database = "mydbname",
   uid = "domainname\\username",
   pwd = "somepassword",
   Port = 1433
 )

The opinionated estimator

You have been lied to. By me.

Killian is lying to you

I taught once a programming class and introduced my students to the notion of an unbiased estimator of the variance of a population. The problem can be stated as follows: given a set of observations $(x_1, x_2, …, x_n)$, what can you say about the variance of the population from which this sample is drawn?

Classical textbooks, MathWorld, the Khan Academy, and Wikipedia all give you the formula for the so-called unbiased estimator of the population variance:

$$\hat{s}^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i – \bar{x})^2$$

where $\bar{x}$ is the sample mean. The expected error of this estimator is zero:

$$E[\hat{s}^2 – \sigma^2] = 0$$

where $\sigma^2$ is the “true” population variance. Put another way, the expected value of this estimator is exactly the population variance:

$$E[\hat{s}^2] = \sigma^2$$

So far so good. The expected error is zero, therefore it’s the best estimator, right? This is what orthodox statistics (and teachers like me who don’t know better) will have you believe.

But Jaynes (Probability Theory) points out that in practical problems one does not care about the expected error of the estimated variances (or of any estimator for that matter). What matters is how accurate this estimator is, i.e. how close it is to the true variance. And this calls for an estimator that will minimise the expected squared error $E[(\hat{s}^2 – \sigma^2)^2]$. But we can also write this expected squared error as:

$$E[(\hat{s}^2 – \sigma^2)^2] = (E[\hat{s}^2] – \sigma^2)^2 + \mathrm{Var}(\hat{s}^2)$$

The expected squared error of our estimator is thus the sum of two terms: the square of the expected error, and the variance of the estimator. When following the cookbooks of orthodox statistics, only the first term is minimised and there is no guarantee that the total error is minimised.

For samples drawn from a Gaussian distribution, Jaynes shows that an estimator that minimises the total (squared) error is

$$\hat{s}^2 = \frac{1}{n+1}\sum_{i=1}^n (x_i – \bar{x})^2$$

Notice that the $n-1$ denominator has been replaced with $n+1$. In a fit of fancifulness I’ll call this an opinionated estimator. Let’s test how well this estimator performs.

First we generate 1000 random sets of 10 samples with mean 0 and variance 25:

samples <- matrix(rnorm(10000, sd = 5), ncol = 10)

For each group of 10 samples, we estimate the population variance first with the canonical $n-1$ denominator. This is what R’s built-in var function will do, according to its documentation:

unbiased <- apply(samples, MARGIN = 1, var)

Next we estimate the population variance with the $n+1$ denominator. We take a little shortcut here by multiplying the unbiased estimator by $(n-1)/(n+1)$, but it makes no difference:

opinionated <- apply(samples, MARGIN = 1, function(x) var(x) * (length(x) - 1) / (length(x) + 1))

Finally we combine everything in one convenient data frame:

estimators <- rbind(data.frame(estimator = "Unbiased", estimate = unbiased),
                    data.frame(estimator = "Opinionated", estimate = opinionated))

histogram(~ estimate | estimator, estimators,
          panel = function(...) {
            panel.histogram(...)
            panel.abline(v = 25, col = "red", lwd = 5)
          })

Unbiased vs Opinionated

It’s a bit hard to tell visually which one is “better”. But let’s compute the average squared error for each estimator:

aggregate(estimate ~ estimator, estimators, function(x) mean((x - 25)^2))

##     estimator estimate
## 1    Unbiased 145.1007
## 2 Opinionated 115.5074

This shows clearly that the $n+1$ denominator yields a smaller total (squared) error than the so-called unbiased $n-1$ estimator, at leat for a sample drawn from a Gaussian distribution.

So do your brain a favour and question everything I tell you. Including this post.