Feature standardization considered harmful

Many statistical learning algorithms perform better when the covariates are on similar scales. For example, it is common practice to standardize the features used by an artificial neural network so that the gradient of its objective function doesn’t depend on the physical units in which the features are described.

The same advice is frequently given for K-means clustering (see Do Clustering algorithms need feature scaling in the pre-processing stage?, Are mean normalization and feature scaling needed for k-means clustering?, and In cluster analysis should I scale (standardize) my data if variables are in the same units?), but there’s a great counter-example given in The Elements of Statistical Learning that I try to reproduce here.

Consider two point clouds ($n=50$ each), randomly drawn around two origins 3 units away from the origin:

Continue reading

No, you have not controlled for confounders

When observational data includes a treatment indicator and some possible confounders, it is very tempting to simply regress the outcome on all features (confounders and treatment alike), extract the coefficients associated with the treatment indicator, and proudly proclaim that “we have controlled for confounders and estimated the treatment effect”.

This approach is wrong. Very wrong. At least as wrong as that DIY electrical job you did last week: it looks all good and neat but you’ve made a critical mistake and there’s no way you can find out without killing yourself.

Or worse, by thinking you’ve controlled for confounders when you haven’t

I can’t explain why this is wrong (I’m not sure I understand it myself) but I can show you some examples proving that this approach is wrong. We’ll work through a few examples, where we compare the results with a traditional regression with a couple of legit causal inference libraries. Since we use simulated data, we’ll also be able to compare with the “true” treatment effect.

Continue reading

A/B testing my resume

Internet wisdom is divided on whether one-page resumes are more effective at landing you an interview than two-page ones. Most of the advice out there seems much opinion- or anecdotal-based, with very little scientific basis.

Well, let’s fix that.

Being currently open to work, I thought this would be the right time to test this scientifically. I have two versions of my resume:

The purpose of a resume is to land you an interview, so we’ll track for each resume how many applications yield a call for an interview. Non-responses after one week are treated as failures. We’ll model the effectiveness of a resume as a binomial distribution: all other things being considered equal, we’ll assume all applications using the same resume type have the same probability ($p1$ or $p2$) of landing an interview. We’d like to estimate these probabilities, and decide if one resume is more effective than the other.

Continue reading

Unit testing SQL with PySpark

Machine-learning applications frequently feature SQL queries, which range from simple projections to complex aggregations over several join operations.

There doesn’t seem to be much guidance on how to verify that these queries are correct. All mainstream programming languages have embraced unit tests as the primary tool to verify the correctness of the language’s smallest building blocks—all, that is, except SQL.

And yet, SQL is a programming language and SQL queries are computer programs, which should be tested just like every other unit of the application.

I’m not responsible

All mainstream languages have libraries for writing unit tests: small computer programs that verify that each software module works as expected. But SQL poses a special challenge, as it can be difficult to use SQL to set up a test, execute it, and verify the output. SQL is a declarative language, usually embedded in a “host” programming language—a language in a language.

So to unit test SQL we need to use that host language to set up the data tables used in our queries, orchestrate the execution of the SQL queries, and verify the correctness of the results.

Continue reading

Scraping real estate for fun

Here’s a fun weekend project: scrape the real estate classifieds of the website of your choice, and do some analytics on the data. I did just that last weekend, using the Scrapy Python library for web scraping, which I then let loose on one of the major real estate classifieds website in Switzerland (can’t tell you which one—not sure they would love me for it).

After about 10 minutes I had the data for 12’124 apartments or houses for sale across Switzerland, with room count, area, price, city, and canton.

I’ve imported the data in R, and log-transformed the room count, area, and price because of extreme skewness. Here’s the resulting scatterplot matrix, obtained with ggpairs():

There’s a number of interesting features, even from this raw, unclean dataset:

  • there are about twice as many apartments for sale than houses
  • the room count comes in discrete values in steps of 0.5 (half rooms are frequently used for “smaller” rooms such as a small kitchen, a small hallway, etc)
  • the room count is highly correlated with area, as expected
  • the price is more correlated with the area than with the room count
  • there are several extreme outliers:
    • a property with 290 rooms (was a typo; the owner meant an area of 290 m2)
    • some properties with abnormally low area (one of them was a house with a listed room count of 1 and area of 1 m2—obviously didn’t bother to enter correct data)
    • and more interesting, several properties with abnormally low prices; the lowest-priced item is a 3.5-room, 80 m2 apartment in Fribourg priced at CHF 99.-.

Before we go any further, we’ll obviously have to clean up these faulty data points. There doesn’t seem to be many of them so I’ll do that manually, and write a follow-up post if I find anything interesting.

Testing Scientific Software with Hypothesis

Writing unit tests for scientific software is challenging because frequently you don’t even know what the output should be. Unlike business software, which automates well-understood processes, here you cannot simply work your way through use case after use case, unit test after unit test. Your program is either correct or it isn’t, and you have no way of knowing.

If you cannot know if your program is correct, does it mean you cannot test it? No, but you’ll be using other techniques than the ones traditionally used for unit tests. In this piece we’ll look at two techniques you can use to test your software. These are:

  • comparing the output with a reference (an oracle)
  • verifying that the output satisfies certain properties.

And we’ll illustrate these two techniques through three examples:

  • the square root function
  • the trigonometric functions
  • a simple Pandas operation
Continue reading

Monty Hall: a programmer’s explanation

I take it we’re all familiar with the infamous Monty Hall problem:

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say A, and the host, who knows what’s behind the doors, opens another door, say C, which has a goat. He then says to you, “Do you want to pick door B?” Is it to your advantage to switch your choice?

The correct strategy consists in accepting the offer to switch doors. The expected gain when doing so is 2/3, vs 1/3 when choosing not to change doors (there are only two doors left closed, so one of them must have the car; hence the expected gains sum to 1).

https://xkcd.com/1282/

The mathematical proof is straightforward but most people (including me) have a hard time understanding intuitively what’s going on. There are plenty of explanations available online, but I’ve recently simulated the problem and come across one that I think is simpler and more elegant than anything I’ve seen so far. And I didn’t make it up; the simulation code did. Here’s how.

Classic Monty

Let’s simulate 10,000 rounds of the game. The car vector will hold the door which has the car behind it:

n <- 10000
car <- sample(c('A', 'B', 'C'), size = n, replace = TRUE)

Quick check that the door appears randomly assigned:

table(car)
## car
##    A    B    C 
## 3329 3310 3361

Let’s now simulate the door being opened by Monty. Without loss of generality, let’s assume you choose door A. Then if the car is behind door A, Monty will open door B or C at random; else he will open the door that has a goat.

door <- ifelse(
  car == 'A',
  sample(c('B', 'C'), size = n, replace = TRUE), # chosen at random 
  ifelse(car == 'B', 'C', 'B'))  # otherwise whichever door has a goat

Now the gain, when declining to switch, is given by how many cases have the car behind door A:

sum(car == 'A') / n
## [1] 0.3329

As expected, it is about 1/3.

The gain, when switching doors, is given by how many times the unopened door has the car:

unopened <- ifelse(door == 'B', 'C', 'B')
sum(car == unopened) / n
## [1] 0.6671

But the car is either behind the unopened door or behind A; you win if the car is not behind A. So the last expression is equivalent to

sum(car != 'A') / n
## [1] 0.6671

And this gives exactly the same result. The probability of winning the car when declining to switch is the probability that the car is behind door A. The probability of winning when switching is just the probability that the car is not behind door A, i.e., 2/3. And that’s the simplest, most elegant explanation of the paradox I’ve ever found.

But there’s more. Notice that in our final expression, we had no need to know which door was opened. And this has interesting consequences for a variation of the game which I’ll call the deterministic Monty Hall.

Deterministic Monty

In the classic version of the game, Monty will select a door at random between B and C if the car happens to be behind A, the door chosen by the player. Let’s now suppose instead that when the car is behind A, Monty always chooses to open door B. Will that information change our strategy?

On one hand, we would feel that when Monty opens door B, the car will more likely be behind door A, and perhaps we should instead choose to stay. Well, let’s simulate this.

door_deterministic <- ifelse(
  car == 'A', 'B', # not chosen at random anymore
  ifelse(car == 'B', 'C', 'B'))

The expression for the expected gains are very similar. If choosing to stay, the expected gain remains

sum(car == 'A') / n
## [1] 0.3329

While when choosing to switch, the gain becomes

unopened_deterministic <- ifelse(door_deterministic == 'B', 'C', 'B')
sum(car == unopened_deterministic) / n
## [1] 0.6671

But this last expression is again simply equal to

sum(car != 'A') / n
## [1] 0.6671

because no matter which strategy Monty used, the car still has to be behind A or the unopened door.

Now this was completely counterintuitive to me: surely, knowing Monty’s strategy should give you an edge? But let’s consider separately the two possibilities:

Monty opens B

Unlike the preceding scenario, this happens in all scenarios where the car is behind A or C (2/3 of the cases). And among these scenarios, in exactly half the cases the car is behind A. So you know that the car is equally likely to be behind A or C, and switching doors will neither help nor hurt you. So this scenario contributes $2/3 \times 1/2 = 1/3$ to the total expected gain.

Monty opens C

This can only happen when the car is behind B, so you certainly benefit from switching doors. But this happens only 1/3 of the time, and this scenario contributes $1/3 \times 1 = 1/3$ to the total expected gain.

So in this variant of the game, your total expected gain remains exactly 2/3.

“Computational” proof

I’m not sure this would be accepted as a rigorous proof, but when I wrote the expression for the expected gain I always ended up with sum(car != 'A') / n, no matter what Monty’s strategy was. Or in plain English:

The expected gain when switching doors is the probability that the car was not behind A in the first place.

This, of course, is:

  • obvious
  • independent of Monty’s strategy
  • always equal to 2/3
  • possibly the most succinct explanation of Monty’s paradox I’ve come across

But it never hurts to double-check what the maths tell us.

Mathematical proof

Let’s work out the probability that the car should be behind door B when Monty opens door C.

$\DeclareMathOperator{\Open}{Open}$

\begin{align*}
\Pr(B \mid \Open(C)) &= \frac{\Pr(\Open(C) \mid B) \times \Pr(B)}{\Pr(\Open(C))} && \text{by Bayes} \\
&= \frac{1 \times \Pr(B)}{\Pr(\Open(C))} && \text{because $B \implies \Open(C)$} \\
&= \frac{1/3}{\Pr(\Open(C))} && \text{because each door is equally probable} \\
&= \frac{1}{3} \frac{1}{\Pr(\Open(C) \mid A) \times \Pr(A) + \Pr(\Open(C) \mid B) \times \Pr(B)} && \text{because $A$ or $B$ must be true when $C$ is open} \\
&= \frac{1}{\Pr(\Open(C) \mid A) + \Pr(\Open(C) \mid B)} && \text{because $\Pr(A) = \Pr(B) = 1/3$}, \\
\end{align*}

and similarly for $\Pr(C \mid \Open(B))$. This result does not depend on Monty’s strategy, but the final figures do:

\begin{align*}
\text{Random choice:} && \Pr(B \mid \Open(C)) &= 2/3 \\
&& \Pr(C \mid \Open(B)) &= 2/3 \\
\text{Deterministic:} && \Pr(B \mid \Open(C)) &= 1 \\
&& \Pr(C \mid \Open(B)) &= 1/2,
\end{align*}

but the expected gain when switching doors (without knowing Monty’s strategy) depends on $\Pr(\Open(B))$ and $\Pr(\Open(C))$. When Monty’s choice is random, $\Pr(\Open(B)) = \Pr(\Open(C)) = 1/2$, but when his choice is deterministic, we have $\Pr(\Open(B)) = 2/3$ and $\Pr(\Open(C)) = 1/3$. Either way, you can easily check that the expected gain is 2/3.

Still. What the computer program taught me was a much simpler way. All you had to do was observe that

$$\Pr(B \mid \Open(C)) = \Pr(\neg A \mid \Open(C)) = \Pr(\neg A) = 2/3,$$

since $A$ and $\Open(C)$ are independent in the random variant. In the deterministic case we’re not really interested in $\Pr(B \mid \Open(C))$ but in the probability that the car is not behind $A$, when all we know is that some door will be open. But as the computer program tells us, these are independent probabilities, so again the probability of win is simply $\Pr(\neg A) = 2/3.$

Reading S3 data from a local PySpark session

For the impatient

To read data on S3 to a local PySpark dataframe using temporary security credentials, you need to:

  1. Download a Spark distribution bundled with Hadoop 3.x
  2. Build and install the pyspark package
  3. Tell PySpark to use the hadoop-aws library
  4. Configure the credentials

The problem

When you attempt read S3 data from a local PySpark session for the first time, you will naturally try the following:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
foo = spark.read.parquet('s3a://<some_path_to_a_parquet_file>')

But running this yields an exception with a fairly long stacktrace, the first lines of which are shown here:

Py4JJavaError: An error occurred while calling o574.parquet.
 : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

Solving this is, fortunately, trivial. You need the hadoop-aws library; the correct way to add it to PySpark’s classpath is to ensure the Spark property spark.jars.packages includes org.apache.hadoop:hadoop-aws:3.2.0. (Be sure to set the same version as your Hadoop version.)

xkcd 1742 Will It Work

(There’s some advice out there telling you to download those jar files manually and copy them to PySpark’s classpath. Don’t do that. Using the spark.jars.packages method ensures you also pull in any transitive dependencies of the hadoop-aws package, such as the AWS SDK. You don’t want to do that manually.)

So what’s this talk of Hadoop 3.x?

Spark 2.x ships with, at best, Hadoop 2.7. But Hadoop didn’t support all AWS authentication mechanisms until Hadoop 2.8. So if you need to access S3 locations protected by, say, temporary AWS credentials, you must use a Spark distribution with a more recent version of Hadoop.

It’s probably possible to combine a plain Spark distribution with a Hadoop distribution of your choice; but the easiest way is to just use Spark 3.x. However there’s a catch: pyspark on PyPI provides Spark 3.x bundled with Hadoop 2.7. There’s work under way to also provide Hadoop 3.x, but until that’s done the easiest is to just download and build pyspark yourself.

Download Spark from their website, be sure you select a 3.x release built with Hadoop 3.x. Unzip the distribution, go to the python subdirectory, built the package and install it:

cd spark-3.0.0-bin-hadoop3.2
cd python
python setup.py dist
pip install dist/*.tgz

(Of course, do this in a virtual environment unless you know what you’re doing.)

With this out of the way you should be able to read any publicly available data on S3, but first you need to tell Hadoop to use the correct authentication provider. For public data you want org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider:

from pyspark import SparkConf
from pyspark.sql import SparkSession

conf = SparkConf()
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.0')
conf.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider')

spark = SparkSession.builder.config(conf=conf).getOrCreate()

df = spark.read.csv('s3a://noaa-ghcn-pds/csv/2020.csv', inferSchema=True)

After a while, this will give you a Spark dataframe representing one of the NOAA Global Historical Climatology Network Daily datasets.

Ok what about AWS credentials then?

That’s why you need Hadoop 3.x, which provides several authentication providers to choose from. For example, say your company uses temporary session credentials; then you need to use the org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider authentication provider. The name of that class must be given to Hadoop before you create your Spark session. The Hadoop documentation says you should set the fs.s3a.aws.credentials.provider property to the full class name, but how do you do that when instantiating the Spark session? There’s documentation out there that advises you to use the _jsc member of the SparkContext, e.g.

sc._jsc.hadoopConfiguration().set('fs.s3n.awsAccessKeyId',<access_key_id>)

But the leading underscore shows clearly that this is a bad idea. Instead, all Hadoop properties can be set while configuring the Spark Session by prefixing the property name with spark.hadoop:

from pyspark import SparkConf
from pyspark.sql import SparkSession

conf = SparkConf()
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.0')
conf.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider')
conf.set('spark.hadoop.fs.s3a.access.key', <access_key>)
conf.set('spark.hadoop.fs.s3a.secret.key', <secret_key>)
conf.set('spark.hadoop.fs.s3a.session.token', <token>)

spark = SparkSession.builder.config(conf=conf).getOrCreate()

And you’ve got a Spark session ready to read from your confidential S3 location.

Oh and one more thing

The temporary session credentials are typically provided by a tool like aws_key_gen. Running that tool will create a file ~/.aws/credentials with the credentials needed by Hadoop to talk to S3, but surely you don’t want to copy/paste those credentials to your Python code. Instead you can also use aws_key_gen to set the right environment variables, for example with

eval `aws_key_gen shell`

before running your Python program. If you do so, you don’t even need to set the credentials in your code.

Probability’s not fair

I’m currently reviewing a book for a well-known tech publisher, in which the author discusses the assignment of probabilities to each face of a die, when the faces are not of equal size. In the absence of more information, he claims that this is impossible.

I beg to differ.

What follows will depend a lot on how you define a probability. As an aspiring Bayesian, I accept probabilities as reflecting our degree of belief in a particular outcome. Consequently, following Jaynes’s convention, all probabilities should be understood as being conditional on the sum of all your prior knowledge. In other words, an event $A$ cannot have a “naked” probability $\Pr(A)$; it can only have a conditional probability $\Pr(A\mid I)$, where $I$ stands for all your prior knowledge and beliefs.

So back to that die. Let’s first consider a similar, but simpler problem:

You’re given a coin, and told that the coin is loaded, but you’re not told which way. What is the probability of tossing heads?

Let’s define the amount of loading $\rho$ as the fraction of heads that will result as the number of tosses approaches infinity. We know that $0\leq\rho\leq 1$ and that $\rho \neq 0.5$.

What follows works even for this coin.

We may even have in mind a prior distribution for $\rho$; perhaps we suspect that most loaded coins have $\rho$ close to $0.5$ to avoid detection. But we won’t need that information, provided we accept that the coin is equally likely to be loaded either way. That is to say, our prior probability distribution function for $\rho$ satisfies $p(\rho) = p(1-\rho)$.

Then the probability of heads can be found from $\Pr(\text{Heads} \mid \rho, I) = \rho$ and from exploiting the symmetry of the problem:

\begin{align*}
\Pr(\text{Heads}\mid I) &= \int_0^1 \Pr(\text{Heads} \mid \rho, I) p(\rho) \mathrm{d}\rho \\
&= \int_0^1 \rho p(\rho) \mathrm{d}\rho \\
&= \left.\left[\int_0^1 \rho p(\rho) \mathrm{d}\rho + \int_0^1 \rho p(\rho) \mathrm{d}\rho\right] \middle/ 2 \right. \\
&= \left.\left[\int_0^1 \rho p(\rho) \mathrm{d}\rho + \int_0^1 \rho p(1 – \rho) \mathrm{d}\rho\right] \middle/ 2 \right.\\
&= \left.\left[\int_0^1 \rho p(\rho) \mathrm{d}\rho + \int_0^1 (1 – \rho) p(\rho) \mathrm{d}\rho\right] \middle/ 2 \right. \\
&= 1 / 2 \\
\end{align*}

Of course, we could also have simply noted that by symmetry, $\Pr(\text{Heads} \mid I) = \Pr(\text{Tails} \mid I)$. And since $\Pr(\text{Heads} \mid I) + \Pr(\text{Tails} \mid I) = 1$, we get again that $\Pr(\text{Heads} \mid I) = 1 / 2$.

This is the correct answer: the only probability one can assign to a toss of heads from a loaded coin, without more information, is $1/2$, even when we know that in the long run, the number of heads will not be half the total number of tosses. It’s perhaps counterintuitive but remember than in the Bayesian context a probability does not have the same interpretation as in a frequentist setting. So by a similar reasoning, the probability of any face of a die coming up when we don’t know whether the die is fair is $1/6$.

Machine Learning in R: Start with an End-to-End Test

As a data scientist, you will likely be asked one day to automate your analysis and port your models to production environments. When that happens you cross the blurry line between data science and software engineering, and become a machine learning engineer. I’d like to share a few tips we’re exploring at Expedia on how to make that transition as successful as possible.

Let’s first discuss testing, and let’s assume without loss of generality that you develop your machine learning application in R. Just like any other software system, your application needs to be thoroughly tested before being deployed. But how do you ensure that your application will perform as expected when dealing with real data? If you’ve structured your application as an R package you already know that you can write unit tests that will run automatically when you check your package; but how do you test the application as a whole?

Testing an application as whole locally, known as integration testing, should normally not be part of your suite of unit tests: integration tests are frequently much slower than unit tests, and would break your regular build-test workflow. (Besides, if you forgive some pedantry, an integration test is not a unit test by definition since it tests the whole system.)

Nevertheless, integration tests serve a useful purpose in your workflow. They’re often the last test you’ll run locally before releasing your application to more ‘serious’ testing. But that doesn’t mean you should write them last; in this article I’m going to argue that writing them first, before even writing a line of code, tends to produce a better, clearer, and more testable design.

Let’s illustrate with an example. You work for a travel agency whose CEO would like a system that predicts the survival probability of its customers, should their cruise ship hit an iceberg. You are told that we have extensive data on shipwreck survivors since 1912, when the Titanic sank. You are told that your application should apply machine learning on a dataset that has the same structure as the famous Titanic survivor data set, but you cannot access the entire set directly.

You are relatively free to choose the output format of your application, because it will be picked up downstream by other teams. You decide to start with a CSV file for now.

Well, where to begin? From an earlier analysis (ideally documented in a vignette, a notebook, or another report) you know that a random forest is the best model for this problem. So it would make sense to verify whether indeed your application trains a random forest from the training data. How would we know that? The simplest way to save the trained model along with the other output, and check that it is indeed a random forest, trained with the right number of training samples. And this is definitely something which we know how to test.

I’m going to assume you have the basic structure of an R package in place, such as the one created by default by RStudio when you start a new project. We’ll call it app:

app
├── DESCRIPTION
├── NAMESPACE
├── R
│   └── hello.R
├── app.Rproj
└── man
    └── hello.Rd

Besides the testthat package, which has become the standard library for writing unit tests in R, I recommend using testhis, which facilitates creating and running integration tests:

usethis::use_testthat()
testthis::use_integration_tests()

This sets up the basic structure for running unit tests and other integration tests:

app
├── DESCRIPTION
├── NAMESPACE
├── R
│   └── hello.R
├── app.Rproj
├── man
│   └── hello.Rd
└── tests
    ├── testthat
    │   └── integration_tests
    └── testthat.R

Any test you create under tests/testhat/integration_tests will be run when you call testthis::test_integration(), but not when running your regular unit tests.

I like to begin with a failing test to make sure that everything is setup correctly. Let’s go ahead and write our end-to-end test, calling it tests/testthat/integrations_test-end_to_end.R:

# integrations_test-end_to_end.R
test_that("this test fails", {
  expect_true(FALSE)
})

As expected, we can now run the “integration” test and see it fail:

> testthis::test_integration()
Loading app
✔ |  OK F W S | Context
✖ |   0 1     | test_end-to-end
───────────────────────────────────────────────────
test-test_end-to-end.R:2: failure: this test fails
FALSE isn't true.
───────────────────────────────────────────────────

══ Results ════════════════════════════════════════
OK:       0
Failed:   1
Warnings: 0
Skipped:  0

Good. Let’s now write test code as if a model had been serialized, and check its number of training samples.

# integrations_test-end_to_end.R
test_that("a model is output with the right number of training samples", {
  app::main()
})

Hold it right there. This test will, of course, not even run because the main() function doesn’t exist. Let’s create R/main.R with the following contents before we continue:

# main.R
main <- function() {}

The integration test now runs, so we can complete it. But before we can continue, we will need some training data, which will be the Titanic survival dataset. We don’t want main() to read the data from file, but from an interface that is as similar as possible to the actual environment.

We’ve been told that the application will read from a PostgreSQL relational database. We could certainly have our test code set up a local PostgreSQL instance, and load it with the training dataset; this will sometimes be the right strategy, but in this case I think it is overkill. Instead, we will exploit the fact that most database systems can be abstracted away through ODBC drivers. R code that talks to PostgreSQL through its ODBC driver should work just as well against any other database system. In particular, we’ll use the in-memory SQLite database for integration testing.

So our integration test code will look like this, in pseudo-code:

# integrations_test-end_to_end.R
test_that("a model is output with the right number of training samples", {
  <load titanic training set>
  <copy it to SQLite>
  app::main()
  <load serialized model>
  <verify number of training samples>
})

Well, let’s do it. Create the tests/testthat/testdata-raw folder (or just call testthis::use_testdata_raw()), and copy there the train.csv dataset downloaded from https://www.kaggle.com/c/titanic/data, renamed to titanic.csv.

We’ll proceed with turning our pseudo-code into real code, refactoring as we go along:

# integrations_test-end_to_end.R
test_that("a model is output with the right number of training samples", {
  # Load Titanic training set
  titanic <- readr::read_csv("../testdata-raw/titanic.csv")
  # Copy it to SQLite
  con <- odbc::dbConnect(
    odbc::odbc(),
    driver = "SQLite",
    database = "file::memory:?cache=shared"
  )
  dplyr::copy_to(con, titanic, temporary = FALSE)
  # Call main()
  app::main()
  # Read serialized model
  model <- readRDS("output/model.rds")
})

Notice the file::memory:?cache=shared URI used to instantiate the in-memory SQLite database. If you simply pass the usual :memory: filename you won’t be able to connect to it from more than one place in your program; but we want the test code to populate the same database that will be accessed by the main application, so we must pass this special file::memory:?chache=shared URI.

Notice also the temporary = FALSE argument to dplyr::copy_to(). This is required for the table to be accessible from another connection than the current one.

Calling testthis::test_integration() should now fail:

> testthis::test_integration()
Loading app
✔ |  OK F W S | Context
✖ |   0 1 1   | test_end-to-end
───────────────────────────────────────────────────
test-test_end-to-end.R:14: warning: a model is output with the right number of training samples
cannot open compressed file 'output/model.rds', probable reason 'No such file or directory'

test-test_end-to-end.R:14: error: a model is output with the right number of training samples
cannot open the connection
1: readRDS("output/model.rds") at /Users/dlindelof/Work/app/tests/testthat/integration_tests/test-test_end-to-end.R:14
2: gzfile(file, "rb")
───────────────────────────────────────────────────

══ Results ════════════════════════════════════════
OK:       0
Failed:   1
Warnings: 1
Skipped:  0

It fails indeed, so we stop writing test code at this point and fix it. It fails because no model is serialized yet; we don’t want to overengineer at this point and fix that by instantiating a fake model and serializing it:

# main.R
main <- function() {
  model <- lm(~ 0)
  if (!dir.exists("output")) dir.create("output")
  saveRDS(model, "output/model.rds")
}

The tests pass now, so we keep on going:

# integrations_test-end_to_end.R
test_that("a model is output with the right number of training samples", {
  # Load Titanic training set
  titanic <- readr::read_csv("../testdata-raw/titanic.csv")
  # Copy it to SQLite
  con <- odbc::dbConnect(
    odbc::odbc(),
    driver = "SQLite",
    database = "file::memory:?cache=shared"
  )
  dplyr::copy_to(con, titanic, temporary = FALSE)
  # Call main()
  main()
  # Read serialized model
  model <- readRDS("output/model.rds")
  # Verify number of training samples
  expect_equal(length(model$fitted.values), nrow(titanic)) # (1)
})
# (1) check if the model has been trained with the right number of samples

There are several issues with that test, but it is complete (for now) and, as expected, fails:

> testthis::test_integration()
Loading app
✔ |  OK F W S | Context
✖ |   0 1     | test_end-to-end
───────────────────────────────────────────────────
test-test_end-to-end.R:16: failure: a model is output with the right number of training samples
length(model$fitted.values) not equal to nrow(titanic).
1/1 mismatches
[1] 0 - 891 == -891
───────────────────────────────────────────────────

══ Results ════════════════════════════════════════
OK:       0
Failed:   1
Warnings: 0
Skipped:  0

To pass the test, the main() function needs to open an ODBC connection to SQLite, read the training dataset, train (some) model with it, and serialize the model. But we run into a problem: how does our application know where to find the database it should connect to? And how does it know which driver to load? A reasonable solution is to use the config package with a configuration file. When we run integration tests we’ll use the test settings, and when we run in production we’ll use the default settings.

So here’s a first attempt at config.yml:

# config.yml
test:
  driver: "SQLite"
  database: "file::memory:?cache=shared"

default:
  driver: "PostgreSQL"
  database: "TBD"

We should now be able to read the titanic dataset from main():

# main.R
main <- function(env = "default") {
  connection_params <- config::get(config = env)
  con <- do.call(odbc::dbConnect, c(odbc::odbc(), connection_params))
  train <- dplyr::tbl(con, "titanic") %>% dplyr::collect()
  model <- lm(~ 0, data = train)
  if (!dir.exists("output")) dir.create("output")
  saveRDS(model, "output/model.rds")
}

Running the tests, we see that the dataset gets read without errors by main(), but the number of training samples in the model remains 0:

> testthis::test_integration()
Loading app
✔ |  OK F W S | Context
✖ |   0 1     | test_end-to-end [0.1 s]
───────────────────────────────────────────────────
test-test_end-to-end.R:16: failure: a model is output with the right number of training samples
length(model$fitted.values) not equal to nrow(titanic).
1/1 mismatches
[1] 0 - 891 == -891
───────────────────────────────────────────────────

══ Results ════════════════════════════════════════
Duration: 0.1 s

OK:       0
Failed:   1
Warnings: 0
Skipped:  0

That is simply because we have not specified any target variable in our call to lm(). Let’s fix that:

# main.R
main <- function(env = "default") {
  connection_params <- config::get(config = env)
  con <- do.call(odbc::dbConnect, c(odbc::odbc(), connection_params))
  train <- dplyr::tbl(con, "titanic") %>% dplyr::collect()
  model <- lm(Survived ~ 0, data = train)
  if (!dir.exists("output")) dir.create("output")
  saveRDS(model, "output/model.rds")
}

And the tests now pass:

> testthis::test_integration()
Loading app
✔ |  OK F W S | Context
✔ |   1       | test_end-to-end

══ Results ════════════════════════════════════════
OK:       1
Failed:   0
Warnings: 0
Skipped:  0

Let’s now recap what we have achieved. We have an integration test case that verifies that a linear model (or any model, really, provided it provides a fitted.values attribute) gets trained by our application, using a dataset read using an ODBC connection we provide. But more importantly, we have written an automated test that exercises the whole application, and will keep on doing so as we implement the missing pieces.

It doesn’t feel like we have accomplished much; after all, our model is certainly not the best one for the classification task we have at hand, and we don’t even train it in a sensible manner. But consider this: before you began the project, chances are the customer (or the boss) didn’t really know what they wanted. Now you are in a position to hold the following conversation:

you: We have decided that the first user story should prove that the application, when run, trains a model on the Titanic dataset. This is what this first integration test demonstrates.
boss: How good is that model?
you: Its accuracy is 62%, if that’s what you mean.
boss: Really? That’s pretty impressive for a first shot. How did you achieve that?
you: It predicts that everybody dies. And 62% of passengers in the training dataset did indeed die.
boss: (shifts uncomfortably) hm I see well I guess we should think of alternative metrics. We’d rather avoid false positives, even at the cost of some accuracy. Can you do something about it?

So by getting a woefully incomplete, but working, application out so soon you’ve been able to help your stakeholders understand what really matters to them. Better, yet, you are even in a position to write the next item on your backlog as a precise user story:

As a customer, I want to know if I'm going to die on this cruise
but a death prediction had better be right 90% of the time

Time to write a new integration test and our first unit tests.