How to set up a reverse SSH tunnel with Amazon Web Services

When the startup shut down there were still dozens of netbooks out there in the wild collecting data on the residential houses fitted with our adaptive heating control algorithms, hopelessly attempting to connect to our VPN server that didn’t exist anymore in order to upload all that data to our now-defunct database. That’s a lot of data, sitting and growing on a lot of internet-connected devices.

Some of us came together and figured it could be possible to resume collecting that data, and showcase the benefits of having our system installed on your house. The first problem was, how do we connect to these netbooks? And at near-zero cost?

Warning: hacks ahead.

We figured that step one would be to establish a reverse SSH tunnel to each of these netbooks. A reverse SSH tunnel is set up when an otherwise-inaccessible device (in our case, the netbooks) connects to a publicly available SSH server, opens a port on the server, and forwards (“tunnels”) all incoming connections to that port back to the device. This is the best solution to connect to a device that’s not exposed to the public internet short of setting up a proper VPN solution.

To set up a reverse SSH tunnel you first need a publicly available machine running an SSH server and that will accept reverse tunnels. The good news is that you can all have one by signing up to Amazon Web Services (AWS) and going to the Elastic Cloud 2 (EC2) service:

Next you want to launch an instance:

You really want the smallest, freeest possible machine here that runs Linux:

Make sure you have generated a key pair for this instance (and that you have saved the private key!) and that the machine accepts SSH from anywhere:

But when you set up an SSH tunnel you will also need to make sure the EC2 instance accepts SSH traffic on the ports that will be opened by the tunnel. These are up to you; I have created two tunnels, one on port 7030 and one on 7040, so navigate to the settings for the security group of your instance and make sure the instance will accept TCP traffic to these ports:

That’s all on the server side. On the netbook side you need to do three things: 1) get the private key, 2) change the file permissions on the key, 3) establish the tunnel.

Getting the private key to the netbook is entirely up to you. What I did, and which is absolutely not recommended, was to place the private key neurobat.pem on the same web server hosting this blog. Then I was able to get the key with

wget --no-check-certificate davidlindelof.com/<path-to-key>

(Notice the --no-check-certificate argument. Those netbooks are hopelessly out of date and won’t accept HTTPS certificates anymore.)

Next you need to set the right permissions on the key, or SSH will not accept them:

chmod 400 <path-to-key>

And finally you can set up the tunnel, say on port 7000:

ssh -i <path-to-key> -fN -R :7000:localhost:22 ec2-user@<ec2-ip-address>

If all went well you’ll now be able to ssh into the remote device by sshing to your EC2 instance on port 7000:

ssh <username-on-device>@<ec2-ip-address> -p 7000

As an extra precaution you might also want to look into using the autossh program, which can detect connection drops and attempt to reconnect.

Clunky? Sure. Hacky? You bet. Brittle? Oh my god. But it did the job and I can now work on doing things the “right” way, i.e. setting up a proper VPN solution, probably based on OpenVPN or something.

Deep silence or deep work

It’s Monday afternoon. It’s a holiday but I have a couple of things to catch up from last week that I didn’t finish. The rest of the family is either on holiday camp or taking a nap in the bedroom. I’m working from home. But the home is anything but silent.

I can hear the girls’ muffled chatting, from the sound of it they’re making up some story with their dolls. The village church bell just tolled a single note for the quarter past the hour. My phone’s notification just dinged, and in a rare moment of self discipline I don’t pick it up. Some birds are chirping outside. The convection oven in the kitchen has had a malfunction in years and emits a beep every 10 seconds that I have learned to ignore. Occasionally a plane comes in overhead to land on Geneva’s airport; there’s only one landing strip and depending on the direction of the wind, planes come in from the direction of our village. And on top of it all I hear some kind of background whine that’s very soft–I usually don’t notice it but it’s definitely there and I don’t know if it comes from outside of me or from inside my head.

That’s a lot of noise. It’s also the best possible working conditions I’ve ever experienced. Today I’ve chosen to deliberately notice all these sounds and now I cannot unhear them.

Then there’s the visual distractions. I’ve been working for the past three years from a corner in the living room, the rest of which fills my field of view, as well as parts of the kitchen.

These working conditions sound bad but they can be fixed. I usually set a screen between me and the rest of the living room, and almost always do my deep focus work wearing noise-canceling over-the-ear headphones, playing focus-friendly music. My family knows that when daddy wears the headphones, he is not to be disturbed unless there’s blood or fire. It mostly works.

Like many others, I used to work in an open-space office. Noise-wise and visual distraction-wise, open-space offices are possibly better than working from home. On more than one occasion, visitors from abroad have been impressed by the museum-grade silence filling a Swiss open-space office. But open-space offices offer a richer set of options for not concentrating on your deep work. Entire days can go by, being interrupted by colleagues, taking a walk to the cafeterias, listening in on neighboring conversations, attending more meetings than you should because you fear you’ll miss out. And the siren song of office perks, of course.

The choice is between perfect quiet filled with distractions, or constant information-free background sounds that you can learn to ignore with monk-like focus. I’ve tried it all and I know what works for me. Do you?

Is The Ratio of Normal Variables Normal?

In Trustworthy Online Controller Experiments I came across this quote, referring to a ratio metric $M = \frac{X}{Y}$, which states that:

Because $X$ and $Y$ are jointly bivariate normal in the limit, $M$, as the ratio of the two averages, is also normally distributed.

That’s only partially true. According to https://en.wikipedia.org/wiki/Ratio_distribution, the ratio of two uncorrelated noncentral normal variables $X = N(\mu_X, \sigma_X^2)$ and $Y = N(\mu_Y, \sigma_Y^2)$ has mean $\mu_X / \mu_Y$ and variance approximately $\frac{\mu_X^2}{\mu_Y^2}\left( \frac{\sigma_X^2}{\mu_X^2} + \frac{\sigma_Y^2}{\mu_Y^2} \right)$. The article implies that this is true when $Y$ is unlikely to assume negative values, say $\mu_Y > 3 \sigma_Y$.

As always, the best way to believe something is to see it yourself. Let’s generate some uncorrelated normal variables far from 0 and their ratio:

ux = 100
sdx = 2
uy = 50
sdy = 0.5

X <- rnorm(1000, mean = ux, sd = sdx)
Y <- rnorm(1000, mean = uy, sd = sdy)
Z <- X / Y

Their ratio looks normal enough:

hist(Z)

Which is confirmed by a q-q plot:

qqnorm(Z)

What about the mean and variance?

mean(Z)
[1] 1.998794
ux / uy
[1] 2
var(Z)
[1] 0.001783404
ux^2 / uy^2 * (sdx^2 / ux^2 + sdy^2 / uy^2)
[1] 0.002

Both the mean and variance are very close to their theoretical values.

But what happens now when the denominator $Y$ has a mean close to 0?

ux = 100
sdx = 2
uy = 10
sdy = 2

X <- rnorm(1000, mean = ux, sd = sdx)
Y <- rnorm(1000, mean = uy, sd = sdy)
Z <- X / Y

Hard to call the resulting ratio normally distributed:

hist(Z)

Which is also clear with a q-q plot:

qqnorm(Z)

In other words, it is generally true that ratio metrics where the denominator is far from 0 will also be close enough to a normal distribution for practical purposes. But when the denominator’s mean is, say, closer than 5 sigmas from 0 that assumption breaks down.

Working with that data scientist

In my current team we have decided to split up the work in a number of workstreams, which are in effect subteams responsible for different aspects of the product. One workstream might be responsible for product instrumentation, another for improving the recommendation algorithms, another responsible for the application’s look and feel. Each workstream has its own backlog and its own set of quarterly commitments, which map nicely to quarterly OKRs.

Workstreams aren’t necessarily disjoint: the same person might contribute to more than one work stream. Indeed for specialists (UX researchers, UI specialists, data science), that is almost the norm. As an aspiring data scientist myself, I contribute to several workstreams; I may entirely own a key result assigned to a workstream, or provide input (e.g. statistical advice, experiment sizing, etc) to another.

We don’t do daily standups, not even among the software engineers. Instead we meet twice weekly for 30 minutes and review the current plans, update the board, and make sure no one is blocked.

We’ve adopted this process early this year. The response from the team has been generally positive. Compared to a more traditional front-end vs back-end division of labour, the team has cited the following benefits:

  • tighter team cohesion
  • better understanding of what the others are working on
  • more productive team meetings
  • greater sense of accomplishments

The main drawback with this system affects those of us in a more specialized role, such as UI, UX, or Data Science, who contribute to more than one workstream. We find ourselves compelled to attend the semi-weekly meetings of all the workstreams we are involved with, and never know which ones we can safely skip. On top of this I also have a weekly Data Science sync with the product manager.

At a recent retrospective we have agreed to mitigate these issues by the following:

  • notes should be taken at all meetings, and the note-taker should remember to tag any team member who might be absent but who might need to know something important;
  • we will shorten the sync meetings to 15 minutes, and defragment them so that two workstreams could have their syncs done in the same half-hour (and sometimes the same room).

I can’t say that this is the final perfect solution to embed a data scientist in a product team but at least we have an adaptive process in place: a system to regularly iterate on our processes and give the team permission to adapt their working agreements.

Are you a specialist embedded in a product team mostly made up of software engineers? How do you interact with the rest of the team? I’d love to hear your story in the comments below.

Controlling for covariates is not the same as “slicing”

To detect small effects in experiments you need to reduce the experimental noise as much as possible. You can do it by working with larger sample sizes, but that doesn’t scale well. A far better approach consists in controlling for covariates that are correlated with your response.

I recently gave a talk at our company on the design of online experiments, and someone pointed out that our automated experiment analysis tool implemented “slicing”, that is, running separate analyses on subsets of the data. Wasn’t that the same thing as controlling for covariates?

Controlling for covariates means you include them in your statistical model. Running separate analyses means each of your sub-analyses has a smaller sample size; you may gain in precision because your response will be less variable in each subset, but you lose the benefits that come from using a larger sample size.

Let’s illustrate this with a simulation. Let’s say we wish to measure the impact of some treatment, whose effect is about 10 times smaller than the standard deviation of the error term:

mu <- 10  # some intercept
err <- 1  # standard deviation of the error
treat_effect <- 0.1  # the treatment effect to estimate

Let’s say we have a total of 1000 units in each arm of this two-sample experiment, and that they belong to 4 different equal-sized groups labeled ABC, and D:

n <- 1000

predictor <- 
  data.frame(
    group = gl(4, n / 4, n * 2, labels = c('A', 'B', 'C', 'D')),
    treat = gl(2, n, labels = c('treat', 'control')))

Let’s simulate the response. For simplicity, let’s say that the group membership has an impact on the response equal to the treatment effect:

group_effect <- treat_effect

response <-
  with(
    predictor,
    mu + as.integer(group) * group_effect + (treat == 'treat') * treat_effect + rnorm(n * 2, sd = err))

df <- cbind(predictor, response)

summary(df)
 group       treat         response     
 A:500   treat  :1000   Min.   : 6.368  
 B:500   control:1000   1st Qu.: 9.607  
 C:500                  Median :10.335  
 D:500                  Mean   :10.325  
                        3rd Qu.:11.015  
                        Max.   :13.757  

The following plot shows how the response is distributed in each group. This is one of those instances where you need statistical models to detect effects that are hard to see in a plot:

ggplot2::ggplot(df, ggplot2::aes(x = group, y = response)) +
  ggplot2::geom_boxplot()

Fitting the full model yields the following confidence intervals:

mod_full <- lm(response ~ group + treat, df)
confint(mod_full)
                    2.5 %      97.5 %
(Intercept)  10.122391874 10.32334506
groupB        0.009174272  0.26336218
groupC        0.106049411  0.36023732
groupD        0.116794148  0.37098206
treatcontrol -0.193199676 -0.01346168

All coefficients are estimated correctly, and the width of the confidence interval of the treatment effect is about 0.18. The treatment effect is statistically significant. Recall that the error is taken to have a standard deviation of 1, and that $n=1000$ per arm, so we would except the 95% confidence interval on the treatment effect to be $2 \times 1.96 \times \sigma \times \sqrt{2/n}$, or about 0.18. We are not very far off.

What happens now if, instead of controlling for the group, we “sliced” the analysis, i.e. we fit four separate models, one per group? On one hand we will have a smaller error than if we fitted a global model that did not control for the group covariate; on the other hand we will have fewer observations per group, which will hurt our confidence intervals. Let’s check:

confint(lm(response ~ treat, df, subset = group == 'A'))
                  2.5 %      97.5 %
(Intercept)  10.0482082 10.30293094
treatcontrol -0.2967421  0.06349024
confint(lm(response ~ treat, df, subset = group == 'B'))
                  2.5 %      97.5 %
(Intercept)  10.2324904 10.48601591
treatcontrol -0.3688051 -0.01026586
confint(lm(response ~ treat, df, subset = group == 'C'))
                 2.5 %      97.5 %
(Intercept)  10.303706 10.55182387
treatcontrol -0.309723  0.04116836
confint(lm(response ~ treat, df, subset = group == 'D'))
                  2.5 %      97.5 %
(Intercept)  10.4170354 10.66328177
treatcontrol -0.3825046 -0.03425957

The estimates for the treatment effect remain unbiased, but now the confidence intervals are about 0.35—or $2 \times 1.96 \times \sqrt{2/(n/4)}$, which is what you would expect for a sample size that’s four times smaller. That’s twice as large as when fitting the whole data with a model that includes the group covariate. In fact most of the groups now have statistically unsignificant results.

I’m all for automated experiment analysis tools; but when the goal is to detect small effects, I think there’s currently no substitute for a manual analysis by a trained statistician (which I am not). Increasing sample sizes can only take you so far; remember that the confidence intervals scale with $\sqrt{n}$ only. It is almost always better to search for a set of covariates correlated with the response, and include them in your statistical model. And that’s what controlling for a covariate means.

Getting into data science

A while back I had the pleasure to address a team of user experience researchers at YouTube, and I got asked for a few resources that could help someone pretty good at science, math, and programming who wanted to get into data science. Here’s the list I gave. These have worked for me in the past, with the caveat that I’m very partial towards books.

Absolute must-reads

An Introduction to Statistical Learning 
Python Data Science Handbook

Both are freely available, outstanding books that cover a LOT of ground. The former uses R and goes somewhat deeper in theory, while the latter uses Python and is perhaps more practical, covering iPython, Numpy, and the scikit-learn ecosystem.

Great too

Learning Statistics with R

One of the clearest expositions of fundamental statistical concepts I’ve read. It’s also well written and avoids dry, lifeless prose; the author does a great job at discussing the pros and cons of each technique, and frequently gives templates on how to present the results. One of the most memorable passages was his/her (read the text to understand…) rant against the use of p-values AFTER looking at the data. Free book.

R for Data Science

Hadley Wickam’s companion book to the tidyverse. Essential reading if you’re into R and use the tidyverse. More oriented towards data manipulation and programming than actual statistical modeling. Free book.

For the brave

The Elements of Statistical Learning

The “grown-up” version of ISLR (mentioned above). Covers a lot of theoretical ground, including a great discussion of the variance-bias tradeoff so beloved of interviewers. That book taught me to stop blindly normalizing covariates before running clustering algorithms.

Regression Modeling Strategies

Harrell is to statistics what Wickham is to data manipulation: the opinionated author of some amazing R packages that do a better job than the ones provided in base R. It’s a very dry text though, and probably better read in conjunction with some explanatory blog posts. Furthermore, it can be difficult to find resources online because these packages are not as widely adopted as the tidyverse.

Summer reading

Data Science from Scratch

Joel Grus is amazing. In this book he shows how to code (and test!) many constructs used in Data Science, culminating with a pseudo-relational database.

Oh you think you know statistics?

Statistical Evidence
Causal Inference in Statistics: A Primer

I’m including these two books because I think reading them will make you a better statistician. The former is a short but mind-blowing read that will make you rethink every analysis you’ve ever done. The latter is the must-read text if you’re going to do any kind of causal inference.

Non-book resources

Machine Learning

Deep Learning

AI nanodegree

These are some online courses I’ve taken and which I can wholeheartedly recommend, especially the first one which covers pretty much most concepts used in DS / ML. The Deep Learning specialization is more oriented towards neural networks, while Udacity’s AI nanodegree has probably nothing to do with DS but is a great intro to topics like building game-playing AI or path-finding algorithms.

Am I missing something? Feel free to add your own recommendations in the comments below.

The law of total probability applied to a conditional probability

Dear future self,

I’ve just lost (again) about half an hour of my life trying to find a vaguely remembered formula that generalizes the law of total probability to the case of conditional probabilities. Here it is. You’re welcome.

So what is the probability of dying from a lighting strike if you’re an American who knows this statistic?

The law of total probability says that if you can decompose the set of possible events into disjoint subsets (say $B$ and $\overline{B}$), then (with obvious generalization to more than two subsets):

$$\Pr(A) = \Pr(A \mid B) \Pr(B) + \Pr(A \mid \overline{B}) \Pr(\overline{B})$$

But what if you’re dealing with $\Pr(A \mid C)$ instead of just $\Pr(A)$? What’s the formula for the law of total probability in that case? What you’re searching for can be found by googling for “total law probability conditional”:

$$\Pr(A \mid C) = \Pr(A \mid B, C) \Pr(B \mid C) + \Pr(A \mid \overline{B}, C) \Pr(\overline{B} \mid C) $$

There’s a great derivation here: https://math.stackexchange.com/questions/2377816/applying-law-of-total-probability-to-conditional-probability.

XKCD on Data Science

I’ve been collecting all XKCD comics related to Data Science and/or Statistics. Here they are, but if you think I’m missing any please let me know in the comments. Use at will in your data visualizations but remember to attribute. Sorted in reverse chronological order.

Effect Size
K-Means Clustering
Methodology Trial
Euler Diagrams
Data Point
Change in Slope
Proxy Variable
Health Data
Garbage Math
Selection Bias
Spacecraft Debris Odds Ratio
Control Group
Confounding Variables
Bayes’ Theorem
Slope Hypothesis Testing
Flawed Data
Error Types
Modified Bayes’ Theorem
Curve-Fitting
Machine Learning
Linear Regression
P-Values
t Distribution
Increased Risk
Seashell
Log Scale
Cell Phones
Significant
Conditional Risk
Correlation
Boyfriend

Quick note about bootstrapping

Cross-validation—the act of keeping a subset of data to measure the performance of a model trained on the rest of the data—never sounded right to me.

It just doesn’t feel optimal to retain an arbitrary fraction of the data when you train your model. Oh and then you’re also supposed to keep another fraction for validating the model. So one set for training, one set for testing (to find the best model structure), and one set for validating the model, i.e. measuring its performance. That’s throwing away quite a lot of data that could be used for training.

That’s why I was excited to learn that bootstrapping provides an alternative. Bootstrapping is an elegant way to maximize the use of the available data, typically when you want to estimate confidence intervals or any other statistic.

In “Applied Predictive Modelling“, the authors discuss resampling techniques, which include bootstrapping and cross-validation (p. 72). The authors explain that bootstrap validation consists in building N models with bootstrapped data and estimating their performance on the out-of-bag samples, i.e. the samples not used in building the model.

I think that may be an error. I don’t have Efron’s seminal book on the bootstrap anymore but I’m pretty sure the accuracy was evaluated against the entire data set, not just the out-of-bag samples.

In “Regression Modelling Strategies“, Frank Harrell describes model validation with the bootstrap thus (emphasis mine):

With the “simple bootstrap” [178, p. 247], one repeatedly fits the model in a bootstrap sample and evaluates the performance of the model on the original sample. The estimate of the likely performance of the final model on future data is estimated by the average of all of the indexes computed on the original sample.

Frank Harrell, Regression Modelling Strategies

The most under-rated programming books

Ask any programmer what their favourite programming book is, and their answer will be one of the usual suspects: Code Complete, The Pragmatic Programmer, or Design Patterns. And rightly so; these are outstanding and highly-regarded works that belong to every programmer’s bookshelf. (If you’re just starting out building up your bookshelf, Jeff Atwood has some great recommendations).

But once you get past the “essential” books you’ll find that there are many incredibly good programming books out there that people don’t talk much about, but which were essential in taking me to the next levels in my professional growth.

Here’s a partial list of such books; I’m sure there are many others, feel free to mention them in the comments.

Growing Object-Oriented Software, Guided by Tests

Cover of "Growing Object-Oriented Software, Guided by Tests
Continue reading