David's blog

Err and err and err but less and less and less

David's blog

Err and err and err but less and less and less

R

Your Classifier Is Broken, But It Is Still Useful

When you run a binary classifier over a population you get an estimate of the proportion of true positives in that population. This is known as the prevalence. But that estimate is biased, because no classifier is perfect. For example, if your classifier tells you that you have 20% of positive cases, but its precision […]

Is The Ratio of Normal Variables Normal?

In Trustworthy Online Controller Experiments I came across this quote, referring to a ratio metric $M = \frac{X}{Y}$, which states that: Because $X$ and $Y$ are jointly bivariate normal in the limit, $M$, as the ratio of the two averages, is also normally distributed. That’s only partially true. According to https://en.wikipedia.org/wiki/Ratio_distribution, the ratio of two […]

Controlling for covariates is not the same as “slicing”

To detect small effects in experiments you need to reduce the experimental noise as much as possible. You can do it by working with larger sample sizes, but that doesn’t scale well. A far better approach consists in controlling for covariates that are correlated with your response. I recently gave a talk at our company […]

Feature standardization considered harmful

Many statistical learning algorithms perform better when the covariates are on similar scales. For example, it is common practice to standardize the features used by an artificial neural network so that the gradient of its objective function doesn’t depend on the physical units in which the features are described. The same advice is frequently given […]

No, you have not controlled for confounders

When observational data includes a treatment indicator and some possible confounders, it is very tempting to simply regress the outcome on all features (confounders and treatment alike), extract the coefficients associated with the treatment indicator, and proudly proclaim that “we have controlled for confounders and estimated the treatment effect”. This approach is wrong. Very wrong. […]

A/B testing my resume

Internet wisdom is divided on whether one-page resumes are more effective at landing you an interview than two-page ones. Most of the advice out there seems much opinion- or anecdotal-based, with very little scientific basis. Well, let’s fix that. Being currently open to work, I thought this would be the right time to test this […]

Monty Hall: a programmer’s explanation

I take it we’re all familiar with the infamous Monty Hall problem: Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say A, and the host, who knows what’s behind the doors, opens another door, say […]

Machine Learning in R: Start with an End-to-End Test

As a data scientist, you will likely be asked one day to automate your analysis and port your models to production environments. When that happens you cross the blurry line between data science and software engineering, and become a machine learning engineer. I’d like to share a few tips we’re exploring at Expedia on how […]

Where to define S4 generics

You need to declare generic functions in S4 before you can define methods for them. If no definition exists you will see the following error: Generic functions are declared with the setGeneric() function, which must precede the call to setMethod(): But when you develop an R package you may have several classes that define their […]

Connecting to SQL Server from R on a Mac with a Windows domain user

Connecting to an SQL Server instance as a Windows domain user is relatively straightforward when you run R on Windows, you have the right ODBC driver installed, and your network is setup properly. You normally don’t need to supply credentials, because the ODBC driver uses the built-in Windows authentication scheme. Assuming your odbcinst.ini file includes […]

Scroll to top