R - David's blog

How Are P-values Distributed Under The Null?

By David LindelöfPosted on January 22, 2025Posted in R2 Comments

I sometimes use this fun interview question for aspiring data scientists: How are p-values distributed assuming the null hypothesis is true? I’ve heard a lot of reasonable answers, including: All very reasonable and intuitive answers which I would probably, at some point, have given myself. They’re also all wrong. The (perhaps surprising) answer is that […]

Your Classifier Is Broken, But It Is Still Useful

By David LindelöfPosted on January 8, 2025Posted in R2 Comments

When you run a binary classifier over a population you get an estimate of the proportion of true positives in that population. This is known as the prevalence. But that estimate is biased, because no classifier is perfect. For example, if your classifier tells you that you have 20% of positive cases, but its precision […]

Is The Ratio of Normal Variables Normal?

By David LindelöfPosted on May 3, 2023Posted in R4 Comments

In Trustworthy Online Controller Experiments I came across this quote, referring to a ratio metric $M = \frac{X}{Y}$, which states that: Because $X$ and $Y$ are jointly bivariate normal in the limit, $M$, as the ratio of the two averages, is also normally distributed. That’s only partially true. According to https://en.wikipedia.org/wiki/Ratio_distribution, the ratio of two […]

Controlling for covariates is not the same as “slicing”

By David LindelöfPosted on April 5, 2023Posted in R

To detect small effects in experiments you need to reduce the experimental noise as much as possible. You can do it by working with larger sample sizes, but that doesn’t scale well. A far better approach consists in controlling for covariates that are correlated with your response. I recently gave a talk at our company […]

Feature standardization considered harmful

By David LindelöfPosted on June 11, 2021Posted in R1 Comment

Many statistical learning algorithms perform better when the covariates are on similar scales. For example, it is common practice to standardize the features used by an artificial neural network so that the gradient of its objective function doesn’t depend on the physical units in which the features are described. The same advice is frequently given […]

No, you have not controlled for confounders

By David LindelöfPosted on February 10, 2021Posted in R4 Comments

When observational data includes a treatment indicator and some possible confounders, it is very tempting to simply regress the outcome on all features (confounders and treatment alike), extract the coefficients associated with the treatment indicator, and proudly proclaim that “we have controlled for confounders and estimated the treatment effect”. This approach is wrong. Very wrong. […]

A/B testing my resume

By David LindelöfPosted on November 24, 2020Posted in R4 Comments

Internet wisdom is divided on whether one-page resumes are more effective at landing you an interview than two-page ones. Most of the advice out there seems much opinion- or anecdotal-based, with very little scientific basis. Well, let’s fix that. Being currently open to work, I thought this would be the right time to test this […]

Monty Hall: a programmer’s explanation

By David LindelöfPosted on October 2, 2020Posted in R3 Comments

I take it we’re all familiar with the infamous Monty Hall problem: Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say A, and the host, who knows what’s behind the doors, opens another door, say […]

Machine Learning in R: Start with an End-to-End Test

By David LindelöfPosted on November 13, 2019Posted in R

As a data scientist, you will likely be asked one day to automate your analysis and port your models to production environments. When that happens you cross the blurry line between data science and software engineering, and become a machine learning engineer. I’d like to share a few tips we’re exploring at Expedia on how […]

Where to define S4 generics

By David LindelöfPosted on August 9, 2019Posted in R

You need to declare generic functions in S4 before you can define methods for them. If no definition exists you will see the following error: Generic functions are declared with the setGeneric() function, which must precede the call to setMethod(): But when you develop an R package you may have several classes that define their […]

How Are P-values Distributed Under The Null?

Like this:

Your Classifier Is Broken, But It Is Still Useful

Like this:

Is The Ratio of Normal Variables Normal?

Like this:

Controlling for covariates is not the same as “slicing”

Like this:

Feature standardization considered harmful

Like this:

No, you have not controlled for confounders

Like this:

A/B testing my resume

Like this:

Monty Hall: a programmer’s explanation

Like this:

Machine Learning in R: Start with an End-to-End Test

Like this:

Where to define S4 generics

Like this:

R

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: