David's blog

xkcd and Data Science

By David LindelöfPosted on February 20, 2023Posted in Uncategorized1 Comment

I’ve been collecting all xkcd comics related to Data Science and/or Statistics. Here they are, but if you think I’m missing any please let me know in the comments. Use at will in your data visualisations but remember to attribute. Sorted in reverse chronological order.

Quick note about bootstrapping

By David LindelöfPosted on February 6, 2023Posted in Uncategorized

Cross-validation—the act of keeping a subset of data to measure the performance of a model trained on the rest of the data—never sounded right to me. It just doesn’t feel optimal to retain an arbitrary fraction of the data when you train your model. Oh and then you’re also supposed to keep another fraction for […]

The most under-rated programming books

By David LindelöfPosted on June 16, 2021Posted in Uncategorized

Ask any programmer what their favourite programming book is, and their answer will be one of the usual suspects: Code Complete, The Pragmatic Programmer, or Design Patterns. And rightly so; these are outstanding and highly-regarded works that belong to every programmer’s bookshelf. (If you’re just starting out building up your bookshelf, Jeff Atwood has some […]

Feature standardization considered harmful

By David LindelöfPosted on June 11, 2021Posted in R1 Comment

Many statistical learning algorithms perform better when the covariates are on similar scales. For example, it is common practice to standardize the features used by an artificial neural network so that the gradient of its objective function doesn’t depend on the physical units in which the features are described. The same advice is frequently given […]

No, you have not controlled for confounders

By David LindelöfPosted on February 10, 2021Posted in R4 Comments

When observational data includes a treatment indicator and some possible confounders, it is very tempting to simply regress the outcome on all features (confounders and treatment alike), extract the coefficients associated with the treatment indicator, and proudly proclaim that “we have controlled for confounders and estimated the treatment effect”. This approach is wrong. Very wrong. […]

A/B testing my resume

By David LindelöfPosted on November 24, 2020Posted in R4 Comments

Internet wisdom is divided on whether one-page resumes are more effective at landing you an interview than two-page ones. Most of the advice out there seems much opinion- or anecdotal-based, with very little scientific basis. Well, let’s fix that. Being currently open to work, I thought this would be the right time to test this […]

Unit testing SQL with PySpark

By David LindelöfPosted on November 16, 2020Posted in Python3 Comments

Machine-learning applications frequently feature SQL queries, which range from simple projections to complex aggregations over several join operations. There doesn’t seem to be much guidance on how to verify that these queries are correct. All mainstream programming languages have embraced unit tests as the primary tool to verify the correctness of the language’s smallest building […]

Scraping real estate for fun

By David LindelöfPosted on November 6, 2020Posted in Uncategorized

Here’s a fun weekend project: scrape the real estate classifieds of the website of your choice, and do some analytics on the data. I did just that last weekend, using the Scrapy Python library for web scraping, which I then let loose on one of the major real estate classifieds website in Switzerland (can’t tell […]

Testing Scientific Software with Hypothesis

By David LindelöfPosted on October 28, 2020Posted in Python

Writing unit tests for scientific software is challenging because frequently you don’t even know what the output should be. Unlike business software, which automates well-understood processes, here you cannot simply work your way through use case after use case, unit test after unit test. Your program is either correct or it isn’t, and you have […]

Monty Hall: a programmer’s explanation

By David LindelöfPosted on October 2, 2020Posted in R3 Comments

I take it we’re all familiar with the infamous Monty Hall problem: Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say A, and the host, who knows what’s behind the doors, opens another door, say […]

David's blog

xkcd and Data Science

Like this:

Quick note about bootstrapping

Like this:

The most under-rated programming books

Like this:

Feature standardization considered harmful

Like this:

No, you have not controlled for confounders

Like this:

A/B testing my resume

Like this:

Unit testing SQL with PySpark

Like this:

Scraping real estate for fun

Like this:

Testing Scientific Software with Hypothesis

Like this:

Monty Hall: a programmer’s explanation

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: