David's blog

Err and err and err but less and less and less

David's blog

Err and err and err but less and less and less

Author : David Lindelöf

Quick note about bootstrapping

Cross-validation—the act of keeping a subset of data to measure the performance of a model trained on the rest of the data—never sounded right to me. It just doesn’t feel optimal to retain an arbitrary fraction of the data when you train your model. Oh and then you’re also supposed to keep another fraction for […]

The most under-rated programming books

Ask any programmer what their favourite programming book is, and their answer will be one of the usual suspects: Code Complete, The Pragmatic Programmer, or Design Patterns. And rightly so; these are outstanding and highly-regarded works that belong to every programmer’s bookshelf. (If you’re just starting out building up your bookshelf, Jeff Atwood has some […]

Feature standardization considered harmful

Many statistical learning algorithms perform better when the covariates are on similar scales. For example, it is common practice to standardize the features used by an artificial neural network so that the gradient of its objective function doesn’t depend on the physical units in which the features are described. The same advice is frequently given […]

No, you have not controlled for confounders

When observational data includes a treatment indicator and some possible confounders, it is very tempting to simply regress the outcome on all features (confounders and treatment alike), extract the coefficients associated with the treatment indicator, and proudly proclaim that “we have controlled for confounders and estimated the treatment effect”. This approach is wrong. Very wrong. […]

A/B testing my resume

Internet wisdom is divided on whether one-page resumes are more effective at landing you an interview than two-page ones. Most of the advice out there seems much opinion- or anecdotal-based, with very little scientific basis. Well, let’s fix that. Being currently open to work, I thought this would be the right time to test this […]

Unit testing SQL with PySpark

Machine-learning applications frequently feature SQL queries, which range from simple projections to complex aggregations over several join operations. There doesn’t seem to be much guidance on how to verify that these queries are correct. All mainstream programming languages have embraced unit tests as the primary tool to verify the correctness of the language’s smallest building […]

Scraping real estate for fun

Here’s a fun weekend project: scrape the real estate classifieds of the website of your choice, and do some analytics on the data. I did just that last weekend, using the Scrapy Python library for web scraping, which I then let loose on one of the major real estate classifieds website in Switzerland (can’t tell […]

Testing Scientific Software with Hypothesis

Writing unit tests for scientific software is challenging because frequently you don’t even know what the output should be. Unlike business software, which automates well-understood processes, here you cannot simply work your way through use case after use case, unit test after unit test. Your program is either correct or it isn’t, and you have […]

Monty Hall: a programmer’s explanation

I take it we’re all familiar with the infamous Monty Hall problem: Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say A, and the host, who knows what’s behind the doors, opens another door, say […]

Reading S3 data from a local PySpark session

For the impatient To read data on S3 to a local PySpark dataframe using temporary security credentials, you need to: Download a Spark distribution bundled with Hadoop 3.x Build and install the pyspark package Tell PySpark to use the hadoop-aws library Configure the credentials The problem When you attempt read S3 data from a local […]

Scroll to top