DB4ALL: reformatting the mess that Internet has become

I always try very hard to keep my posts within the main topic of this blog, namely computers in the context of building automation and simulation. Occasionally I fail, like for today’s post.

I’d like to tell you about a software company co-founded by a friend and fellow Toastmaster of mine, David Portabella. The company’s name is [DB4ALL](http://www.db4all.com), and they specialize in software for retrieving structured data from the web.

(Disclaimer: I am not affiliated with this company. I have had the opportunity to play with their tool, which I sincerely think is a high-quality one, but I derive no remuneration from writing this piece.)

They’ve developed `Webminer’, a Java library for extracting data in a structured manner from any website. Suppose, for instance, that you need a relational database with the data from the [CIA World Factbook](https://www.cia.gov/library/publications/the-world-factbook/). That data, though in the public domain, cannot be obtained in the form of a relational database, but only by clicking around on the CIA website. But with ‘Webminer’, the smart guys at DB4ALL can write a custom application that will know how to navigate such websites, ‘scrape’ and ‘normalize’ its data, and save it to a relational database for you.

On [DB4ALL’s website](http://www.db4all.com) you will find references to [the two most popular datasets](http://db4all.com/databases/) that they’ve mined: the above-mentioned CIA World Factbook, and the SourceForge database of open-source projects. Having such data in a relational form is invaluable for any researcher or marketing analyst. Suppose for instance that you want scientific data on the popularity of different programming languages over time in open-source projects. Well with these datasets you have all you need to get started.

This, for instance, is a screenshot of the SourceForge dataset opened in Excel:

All in all, if you need publicly available data from a website stored in a relational database form, you should definitely consider using [DB4ALL](http://www.db4all.com)’s services.

Software engineering best practices in academia

As you might know, my primary background stems from the field of
academia and research, but over the past years my interests have
focused increasingly on software engineering.

With the benefit of hindsight, it’s clear to me today that if I had
known what I know today about software, I would without doubt have
been a much, much more productive researcher and graduate
student. It’s simply not possible today to carry out research without
programming. And research itself, to be considered valuable, requires
exactly the same qualitities demanded from modern software
engineering: repeatability, versioning, and safe explorations.

I’m convinced today that researchers would benefit if practicing
software engineers would give them some feedback on how they solve
these problems. And I’ve often pondered whether I should begin writing
on software engineering topics that I think could be relevant for
scientists and/or engineers, particularly in the academic field. It
could even form the basis for a series of blog posts.

I’d rather ask you, dear reader, for advice on this. **Would you like
me to begin a series of posts on software engineering topics relevant
to scientists and engineers in academia?** And if yes, which particular
subjects would you like to see me discuss?

I’m really, really looking forward to reading your comments on this matter.

Weird certificate verification error

I spent most of the day today debugging a very mysterious error we
encountered when trying to programmatically call a web service over SSL
from Java.

Here is the source code with which we managed to reliable reproduce
the error:

import javax.net.SocketFactory;
import javax.net.ssl.SSLSocketFactory;
import java.io.*;
import java.net.Socket;

public class SimpleSSLTest {
public static void main(String[] args) throws IOException {
try {
int port = 443;
String hostname = “somehost.com”;
SocketFactory socketFactory = SSLSocketFactory.getDefault();
Socket socket = socketFactory.createSocket(hostname, port);
InputStream in = socket.getInputStream();
OutputStream out = socket.getOutputStream();
PrintWriter pout = new PrintWriter(new BufferedWriter(new OutputStreamWriter(out)));
pout.println(“GET ” + “/” + ” HTTP/1.0″);
pout.println();
pout.flush();
BufferedReader bin = new BufferedReader(new InputStreamReader(in));
String inputLine;
while ((inputLine = bin.readLine()) != null) {
System.out.println(inputLine);
}
in.close();
out.close();
} catch (IOException e) { throw e; }
}

The website, `somehost.com`, used a SSL certificate signed by our own
internal certificate authority. That authority’s certificate was
stored in a `cacerts` Java keystore. We run this code from the command
line thus:

$ java -Djavax.net.ssl.trustStore=cacerts -cp target/classes/ SimpleSSLTest

When we run this, the application bombs with an exception, the root
cause of which reads as follows:

Caused by: java.security.cert.CertPathValidatorException: CA key usage check failed: keyCertSign bit is not set
at sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(PKIXMasterCertPathValidator.java:153)
at sun.security.provider.certpath.PKIXCertPathValidator.doValidate(PKIXCertPathValidator.java:325)
at sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(PKIXCertPathValidator.java:187)
at java.security.cert.CertPathValidator.validate(CertPathValidator.java:267)
at sun.security.validator.PKIXValidator.doValidate(PKIXValidator.java:261)
… 22 more

We’ve tried to wrap our heads around this problem the whole day and
could make neither head nor tail about it, especially as we didn’t get
this error at all when targeting another host, using another
certificate but signed by the same certificate authority.

As a last resort, I thought of checking exactly which version of Java
we were using. Turned out we were using OpenJDK, the version that
replaced Sun’s version in Ubuntu 10.4. Running the same code with
Sun’s Java SDK solved the problem, but we can’t confidently state that
we understand what was wrong. Perhaps a bug in OpenJDK’s
implementation of JSSE. Who knows.

If you’ve run into the same problem, feel free to leave a comment. I’d
be interested to hear if (and how) you’ve solved it.