On the 23th of October I will be presenting the paper "LOD Laundromat: A Uniform Way of Publishing Other People's Dirty Data" at this year's International Semantic Web Conference (ISWC).
The paper is about the LOD Laundromat, whose purpose is to clean and disseminate all Linked Open Data (LOD) in a standards-compliant way, in a single format and in a single location.
The LOD Washing Machine, which performs the data cleaning process, was written in SWI-Prolog 7 and makes use of the various Semweb libraries.
The LOD Laundromat has currently been able to distill over 13 billion triples from thousands of datasets. When loading so many datasets we came across many idiosyncrasies:
- missing archive headers,
- RDF/XML documents with multiple roots,
- RDFa embedded in HTML without version declaration,
- File extensions / HTTP
Content-Typevalues that are not indicative of the serialization format used,
- undefined RDF prefixes,
- very many syntax errors (e.g., unescaped newlines in literals in N-Triples),
Both due to the scale and the occurrence of such corner cases we were able to detect several bugs/limitations/memory leaks in the Semweb libraries that would not have come up quickly during 'normal use' or by processing the limited set of W3C test cases.
I believe that with the LOD Laundromat we have shown that Semweb is now a very mature library collection. My thanks go to Jan and the other authors for making this project succeed!