The plenary meeting was preceded by a Linked Open Data Meetup with talks from Springer, fluid Operations, and several LOD2 partners (Universität Leipzig, University of Mannheim, theSemantic Web Company, and Wolters Kluwer Deutschland GmbH (WKD)).
Wolters Kluwer Deutschland GmbH (WKD) gave a presentation on the content production pipeline of their legal publications and their experiences in incorporating LOD2 technologies for content enrichment. This is a very successful LOD2 use case and demonstrates the value of linked data for the information industry.
Springer gave a talk about their interest in linked data for enriching the Lecture Notes in Computer Science product. Also conference proceedings could be enhanced with structured metadata inRDF. I asked about nanopublications. The comment was that content authors might perceive nanopublications as an extra imposition. On the other hand, in the life sciences field there is a lot of enthusiasm for the idea. We will see; anyway, biology will likely lead the way for nanopublications. I referred Aliaksandr Birukou of Springer to the companies Euretos and its parent S&T in Delft, Netherlands, and to Barend Mons, scientific director of NBIC, the Netherlands Bioinformatics Centre. These are among the founding fathers of the Nano Republic, as they themselves put it.
Sebastian Hellman gave a talk on efforts to set up the DBpedia Foundation as a not-for-profit organization, hopefully in the next 10 days, to aid in the sustainability and growth of the DBpedia project. The Foundation would identify stakeholders, their interests, and ways to generate income to further improve DBpedia. Planned areas of improvement include the development of high-availability value-added DBpedia services with quality of service (QoS) agreements for enterprise users; additional tools in the DBpedia stack to support improved and cost-efficient data curation and internationalization; and improved documentation, tutorials, and support to speed uptake.
I had a word with Peter Haase of fluid Operations about the Optique project and their cloud management offerings. The claim is to do ontology-directed querying over thousands of terabytes of heterogenous data. This turns out to be a full-force attempt at large scale SQL federation with ontology-directed query rewriting for covering OWL 2 QL semantics. With Ian Horrocks of Oxford leading the ontology side, the matter is in good hands. Still the matter is not without its problems. Simple lookups can be directed to the data but if there are terabytes of it, it is more likely thataggregations are what is desired. Federated aggregation tends to move a lot of data. So the problems are as they ever were. However, if the analytics are already done and stored in the relational space, finding these based on ontologies is a worthwhile thing for streamlining end user access to information.
The LOD2 plenary itself was structured in the usual way, covering the work packages in two parallel tracks.
On the database side, the final victory will be won by going to adaptive schema for RDF. We brought the RDF penalty against relational to a factor of 2.5 for common analytics style queries, e.g., Star Schema Benchmark. This is a comparison to Virtuoso SQL, which offers very high performance in this workload, over 2x the speed of column store pioneer MonetDB and 300x MySQL. So this is where matters stand. To move them significantly forward, exploitation of structure for guiding physical storage will be needed. Also the project still has to deliver the 500 Gtriple results. The experiments around Christmas at CWI support the possibility, but they are not final. Putting triples into tables when the triples in fact form table-shaped structures, which is the case most of the time, may turn out to be necessary for this. At least, this will be a significant help.
Be the case as it may, using a table schema for regularly shaped data, while preserving the RDF quad flexibility, would essentially abolish the RDF tax and bring the LOD2 project to a glorious conclusion in August.
I took the poetic license to compare the data journey into RDF and back to the Egyptian myth of Osiris: The data gets shut in a silo and then gets cut into 14 pieces; and subsequently thrown into the Nile (i.e., the LOD cloud, or the CKAN catalog). Grief-stricken Isis sees what is become of her love: She patiently reassembles the pieces, reconstructing Osiris in fact so well that he sires her a child, hawk-headed Horus, who proceeds to reclaim his father’s honor. (See, Isis means Intelligent Structured Information Storage.)
I had many interesting conversations with Chris Bizer about his research in data integration, working with the 150M HTML tables in the common crawl. The idea is to resolve references and combine data from the tables. Interestingly enough, the data model in these situations is basically triples, while these are generally not stored as RDF but in Lucene. This makes sense due to the string-matching nature of the task. There appears to be opportunity in bringing together the state of the art in database, meaning the very highly optimized column-store and vectored execution in Virtuoso with the search-style workload found in instance matching and other data integration tasks. The promise goes in the direction of very fast ETL and subsequent discovery of structural commonalities and enrichment possibilities. This is also not infinitely far from the schema discovery that one may do in order to adaptively optimize storage based on the data.
Volha Bryl gave a very good overview of the Mannheim work in the data integration domain. For example, learning data fusion rules from examples of successful conflict resolution seems very promising. Learning text extraction rules from examples is also interesting. The problem of data integration is that the tasks are very heterogenous and therefore data integration suites have very large numbers of distinct tools. This is labor intensive but there is progress in automation. An error-free, or near enough, data product remains case by case and has human curation but automatic methods seem, based on Volha’s and Chris’ presentation, to be in the ballpark for statistics.
Giovanni Tummarello of Insight/SindiceTech, always the life of the party, presented his Solr-based relational faceted browser. The idea is to show and drill down by facets over a set of related tables; in the demo, this was investments, investment targets, and investors. You can look at the data from any of the points and restrict the search based on attributes of any. Well, this is what a database does, right? That is so, but the Sindice tool is on top of Solr and actually materializes joins into a document. This blows up the data but has all the things colocated so it can even run from disk. We also talked about the Knowledge Graph package Sindice offers on the Google cloud, this time a Virtuoso application.
We hope that negotiations between SindiceTech and Insight (formerly DERI) around open sourcing the SPARQL editor and other items come to a successful conclusion. The SPARQL editor especially would be of general interest to the RDF community. It is noteworthy that there is no SPARQL query builder in common use out there (even OpenLink‘s own open source iSPARQL has been largely (but not entirely!) overlooked and misunderstood, though it’s been available as part of the OpenLink Ajax Toolkit for several years). OK, a query builder is useful when there is schema. But if the schema is an SQL one, as will be the case if RDF is adaptively stored, then any SQL query builder can be applied to the regular portion of the data. 40 years of calendar time and millennia of person years have gone into making SQL front ends and these will become applicable overnight; Virtuoso does speak SQL, as you may know.
I had the breakout session about the database work in LOD2. What will be done is clear enough, the execution side is very good, and our coverage of the infinite space of query optimization continues to grow. One more revolution for storage may come about, as suggested above. There is not very much to discuss, just to execute. So I used the time to explain how you run
SELECT SUM ( l_extendedprice ) FROM lineitem , part WHERE l_partkey = p_partkey AND p_name LIKE '%green%'
Simple query, right? Sure, but application guys or sem-heads generally have no clue about how these in fact need to be done. I have the probably foolish belief that a little understanding of database, especially in the RDF space which does get hit by every query optimization problem, would be helpful. At least one would know what goes wrong. So I explained to Giovanni, who is in fact a good geek, that this is a hash join, and with only a little prompting he suggested that you should also put a Bloom filter in front of the hash. Good. So in the bar after dinner I was told I ought to teach. Maybe. But the students would have to be very fast and motivated. Anyway, the take-home message is that the DBMS must figure it out. In the SQL space this is easier, and of course, if most of RDF reduces to this, then RDF too will be more predictable in this department.
I talked with Martin Kaltenböck of the Semantic Web Company about his brilliant networking accomplishments around organizing the European Data Forum and other activities. Martin is a great ambassador and lobbyist for linked data across Europe. Great work, also in generating visibility for LOD2.
The EU in general, thanks in great part to Stefano Bertolo’s long term push in this direction, is putting increasing emphasis on measuring progress in the research it funds. This is one of the messages from the LOD2 review also. Database is the domain of performance race par excellence; the matters on that side are well attended to by LDBC and, of course, the unimpeachably authoritative TPC, among others. In other domains, measurement is harder, as it involves a human-curated ground truth for any extraction, linking, or other integration. There is good work in both Mannheim and Leipzig in these areas, and I may at some point take a closer look, but for now it is appropriate to stick to core database.