Web Search
Recent Changes - Search:

DBpediaQueries

Notes on specific DBpedia queries and related observations.

§Objective

dbpedia adds a new dimension to the data in Wikipedia, by querying it as a database. Infoboxes hold structured data that become one component of dbpedia. A challenge is that the data in Wikipedia infoboxes is not always consistent, and thus queries may not extract all of the applicable data. My objective here is to investigate why and if some of it can be fixed in an automated fashion.

§Summary

From the below, here are some general results:

  • Some attributes are just rearranged versions of other attributes (see birthdate below), or that use underscores instead of camel case.
  • Some attributes are ambiguous for dbpedia purposes (like dbpedia2:location for skyscrapers)
  • Some attributes exist in both singular and plural (e.g., body_style(s) for yago:car; name(s) for yago:novelist 's)
  • Some attributes include specific dimensions (see Units section below), while others are either dimensionless or are specified using xsd: datatypes

§General

One way to organize searching for synonyms is to start with the list of Wikipedia infobox templates (for example, here's a list of the Geography infobox templates.

DBpedia does not linkify internal wikipedia references (text enclosed in [[...]]). For example, at http://dbpedia.org/page/Meritus_Mandarin_Singapore, see the rows for p:architect and p:period.

A simple property finder lists the distinct properties of all subjects (or resources) within a given category. This should simplify finding aliases and semantic overlaps.

SELECT distinct ?prop WHERE {
  ?subj rdf:type <CATEGORY_URI> .
  ?subj ?prop [].
  FILTER regex(?prop, ".*dbpedia2*")
}

Here's an example query for South American Countries.

§Buildings

The first set of observations here is for buildings like the Empire State Building

Here are the properties of all objects that are yago:skyscrapers. It would be nice to relate properties that tend (for some definition of "tend") to have the same datatype.

The dbpedia2:location of all yago:skyscrapers shows a wide variety of values. This suggests it's ambiguous as far as dbpedia is concerned (but not as Wikipedia is concerned). This would be difficult for dbpedia to parse reliably.

§ Date Ranges

Buildings and Periods This query correctly dereferences single date links to dbpedia, but many of the dates are ranges; in this case, just the markup is shown (see General comment above). (here's Buildings and Periods just for yago:skyscrapers.

§ Number of floors vs. stories

Building height by floors floors, floor_count, and height_stories are used. Most use the correct integer annotation, but others add annotations to the base value, like "40 and 2 basements" for the http://dbpedia.org/page/Meritus_Mandarin_Singapore, and thus are not scanned correctly.

In several places, extra information is glommed into the infobox field like the above example. If we make an assertion that we expect an integer (based on "most" of the other results), parsing the first integer out of the string fixes many of the problems, although we still need to capture the annotation. In some results for this query, like "70 (North Tower); 63 (South Tower)s" for http://dbpedia.org/page/The_Sail_%40_Marina_Bay, parsing still kind of works.

§ Units

There are a variety of mechanisms used to signify Building height by distance measure: dbpedia2:height, dbpedia2:height_feet, and dbpedia2:height_meters. Here is the Building height units query that shows the combinations.

The most appropriate approach seems to specify units in the object (e.g., "259"^^dbpedia:units/Meter). A suggestion for consolidating different units is to look for a suffix to a property that is a length (or other measurement units): _feet, _meters, etc. This rule assumes that these suffixes occur for no other reason.

For area, the following parameters are at least identified: dbpedia2:area, dbpedia2:area_ft, dbpedia2:area_land, dbpedia2:area_magnitude, and dbpedia2:area_total. There is also dbpedia2:area_land and dbpedia2:area_water (see the following query).

§Locations

Even within DBpedia, there are multiple ways buildings (and related entities) are associated with places. A location such as Manhattan shows building associations as either "city" or "location" properties. Also, the "headquarters" property for an organization (e.g., United Nations) should be related to a building.

§People

§Birth and Death

Locations such as Manhattan or better yet, the entire U.S. shows a variety of ways to refer to birth and death locations for people:

Birth: Origin, birth_place, birthplace, placebirth, place_birth, Birthplace, place_of_birth, PLACE_OF_BIRTH, placeofbirth
Death: death, deathplace, death_place, place_of_death, PLACE_OF_DEATH

This list of the properties of novelists shows multiple ways to refer to first name (First, Given, Given1(/Given2), foaf:givenname), last name (Last, Surname, Surname1(/Surname2), foaf:surname); full name (Name, NAME, foaf:name).

Ideas to fix this in the general case:

  • downcase all property names
  • parse property names by "_" into words, remove common words (like of [lang:en]) then look for equivalences between them.
  • assume a_of_b and b_a or ba are all equivalent property names

§Sports Cars

Properties used based on yago class sports_car and query on mid-engined vehicles.

Here is all yago:cars and their properties.

The body styles of cars (the first 200 at least) are complex infoboxes. How should Dbpedia interpret this?

§Locomotives

Here is a list of subjects that have a Template:infobox_locomotive, that use the locomotive infobox template. The resulting properties of this collection of locomotives shows few aliased/synonymous properties ("released" and "release_date" is the only one I could find).

Alternatively, the list of yago:locomotives and the resulting properties show a wider variety of subjects and attributes.

§Aircraft

§Boeing 747

Observations in comparing dbpedia test, wikipedia infobox

  • property/caption: the weighting algorithm misses here. For caption, you always want to capture the entire string.
  • property/firstFlight comes out as <http://dbpedia.org/resource/1969-02-09>. The infobox entry is simply 1969-02-09?; it would be good to capture this as an xsd:date
  • property/unitCost is cut off at the first <br>
Comment (function:DisplayCommentsCount ) Edit - History - Print - Recent Changes - Search
Page last modified on August 02, 2007, at 12:55 AM