This week, I didn’t work much but performed the following tasks and observed the following things :
Entity mapping
Mapped URIs/URLs of articles to their labels(which are similar to the the article’s title) There are 2 such properties dbp:name and rdf:labels. Since I wasn’t able to see the property dbp:name on every article’s page in the dbpedia database, so went with rdf:labels instead.
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?title
WHERE { <http://dbpedia.org/resource/Berlin_Wall>
rdfs:label ?title .
FILTER (lang(?title) = 'en')
Fig1: Berlin Wall dbp:name property
Fig2: Berlin Wall rdf:label property
Fig3: Python Programming language dbp:p property
Fig4: Python Programming language rdf:label property
But there were some articles which had very few or no properties.
Fig5: No property about title/label/name of article
Fig6: No property about title/label/name of article
Entity recognition and extraction
In our case, hyperlinks act as entities only, as stated in the goal of the problem statement: predicate resolution of wiki links among entities.
However, after scraping all the text of the article Berlin Wall, I used spaCy library to find the entities. I used to believe that in most cases named entity will be a hyperlink in the article but it was not fully true. It was found that many entities that were recognised by spaCy lib were also the hyperlinks, which we found by running the SPARQL query using dbo:wikiPageWikiLink, in the scraped text.
I extracted entities by 2 methods:
- by running sparql query -> returns hyperlinks present in an article -> which act as entities in our case.
- And by using spacy library.
For e.g.
Iron curtain was recognised as an entity when we ran the SPARQL query but was not recognised as an entity by spaCy lib.
August didn’t come when we ran the SPARQL query but spaCy lib recogonised it as an entity.
Fig7: SPARQL Query dbo:wikiPageWikiLink hyperlinks present in wikipedia article
Fig8: SpaCy library for entity extraction and recognition
I think that the entities which are not returned by the SPARQL query hold relatively less significance since we have to identify the relation between the articles(which are entities in our case)
Today I am looking into coreference resolution.