It's triples all the way down
Posted at 12:52
The meeting, hosted by our partner InfAI e. V., took place on the 14th to the
15th of December at the University of Leipzig.
The 29 attendees in total, including 15 partners, discussed and
reviewed the progress of all work packages in 2016 and planned the
activities and workshops taking place in the next six months.
On the second day we talked about several societal challenge pilots in the fields of AgroKnow, transport, security etc. It’s been the last plenary for this year and we thank everybody for their work in 2016. Big Data Europa and our partners are looking forward to 2017.
The next Plenary Meeting will be hosted by VU Amsterdam and will take place in Amsterdam, in June 2017.
Posted at 13:33
Posted at 23:59
Dear all,
The Smart Data Analytics group /AKSW are very happy to announce SANSA 0.1 – the initial release of the Scalable Semantic Analytics Stack. SANSA combines distributed computing and semantic technologies in order to allow powerful machine learning, inference and querying capabilities for large knowledge graphs.
Website: http://sansa-stack.net
GitHub: https://github.com/SANSA-Stack
Download: http://sansa-stack.net/downloads-usage/
ChangeLog: https://github.com/SANSA-Stack/SANSA-Stack/releases
You can find the FAQ and usage examples at http://sansa-stack.net/faq/.
The following features are currently supported by SANSA:
We want to thank everyone who helped to create this release, in particular, the projects Big Data Europe, HOBBIT and SAKE.
Kind regards,
Posted at 14:41
Our paper, “LODStats:
The Data Web Census Dataset”, won the award for Best Resources
Paper at the recent conference in Kobe/Japan, which was the premier
international forum for Semantic Web and Linked Data Community. The
paper presents the LODStats dataset, which provides a comprehensive
picture of the current state of a significant part of the Data
Web.
Congrats to Ivan Ermilov, Jens Lehmann, Michael Martin and Sören Auer.
Please find the complete list of winners here.
Posted at 14:05

Knowledge graphs are structured representations of facts where nodes are real-world entities or events and edges are the associations among the pair of entities. Knowledge graphs can be constructed using automatic or manual techniques. Manual techniques construct high quality knowledge graphs but are expensive, time consuming and not scalable. Hence, automatic information extraction techniques are used to create scalable knowledge graphs but the extracted information can be of poor quality due to the presence of dubious facts.
An extracted fact is dubious if it is incorrect, inexact or correct but lacks evidence. A fact might be dubious because of the errors made by NLP extraction techniques, improper design consideration of the internal components of the system, choice of learning techniques (semi-supervised or unsupervised), relatively poor quality of heuristics or the syntactic complexity of underlying text. A preliminary analysis of several knowledge extraction systems (CMU’s NELL and JHU’s KELVIN) and observations from the literature suggest that dubious facts can be identified, diagnosed and managed. In this dissertation, I will explore approaches to identify and repair such dubious facts from a knowledge graph using several complementary approaches, including linguistic analysis, common sense reasoning, and entity linking.
Committee: Drs. Tim Finin (Chair), Anupam Joshi, Tim Oates, Paul McNamee (JHU), Partha Talukdar (IISc, India)
Posted at 02:25
Diego Moussallem will
discuss the paper “Probabilistic Bag-Of-Hyperlinks Model for
Entity Linking” by Octavian-Eugen Ganea et. al. which
was accepted at WWW 2016.
Abstract: Many fundamental problems in natural language processing rely on determining what entities appear in a given text. Commonly referenced as entity linking, this step is a fundamental component of many NLP tasks such as text understanding, automatic summarization, semantic search or machine translation. Name ambiguity, word polysemy, context dependencies and a heavy-tailed distribution of entities contribute to the complexity of this problem. We here propose a probabilistic approach that makes use of an effective graphical model to perform collective entity disambiguation. Input mentions (i.e., linkable token spans) are disambiguated jointly across an entire document by combining a document-level prior of entity co-occurrences with local information captured from mentions and their surrounding context. The model is based on simple sufficient statistics extracted from data, thus relying on few parameters to be learned. Our method does not require extensive feature engineering, nor an expensive training procedure. We use loopy belief propagation to perform approximate inference. The low complexity of our model makes this step sufficiently fast for real-time usage. We demonstrate the accuracy of our approach on a wide range of benchmark datasets, showing that it matches, and in many cases outperforms, existing state-of-the-art methods
Large-Scale Learning of Relation-Extraction Rules with Distant Supervision from the Web
Afterward, René Speck will present
the paper “Large-Scale Learning of Relation-Extraction Rules
with
Distant Supervision from the Web” by Sebastian Krause et.
al. which was accepted at ISWC
2012.
Abstract: We present a large-scale relation extraction (RE) system which learns grammar-based RE rules from the Web by utilizing large numbers of relation instances as seed. Our goal is to obtain rule sets large enough to cover the actual range of linguistic variation, thus tackling the long-tail problem of real-world applications. A variant of distant supervision learns several relations in parallel, enabling a new method of rule filtering. The system detects both binary and n-ary relations. We target 39 relations from Freebase, for which 3M sentences extracted from 20M web pages serve as the basis for learning an average of 40K distinctive rules per relation. Employing an efficient dependency parser, the average run time for each relation is only 19 hours. We compare these rules with ones learned from local corpora of different sizes and demonstrate that the Web is indeed needed for a good coverage of linguistic variation
This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.
Posted at 11:30
In previous articles I have covered multiple ways to create training corpuses for unsupervised learning and positive and negative training sets for supervised learning 1 , 2 , 3 using Cognonto and KBpedia. Different structures inherent to a knowledge graph like KBpedia can lead to quite different corpuses and sets. Each of these corpuses or sets may yield different predictive powers depending on the task at hand.
So far we have covered two ways to leverage the KBpedia Knowledge Graph to automatically create positive and negative training corpuses:
Now we will introduce a third way to create a different kind of training corpus:
Aspects are aggregations of entities that are
grouped according to their characteristics different from their
direct types. Aspects help to group related entities by situation,
and not by identity nor definition. It is another way to organize
the knowledge graph and to leverage it. KBpedia has about 80 aspects that provide
this secondary means for placing entities into related real-world
contexts. Not all aspects relate to a given entity.
To continue with the musical domain, there exists two aspects of interest:
What we will do first is to query the KBpedia Knowledge Graph
using theSPARQL
query language to get the list of all of the KBpedia reference
concepts that are related to the Music or the
Genre aspects. Then, for each of these reference
concepts, we will count the number of named entities that can be
reached in the complete KBpedia structure.
prefix kko: <http://kbpedia.org/ontologies/kko#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix schema: <http://schema.org/>
select distinct ?class count(distinct ?entity) as ?nb
from <http://dbpedia.org>
from <http://www.uspto.gov>
from <http://wikidata.org>
from <http://kbpedia.org/1.10/>
where
{
?entity dcterms:subject ?category .
graph <http://kbpedia.org/1.10/>
{
{?category <http://kbpedia.org/ontologies/kko#hasMusicAspect> ?class .}
union
{?category <http://kbpedia.org/ontologies/kko#hasGenre> ?class .}
}
}
order by desc(?nb)
| reference concept | nb |
|---|---|
| http://kbpedia.org/kko/rc/Album-CW | 128772 |
| http://kbpedia.org/kko/rc/Song-CW | 74886 |
| http://kbpedia.org/kko/rc/Music | 51006 |
| http://kbpedia.org/kko/rc/Single | 50661 |
| http://kbpedia.org/kko/rc/RecordCompany | 5695 |
| http://kbpedia.org/kko/rc/MusicalComposition | 5272 |
| http://kbpedia.org/kko/rc/MovieSoundtrack | 2919 |
| http://kbpedia.org/kko/rc/Lyric-WordsToSong | 2374 |
| http://kbpedia.org/kko/rc/Band-MusicGroup | 2185 |
| http://kbpedia.org/kko/rc/Quartet-MusicalPerformanceGroup | 2078 |
| http://kbpedia.org/kko/rc/Ensemble | 1438 |
| http://kbpedia.org/kko/rc/Orchestra | 1380 |
| http://kbpedia.org/kko/rc/Quintet-MusicalPerformanceGroup | 1335 |
| http://kbpedia.org/kko/rc/Choir | 754 |
| http://kbpedia.org/kko/rc/Concerto | 424 |
| http://kbpedia.org/kko/rc/Symphony | 299 |
| http://kbpedia.org/kko/rc/Singing | 154 |
Seventeen KBpedia reference concepts are related to the two aspects we want to focus on. The next step is to take these 17 reference concepts and to create a new domain corpus with them. We will use the new version of KBpedia to create the full set of reference concepts that will scope our domain by inference.
Next we will try to use this information to create two totally different kinds of training corpuses:
The first training corpus we want to test is one that uses the
linkage between KBpedia reference concepts and Wikipedia pages. The
first thing is to generate the domain training corpus with the
17 seed reference concepts and then to infer other
related reference concepts.
(use 'cognonto-esa.core) (require '[cognonto-owl.core :as owl]) (require '[cognonto-owl.reasoner :as reasoner]) (def kbpedia-manager (owl/make-ontology-manager)) (def kbpedia (owl/load-ontology "resources/kbpedia_reference_concepts_linkage.n3" :manager kbpedia-manager)) (def kbpedia-reasoner (reasoner/make-reasoner kbpedia)) (define-domain-corpus ["http://kbpedia.org/kko/rc/Album-CW" "http://kbpedia.org/kko/rc/Song-CW" "http://kbpedia.org/kko/rc/Music" "http://kbpedia.org/kko/rc/Single" "http://kbpedia.org/kko/rc/RecordCompany" "http://kbpedia.org/kko/rc/MusicalComposition" "http://kbpedia.org/kko/rc/MovieSoundtrack" "http://kbpedia.org/kko/rc/Lyric-WordsToSong" "http://kbpedia.org/kko/rc/Band-MusicGroup" "http://kbpedia.org/kko/rc/Quartet-MusicalPerformanceGroup" "http://kbpedia.org/kko/rc/Ensemble" "http://kbpedia.org/kko/rc/Orchestra" "http://kbpedia.org/kko/rc/Quintet-MusicalPerformanceGroup" "http://kbpedia.org/kko/rc/Choir" "http://kbpedia.org/kko/rc/Symphony" "http://kbpedia.org/kko/rc/Singing" "http://kbpedia.org/kko/rc/Concerto"] kbpedia "resources/aspects-concept-corpus-dictionary.csv" :reasoner kbpedia-reasoner) (create-pruned-pages-dictionary-csv "resources/aspects-concept-corpus-dictionary.csv" "resources/aspects-concept-corpus-dictionary.pruned.csv" "resources/aspects-corpus-normalized/")
Once pruned, we end-up with a domain which has 108
reference concepts which will enable us to create models with 108
features. The next step is to create the actual semantic
interpreter and the SVM models:
;; Load dictionaries (load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/aspects-concept-corpus-dictionary.pruned.csv") ;; Create the semantic interpreter (build-semantic-interpreter "aspects-concept-pruned" "resources/semantic-interpreters/aspects-concept-pruned/" (distinct (concat (get-domain-pages) (get-general-pages)))) ;; Build the SVM model vectors (build-svm-model-vectors "resources/svm/aspects-concept-pruned/" :corpus-folder-normalized "resources/aspects-corpus-normalized/") ;; Train the linear SVM classifier (train-svm-model "svm.aspects.concept.pruned" "resources/svm/aspects-concept-pruned/" :weights nil :v nil :c 1 :algorithm :l2l2)
Then we have to evaluate this new model using the gold standard:
(evaluate-model "svm.aspects.concept.pruned" "resources/gold-standard-full.csv")
True positive: 28 False positive: 0 True negative: 923 False negative: 66 Precision: 1.0 Recall: 0.29787233 Accuracy: 0.93510324 F1: 0.45901638
Now let’s try to find better hyperparameters using grid search:
(svm-grid-search "grid-search-aspects-concept-pruned-tests" "resources/svm/aspects-concept-pruned/" "resources/gold-standard-full.csv" :selection-metric :f1 :grid-parameters [{:c [1 2 4 16 256] :e [0.001 0.01 0.1] :algorithm [:l2l2] :weight [1 15 30]}])
{:gold-standard "resources/gold-standard-full.csv"
:selection-metric :f1
:score 0.84444445
:c 1
:e 0.001
:algorithm :l2l2
:weight 30}
After running the grid search with these initial broad range
values, we found a configuration that gives us 0.8444
for the F1 score. So far, this score is the best to
date we have gotten for the full gold standard2, 3. Let’s see all
of the metrics for this configuration:
(train-svm-model "svm.aspects.concept.pruned" "resources/svm/aspects-concept-pruned/" :weights {1 30.0} :v nil :c 1 :e 0.001 :algorithm :l2l2) (evaluate-model "svm.aspects.concept.pruned" "resources/gold-standard-full.csv")
True positive: 76 False positive: 10 True negative: 913 False negative: 18 Precision: 0.88372093 Recall: 0.80851066 Accuracy: 0.972468 F1: 0.84444445
These results are also the best balance between
precision and recall that we have gotten
so far2, 3. Better
precision can be obtained if necessary but only at the
expense of lower recall.
Let’s take a look at the improvements we got compared to the previous training corpuses we had:
+4.16%+35.72%+2.06%+20.63%This new training corpus based on the KBpedia aspects, after
hyperparameter optimization, did increase all the metrics we
calculate. The more stiking improvement is the recall
which improved by more than 35%.
The next training corpus we want to test is one that uses the linkage between KBpedia reference concepts and linked external vocabularies to get a series of linked named entities as the positive training set of for each of the features of the model.
The first thing to do is to is to create the positive training set populated with named entities related to the reference concepts. We will get a random sample of ~50 named entities per reference concept:
(require '[cognonto-rdf.query :as query]) (require '[clojure.java.io :as io]) (require '[clojure.data.csv :as csv]) (require '[clojure.string :as string]) (defn generate-domain-by-rc [rc domain-file nb] (with-open [out-file (io/writer domain-file :append true)] (doall (->> (query/select (str "prefix kko: <http://kbpedia.org/ontologies/kko#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema> prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> select distinct ?entity from <http://dbpedia.org> from <http://www.uspto.gov> from <http://wikidata.org> from <http://kbpedia.org/1.10/> where { ?entity dcterms:subject ?category . graph <http://kbpedia.org/1.10/> { ?category ?aspectProperty <" rc "> . } } ORDER BY RAND() LIMIT " nb) kb-connection) (map (fn [entity] (csv/write-csv out-file [[(string/replace (:value (:entity entity)) "http://dbpedia.org/resource/" "") (string/replace rc "http://kbpedia.org/kko/rc/" "")]]))))))) (defn generate-domain-by-rcs [rcs domain-file nb-per-rc] (with-open [out-file (io/writer domain-file)] (csv/write-csv out-file [["wikipedia-page" "kbpedia-rc"]]) (doseq [rc rcs] (generate-domain-by-rc rc domain-file nb-per-rc)))) (generate-domain-by-rcs ["http://kbpedia.org/kko/rc/" "http://kbpedia.org/kko/rc/Concerto" "http://kbpedia.org/kko/rc/DoubleAlbum-CW" "http://kbpedia.org/kko/rc/MusicalComposition-Psychedelic" "http://kbpedia.org/kko/rc/MusicalComposition-Religious" "http://kbpedia.org/kko/rc/PunkMusic" "http://kbpedia.org/kko/rc/BluesMusic" "http://kbpedia.org/kko/rc/HeavyMetalMusic" "http://kbpedia.org/kko/rc/PostPunkMusic" "http://kbpedia.org/kko/rc/CountryRockMusic" "http://kbpedia.org/kko/rc/BarbershopQuartet-MusicGroup" "http://kbpedia.org/kko/rc/FolkMusic" "http://kbpedia.org/kko/rc/Verse" "http://kbpedia.org/kko/rc/RockBand" "http://kbpedia.org/kko/rc/Lyric-WordsToSong" "http://kbpedia.org/kko/rc/Refrain" "http://kbpedia.org/kko/rc/MusicalComposition-GangstaRap" "http://kbpedia.org/kko/rc/MusicalComposition-Klezmer" "http://kbpedia.org/kko/rc/HouseMusic" "http://kbpedia.org/kko/rc/MusicalComposition-AlternativeCountry" "http://kbpedia.org/kko/rc/PsychedelicMusic" "http://kbpedia.org/kko/rc/ReggaeMusic" "http://kbpedia.org/kko/rc/AlternativeRockBand" "http://kbpedia.org/kko/rc/AlternativeRockMusic" "http://kbpedia.org/kko/rc/MusicalComposition-Trance" "http://kbpedia.org/kko/rc/Ensemble" "http://kbpedia.org/kko/rc/RhythmAndBluesMusic" "http://kbpedia.org/kko/rc/NewAgeMusic" "http://kbpedia.org/kko/rc/RockabillyMusic" "http://kbpedia.org/kko/rc/MusicalComposition-Blues" "http://kbpedia.org/kko/rc/MusicalComposition-Opera" "http://kbpedia.org/kko/rc/Choir" "http://kbpedia.org/kko/rc/SurfMusic" "http://kbpedia.org/kko/rc/Quintet-MusicalPerformanceGroup" "http://kbpedia.org/kko/rc/MusicalComposition-JazzRock" "http://kbpedia.org/kko/rc/MusicalComposition-Country" "http://kbpedia.org/kko/rc/CountryMusic" "http://kbpedia.org/kko/rc/MusicalComposition-PopRock" "http://kbpedia.org/kko/rc/MusicalComposition-Romantic" "http://kbpedia.org/kko/rc/Recitative" "http://kbpedia.org/kko/rc/Chorus" "http://kbpedia.org/kko/rc/FusionMusic" "http://kbpedia.org/kko/rc/MovieSoundtrack" "http://kbpedia.org/kko/rc/GreatestHitsAlbum-CW" "http://kbpedia.org/kko/rc/MusicalComposition-Christian" "http://kbpedia.org/kko/rc/ClassicalMusic-Baroque" "http://kbpedia.org/kko/rc/MusicalComposition-NewAge" "http://kbpedia.org/kko/rc/MusicalComposition-TraditionalPop" "http://kbpedia.org/kko/rc/TranceMusic" "http://kbpedia.org/kko/rc/MusicalComposition-Celtic" "http://kbpedia.org/kko/rc/LoungeMusic" "http://kbpedia.org/kko/rc/MusicalComposition-Reggae" "http://kbpedia.org/kko/rc/MusicalComposition-Baroque" "http://kbpedia.org/kko/rc/Trio-MusicalPerformanceGroup" "http://kbpedia.org/kko/rc/Symphony" "http://kbpedia.org/kko/rc/MusicalComposition-RockAndRoll" "http://kbpedia.org/kko/rc/PopRockMusic" "http://kbpedia.org/kko/rc/IndustrialMusic" "http://kbpedia.org/kko/rc/JazzMusic" "http://kbpedia.org/kko/rc/MusicalChord" "http://kbpedia.org/kko/rc/ProgressiveRockMusic" "http://kbpedia.org/kko/rc/GothicMusic" "http://kbpedia.org/kko/rc/LiveAlbum-CW" "http://kbpedia.org/kko/rc/NewWaveMusic" "http://kbpedia.org/kko/rc/NationalAnthem" "http://kbpedia.org/kko/rc/OldieSong" "http://kbpedia.org/kko/rc/Song-Sung" "http://kbpedia.org/kko/rc/RockMusic" "http://kbpedia.org/kko/rc/Aria" "http://kbpedia.org/kko/rc/MusicalComposition-Disco" "http://kbpedia.org/kko/rc/GospelMusic" "http://kbpedia.org/kko/rc/BluegrassMusic" "http://kbpedia.org/kko/rc/FolkRockMusic" "http://kbpedia.org/kko/rc/RockAndRollMusic" "http://kbpedia.org/kko/rc/Opera-CW" "http://kbpedia.org/kko/rc/HitSong-CW" "http://kbpedia.org/kko/rc/Tune" "http://kbpedia.org/kko/rc/Quartet-MusicalPerformanceGroup" "http://kbpedia.org/kko/rc/RapMusic" "http://kbpedia.org/kko/rc/RecordCompany" "http://kbpedia.org/kko/rc/MusicalComposition-ACappella" "http://kbpedia.org/kko/rc/MusicalComposition-Electronica" "http://kbpedia.org/kko/rc/Music" "http://kbpedia.org/kko/rc/GlamRockMusic" "http://kbpedia.org/kko/rc/LoveSong" "http://kbpedia.org/kko/rc/MusicalComposition-Gothic" "http://kbpedia.org/kko/rc/MarchingBand" "http://kbpedia.org/kko/rc/MusicalComposition-Punk" "http://kbpedia.org/kko/rc/BluesRockMusic" "http://kbpedia.org/kko/rc/TechnoMusic" "http://kbpedia.org/kko/rc/SoulMusic" "http://kbpedia.org/kko/rc/ChamberMusicComposition" "http://kbpedia.org/kko/rc/Requiem" "http://kbpedia.org/kko/rc/MusicalComposition" "http://kbpedia.org/kko/rc/ElectronicMusic" "http://kbpedia.org/kko/rc/CompositionMovement" "http://kbpedia.org/kko/rc/StringQuartet-MusicGroup" "http://kbpedia.org/kko/rc/Riff" "http://kbpedia.org/kko/rc/Anthem" "http://kbpedia.org/kko/rc/HardRockMusic" "http://kbpedia.org/kko/rc/MusicalComposition-BluesRock" "http://kbpedia.org/kko/rc/MusicalComposition-Cyberpunk" "http://kbpedia.org/kko/rc/MusicalComposition-Industrial" "http://kbpedia.org/kko/rc/MusicalComposition-Funk" "http://kbpedia.org/kko/rc/Album-CW" "http://kbpedia.org/kko/rc/HipHopMusic" "http://kbpedia.org/kko/rc/Single" "http://kbpedia.org/kko/rc/Singing" "http://kbpedia.org/kko/rc/SwingMusic" "http://kbpedia.org/kko/rc/Song-CW" "http://kbpedia.org/kko/rc/SalsaMusic" "http://kbpedia.org/kko/rc/MusicalComposition-Jazz" "http://kbpedia.org/kko/rc/ClassicalMusic" "http://kbpedia.org/kko/rc/MilitaryBand" "http://kbpedia.org/kko/rc/SkaMusic" "http://kbpedia.org/kko/rc/Orchestra" "http://kbpedia.org/kko/rc/GrungeRockMusic" "http://kbpedia.org/kko/rc/SouthernRockMusic" "http://kbpedia.org/kko/rc/MusicalComposition-Ambient" "http://kbpedia.org/kko/rc/DiscoMusic"] "resources/aspects-domain-corpus.csv")
Next let’s create the actual positive training corpus and let’s normalize it:
(cache-aspects-corpus "resources/aspects-entities-corpus.csv" "resources/aspects-corpus/") (normalize-cached-corpus "resources/corpus/" "resources/corpus-normalized/")
We end up with 22 features for which we can get
named entities from the KBpedia Knowledge Base. These will be the
22 features of our model. The complete positive training set has
799 documents in it.
(load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/aspects-entities-corpus-dictionary.pruned.csv") (build-semantic-interpreter "aspects-entities-pruned" "resources/semantic-interpreters/aspects-entities-pruned/" (distinct (concat (get-domain-pages) (get-general-pages)))) (build-svm-model-vectors "resources/svm/aspects-entities-pruned/" :corpus-folder-normalized "resources/aspects-corpus-normalized/") (train-svm-model "svm.aspects.entities.pruned" "resources/svm/aspects-entities-pruned/" :weights nil :v nil :c 1 :algorithm :l2l2)
Now let’s evaluate the model with default hyperparameters:
(evaluate-model "svm.aspects.entities.pruned" "resources/gold-standard-full.csv")
True positive: 9 False positive: 10 True negative: 913 False negative: 85 Precision: 0.47368422 Recall: 0.095744684 Accuracy: 0.906588 F1: 0.15929204
Now let’s try to improve this F1 score using grid search:
(svm-grid-search "grid-search-aspects-entities-pruned-tests" "resources/svm/aspects-entities-pruned/" "resources/gold-standard-full.csv" :selection-metric :f1 :grid-parameters [{:c [1 2 4 16 256] :e [0.001 0.01 0.1] :algorithm [:l2l2] :weight [1 15 30]}])
{:gold-standard "resources/gold-standard-full.csv"
:selection-metric :f1
:score 0.44052863
:c 4
:e 0.001
:algorithm :l2l2
:weight 15}
We have been able to greatly improve the F1 score
by tweaking the hyperparameters, but the results are still
disappointing. There are multiple ways to automatically generate
training corpuses, but not all of them are born equal. This is why
having a pipeline that can automatically create the training
corpuses, optimize the hyperparameters and evaluate the models is
more than welcome since this is the bulk of the time a data
scientist has to spend to create his models.
After automatically creating multiple different positive and
negative training sets, after testing multiple learning methods and
optimizing hyperparameters, we found the best training sets with
the best learning method and the best hyperparameter to create an
initial, optimal, model that has an accuracy of 97.2%,
a precision of 88.4%, a recall of
80.9% and overall F1 measure of 84.4% on
a gold standard created from real, random, pieces of news from
different general and specialized news sites.
The thing that is really interesting and innovative in this method is how a knowledge base of concepts and entities can be used to label positive and negative training sets to feed supervised learners and how the learner can perform well on totally different input text data (in this case, news articles). The same is true when creating training corpuses for unsupervised leaning4.
The most wonderful thing from an operational standpoint is that all of this searching, testing and optimizing can be performed by a computer automatically. The only tasks required by a human is to define the scope of a domain and to manually label a gold standard for performance evaluation and hyperparameters optimization.
Posted at 11:14
In the first part of this series we found the good hyperparameters for a single linear SVM classifier. In part 2, we will try another technique to improve the performance of the system: ensemble learning.
So far, we already reached 95% of accuracy with
some tweaking the hyperparameters and the training corpuses but the
F1 score is still around ~70% with the
full gold standard which can be improved. There are also situations
when precision should be nearly perfect (because false
positives are really not acceptable) or when the
recall should be optimized.
Here we will try to improve this situation by using ensemble learning. It uses multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. In our examples, each model will have a vote and the weight of the vote will be equal for each mode. We will use five different strategies to create the models that will belong to the ensemble:
Different strategies will be used depending on different things like: are the positive and negative training documents unbalanced? How many features does the model have? etc. Let’s introduce each of these different strategies.
Note that in this article I am only creating ensembles with linear SVM learners. However an ensemble can be composed of multiple different kind of learners, like SVM with non-linear kernels, decisions trees, etc. However, to simplify this article, we will stick to a single linear SVM with multiple different training corpuses and features.
The idea behind bagging is to draw a subset of positive and negative training samples at random and with replacement. Each model of the ensemble will have a different training set but some of the training sample may appear in multiple different training sets.
Asymmetric Bagging has been proposed by Tao, Tang, Li and Wu 1. The idea is to use asymmetric bagging when the number of positive training samples is largely unbalanced relatively to the negative training samples. The idea is to create a subset of random (with replacement) negative training samples, but by always keeping the full set of positive training samples.
The idea behind feature bagging is the same as bagging, but works on the features of the model instead of the training sets. It attempts to reduce the correlation between estimators (features) in an ensemble by training them on random samples of features instead of the entire feature set.
Asymmetric Bagging and Random Subspace Method has also been proposed by Tao, Tang, Li and Wu 1. The problems they had with their content-based image retrieval system are the same we have with this kind of automatic training corpuses generated from knowledge graph:
The third point is not immediately an issue for us (except if you have a domain with many more features than we had in our example), but becomes one when we start using asymmetric bagging.
What we want to do here is to implement asymmetric bagging and
the random subspace method to create
number of individual models. This
method is called ABRS-SVM which stands for Asymmetric Bagging
Random Subspace Support Vector Machines.
The algorithm we will use is:
Bagging with features bagging is the same as asymmetric bagging
with the random subspace method except that we use bagging instead
of asymmetric bagging. (ABRS should be used if your
positive training sample is severely unbalanced compared to your
negative training sample. Otherwise BRS should be
used.)
We use the linear Semantic Vector Machine (SVM) as the learner to use for the ensemble. What we will be creating is a series of SVM models that will be different depending on the ensemble method(s) we will use to create the ensemble.
The first step we have to do is to create a structure where all
the positive and negative training documents will have their vector
representation. Since this is the task that takes the most time in
the whole process, we will calculate them using the
(build-svm-model-vectors) function and we will
serialize the structure on the file system. That way, to create the
ensemble’s models, we will only have to load it from the file
system without having the re-calculate it each time.
The goal is to create a set of X number of SVM
classifiers where each of them use different models. The models can
differ in their features or their training corpus. Then each of the
classifier will try to classify an input text according to their
own model. Finally each classifier will vote to determine if that
input text belong, or not, to the domain.
There are four hyperparameters related to ensemble learning:
Other hyperparameters could include the ones of the linear SVM
classifier, but in this example we will simply reuse the best
parameters we found above. We now train the ensemble using the
(train-ensemble-svm) function.
Once the ensemble is created and trained, then we have to use
the (classify-ensemble-text) function to classify an
input text using the ensemble we created. That function takes two
parameters: :mode, which is the ensemble’s mode, and
:vote-acceptance-ratio, which defines the number of
positive votes that is required such that the ensemble positively
classify the input text. By default, the ratio is 50%,
but if you want to optimize the precision of the
ensemble, then you may want to increase that ratio to
70% or even 95% as we will see below.
Finally the ensemble, configured with all its hyperparameters,
will be evaluated using the (evaluate-ensemble)
function, which is the same as the (evaluate-model)
function, but which uses the ensemble instead of a single SVM model
to classify all of the articles. As before, we will characterize
the assignments in relation to the gold standard.
Let’s now train different ensembles to try to improve the performance of the system.
The current corpus training set is highly unbalanced. This is why the first test we will do is to apply the asymmetric bagging strategy. What this does is that each of the SVM classifiers will use the same positive training set with the same number of positive documents. However, each of them will take a random number of negative training documents (with replacement).
(use 'cognonto-esa.core) (use 'cognonto-esa.ensemble-svm) (load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/domain-corpus-dictionary.pruned.csv") (load-semantic-interpreter "base-pruned" "resources/semantic-interpreters/base-pruned/") (reset! ensemble []) (train-ensemble-svm "ensemble.base.pruned.ab.c2.w30" "resources/ensemble-svm/base-pruned/" :mode :ab :weight {1 30.0} :c 2 :e 0.001 :nb-models 100 :nb-training-documents 3500)
Now let’s evaluate this ensemble with a vote acceptance ratio of
50%
(evaluate-ensemble "ensemble.base.pruned.ab.c2.w30" "resources/gold-standard-full.csv" :mode :ab :vote-acceptance-ratio 0.50)
True positive: 48 False positive: 6 True negative: 917 False negative: 46 Precision: 0.8888889 Recall: 0.5106383 Accuracy: 0.9488692 F1: 0.6486486
Let’s increase the vote acceptance ratio to
90%:
(evaluate-ensemble "ensemble.base.pruned.ab.c2.w30" "resources/gold-standard-full.csv" :mode :ab :vote-acceptance-ratio 0.90)
True positive: 37 False positive: 2 True negative: 921 False negative: 57 Precision: 0.94871795 Recall: 0.39361703 Accuracy: 0.94198626 F1: 0.556391
In both cases, the precision increases considerably
compared to the non-ensemble learning results. However, the
recall did drop at the same time, which dropped the
F1 score as well. Let’s now try with the
ABRS method
The goal of the random subspace method is to select a random set of features. This means that each model will have their own feature set and will make predictions according to them. With the ABRS strategy, we will conclude with highly different models since none will have the same negative training sets nor the same features.
Here what we test is to define each classifier with
65 randomly chosen features out of 174 to
restrict the negative training corpus to 3500 randomly selected
documents. Then we choose to create 300 models to try to get a
really heterogeneous population of models.
(reset! ensemble []) (train-ensemble-svm "ensemble.base.pruned.abrs.c2.w30" "resources/ensemble-svm/base-pruned/" :mode :abrs :weight {1 30.0} :c 2 :e 0.001 :nb-models 300 :nb-features 65 :nb-training-documents 3500)
(evaluate-ensemble "ensemble.base.pruned.abrs.c2.w30" "resources/gold-standard-full.csv" :mode :abrs :vote-acceptance-ratio 0.50)
True positive: 41 False positive: 3 True negative: 920 False negative: 53 Precision: 0.9318182 Recall: 0.43617022 Accuracy: 0.9449361 F1: 0.59420294
For these features and training sets, using the
ABRS method did not improve on the AB
method we tried above.
This use case shows three totally different ways to use the KBpedia Knowledge Graph to automatically create positive and negative training sets. We demonstrated how the full process can be automated where the only requirement is to get a list of seed KBpedia reference concepts.
We also quantified the impact of using new versions of KBpedia, and how different strategies, techniques or algorithms can have different impacts on the prediction models.
Creating prediction models using supervised machine learning algorithms (which is currently the bulk of the learners currently used) has two global steps:
Unfortunately, today, given the manual efforts required by the first step, the overwhelming portion of time and budget is spent here to create a prediction model. By automating much of this process, Cognonto and KBpedia substantially reduces this effort. Time and budget can now be re-directed to the second step of “dialing in” the learners, where the real payoff occurs. of training corpuses.
Further, as we also demonstrated, once we automated this process of labeling and reference standards, then we can also automate the testing and optimization of multiple different kind of prediction algorithms, hyperparameters configuration, etc. In short, for both steps, KBpedia provides significant reductions in times and efforts to get to desired results.
Posted at 11:05
In my previous blog post, Create a Domain Text Classifier Using Cognonto, I explained how one can use the KBpedia Knowledge Graph to automatically create positive and negative training corpuses for different machine learning tasks. I explained how SVM classifiers could be trained and used to check if an input text belongs to the defined domain or not.
This article is the first of two articles.In first part I will extend on this idea to explain how the KBpedia Knowledge Graph can be used, along with other machine learning techniques, to cope with different situations and use cases. I will cover the concepts of feature selection, hyperparameter optimization, and ensemble learning (in part 2 of this series). The emphasis here is on the testing and refining of machine learners, versus the set up and configuration times that dominate other approaches.
Depending on the domain of interest, and depending on the
required precision or recall, different
strategies and techniques can lead to better predictions. More
often than not, multiple different training corpuses, learners and
hyperparameters need to be tested before ending up with the initial
best possible prediction model. This is why I will strongly
emphasize the fact that the KBpedia Knowledge Graph and Cognonto can be used to automate fully
the creation of a wide range of different training corpuses, to
create models, to optimize their hyperparameters, and to evaluate
those models.
For this article, I will use the latest version of the KBpedia
Knowledge Graph version 1.10 that we
just released. A knowledge graph such as KBpedia is not static.
It constantly evolves, gets fixed, and improves. New concepts are
created, deprecated concepts are removed, new linkage to external
data sources are created, etc. This growth means that any of these
changes can have a [positive] impact on the creation of the
positive and negative training sets. Applications based on KBpedia
should be tested against any new knowledge graph that is released
to see if its models will improve. Better concepts, better
structure, and more linkages will often lead to better training
sets as well.
Such growth in KBpedia is also why automating, and more importantly testing, this process is crucial. Upon the release of major new versions we are able to automate all of these steps to see the final impacts of upgrading the knowledge graph:
Because each of these steps belongs to an automated workflow, we can easily check the impact of updating the KBpedia Knowledge Graph on our models.
A new step I am adding in this current use case is to use a reasoner to reason over the KBpedia Knowledge Graph. The reasoner is used when we define the scope of the domain to classify. We will browse the knowledge graph to see which seed reference concepts we should add to the scope. Then we will use a reasoner to extend the models to include any new sub-classes relevant to the scope of the domain. This means that we may add further specific features to the final model.
Recall our prior
use case used Music as its domain scope. The first step
is to use the new KBpedia version 1.10 along with a
reasoner to create the full scope of this updated Music domain.
The result of using this new version and a reasoner is that we
now end up with 196 features (reference documents)
instead of 64. This also means that we will have 196
documents in our positive training set if we only use the Wikipedia
pages linked to these reference concepts (and not their related
named entities).
(use 'cognonto-esa.core) (require '[cognonto-owl.core :as owl]) (require '[cognonto-owl.reasoner :as reasoner]) (def kbpedia-manager (owl/make-ontology-manager)) (def kbpedia (owl/load-ontology "resources/kbpedia_reference_concepts_linkage.n3" :manager kbpedia-manager)) (def kbpedia-reasoner (reasoner/make-reasoner kbpedia)) (define-domain-corpus ["http://kbpedia.org/kko/rc/Music" "http://kbpedia.org/kko/rc/Musician" "http://kbpedia.org/kko/rc/MusicPerformanceOrganization" "http://kbpedia.org/kko/rc/MusicalInstrument" "http://kbpedia.org/kko/rc/Album-CW" "http://kbpedia.org/kko/rc/Album-IBO" "http://kbpedia.org/kko/rc/MusicalComposition" "http://kbpedia.org/kko/rc/MusicalText" "http://kbpedia.org/kko/rc/PropositionalConceptualWork-MusicalGenre" "http://kbpedia.org/kko/rc/MusicalPerformer"] kbpedia "resources/domain-corpus-dictionary.csv" :reasoner kbpedia-reasoner)
The next step is to create the actual training corpuses: the general and domain ones. We have to load the dictionaries we created in the previous step, and then to locally cache and normalize the corpuses. Remember that the normalization steps are:
(load-dictionaries "resources/general-corpus-dictionary.csv" "resources/domain-corpus-dictionary.csv") (cache-corpus) (normalize-cached-corpus "resources/corpus/" "resources/corpus-normalized/")
Because we never have enough instances in our gold standards to test against, let’s create a third one, but this time adding a music related news feed that will add more positive examples to the gold standard.
(defn create-gold-standard-from-feeds [name] (let [feeds ["http://www.music-news.com/rss/UK/news" "http://rss.cbc.ca/lineup/topstories.xml" "http://rss.cbc.ca/lineup/world.xml" "http://rss.cbc.ca/lineup/canada.xml" "http://rss.cbc.ca/lineup/politics.xml" "http://rss.cbc.ca/lineup/business.xml" "http://rss.cbc.ca/lineup/health.xml" "http://rss.cbc.ca/lineup/arts.xml" "http://rss.cbc.ca/lineup/technology.xml" "http://rss.cbc.ca/lineup/offbeat.xml" "http://www.cbc.ca/cmlink/rss-cbcaboriginal" "http://rss.cbc.ca/lineup/sports.xml" "http://rss.cbc.ca/lineup/canada-britishcolumbia.xml" "http://rss.cbc.ca/lineup/canada-calgary.xml" "http://rss.cbc.ca/lineup/canada-montreal.xml" "http://rss.cbc.ca/lineup/canada-pei.xml" "http://rss.cbc.ca/lineup/canada-ottawa.xml" "http://rss.cbc.ca/lineup/canada-toronto.xml" "http://rss.cbc.ca/lineup/canada-north.xml" "http://rss.cbc.ca/lineup/canada-manitoba.xml" "http://feeds.reuters.com/news/artsculture" "http://feeds.reuters.com/reuters/businessNews" "http://feeds.reuters.com/reuters/entertainment" "http://feeds.reuters.com/reuters/companyNews" "http://feeds.reuters.com/reuters/lifestyle" "http://feeds.reuters.com/reuters/healthNews" "http://feeds.reuters.com/reuters/MostRead" "http://feeds.reuters.com/reuters/peopleNews" "http://feeds.reuters.com/reuters/scienceNews" "http://feeds.reuters.com/reuters/technologyNews" "http://feeds.reuters.com/Reuters/domesticNews" "http://feeds.reuters.com/Reuters/worldNews" "http://feeds.reuters.com/reuters/USmediaDiversifiedNews"]] (with-open [out-file (io/writer (str "resources/" name ".csv"))] (csv/write-csv out-file [["class" "title" "url"]]) (doseq [feed-url feeds] (doseq [item (:entries (feed/parse-feed feed-url))] (csv/write-csv out-file "" (:title item) (:link item) :append true))))))
This routine creates this third gold standard. Remember, we use the gold standard to evaluate different methods and models to classify an input text to see if it belongs to the domain or not.
For each piece of news aggregated that way, I manually determined if the candidate document belongs to the domain or not. This task can be tricky, and requires a clear understanding of the proper scope for the domain. In this example, I consider an article to belong to the music domain if it mentions music concepts such as musical albums, songs, multiple music related topics, etc. If only a singer is mentioned in an article because he broke up with his girlfriend, without further mention of anything related to music, I won’t tag it as being part of the domain.
[However, under a different interpretation of what should be in the domain wherein any mention of a singer qualifies, then we could extend the classification process to include named entities (the singer) extraction to help properly classify those articles. This revised scope is not used in this article, but it does indicate how your exact domain needs should inform such scoping decisions.]
You can download this new third gold standard from here.
Now that we have updated the training corpuses using the updated scope of the domain compared to the previous tests, let’s analyze the impact of using a new version of KBpedia and to use a reasoner to increase the number of features in our model. Let’s run our automatic process to evaluate the new models. The remaining steps that needs to be run are:
Note: the see the full explanation of how ESA and the SVM
classifiers works, please refer to the
Create a Domain Text Classifier
Using Cognonto article for more background information.
;; Load positive and negative training corpuses (load-dictionaries "resources/general-corpus-dictionary.csv" "resources/domain-corpus-dictionary.csv") ;; Build the ESA semantic interpreter (build-semantic-interpreter "base" "resources/semantic-interpreters/base/" (distinct (concat (get-domain-pages) (get-general-pages)))) ;; Build the vectors to feed to a SVM classifier using ESA (build-svm-model-vectors "resources/svm/base/" :corpus-folder-normalized "resources/corpus-normalized/") ;; Train the SVM using the best parameters discovered in the previous tests (train-svm-model "svm.w50" "resources/svm/base/" :weights {1 50.0} :v nil :c 1 :algorithm :l2l2)
Let’s evaluate this model using our three gold standards:
(evaluate-model "svm.goldstandard.1.w50" "resources/gold-standard-1.csv")
True positive: 21 False positive: 3 True negative: 306 False negative: 6 Precision: 0.875 Recall: 0.7777778 Accuracy: 0.97321427 F1: 0.8235294
The performance changes related to the previous results (using
KBpedia 1.02) are:
+10.33%-12.16%+0.31%+0.26%The results for the second gold standard are:
(evaluate-model "svm.goldstandard.2.w50" "resources/gold-standard-2.csv")
True positive: 16 False positive: 3 True negative: 317 False negative: 9 Precision: 0.84210527 Recall: 0.64 Accuracy: 0.9652174 F1: 0.72727275
The performances changes related to the previous results (using
KBpedia 1.02) are:
+6.18%-29.35%-1.19%-14.63%What we can say is that the new scope for the domain greatly
improved the precision of the model. This happens
because the new model is probably more complex and better scoped,
which leads it to be more selective. However, because of this the
recall of the model suffers since some of the positive
case of our gold standard are not considered to be positive but
negative, which now creates new false positives. As
you can see, there is almost always a tradeoff between
precision and recall. However, you could
have 100% precision by only having one result right,
but then the recall would suffer greatly. This is why
the F1 score
is important since it is a weighted average of the
precision and the recall.
Now let’s look at the results of our new gold standard:
(evaluate-model "svm.goldstandard.3.w50" "resources/gold-standard-3.csv")
True positive: 28 False positive: 3 True negative: 355 False negative: 22 Precision: 0.9032258 Recall: 0.56 Accuracy: 0.9387255 F1: 0.69135803
Again, with this new gold standard, we can see the same pattern:
the precision is pretty good, but the
recall is not that great since about half the
true positives did not get noticed by the model.
Now, what could we do to try to improve this situation? The next thing we will investigate is to use feature selection and pruning.
A new method that we will investigate to try to improve the performance of the models is called feature selection. As its name says, what we are doing is to select specific features to create our prediction model. The idea here is that not all features are born equal and different features may have different (positive or negative) impacts on the model.
In our specific use case, we want to do feature selection using a pruning technique. What we will do is to count the number of tokens for each of our features, and each of the Wikipedia page related to these features. If the number of tokens in an article is too small (below 100), then we will drop that feature.
[Note: feature selection is a complex topic; other options and nuances are not further discussed here.]
The idea here is not to give undue importance to a feature for which we lack proper positive documents in the training corpus. Depending on the feature, it may, or may not, have an impact on the overall model’s performance.
Pruning the general and domain specific dictionaries is really simple. We only have to read the current dictionaries, to read each of the documents mentioned in the dictionary from the cache, to calculate the number of tokens in each, and then to keep them or to drop them if they reach a certain threshold. Finally we write a new dictionary with the pruned features and documents:
(defn create-pruned-pages-dictionary-csv [dictionary-file prunned-file normalized-corpus-folder & {:keys [min-tokens] :or {min-tokens 100}}] (let [dictionary (rest (with-open [in-file (io/reader dictionary-file)] (doall (csv/read-csv in-file))))] (with-open [out-file (io/writer prunned-file)] (csv/write-csv out-file (->> dictionary (mapv (fn [[title rc]] (when (.exists (io/as-file (str normalized-corpus-folder title ".txt"))) (when (> (->> (slurp (str normalized-corpus-folder title ".txt")) tokenize count) min-tokens) [[title rc]])))) (apply concat) (into []))))))
Then we can prune the general and domain specific dictionaries using this simple function:
(create-pruned-pages-dictionary-csv "resources/general-corpus-dictionary.csv" "resources/general-corpus-dictionary.pruned.csv" "resources/corpus-normalized/" min-tokens 100) (create-pruned-pages-dictionary-csv "resources/domain-corpus-dictionary.csv" "resources/domain-corpus-dictionary.pruned.csv" "resources/corpus-normalized/" min-tokens 100)
As a result of this specific pruning approach, the number of
features drops from 197 to 175.
Now that the training corpuses have been pruned, let’s load them and then evaluate their performance on the gold standards.
;; Load positive and negative pruned training corpuses (load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/domain-corpus-dictionary.pruned.csv") ;; Build the ESA semantic interpreter (build-semantic-interpreter "base" "resources/semantic-interpreters/base-pruned/" (distinct (concat (get-domain-pages) (get-general-pages)))) ;; Build the vectors to feed to a SVM classifier using ESA (build-svm-model-vectors "resources/svm/base-pruned/" :corpus-folder-normalized "resources/corpus-normalized/") ;; Train the SVM using the best parameters discovered in the previous tests (train-svm-model "svm.w50" "resources/svm/base-pruned/" :weights {1 50.0} :v nil :c 1 :algorithm :l2l2)
Let’s evaluate this model using our three gold standards:
(evaluate-model "svm.pruned.goldstandard.1.w50" "resources/gold-standard-1.csv")
True positive: 21 False positive: 2 True negative: 307 False negative: 6 Precision: 0.9130435 Recall: 0.7777778 Accuracy: 0.97619045 F1: 0.84000003
The performances changes related to the initial results (using
KBpedia 1.02) are:
+18.75%-12.08%+0.61%+2.26%In this case, compared with the previous results (non-pruned
with KBpedia 1.10), we improved the
precision without decreasing the recall
which is the ultimate goal. This means that the F1
score increased by 2.26% just by pruning, for this
gold standard.
The results for the second gold standard are:
(evaluate-model "svm.goldstandard.2.w50" "resources/gold-standard-2.csv")
True positive: 16 False positive: 3 True negative: 317 False negative: 9 Precision: 0.84210527 Recall: 0.64 Accuracy: 0.9652174 F1: 0.72727275
The performances changes related to the previous results (using
KBpedia 1.02) are:
+6.18%-29.35%-1.19%-14.63%In this case, the results are identical (with non-pruned with
KBpedia 1.10). Pruning did not change anything.
Considering the relatively small size of the gold standard, this is
to be expected since the model also did not drastically change.
Now let’s look at the results of our new gold standard:
(evaluate-model "svm.goldstandard.3.w50" "resources/gold-standard-3.csv")
True positive: 27 False positive: 7 True negative: 351 False negative: 23 Precision: 0.7941176 Recall: 0.54 Accuracy: 0.9264706 F1: 0.64285713
Now let’s check how these compare to the non-pruned version of the training corpus:
-12.08%-3.7%-1.31%-7.02%Both false positives and false
negatives increased with this change, which also led to a
decrease in the overall metrics. What happened?
Different things may have happened in fact. Maybe the new set of features are not optimal, or maybe the hyperparameters of the SVM classifier are offset. This is what we will try to figure out by working with two new methods that we will use to try to continue to improve our model: hyperparameters optimization using grid search and using ensembles learning.
Hyperparameters are parameters that are not learned by the
estimators. They are a kind of configuration option for an
algorithm. In the case of a linear SVM, hyperparameters are
C, epsilon, weight and the
algorithm used. Hyperparameter optimization is the
task of trying to find the right parameter values in order to
optimize the performance of the model.
There are multiple different strategies that we can use to try to find the best values for these hyperparameters, but the one we will use is called the grid search, which exhaustively searches across a manually defined subset of possible hyperparameter values.
The grid search function we want to define will enable us to
specify the algorithm(s), the weight(s),
C and the stopping tolerence. Then we
will want the grid search to keep the hyperparameters that optimize
the score of the metric we want to focus on. We also have to
specify the gold standard we want to use to evaluate the
performance of the different models.
Here is the function that implements that grid search algorithm:
(defn svm-grid-search [name model-path gold-standard & {:keys [grid-parameters selection-metric] :or {grid-parameters [{:c [1 2 4 16 256] :e [0.001 0.01 0.1] :algorithm [:l2l2] :weight [1 15 30]}] selection-metric :f1}}] (let [best (atom {:gold-standard gold-standard :selection-metric selection-metric :score 0.0 :c nil :e nil :algorithm nil :weight nil}) model-vectors (read-string (slurp (str model-path "model.vectors")))] (doseq [parameters grid-parameters] (doseq [algo (:algorithm parameters)] (doseq [weight (:weight parameters)] (doseq [e (:e parameters)] (doseq [c (:c parameters)] (train-svm-model name model-path :weights {1 (double weight)} :v nil :c c :e e :algorithm algo :model-vectors model-vectors) (let [results (evaluate-model name gold-standard :output false)] (println "Algorithm:" algo) (println "C:" c) (println "Epsilon:" e) (println "Weight:" weight) (println selection-metric ":" (get results selection-metric)) (println) (when (> (get results selection-metric) (:score @best)) (reset! best {:gold-standard gold-standard :selection-metric selection-metric :score (get results selection-metric) :c c :e e :algorithm algo :weight weight})))))))) @best))
The possible algorithms are:
:l2lr_primal:l2l2:l2l2_primal:l2l1:multi:l1l2_primal:l1lr:l2lrTo simplify things a little bit for this task, we will merge the three gold standards we have into one. We will use that gold standard moving forward. The merged gold standard can be downloaded from here. We now have a single gold standard with 1017 manually vetted web pages.
Now that we have a new consolidated gold standard, let’s calculate the performance of the models when the training corpuses are pruned and not. This will become the new basis to compare the subsequent results in this article. The metrics when they training corpuses are pruned:
True positive: 56
false positive: 10
True negative: 913
False negative: 38Precision: 0.8484849
Recall: 0.59574467
Accuracy: 0.95280236
F1: 0.7
Now, let’s run the grid search that will try to optimize the
F1 score of the model using the pruned training
corpuses and using the full gold standard:
(svm-grid-search "grid-search-base-pruned-tests" "resources/svm/base-pruned/" "resources/gold-standard-full.csv" :selection-metric :f1 :grid-parameters [{:c [1 2 4 16 256] :e [0.001 0.01 0.1] :algorithm [:l2l2] :weight [1 15 30]}])
{:gold-standard "resources/gold-standard-full.csv"
:selection-metric :f1
:score 0.7096774
:c 2
:e 0.001
:algorithm :l2l2
:weight 30}
With a simple subset of the possible hyperparameter space, we
found that by increasing the c parameter to 2 we could
improve the performance of the F1 score on the gold
standard by 1.37%. It is not a huge gain, but it is
still an appreciable gain given the miinimal effort invested so far
(basically: waiting for the grid search to finish). Subsequently we
could tweak the subset of parameters to try to improve a little
further. Let’s try with c = [ 1.5 , 2 , 2.5 ] and
weight = [30, 40]. Let’s also try to check with other
algorithms as well like L2-regularized L1-loss support vector
regression (dual).
The goal here is to configure the initial grid search with general parameters with a wide range of possible values. Then subsequently we can use that tool to fine tune some of the parameters that were returning good results. In any case, the more computer power and time you have, the more tests you will be able to perform.
Posted at 11:00
The Linked Data Lexicography for High-End Language Technology (LDL4HELTA) project was started in cooperation between Semantic Web Company (SWC) and K Dictionaries. LDL4HELTA combines lexicography and Language Technology with semantic technologies and Linked (Open) Data mechanisms and technologies. One of the implementation steps of the project is to create a language graph from the dictionary data.
The input data, described further, is a Spanish dictionary core translated into multiple languages and available in XML format. This data should be triplified (which means to be converted to RDF – Resource Description Framework) for several purposes, including to enrich it with external resources. The triplified data needs to comply with Semantic Web principles.
To get from a dictionary’s XML format to its triples, I learned that you must have a model. One piece of the sketched model, representing two Spanish words which have senses that relate to each other, is presented in Figure 1.
This sketched model first needs to be created by a linguist who understands both the language complexity and Semantic Web principles. The extensive model [1] was developed at the Ontology Engineering Group of the Universidad Politécnica de Madrid (UPM).
Language is very complex. With this we all agree! How complex it really is, is probably often underestimated, especially when you need to model all its details and triplify it.
So why is the task so complex?
To start with, the XML structure is complex in itself, as it contains nested structures. Each word constitutes an entry. One single entry can contain information about:
|
|
Entries can have predefined values, which can recur but their fields can also have so-called free values, which can vary too. Such fields are:
As mentioned above, in order to triplify a dictionary one needs to have a clear defined model. Usually, when modelling linked data or just RDF it is important to make use of existing models and schemas to enable easier and more efficient use and integration. One well-known lexicon model is Lemon. Lemon contains good pieces of information to cover our dictionary needs, but not all of them. We started using also the Ontolex model, which is much more complex and is considered to be the evolution of Lemon. However, some pieces of information were still missing, so we created an additional ontology to cover all missing corners and catch the specific details that did not overlap with the Ontolex model (such as the free values).
An additional level of complexity was the need to identify exactly the missing pieces in Ontolex model and its modules and create the part for the missing information. This was part of creating the dictionary’s model which we calledontolexKD.
As a developer you never sit down to think about all the senses or meanings or translations of a word (except if you specialize in linguistics), so just to understand the complexity was a revelation for me. And still, each dictionary contains information that is specific to it and which needs to be identified and understood.
The process used in order to do the mapping consists of several steps. Imagine this as a processing pipeline which manipulates the XML data. UnifiedViews is an ETL tool, specialized in the management of RDF data, in which you can configure your own processing pipeline. One of its use cases is to triplify different data formats. I used it to map XML to RDF and upload it into a triple store. Of course this particular task can also be achieved with other such tools or methods for that matter. In UnifiedViews the processing pipeline resembles what appears in Figure 2.
The pipeline is composed out of data processing units (DPUs) which communicate iteratively. In a left-to-right order the process in Figure 2 represents:
Basically the XML is transformed using XSLT.
Complexity increases also through the URIs (Uniform Resource Identifier) that are needed for mapping the information in the dictionary, because with Linked Data any resource should have a clearly identified and persistent identifier! The start was to represent a single word (headword) under a desired namespace and build on it to associate it with its part of speech, grammatical number, grammatical gender, definition, translation – just to begin with.
The base URIs follow the best practices recommended in the ISA study on persistent URIs following the pattern:http://{domain}/{type}/{concept}/{reference}.
An example of such URIs for the forms of a headword is:
These two URIs represent the singular masculine and singular feminine forms of the Spanish word entendedor.
If the dictionary contains two different adjectival endings, as with entendedor which has different endings for the feminine and masculine forms (entendedora and entendedor), and they are not explicitly mentioned in the dictionary than we use numbers in the URI to describe them. If the gener would be explicitly mentioned the URIs would be:
In addition, we should consider that the aim of triplifying the XML was for all these headwords with senses, forms and translations, to connect and be identified and linked following Semantic Web principles. The actual overlap and linking of the dictionary resources remains open. A second step for improving the triplification and mapping similar entries, if possible at all, still needs to be carried out. As an example, let’s take two dictionaries, say German, which contain a translation into English and an English dictionary which also contains translations into German. We get the following translations:
Bank – bank – German to English
bank – Bank – English to German
The URI of the translation from German to English was designed to look like:
And the translation from English to German would be:
In this case both represent the same translation but have different URIs because they were generated from different dictionaries (mind the translation order). These should be mapped so as to represent the same concept, theoretically, or should they not?
The word Bank in German can mean either a bench or a bank in English. When I translate both English senses back into German I get again the word Bank, but I cannot be sure which sense I translate unless the sense id is in the URI, hence the SE00006110 and SE00006116. It is important to keep the order of translation (target-source) but later map the fact that both translations refer to the same sense, same concept. This is difficult to establish automatically. It is hard even for a human sometimes.
One of the last steps of complexity was to develop a generic XSLT which can triplify all the different languages of this dictionary series and store the complete data in a triple store. The question remains: is the design of such a universal XSLT possible while taking into account the differences in languages or the differences in dictionaries?
The task at hand is not completed from the point of view of enabling the dictionary to benefit from Semantic Web principles yet. The linguist is probably the first one who can conceptualize “the how to do this”.
As a next step we will improve the Linked Data created so far and bring it to the status of a good linked language graph by enriching the RDF data with additional information, such as the history of a term or additional grammatical information etc.
References:
[1] J. Bosque-Gil, J. Gracia, E. Montiel-Ponsoda, and G. Aguado-de Cea, “Modelling multilingual lexicographic resources for the web of data: the k dictionaries case,” in Proc. of GLOBALEX’16 workshop at LREC’15, Portoroz, Slovenia, May 2016.
Posted at 12:07
Hello Community! We
are very pleased to announce that our paper “Radon– Rapid
Discovery of Topological Relations” was accepted for
presentation at the Thirty-First AAAI
Conference on Artificial Intelligence (AAAI-17), which will be
held in February 4–9 at the Hilton San Francisco, San Francisco,
California, USA.
In more detail, we will present the following paper: “Radon– Rapid Discovery of Topological Relations” Mohamed Ahmed Sherif, Kevin Dreßler, Panayiotis Smeros, and Axel-Cyrille Ngonga Ngomo
Abstract. Datasets containing geo-spatial resources are increasingly being represented according to the Linked Data principles. Several time-efficient approaches for discovering links between RDF resources have been developed over the last years. However, the time-efficient discovery of topological relations between geospatial resources has been paid little attention to. We address this research gap by presenting Radon, a novel approach for the rapid computation of topological relations between geo-spatial resources. Our approach uses a sparse tiling index in combination with minimum bounding boxes to reduce the computation time of topological relations. Our evaluation of Radon’s runtime on 45 datasets and in more than 800 experiments shows that it outperforms the state of the art by up to 3 orders of magnitude while maintaining an F-measure of 100%. Moreover, our experiments suggest that Radon scales up well when implemented in parallel.
Acknowledgments
This work is implemented in the link discovery framework LIMES and has been supported
by the European Union’s H2020 research and innovation action
HOBBIT (GA no. 688227) as
well as the BMWI Project GEISER (project no.
01MD16014).
Posted at 13:48
Posted at 15:09
Posted at 23:59
![]()
Posted at 12:39
Holiday season is nearly upon us. Donating to a charity is an alternative form of gift giving that shows you care, whilst directing your money towards helping those that need it. There are a lot of great and deserving causes you can support, and I’m certainly not going to tell you where you should donate your money.
But I’ve been thinking about the various ways in which I
can support projects that I care about. There are a lot of them as
it turns out. And it occurred to me that I could ask friends
and family who might want to buy me a gift to donate to them
instead. It’ll save me getting me getting yet another scarf, pair
of socks, or (shudder) a
![]()
Posted at 19:23
Open data is data that anyone can access, use and share.
Open data is the result of several processes. The most obvious one is the release process that results in data being made available for reuse and sharing.
But there are other processes that may take place before that open data is made available: collecting and curating a dataset; running it through quality checks; or ensuring that data has been properly anonymised.
There are also processes that happen after data has been published. Providing support to users, for example. Or dealing with error reports or service issues with an API or portal.
Some processes are also continuous. Engaging with re-users is something that is best done on an ongoing basis. Re-users can help you decide which datasets to release and when. They can also give you feedback on ways to improve how your data is published. Or how it can be connected and enriched against other sources.
Collectively these processes define the practice of open data.
The practice of open data covers much more than the technical details of helping someone else access your data. It covers a whole range of organisational activities.
Releasing open data can be really easy. But developing your open data practice can take time. It can involve other changes in your organisation, such as creating a more open approach to data sharing. Or getting better at data governance and management.
The extent to which you develop an open data practice depends on how important open data is to your organisation. Is it part of your core strategy or just something you’re doing on a more limited basis?
The breadth and depth of the practice of open data is surprising to many people. The learning process is best experienced. Going through the process of opening a dataset, however small, provides useful insight that can help identify where further learning is needed.
On aspect of the practice of open data involves
understanding what data can be open, what can be shared and what
must stay closed. Moving data along
![]()
Posted at 19:52
The Cognonto demo is powered by an extensive knowledge graph called the KBpedia Knowledge Graph, as organized according to the KBpedia Knowledge Ontology (KKO). KBpedia is used for all kinds of tasks, some of which are demonstrated by the Cognonto use cases. KBpedia powers dataset linkage and mapping tools, machine learning training workflows, entity and concept extractions, category and topic tagging, etc.
The KBpedia Knowledge Graph is a structure of more than 39,000 reference concepts linked to 6 major knowledge bases and 20 popular ontologies in use across the Web. Unlike other knowledge graphs that analyze big corpuses of text to extract “concepts” (n-grams) and their co-occurrences, KBpedia has been created, is curated, is linked, and evolves using humans for the final vetting steps. KBpedia and its build process is thus a semi-automatic system.
The challenge with such a project is to be able to grow and refine (add or remove relations) within the structure without creating unknown conceptual issues. The sheer combinatorial scope of KBpedia means it is not possible for a human to fully understand the impact of adding or removing a relation on its entire structure. There is simply too much complexity in the interaction amongst the reference concepts (and their different kinds of relations) within the KBpedia Knowledge Graph.
What I discuss in this article is how Cognonto creates and then constantly evolves the KBpedia Knowledge Graph. In parallel with our creating KBpedia over the years, we also have needed to develop our own build processes and tools to make sure that every time something changes in KBpedia’s structure that it remains satisfiable and coherent.
As you may experience for yourself with the Knowledge Graph browser, the KBpedia structure is linked to multiple external sources of information. Each of these sources (six major knowledge bases and another 20 ontologies) has its own world view. Each of these sources use its own concepts to organize its own structure.
What the KBpedia Knowledge Graph does is to merge all these different world views (and their associated instances and entities) into a coherent whole. One of the purposes of the KBpedia Knowledge Graph is to act as a scaffolding for integrating still further external sources, specifically in the knowledge domains relevant to specific clients.
One inherent characteristic of these knowledge sources is that they are constantly changing. Some may be updated only occasionally, others every year, others every few months, others every few weeks, or whatever. In the cases of Wikipedia and Wikidata, two of the most important contributors to KBpedia, thousands of changes occur daily. This dynamism of knowledge sources is an important fact since every time a source is changed, it may mean that its world view may have changed as well. Any of these changes can have an impact on KBpedia and the linkages we have to that external source.
Because of this dynamic environment, we do have to constantly regenerate the KBpedia Knowledge Graph and we constantly have to make sure that any changes in its structure or in the structure of the sources linked to it doesn’t make it insatisfiable nor incoherent.
It is for these reasons that we developed an extensive knowledge graph building process that includes a series of tests that are run every time that the knowledge graph get modified. Each new build is verified that it is still satisfiable and coherent.
The KBpedia Knowledge Graph build process has been developed over the years to create a robust workflow that enables us to regenerate KBpedia every time that something changed in it. The build process ensures that no issues are created every time we change something and regenerate KBpedia. Our build process also calculates a series of statistics and metrics that enable us to follow its evolution.
The process works as follow:
It is important that we be able to do these builds and tests rapidly, so that we can move along new version releases rapidly. Remember, all changes to the KBpedia Knowledge Graph are manually vetted.
To accomplish this aim we actually build KBpedia from a set of fairly straightforward input files (for easy inspection and modification). We can completely rebuild all of KBpedia in less than two hours. About 45 minutes are required for building the overall structure and applying the satisfiability and coherency tests. The typology aspects of KBpedia and their tests adds another hour or so to complete the build. The rapidity of the build cycle means we can test and refine nearly in real time, useful when we are changing or refining big chunks of the structure.
Building the KBpedia Knowledge Graph is like M.C. Escher’s hand’s drawing themselves. Because of the synergy between the Knowledge Graph reference concepts, its upper structure, its typologies and its numerous links to external linkages, any addition in one of these areas can lead to improvements in other areas of the knowledge graph. These improvements are informed by analyzing the metrics, statistics, and possible errors logged by the build process.
The Knowledge Graph is constantly evolving, self-healing and expanding. This is why that the build process and more importantly its tests are crucial to make sure that new issues are not introduced every time something changes within the structure.
To illustrate these points, let’s dig a little deeper into the KBpedia Knowledge Graph build process.
The KBpedia Knowledge Graph is built from a few relatively straightforward assignment files serialized in CSV. Each file has its purpose in the build process and is encoded using UTF-8 for internationalization purposes. KBpedia is just a set of simple indexes serialized as CSV files that can easily be exchanged, updated and re-processed.
The process is 100% repeatable and testable. If issues are found in the future that require a new step or a new test, it can easily be improved by plugging-in a new step or a new test into the processing pipeline. In fact, the current pipeline is the incremental result of years of working this process. I’m sure we will add more steps still as time goes on.
The process is also semi-automatic. Certain tests may cause the process to completely fail. If such a failure happens, then immediate actions are outputed in different log files. If the process does complete, then all kinds of log files and statistics about the KBpedia Knowledge Graph structure are written to the file system. Once completed, the human operator can easily check these logs and update the input files to improve something he may have found after analyzing the output files.
Building KBpedia is really an iterative process. It often is generated hundred of times before a new version is released.
The core and more important test in the process is the satisfiability test that is run once the KBpedia Knowledge Graph is generated. An unsatisfiable class is a class that does not “satisfy” (is inconsistent with) the structure of the knowledge graph. In KBpedia, what needs to be satisfied are the disjoint assertions that exists at the upper level of the knowledge graph. If an assertion between two reference concepts (like a sub-class-of or an equivalent-to relationship) leads to an unsatisfiable disjoint assertion, then an error is raised and the issue will need to be fixed by the human operator.
Here is an example of an unsatisfiable class. In this example,
someone wants to say that a musical group (kbpedia:MusicPerformanceOrganization)
is a sub-class-of a musician (kbpedia:Musician).
This new assertion is obviously an error (since musicians may also
be individuals), but the human operator didn’t noticed it when he
created the new relationship between the two reference concepts.
So, how does the build process catch such errors? Here is how:
Because the two classes belong to two disjoint super classes, then the KBpedia generator finds this issue and returns an error along with logging report that explains why that new assertion makes the structure unsatisfiable. This testing and audit report is pretty powerful (and essential) to be able to maintain the integrity of the knowledge graph.
The satisfiability testing of external concepts linked to KBpedia is performed in two steps:
This second step is essential to make sure that any external concept we link to KBpedia is done properly and does not trigger any linking errors. In fact, we are trying to minimize the number of errors using the unsatisfiability testing. The process of checking if external concepts linked to the KBpedia Knowledge Graph satisfies the structure is the same. If their inclusion leads to such an issue, then it means that the links are the issue, since we know that the KBpedia core structure is satisfiable (since it was the previous step). Once detected, the linkage error(s) will be reviewed and fixed by the human operator and the structure will be regenerated. In the early phases of a new build, these fixes are accumulated and processed in batches. At the end of a new build, only one or a few errors remain to be corrected.
Another important test is to make sure that the KBpedia
Knowledge Graph is fully connected. We don’t want to have islands
of concepts in the graph, we want to make sure that every concept
is reachable using sub-class-of,
super-class-of or equivalent-to
relationships. If the build process detects that some concepts are
disconnected from the graph, then new relationships will need to be
created to reconnect the graph. These “orphan” tests ensure the
integrity and completeness of the overall graph structure.
What is a typology? As stated by Merriam Webster, a
typology is “a system used for putting things into groups
according to how they are similar.” The KBpedia typologies, of
which there are about 80, are the classification of types that are
closely related, which we term Super Types. Three example
Super Types are People, Activities and Products. The Super Types
are found in the upper reaches of the KBpedia Knowledge Graph. (See
further
this article by Mike Bergman describing the upper structure of
KBpedia and its relation to the typologies.) Thousands of
disjointedness assertions have been defined between
individual Super Types to other Super Types. These assertions
enforce the fact that the reference concepts related to a Super
Type A are not similar to the reference concepts
related to, say, Super Type B.
These disjointedness assertions are a major factor in how we can rapidly slice-and-dice the KBpedia knowledge space to rapidly create training corpuses and positive and negative training sets for machine learning. These same disjointedness relationships are what we use to make sure that the KBpedia Knowledge Graph structure is satisfiable and coherent.
Another use of the typologies is to have a general overview of the knowledge graph. Each typology is a kind of lens that shows different parts of the knowledge graph. The build process creates a log of each of the typologies with all the reference conepts that belong to it. Similarly, the build process also creates a mini-ontology for each typology that can be inspected in an ontology editor. We use these outputs to more easily assess the various structures within KBpedia and to find possible conceptual issues as part of our manual vetting before final approvals.
Creating, maintaining and evolving a knowledge graph the size of KBpedia is a non-trivial task. It is also a task that must be done frequently and rapidly whenever the underlying nature of KBpedia’s constituent knowledge bases dynamically changes. These demands require a robust build process with multiple logic and consistency tests. At every step we have to make sure that the entire structure is satisfiable and coherent. Fortunately, after development over a number of years, we now have processes in place that are battle tested and can continue to be expanded as the KBpedia Knowledge Graph constantly evolves.
Posted at 19:57
When I’m discussing business models around open data I regularly
refer to a few different examples. Not all of these have well
developed case studies, so I thought I’d start trying to capture
them here. In this first write-up I’m going to look at
![]()
Posted at 22:28
[work in progress – I’m updating it gradually]
![]()
Posted at 16:11
As of last month
![]()
Posted at 19:39
One of the topics that most interests me at the moment is how we design systems and organisations that contribute to the creation and maintenance of the open data commons.
This is more than a purely academic interest. If we can understand the characteristics of successful open data projects like Open Street Map or Musicbrainz then we could replicate them in other areas. My hope is that we may be able to define a useful tool-kit of organisational and technical design patterns that make it more likely for other similar projects to proceed. These patterns might also give us a way to evaluate and improve other existing systems.
A lot of the current discussion around this topic is going on
under the “
![]()
Posted at 19:03
Posted at 16:49
A common task required by systems that automatically analyze text is to classify an input text into one or multiple classes. A model needs to be created to scope the class (what belongs to it and what does not) and then a classification algorithm uses this model to classify an input text.
Multiple classification algorithms exists to perform such a task: Support Vector Machine (SVM), K-Nearest Neigbours (KNN), C4.5 and others. What is hard with any such text classification task is not so much how to use these algorithms: they are generally easy to configure and use once implemented in a programming language. The hard – and time-consuming – part is to create a sound training corpus that will properly define the class you want to predict. Further, the steps required to create such a training corpus must be duplicated for each class you want to predict.
Since creating the training corpus is what is time consuming, this is where Cognonto provides its advantages.
In this article, we will show you how Cognonto’s KBpedia Knowledge Graph can be used to automatically generate training corpuses that are used to generate classification models. First, we define (scope) a domain with one or multiple KBpedia reference concepts. Second, we aggregate the training corpus for that domain using the KBpedia Knowledge Graph and its linkages to external public datasets that are then used to populate the training corpus of the domain. Third, we use the Explicit Semantic Analysis (ESA) algorithm to create a vectorial representation of the training corpus. Fourth, we create a model using (in this use case) an SVM classifier. Finally, we predict if an input text belongs to the class (scoped domain) or not.
This use case can be used in any workflow that needs to pre-process any set of input texts where the objective is to classify relevant ones into a defined domain.
Unlike more traditional topic taggers where topics are tagged in an input text with weights provided for each of them, we will see how it is possible to use the semantic interpreter to tag main concepts related to an input text even if the surface form of the topic is not mentioned in the text. We accomplish this by leveraging ESA’s semantic interpreter.
In this article, two concepts are at the center of everything:
what I call the general domain and the specific
domain(s). What I call the general domain can
be seen as the set of all specific domains. It includes the set of
classes that generally define common things of the World. What we
call a specific domain is one or multiple classes that
scope a domain of interest. A specific domain is a
subset of classes of the general domain.
In Cognonto, the general domain is defined by all the ~39,000 KBpedia reference concepts. A specific domain is any sub-set of the ~39,000 KBpedia reference concept that adequately scopes a domain of interest.
The purpose of this use case is to show how we can determine if an input text belongs to a specific domain of interest. What we have to do is to create two training corpuses: one that defines the general domain, and one that defines the specific domain. However, how do we go about defining these corpuses? One way would be to do this manually, but it would take an awful lot of time to do.
This is the crux of the matter: we will generate the general domain corpus and specific domain ones automatically using the KBpedia Knowledge Graph and all of its linkages to external public datasets. The time and resources thus saved from creating the training corpuses can be spent testing different classification algorithms, tweaking their parameters, evaluating them, etc.
What is so powerful in leveraging the KBpedia Knowledge Graph in this manner is that we can generate training sets for all kind of domains of interests automatically.
The first step we have to do is to define the training corpuses that we will use to create the semantic interpreter and the SVM classification models. We have to create the general domain training corpus and the domain specific training corpus. The example domain I have chosen for this use case is scoped by the ideas of Music, Musicians, Music Records, Musical Groups, Musical Instruments, etc.
The general training corpus is quite easy to create. The only thing I have to do is to query the KBpedia Knowledge Graph to get all the Wikipedia pages linked to all the KBpedia reference concepts. These pages will become the general training corpus.
Note that in this article I will only use the linkages to the Wikipedia dataset, but I could also use any other datasets that are linked to the KBpedia Knowledge Graph in exactly the same way. Here is how we aggregate all the documents that will belong to a training corpus:
Note all I need do is to use the KBpedia structure, query it, and then write the general corpus into a CSV file. This CSV file will be used later for most of the subsequent tasks.
(define-general-corpus "resources/kbpedia_reference_concepts_linkage.n3" "resources/general-corpus-dictionary.csv")
The next step is to define the training corpuse of the
specific domain for this use case, the music
domain. To do so, I need merely search KBpedia to find all the
reference concepts I am interested in that will scope my music
domain. These domain-specific KBpedia reference concepts will
be the features of the SVM models we will test below.
What the define-domain-corpus function does below
is simply to query KBpedia to get all the Wikipedia articles
related to these concepts, their sub-classes and to create the
training corpus from them.
In this article we only define a binary classifier. However, if we would want to create a multi-class classifier then we would have to define multiple specific domain training corpuses exactly the same way. The only time we would have to spend is to search KBpedia (using the Cognonto user interface) to find the reference concepts we want to use to scope the domains we want to define. We will show how quickly this can be done with impressive results in a later use case.
(define-domain-corpus ["http://kbpedia.org/kko/rc/Music" "http://kbpedia.org/kko/rc/Musician" "http://kbpedia.org/kko/rc/MusicPerformanceOrganization" "http://kbpedia.org/kko/rc/MusicalInstrument" "http://kbpedia.org/kko/rc/Album-CW" "http://kbpedia.org/kko/rc/Album-IBO" "http://kbpedia.org/kko/rc/MusicalComposition" "http://kbpedia.org/kko/rc/MusicalText" "http://kbpedia.org/kko/rc/PropositionalConceptualWork-MusicalGenre" "http://kbpedia.org/kko/rc/MusicalPerformer"] "resources/kbpedia_reference_concepts_linkage.n3" "resources/domain-corpus-dictionary.csv")
Once the training corpuses are defined, we want to cache them locally to be able to play with them, without having to re-download them from the Web or re-create them each time.
(cache-corpus)
The cache is composed of 24,374 Wikipedia pages, which is about
2G of raw data. However, we have some more processing
to perform on the raw Wikipedia pages since what we ultimately want
is a set of relevant tokens (words) that will be used to calculate
the value of the features of our model using the ESA semantic
interpreter. Since we may want to experiment with different
normalization rules, what we do is to re-write each document of the
corpus in another folder that we will be able to re-create as
required if the normalization rules change in the future. We can
quickly re-process these input files and save them in separate
folders for testing and comparative purposes.
The normalization steps performed by this function are to:
Normalization steps could be dropped or others included, but these are the standard ones Cognonto applies in its baseline configuration.
(normalize-cached-corpus "resources/corpus/" "resources/corpus-normalized/")
After cleaning, the size of the cache is now 208M
(instead of the initial 2G for the raw web pages).
Note that unlike what is discussed in the original ESA
research papers by Evgeniy Gabrilovich we
are not pruning any pages (the ones with less than X
number of tokens, etc. This could be done but at a subsequent
tweaking step, which our results below indicate is not really
necessary.
Now that the training corpuses are created we can now build the semantic interpreter to create the vectors that will be used to train the SVM classifier.
What we want to do is to classify (determine) if an input text belongs to a class as defined by a domain. The relatedness of the input text is based on how closely the specific domain corpus is related to the general one. This classification is performed with some classifiers like SVM, KNN and C4.5. However, each of these algorithms need to use some kind of numerical vector, upon which the actual classifier requires to model and classify the candidate input text. Creating this numeric vector is the job of the ESA Semantic Interpreter.
Let’s dive a little further into the Semantic Interpreter to understand how it operates. Note that you can skip the next section and continue with the following one.
The Semantic Interpreter is a process that maps fragments of natural language into a weighted sequence of text concepts ordered by their relevance to the input.
Each concept in the domain is accompanied by a document from the KBpedia Knowledge Graph, which acts as its representative term set to capture the idea (meaning) of the concept. The overall corpus is based on the combined documents from KBpedia that match the slice retrieved from the knowledge graph based on the domain query(ies).
The corpus is composed of
concepts that come from the domain
ontology associated with
KBpedia Knowledge Base documents. We
build a sparse matrix
where each of the
columns corresponds to a concept and
where each of the rows corresponds to a word that occurs in the
related entity documents
. The matrix entry
is the TF-IDF value of the word
in document
.
The TF-IDF value of a given term is calculated as:
where
is the number of words in the document
, where the term frequency is defined
as:
and where the document frequency
is the number of documents where the
term
appears.
Unlike the standard ESA system, pruning is not performed on the matrix to remove the least-related concepts for any given word. We are not doing the pruning due to the fact that the ontologies are highly domain specific as opposed to really broad and general vocabularies. However, with a different mix of training text, and depending on the use case, the stardard ESA model may benefit from pruning the matrix.
Once the matrix is created, we do perform cosine normalization on each column:
where
is the TF-IDF weight of the word
in the concept document
, where
is the square root of the sum of
exponent of the TF-IDF weight of each word
in document
. This normalization removes, or at
least lowers, the effect of the length of the input documents.
The first semantic interpreter we will create is composed of the
general corpus which has 24,374 Wikipedia pages and the music
domain-specific corpus composed of 62 Wikipedia pages. The 62
Wikipedia pages that compose the music domain corpus come from the
selected KBpedia reference concepts and their sub-classes that we
defined in the Define The Specific Domain Training
Corpus section above.
(load-dictionaries "resources/general-corpus-dictionary.csv" "resources/domain-corpus-dictionary--base.csv") (build-semantic-interpreter "base" "resources/semantic-interpreters/base/" (distinct (concat (get-domain-pages) (get-general-pages))))
Before building the SVM classifier, we have to create a gold standard that we will use to evaluate the performance of the models we will test. What I did is to aggregate a list of news feeds from the CBC and from Reuters and then I crawled each of them to get the news they were containing. Once I aggregated each of them in a spreadsheet, I manually classified each of them. The result is a gold standard of 336 news pages which were classified as being related to the music domain or not. It can be downloaded from here.
Subsequently, three days later, I re-crawled the same feeds to create a second gold standard that has 345 new spages. It can be downloaded from here. I will use both to evaluate the different SVM models we will create below. (I created the two standards because of some internal tests and statistics we are compiling.)
Both gold standards got created this way:
(defn create-gold-standard-from-feeds [name] (let [feeds ["http://rss.cbc.ca/lineup/topstories.xml" "http://rss.cbc.ca/lineup/world.xml" "http://rss.cbc.ca/lineup/canada.xml" "http://rss.cbc.ca/lineup/politics.xml" "http://rss.cbc.ca/lineup/business.xml" "http://rss.cbc.ca/lineup/health.xml" "http://rss.cbc.ca/lineup/arts.xml" "http://rss.cbc.ca/lineup/technology.xml" "http://rss.cbc.ca/lineup/offbeat.xml" "http://www.cbc.ca/cmlink/rss-cbcaboriginal" "http://rss.cbc.ca/lineup/sports.xml" "http://rss.cbc.ca/lineup/canada-britishcolumbia.xml" "http://rss.cbc.ca/lineup/canada-calgary.xml" "http://rss.cbc.ca/lineup/canada-montreal.xml" "http://rss.cbc.ca/lineup/canada-pei.xml" "http://rss.cbc.ca/lineup/canada-ottawa.xml" "http://rss.cbc.ca/lineup/canada-toronto.xml" "http://rss.cbc.ca/lineup/canada-north.xml" "http://rss.cbc.ca/lineup/canada-manitoba.xml" "http://feeds.reuters.com/news/artsculture" "http://feeds.reuters.com/reuters/businessNews" "http://feeds.reuters.com/reuters/entertainment" "http://feeds.reuters.com/reuters/companyNews" "http://feeds.reuters.com/reuters/lifestyle" "http://feeds.reuters.com/reuters/healthNews" "http://feeds.reuters.com/reuters/MostRead" "http://feeds.reuters.com/reuters/peopleNews" "http://feeds.reuters.com/reuters/scienceNews" "http://feeds.reuters.com/reuters/technologyNews" "http://feeds.reuters.com/Reuters/domesticNews" "http://feeds.reuters.com/Reuters/worldNews" "http://feeds.reuters.com/reuters/USmediaDiversifiedNews"]] (with-open [out-file (io/writer (str "resources/" name ".csv"))] (csv/write-csv out-file [["class" "title" "url"]]) (doseq [feed-url feeds] (doseq [item (:entries (feed/parse-feed feed-url))] (csv/write-csv out-file "" (:title item) (:link item) :append true))))))
Each of the different models we will test in the next sections will be evaluated using the following function:
(defn evaluate-model [evaluation-no gold-standard-file] (let [gold-standard (rest (with-open [in-file (io/reader gold-standard-file)] (doall (csv/read-csv in-file)))) true-positive (atom 0) false-positive (atom 0) true-negative (atom 0) false-negative (atom 0)] (with-open [out-file (io/writer (str "resources/evaluate-" evaluation-no ".csv"))] (csv/write-csv out-file [["class" "title" "url"]]) (doseq [[class title url] gold-standard] (when-not (.exists (io/as-file (str "resources/gold-standard-cache/" (md5 url)))) (spit (str "resources/gold-standard-cache/" (md5 url)) (slurp url))) (let [predicted-class (classify-text (-> (slurp (str "resources/gold-standard-cache/" (md5 url))) defluff-content))] (println predicted-class " :: " title) (csv/write-csv out-file [[predicted-class title url]] :append true) (when (and (= class "1") (= predicted-class 1.0)) (swap! true-positive inc)) (when (and (= class "0") (= predicted-class 1.0)) (swap! false-positive inc)) (when (and (= class "0") (= predicted-class 0.0)) (swap! true-negative inc)) (when (and (= class "1") (= predicted-class 0.0)) (swap! false-negative inc)))) (println "True positive: " @true-positive) (println "false positive: " @false-positive) (println "True negative: " @true-negative) (println "False negative: " @false-negative) (println) (let [precision (float (/ @true-positive (+ @true-positive @false-positive))) recall (float (/ @true-positive (+ @true-positive @false-negative)))] (println "Precision: " precision) (println "Recall: " recall) (println "Accuracy: " (float (/ (+ @true-positive @true-negative) (+ @true-positive @false-negative @false-positive @true-negative)))) (println "F1: " (float (* 2 (/ (* precision recall) (+ precision recall)))))))))
What this function does is to calculate the number of
true-positive, false-positive,
true-negative and false-negatives scores
within the gold standard by applying the current model, and then to
calculate the precision, recall,
accuracy and F1 metrics. You can read
more about how binary classifiers can be evaluated from
here.
Now that we have numeric vector representations of the music domain and now that we have a way to evaluate the quality of the models we will be creating, we can now create and evaluate our prediction models.
The classification algorithm I choose to use for this article is the Support Vector Machine (SVM). I use the Java port of the LIBLINEAR library. Let’s create the first SVM model:
(build-svm-model-vectors "resources/svm/base/") (train-svm-model "svm.w0" "resources/svm/base/" :weights nil :v nil :c 1 :algorithm :l2l2)
This initial model is created using a training set that is composed of 24,311 documents that doesn’t belong to the class (the music specific domain), and 62 documents that does belong to that class.
Now, let’s evaluate how this initial model perform against the the two gold standards:
(evaluate-model "w0" "resources/gold-standard-1.csv" )
True positive: 5 False positive: 0 True negative: 310 False negative: 21 Precision: 1.0 Recall: 0.1923077 Accuracy: 0.9375 F1: 0.32258064
(evaluate-model "w0" "resources/gold-standard-2.csv" )
True positive: 2 false positive: 1 True negative: 319 False negative: 23 Precision: 0.6666667 Recall: 0.08 Accuracy: 0.93043476 F1: 0.14285713
Well, this first run looks like to be really poor! The issue here is a common issue with how the SVM classifier is being used. Ideally, the number of documents that belong to the class and the number of documents that do not belong to the class should be about the same. However, because of the way we defined the music specific domain, and because of the way we created the training corpuses, we ended up with two really unbalanced sets of training documents: 24,311 that doesn’t belong to the class and only 63 that does belong to the class. That is the reason why we are getting these kinds of poor results.
What can we do from here? We have two possibilities:
Let’s test both options. We will initially play with the weights to see how much we can improve the current situation.
What we will do now is to create a series of models that will differ in the weight we will define to improve the weight of the classified terms in the SVM process.
(train-svm-model "svm.w10" "resources/svm/base/" :weights {1 10.0} :v nil :c 1 :algorithm :l2l2) (evaluate-model "w10" "resources/gold-standard-1.csv")
True positive: 17 False positive: 1 True negative: 309 False negative: 9 Precision: 0.9444444 Recall: 0.65384614 Accuracy: 0.9702381 F1: 0.77272725
(evaluate-model "w10" "resources/gold-standard-2.csv")
True positive: 15 False positive: 2 True negative: 318 False negative: 10 Precision: 0.88235295 Recall: 0.6 Accuracy: 0.9652174 F1: 0.71428573
This is already a clear improvement for both gold standards. Let’s see if we continue to see improvements if we continue to increase the weight.
(train-svm-model "svm.w25" "resources/svm/base/" :weights {1 25.0} :v nil :c 1 :algorithm :l2l2) (evaluate-model "w25" "resources/gold-standard-1.csv")
True positive: 20 False positive: 3 True negative: 307 False negative: 6 Precision: 0.8695652 Recall: 0.7692308 Accuracy: 0.97321427 F1: 0.8163265
(evaluate-model "w25" "resources/gold-standard-2.csv")
True positive: 21 False positive: 5 True negative: 315 False negative: 4 Precision: 0.8076923 Recall: 0.84 Accuracy: 0.973913 F1: 0.82352936
The general metrics continued to improve. By increasing the
weight, the precision dropped a little bit, but the
recall improved quite a bit. The overall
F1 score significantly improved. Let’s see with the
Weight at 50.
(train-svm-model "svm.w50" "resources/svm/base/" :weights {1 50.0} :v nil :c 1 :algorithm :l2l2) (evaluate-model "w50" "resources/gold-standard-1.csv")
True positive: 23 False positive: 7 True negative: 303 False negative: 3 Precision: 0.76666665 Recall: 0.88461536 Accuracy: 0.9702381 F1: 0.82142854
(evaluate-model "w50" "resources/gold-standard-2.csv")
True positive: 23 False positive: 6 True negative: 314 False negative: 2 Precision: 0.79310346 Recall: 0.92 Accuracy: 0.9768116 F1: 0.8518519
The trend continues: decline in precision increase
of recall and overall F1 score is better
in both cases. Let’s try with a weight of 200
(train-svm-model "svm.w200" "resources/svm/base/" :weights {1 200.0} :v nil :c 1 :algorithm :l2l2) (evaluate-model "w200" "resources/gold-standard-1.csv")
True positive: 23 False positive: 7 True negative: 303 False negative: 3 Precision: 0.76666665 Recall: 0.88461536 Accuracy: 0.9702381 F1: 0.82142854
(evaluate-model "w200" "resources/gold-standard-2.csv")
True positive: 23 False positive: 6 True negative: 314 False negative: 2 Precision: 0.79310346 Recall: 0.92 Accuracy: 0.9768116 F1: 0.8518519
Results are the same, it looks like improving the weights up to
a certain point adds further to the predictive power. However, the
goal of this article is not to be an SVM parametrization tutorial.
Many other tests could be done such as testing different values for
the different SVM parameters like the C parameter and
others.
Now let’s see if we can improve the performance of the model even more by adding new documents that belong to the class we want to define in the SVM model. The idea of adding documents is good, but how may we quickly process thousands of new documents that belong to that class? Easy, we will use the KBpedia Knowledge Graph and its linkage to entities that exists into the KBpedia Knowledge Base to get thousands of new documents highly related to the music domain we are defining.
Here is how we will proceed. See how we use the
type relationship between the classes and their
individuals:
The millions of completely typed instances in KBpedia enable us to retrieve such large training sets efficiently and quickly.
To extend the music domain model I added about 5000 albums, musicians and bands documents using the relationships querying strategy outlined in the figure above. What I did is just to add 3 new features but with thousands of new training documents in the corpus.
What I had to do was to:
(extend-domain-pages-with-entities) (cache-corpus)
(load-dictionaries "resources/general-corpus-dictionary.csv" "resources/domain-corpus-dictionary--extended.csv") (build-semantic-interpreter "domain-extended" "resources/semantic-interpreters/domain-extended/" (distinct (concat (get-domain-pages) (get-general-pages)))) (build-svm-model-vectors "resources/svm/domain-extended/")
Just like what we did for the first series of tests, we now will create different SVM models and evaluate them. Since we now have a nearly balanced set of training corpus documents, we will test much smaller weights (no weight, and then 2 weight).
(train-svm-model "svm.w0" "resources/svm/domain-extended/" :weights nil :v nil :c 1 :algorithm :l2l2) (evaluate-model "w0" "resources/gold-standard-1.csv")
True positive: 20 False positive: 12 True negative: 298 False negative: 6 Precision: 0.625 Recall: 0.7692308 Accuracy: 0.9464286 F1: 0.6896552
(evaluate-model "w0" "resources/gold-standard-2.csv")
True positive: 18 False positive: 17 True negative: 303 False negative: 7 Precision: 0.51428574 Recall: 0.72 Accuracy: 0.93043476 F1: 0.6
As we can see, the model is scoring much better than the previous one when the weight is zero. However, it is not as good as the previous one when weights are modified. Let’s see if we can benefit increasing the weight for this new training set:
(train-svm-model "svm.w2" "resources/svm/domain-extended/" :weights {1 2.0} :v nil :c 1 :algorithm :l2l2) (evaluate-model "w2" "resources/gold-standard-1.csv")
True positive: 21 False positive: 23 True negative: 287 False negative: 5 Precision: 0.47727272 Recall: 0.8076923 Accuracy: 0.9166667 F1: 0.59999996
(evaluate-model "w2" "resources/gold-standard-2.csv")
True positive: 20 False positive: 33 True negative: 287 False negative: 5 Precision: 0.3773585 Recall: 0.8 Accuracy: 0.8898551 F1: 0.51282054
Overall the models seems worse with weight 2, let’s try with weight 5:
(train-svm-model "svm.w5" "resources/svm/domain-extended/" :weights {1 5.0} :v nil :c 1 :algorithm :l2l2) (evaluate-model "w5" "resources/gold-standard-1.csv")
True positive: 25 False positive: 52 True negative: 258 False negative: 1 Precision: 0.32467532 Recall: 0.96153843 Accuracy: 0.8422619 F1: 0.4854369
(evaluate-model "w2" "resources/gold-standard-2.csv")
True positive: 23 False positive: 62 True negative: 258 False negative: 2 Precision: 0.27058825 Recall: 0.92 Accuracy: 0.81449276 F1: 0.41818184
The performances are just getting worse. But this makes sense at the same time. Now that the training set is balanced, there are many more tokens that participate into the semantic interpreter and so in the vectors generated by it and used by the SVM. If we increase the weight of a balanced training set, then this intuitively should re-unbalance the training set and worsen the performances. This is what is apparently happening.
Re-balancing the training set using this strategy does not look to be improving the prediction model, at least not for this domain and not for these SVM parameters.
So far, we have been able to test different kind of strategies to create different training corpuses, to select different features, etc. We have been able to do this within a day, mostly waiting for the desktop computer to build the semantic interpreter and the vectors for the training sets. It has been possible thanks to the KBpedia Knowledge Graph that enabled us to easily and automatically slice-and-dice the knowledge structure to perform all these tests quickly and efficiently.
There are other things we could do to continue to improve the prediction model, such as manually selecting features returned by KBpedia. Then we could test different parameters of the SVM classifier, etc. However, such tweaks are the possible topics of later use cases.
Let me add a few additional words about multiclass classification. As we saw, we can easily define domains by selecting one or multiple KBpedia reference concepts and all of their sub-classes. This general process enables us to scope any domain we want to cover. Then we can use the KBpedia Knowledge Graph’s relationship with external data sources to create the training corpus for the scoped domain. Finally, we can use SVM as a binary classifier to determine if an input text belongs to the domain or not. However, what if we want to classify an input text with more than one domain?
This can easily be done by using the
one-vs-rest (also called the one-vs-all)
multiclass classification strategy. The only thing we have to do is
to define multiple domains of interest, and then to create a SVM
model for each of them. As noted above, this effort is almost
solely one of posing one or more queries to KBpedia for a given
domain. Finally, to predict if an input text belongs to any of each
domain models we defined, we need to apply an SVM option (like
LIBLINEAR) that already implements multi-class SVM
classification.
In this article, we tested multiple, different strategies to create a good prediction model using SVM to classify input texts into a music-related class. We tested unbalanced training corpuses, balanced training corpuses, different set of features, etc. Some of these tests improved the prediction model; others made it worse. The key point that should be remembered is that any machine learning effort requires bounding, labeling, testing and refining multiple parameters in order to obtain the best results. Use of the KBpedia Knowledge Graph and its linkage to external public datasets enables Cognonto to now do this previously lengthy and time-consuming tasks quickly and efficiently.
Within a few hours, we created a classifier with an
accuracy of about 97% that classifies
input text to belong to a music domain or not. We demonstrate how
we can create such classifiers more-or-less
automatically using the KBpedia Knowledge Graph to
define the scope of the domain and to classify new text into that
domain based on relevant KBpedia reference concepts. Finally, we
note how we may create multi-class classifiers using exactly the
same mechanisms.
Posted at 00:49
Here’s how to make a presence robot with Chromium 51, WebRTC, Raspberry Pi 3 and EasyRTC. It’s actually very easy, especially now that Chromium 51 comes with Raspian Jessie, although it’s taken me a long time to find the exact incantation.
If you’re going to use it for real, I’d suggest using the
![]()
Posted at 21:53
For the purposes of having something to point to in future, here’s a list of different meanings of “open” that I’ve encountered.
XYZ is “open” because:
I gather that at
![]()
Posted at 14:51
In this post I want to highlight what I think are some fairly large gaps in the standards we have for publishing and consuming data on the web. My purpose for writing these down is to try and fill in gaps in my own knowledge, so leave a comment if you think I’m missing something (there’s probably loads!)
To define the scope of those standards, lets try and answer two questions.
Question 1: What are the various activities that we might want to carry out around an open dataset?
Question 2: What are the various activities that we might want to carry out around an open data catalogue?
Now, based on that quick review: which of these areas of functionality are covered by existing standards?
Posted at 14:22
Marvin Frommhold will
discuss the paper “Version Control for RDF Triple Stores”
by Steve Cassidy and James Ballantine which forms the foundation of
his own work regarding versioning for RDF.
Abstract: RDF, the core data format for the Semantic Web, is increasingly being deployed both from automated sources and via human authoring either directly or through tools that generate RDF output. As individuals build up large amounts of RDF data and as groups begin to collaborate on authoring knowledge stores in RDF, the need for some kind of version management becomes apparent. While there are many version control systems available for program source code and even for XML data, the use of version control for RDF data is not a widely explored area. This paper examines an existing version control system for program source code, Darcs, which is grounded in a semi-formal theory of patches, and proposes an adaptation to directly manage versions of an RDF triple store.
NEED4Tweet: A Twitterbot for Tweets Named Entity
Extraction and
Disambiguation
Afterwards, Diego Esteves will
present the paper “NEED4Tweet: A Twitterbot for Tweets Named
Entity Extraction and
Disambiguation” by Mena B. Habib and Maurice van
Keulen which was accepted at ACL
2015.
Abstract: In this demo
paper, we present NEED4Tweet, a Twitterbot for named
entity extraction (NEE) and disambiguation (NED) for
Tweets. The straightforward application of state-of-the-art
extraction and disambiguation approaches on informal text
widely used in Tweets, typically results in significantly
degraded performance due to the lack of formal
structure; the lack of sufficient context required;
and the seldom entities involved. In this paper, we introduce
a novel framework
that copes with the introduced challenges. We rely on
contextual and semantic features more than syntactic
features which are less informative. We believe that
disambiguation can help to improve the extraction process.
This mimics the way humans understand language.
This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.
Posted at 07:55
Dear all,
the LIMES Dev team is happy to announce LIMES 1.0.0.
LIMES, the Link Discovery Framework for Metric Spaces, is a link discovery framework for the Web of Data. It implements time-efficient approaches for large-scale link discovery based on the characteristics of metric spaces. Our approaches facilitate different approximation techniques to compute estimates of the similarity between instances. These estimates are then used to filter out a large amount of those instance pairs that do not suffice the mapping conditions. By these means, LIMES can reduce the number of comparisons needed during the mapping process by several orders of magnitude. The approaches implemented in LIMES include the original LIMES algorithm for edit distances, HR3, HYPPO and ORCHID.
Additionally, LIMES supports the first planning technique for link discovery HELIOS, that minimizes the overall execution of a link specification, without any loss of completeness. Moreover, LIMES implements supervised and unsupervised machine-learning algorithms for finding accurate link specifications. The algorithms implemented here include the supervised, active and unsupervised versions of EAGLE and WOMBAT.
Website: http://aksw.org/Projects/LIMES.html
Download: https://github.com/AKSW/LIMES-dev/releases/tag/1.0.0
GitHub: https://github.com/AKSW/LIMES-dev
User manual: http://aksw.github.io/LIMES-dev/user_manual/
Developer manual: http://aksw.github.io/LIMES-dev/developer_manual/
What is new in LIMES 1.0.0:
We would like to thank everyone who helped to create this release. We also acknowledge the support of the SAKE and HOBBIT projects.
Kind regards,
Posted at 09:38
Dear all,
the Smart Data Analytics group at AKSW is happy to announce DL-Learner 1.3.
DL-Learner is a framework containing algorithms for supervised machine learning in RDF and OWL. DL-Learner can use various RDF and OWL serialization formats as well as SPARQL endpoints as input, can connect to most popular OWL reasoners and is easily and flexibly configurable. It extends concepts of Inductive Logic Programming and Relational Learning to the Semantic Web in order to allow powerful data analysis.
Website: http://dl-learner.org
GitHub page: https://github.com/AKSW/DL-Learner
Download: https://github.com/AKSW/DL-Learner/releases
ChangeLog: http://dl-learner.org/development/changelog/
DL-Learner is used for data analysis tasks within other tools such as ORE and RDFUnit. Technically, it uses refinement operator based, pattern-based and evolutionary techniques for learning on structured data. For a practical example, see http://dl-learner.org/community/carcinogenesis/. It also offers a plugin for Protégé, which can give suggestions for axioms to add.
In the current release, we added a large number of new algorithms and features. For instance, DL-Learner supports terminological decision tree learning, it integrates the LEAP and EDGE systems as well as the BUNDLE probabilistic OWL reasoner. We migrated the system to Java 8, Jena 3, OWL API 4.2 and Spring 4.3. We want to point to some related efforts here:
We want to thank everyone who helped to create this release, in particular we want to thank Giuseppe Cota who visited the core developer team and significantly improved DL-Learner. We also acknowledge support by the recently SAKE project, in which DL-Learner will be applied to event analysis in manufacturing use cases, as well as Big Data Europe and HOBBIT projects.
Kind regards,
Lorenz Bühmann, Jens Lehmann, Patrick Westphal and Simon Bin
Posted at 19:41