December 22

Bob DuCharme: A modern neural network in 11 lines of Python

And a great learning tool for understanding neural nets.

Posted at 12:52

December 16

AKSW Group - University of Leipzig: 4th Big Data Europe Plenary at Leipzig University

The meeting, hosted by our partner InfAI e. V., took place on the 14th to the 15th of December at the University of Leipzig.
The 29 attendees in total, including 15 partners, discussed and reviewed the progress of all work packages in 2016 and planned the activities and workshops taking place in the next six months.

On the second day we talked about several societal challenge pilots in the fields of AgroKnow, transport, security etc. It’s been the last plenary for this year and we thank everybody for their work in 2016. Big Data Europa and our partners are looking forward to 2017.

The next Plenary Meeting will be hosted by VU Amsterdam and will take place in Amsterdam, in June 2017.

Posted at 13:33

December 15

Dublin Core Metadata Initiative: University of Edinburgh joins DCMI as Institutional Member<</title>

2016-12-15, The DCMI Governing Board is pleased to announce that University of Edinburgh has joined DCMI as an Institutional Member. The University of Edinburgh is a world-leading centre of academic excellence with the mission to create, disseminate and curate knowledge. As a great civic university, Edinburgh especially values its intellectual and economic relationship with the Scottish community that forms its base and provides the foundation from which it will continue to look to the widest international horizons, enriching both itself and Scotland. Alasdair MacDonald with the Edinburgh University Library will represent the University on the DCMI Governing Board. For information on becoming a DCMI Institutional Member, visit the DCMI membership page at http://dublincore.org/support/.

Posted at 23:59

December 09

AKSW Group - University of Leipzig: SANSA 0.1 (Semantic Analytics Stack) Released

Dear all,

The Smart Data Analytics group /AKSW are very happy to announce SANSA 0.1 – the initial release of the Scalable Semantic Analytics Stack. SANSA combines distributed computing and semantic technologies in order to allow powerful machine learning, inference and querying capabilities for large knowledge graphs.

Website: http://sansa-stack.net
GitHub: https://github.com/SANSA-Stack
Download: http://sansa-stack.net/downloads-usage/
ChangeLog: https://github.com/SANSA-Stack/SANSA-Stack/releases

You can find the FAQ and usage examples at http://sansa-stack.net/faq/.

The following features are currently supported by SANSA:

Support for reading and writing RDF files in N-Triples format
Support for reading OWL files in various standard formats
Querying and partitioning based on Sparqlify
Support for RDFS/RDFS Simple/OWL-Horst forward chaining inference
Initial RDF graph clustering support
Initial support for rule mining from RDF graphs

We want to thank everyone who helped to create this release, in particular, the projects Big Data Europe, HOBBIT and SAKE.

Kind regards,

The SANSA Development Team

Posted at 14:41

AKSW Group - University of Leipzig: AKSW wins award for Best Resources Paper at ISWC 2016 in Japan

Our paper, “LODStats: The Data Web Census Dataset”, won the award for Best Resources Paper at the recent conference in Kobe/Japan, which was the premier international forum for Semantic Web and Linked Data Community. The paper presents the LODStats dataset, which provides a comprehensive picture of the current state of a significant part of the Data Web.

Congrats to Ivan Ermilov, Jens Lehmann, Michael Martin and Sören Auer.

Please find the complete list of winners here.

Posted at 14:05

November 30

Ebiquity research group UMBC: PhD Proposal: Ankur Padia, Dealing with Dubious Facts in Knowledge Graphs

the skeptic

Dissertation Proposal

Dealing with Dubious Facts
in Knowledge Graphs

Ankur Padia

1:00-3:00pm Wednesday, 30 November 2016, ITE 325b, UMBC

Knowledge graphs are structured representations of facts where nodes are real-world entities or events and edges are the associations among the pair of entities. Knowledge graphs can be constructed using automatic or manual techniques. Manual techniques construct high quality knowledge graphs but are expensive, time consuming and not scalable. Hence, automatic information extraction techniques are used to create scalable knowledge graphs but the extracted information can be of poor quality due to the presence of dubious facts.

An extracted fact is dubious if it is incorrect, inexact or correct but lacks evidence. A fact might be dubious because of the errors made by NLP extraction techniques, improper design consideration of the internal components of the system, choice of learning techniques (semi-supervised or unsupervised), relatively poor quality of heuristics or the syntactic complexity of underlying text. A preliminary analysis of several knowledge extraction systems (CMU’s NELL and JHU’s KELVIN) and observations from the literature suggest that dubious facts can be identified, diagnosed and managed. In this dissertation, I will explore approaches to identify and repair such dubious facts from a knowledge graph using several complementary approaches, including linguistic analysis, common sense reasoning, and entity linking.

Committee: Drs. Tim Finin (Chair), Anupam Joshi, Tim Oates, Paul McNamee (JHU), Partha Talukdar (IISc, India)

Posted at 02:25

November 26

AKSW Group - University of Leipzig: AKSW Colloquium, 28.11.2016, NED using PBOH + Large-Scale Learning of Relation-Extraction Rules.

In the upcoming Colloquium, November the 28th at 3 PM, two papers will be presented:

Probabilistic Bag-Of-Hyperlinks Model for Entity Linking

Diego Moussallem will discuss the paper “Probabilistic Bag-Of-Hyperlinks Model for Entity Linking” by Octavian-Eugen Ganea et. al. which was accepted at WWW 2016.

Abstract: Many fundamental problems in natural language processing rely on determining what entities appear in a given text. Commonly referenced as entity linking, this step is a fundamental component of many NLP tasks such as text understanding, automatic summarization, semantic search or machine translation. Name ambiguity, word polysemy, context dependencies and a heavy-tailed distribution of entities contribute to the complexity of this problem. We here propose a probabilistic approach that makes use of an effective graphical model to perform collective entity disambiguation. Input mentions (i.e., linkable token spans) are disambiguated jointly across an entire document by combining a document-level prior of entity co-occurrences with local information captured from mentions and their surrounding context. The model is based on simple sufficient statistics extracted from data, thus relying on few parameters to be learned. Our method does not require extensive feature engineering, nor an expensive training procedure. We use loopy belief propagation to perform approximate inference. The low complexity of our model makes this step sufficiently fast for real-time usage. We demonstrate the accuracy of our approach on a wide range of benchmark datasets, showing that it matches, and in many cases outperforms, existing state-of-the-art methods

Large-Scale Learning of Relation-Extraction Rules with Distant Supervision from the Web

Afterward, René Speck will present the paper “Large-Scale Learning of Relation-Extraction Rules with
Distant Supervision from the Web” by Sebastian Krause et. al. which was accepted at ISWC 2012.

Abstract: We present a large-scale relation extraction (RE) system which learns grammar-based RE rules from the Web by utilizing large numbers of relation instances as seed. Our goal is to obtain rule sets large enough to cover the actual range of linguistic variation, thus tackling the long-tail problem of real-world applications. A variant of distant supervision learns several relations in parallel, enabling a new method of rule filtering. The system detects both binary and n-ary relations. We target 39 relations from Freebase, for which 3M sentences extracted from 20M web pages serve as the basis for learning an average of 40K distinctive rules per relation. Employing an efficient dependency parser, the average run time for each relation is only 19 hours. We compare these rules with ones learned from local corpora of different sizes and demonstrate that the Web is indeed needed for a good coverage of linguistic variation

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted at 11:30

November 21

Frederick Giasson: Leveraging KBpedia Aspects To Generate Training Sets Automatically

In previous articles I have covered multiple ways to create training corpuses for unsupervised learning and positive and negative training sets for supervised learning ¹ ^, ² ^, ³ using Cognonto and KBpedia. Different structures inherent to a knowledge graph like KBpedia can lead to quite different corpuses and sets. Each of these corpuses or sets may yield different predictive powers depending on the task at hand.

So far we have covered two ways to leverage the KBpedia Knowledge Graph to automatically create positive and negative training corpuses:

Using the links that exist between each KBpedia reference concept and their related Wikipedia pages
Using the linkages between KBpedia reference concepts and external vocabularies to create training corpuses out of
named entities.

Now we will introduce a third way to create a different kind of training corpus:

Using the KBpedia aspects linkages.

Aspects are aggregations of entities that are grouped according to their characteristics different from their direct types. Aspects help to group related entities by situation, and not by identity nor definition. It is another way to organize the knowledge graph and to leverage it. KBpedia has about 80 aspects that provide this secondary means for placing entities into related real-world contexts. Not all aspects relate to a given entity.

Creating New Domain Using KBpedia Aspects

To continue with the musical domain, there exists two aspects of interest:

Music
Genres

What we will do first is to query the KBpedia Knowledge Graph using theSPARQL query language to get the list of all of the KBpedia reference concepts that are related to the Music or the Genre aspects. Then, for each of these reference concepts, we will count the number of named entities that can be reached in the complete KBpedia structure.

prefix kko: <http://kbpedia.org/ontologies/kko#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix dcterms: <http://purl.org/dc/terms/> 
prefix schema: <http://schema.org/>

select distinct ?class count(distinct ?entity) as ?nb
from <http://dbpedia.org>
from <http://www.uspto.gov>
from <http://wikidata.org>
from <http://kbpedia.org/1.10/>
where
{
  ?entity dcterms:subject ?category .

  graph <http://kbpedia.org/1.10/>
  {
    {?category <http://kbpedia.org/ontologies/kko#hasMusicAspect> ?class .}
    union
    {?category <http://kbpedia.org/ontologies/kko#hasGenre> ?class .}
  }
}
order by desc(?nb)

reference concept	nb
http://kbpedia.org/kko/rc/Album-CW	128772
http://kbpedia.org/kko/rc/Song-CW	74886
http://kbpedia.org/kko/rc/Music	51006
http://kbpedia.org/kko/rc/Single	50661
http://kbpedia.org/kko/rc/RecordCompany	5695
http://kbpedia.org/kko/rc/MusicalComposition	5272
http://kbpedia.org/kko/rc/MovieSoundtrack	2919
http://kbpedia.org/kko/rc/Lyric-WordsToSong	2374
http://kbpedia.org/kko/rc/Band-MusicGroup	2185
http://kbpedia.org/kko/rc/Quartet-MusicalPerformanceGroup	2078
http://kbpedia.org/kko/rc/Ensemble	1438
http://kbpedia.org/kko/rc/Orchestra	1380
http://kbpedia.org/kko/rc/Quintet-MusicalPerformanceGroup	1335
http://kbpedia.org/kko/rc/Choir	754
http://kbpedia.org/kko/rc/Concerto	424
http://kbpedia.org/kko/rc/Symphony	299
http://kbpedia.org/kko/rc/Singing	154

Seventeen KBpedia reference concepts are related to the two aspects we want to focus on. The next step is to take these 17 reference concepts and to create a new domain corpus with them. We will use the new version of KBpedia to create the full set of reference concepts that will scope our domain by inference.

Next we will try to use this information to create two totally different kinds of training corpuses:

One that will rely on the links between the reference concepts and Wikipedia pages
One that will rely on the linkages to external vocabularies to create a list of named entities that will be used as
the training corpus

Creating Model With Reference Concepts

The first training corpus we want to test is one that uses the linkage between KBpedia reference concepts and Wikipedia pages. The first thing is to generate the domain training corpus with the 17 seed reference concepts and then to infer other related reference concepts.

(use 'cognonto-esa.core)
(require '[cognonto-owl.core :as owl])
(require '[cognonto-owl.reasoner :as reasoner])


(def kbpedia-manager (owl/make-ontology-manager))
(def kbpedia (owl/load-ontology "resources/kbpedia_reference_concepts_linkage.n3"
                                :manager kbpedia-manager))
(def kbpedia-reasoner (reasoner/make-reasoner kbpedia))

(define-domain-corpus ["http://kbpedia.org/kko/rc/Album-CW"
                       "http://kbpedia.org/kko/rc/Song-CW"
                       "http://kbpedia.org/kko/rc/Music"
                       "http://kbpedia.org/kko/rc/Single"
                       "http://kbpedia.org/kko/rc/RecordCompany"
                       "http://kbpedia.org/kko/rc/MusicalComposition"
                       "http://kbpedia.org/kko/rc/MovieSoundtrack"
                       "http://kbpedia.org/kko/rc/Lyric-WordsToSong"
                       "http://kbpedia.org/kko/rc/Band-MusicGroup"
                       "http://kbpedia.org/kko/rc/Quartet-MusicalPerformanceGroup"
                       "http://kbpedia.org/kko/rc/Ensemble"
                       "http://kbpedia.org/kko/rc/Orchestra"
                       "http://kbpedia.org/kko/rc/Quintet-MusicalPerformanceGroup"
                       "http://kbpedia.org/kko/rc/Choir"
                       "http://kbpedia.org/kko/rc/Symphony"
                       "http://kbpedia.org/kko/rc/Singing"
                       "http://kbpedia.org/kko/rc/Concerto"]
  kbpedia
  "resources/aspects-concept-corpus-dictionary.csv"
  :reasoner kbpedia-reasoner)

(create-pruned-pages-dictionary-csv "resources/aspects-concept-corpus-dictionary.csv"
                                    "resources/aspects-concept-corpus-dictionary.pruned.csv" 
                                    "resources/aspects-corpus-normalized/")

Once pruned, we end-up with a domain which has 108 reference concepts which will enable us to create models with 108 features. The next step is to create the actual semantic interpreter and the SVM models:

;; Load dictionaries
(load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/aspects-concept-corpus-dictionary.pruned.csv")

;; Create the semantic interpreter
(build-semantic-interpreter "aspects-concept-pruned" "resources/semantic-interpreters/aspects-concept-pruned/" (distinct (concat (get-domain-pages) (get-general-pages))))

;; Build the SVM model vectors
(build-svm-model-vectors "resources/svm/aspects-concept-pruned/" :corpus-folder-normalized "resources/aspects-corpus-normalized/")

;; Train the linear SVM classifier
(train-svm-model "svm.aspects.concept.pruned" "resources/svm/aspects-concept-pruned/"
                 :weights nil
                 :v nil
                 :c 1
                 :algorithm :l2l2)

Then we have to evaluate this new model using the gold standard:

(evaluate-model "svm.aspects.concept.pruned" "resources/gold-standard-full.csv")

True positive:  28
False positive:  0
True negative:  923
False negative:  66

Precision:  1.0
Recall:  0.29787233
Accuracy:  0.93510324
F1:  0.45901638

Now let’s try to find better hyperparameters using grid search:

(svm-grid-search "grid-search-aspects-concept-pruned-tests" 
                       "resources/svm/aspects-concept-pruned/" 
                       "resources/gold-standard-full.csv"
                       :selection-metric :f1
                       :grid-parameters [{:c [1 2 4 16 256]
                                          :e [0.001 0.01 0.1]
                                          :algorithm [:l2l2]
                                          :weight [1 15 30]}])

{:gold-standard "resources/gold-standard-full.csv"
 :selection-metric :f1
 :score 0.84444445 
 :c 1
 :e 0.001 
 :algorithm :l2l2
 :weight 30}

After running the grid search with these initial broad range values, we found a configuration that gives us 0.8444 for the F1 score. So far, this score is the best to date we have gotten for the full gold standard²^, ³. Let’s see all of the metrics for this configuration:

(train-svm-model "svm.aspects.concept.pruned" "resources/svm/aspects-concept-pruned/"
                 :weights {1 30.0}
                 :v nil
                 :c 1 
                 :e 0.001
                 :algorithm :l2l2)

(evaluate-model "svm.aspects.concept.pruned" "resources/gold-standard-full.csv")

True positive:  76
False positive:  10
True negative:  913
False negative:  18

Precision:  0.88372093
Recall:  0.80851066
Accuracy:  0.972468
F1:  0.84444445

These results are also the best balance between precision and recall that we have gotten so far²^, ³. Better precision can be obtained if necessary but only at the expense of lower recall.

Let’s take a look at the improvements we got compared to the previous training corpuses we had:

Precision: +4.16%
Recall: +35.72%
Accuracy: +2.06%
F1: +20.63%

This new training corpus based on the KBpedia aspects, after hyperparameter optimization, did increase all the metrics we calculate. The more stiking improvement is the recall which improved by more than 35%.

Creating Model With Entities

The next training corpus we want to test is one that uses the linkage between KBpedia reference concepts and linked external vocabularies to get a series of linked named entities as the positive training set of for each of the features of the model.

The first thing to do is to is to create the positive training set populated with named entities related to the reference concepts. We will get a random sample of ~50 named entities per reference concept:

(require '[cognonto-rdf.query :as query])
(require '[clojure.java.io :as io])
(require '[clojure.data.csv :as csv])
(require '[clojure.string :as string])

(defn generate-domain-by-rc
  [rc domain-file nb]
  (with-open [out-file (io/writer domain-file :append true)]
    (doall
     (->> (query/select
           (str "prefix kko: <http://kbpedia.org/ontologies/kko#>
                 prefix rdfs: <http://www.w3.org/2000/01/rdf-schema>
                 prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

                 select distinct ?entity
                 from <http://dbpedia.org>
                 from <http://www.uspto.gov>
                 from <http://wikidata.org>
                 from <http://kbpedia.org/1.10/>
                 where
                 {
                   ?entity dcterms:subject ?category .
                   graph <http://kbpedia.org/1.10/>
                   {
                     ?category ?aspectProperty <" rc "> .
                   }
                 }
                 ORDER BY RAND() LIMIT " nb) kb-connection)
          (map (fn [entity]
                 (csv/write-csv out-file [[(string/replace (:value (:entity entity)) "http://dbpedia.org/resource/" "")
                                           (string/replace rc "http://kbpedia.org/kko/rc/" "")]])))))))


(defn generate-domain-by-rcs 
  [rcs domain-file nb-per-rc]
  (with-open [out-file (io/writer domain-file)]
    (csv/write-csv out-file [["wikipedia-page" "kbpedia-rc"]])
    (doseq [rc rcs] (generate-domain-by-rc rc domain-file nb-per-rc))))

(generate-domain-by-rcs ["http://kbpedia.org/kko/rc/"
                         "http://kbpedia.org/kko/rc/Concerto"
                         "http://kbpedia.org/kko/rc/DoubleAlbum-CW"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Psychedelic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Religious"
                         "http://kbpedia.org/kko/rc/PunkMusic"
                         "http://kbpedia.org/kko/rc/BluesMusic"
                         "http://kbpedia.org/kko/rc/HeavyMetalMusic"
                         "http://kbpedia.org/kko/rc/PostPunkMusic"
                         "http://kbpedia.org/kko/rc/CountryRockMusic"
                         "http://kbpedia.org/kko/rc/BarbershopQuartet-MusicGroup"
                         "http://kbpedia.org/kko/rc/FolkMusic"
                         "http://kbpedia.org/kko/rc/Verse"
                         "http://kbpedia.org/kko/rc/RockBand"
                         "http://kbpedia.org/kko/rc/Lyric-WordsToSong"
                         "http://kbpedia.org/kko/rc/Refrain"
                         "http://kbpedia.org/kko/rc/MusicalComposition-GangstaRap"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Klezmer"
                         "http://kbpedia.org/kko/rc/HouseMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-AlternativeCountry"
                         "http://kbpedia.org/kko/rc/PsychedelicMusic"
                         "http://kbpedia.org/kko/rc/ReggaeMusic"
                         "http://kbpedia.org/kko/rc/AlternativeRockBand"
                         "http://kbpedia.org/kko/rc/AlternativeRockMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Trance"
                         "http://kbpedia.org/kko/rc/Ensemble"
                         "http://kbpedia.org/kko/rc/RhythmAndBluesMusic"
                         "http://kbpedia.org/kko/rc/NewAgeMusic"
                         "http://kbpedia.org/kko/rc/RockabillyMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Blues"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Opera"
                         "http://kbpedia.org/kko/rc/Choir"
                         "http://kbpedia.org/kko/rc/SurfMusic"
                         "http://kbpedia.org/kko/rc/Quintet-MusicalPerformanceGroup"
                         "http://kbpedia.org/kko/rc/MusicalComposition-JazzRock"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Country"
                         "http://kbpedia.org/kko/rc/CountryMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-PopRock"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Romantic"
                         "http://kbpedia.org/kko/rc/Recitative"
                         "http://kbpedia.org/kko/rc/Chorus"
                         "http://kbpedia.org/kko/rc/FusionMusic"
                         "http://kbpedia.org/kko/rc/MovieSoundtrack"
                         "http://kbpedia.org/kko/rc/GreatestHitsAlbum-CW"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Christian"
                         "http://kbpedia.org/kko/rc/ClassicalMusic-Baroque"
                         "http://kbpedia.org/kko/rc/MusicalComposition-NewAge"
                         "http://kbpedia.org/kko/rc/MusicalComposition-TraditionalPop"
                         "http://kbpedia.org/kko/rc/TranceMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Celtic"
                         "http://kbpedia.org/kko/rc/LoungeMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Reggae"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Baroque"
                         "http://kbpedia.org/kko/rc/Trio-MusicalPerformanceGroup"
                         "http://kbpedia.org/kko/rc/Symphony"
                         "http://kbpedia.org/kko/rc/MusicalComposition-RockAndRoll"
                         "http://kbpedia.org/kko/rc/PopRockMusic"
                         "http://kbpedia.org/kko/rc/IndustrialMusic"
                         "http://kbpedia.org/kko/rc/JazzMusic"
                         "http://kbpedia.org/kko/rc/MusicalChord"
                         "http://kbpedia.org/kko/rc/ProgressiveRockMusic"
                         "http://kbpedia.org/kko/rc/GothicMusic"
                         "http://kbpedia.org/kko/rc/LiveAlbum-CW"
                         "http://kbpedia.org/kko/rc/NewWaveMusic"
                         "http://kbpedia.org/kko/rc/NationalAnthem"
                         "http://kbpedia.org/kko/rc/OldieSong"
                         "http://kbpedia.org/kko/rc/Song-Sung"
                         "http://kbpedia.org/kko/rc/RockMusic"
                         "http://kbpedia.org/kko/rc/Aria"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Disco"
                         "http://kbpedia.org/kko/rc/GospelMusic"
                         "http://kbpedia.org/kko/rc/BluegrassMusic"
                         "http://kbpedia.org/kko/rc/FolkRockMusic"
                         "http://kbpedia.org/kko/rc/RockAndRollMusic"
                         "http://kbpedia.org/kko/rc/Opera-CW"
                         "http://kbpedia.org/kko/rc/HitSong-CW"
                         "http://kbpedia.org/kko/rc/Tune"
                         "http://kbpedia.org/kko/rc/Quartet-MusicalPerformanceGroup"
                         "http://kbpedia.org/kko/rc/RapMusic"
                         "http://kbpedia.org/kko/rc/RecordCompany"
                         "http://kbpedia.org/kko/rc/MusicalComposition-ACappella"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Electronica"
                         "http://kbpedia.org/kko/rc/Music"
                         "http://kbpedia.org/kko/rc/GlamRockMusic"
                         "http://kbpedia.org/kko/rc/LoveSong"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Gothic"
                         "http://kbpedia.org/kko/rc/MarchingBand"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Punk"
                         "http://kbpedia.org/kko/rc/BluesRockMusic"
                         "http://kbpedia.org/kko/rc/TechnoMusic"
                         "http://kbpedia.org/kko/rc/SoulMusic"
                         "http://kbpedia.org/kko/rc/ChamberMusicComposition"
                         "http://kbpedia.org/kko/rc/Requiem"
                         "http://kbpedia.org/kko/rc/MusicalComposition"
                         "http://kbpedia.org/kko/rc/ElectronicMusic"
                         "http://kbpedia.org/kko/rc/CompositionMovement"
                         "http://kbpedia.org/kko/rc/StringQuartet-MusicGroup"
                         "http://kbpedia.org/kko/rc/Riff"
                         "http://kbpedia.org/kko/rc/Anthem"
                         "http://kbpedia.org/kko/rc/HardRockMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-BluesRock"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Cyberpunk"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Industrial"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Funk"
                         "http://kbpedia.org/kko/rc/Album-CW"
                         "http://kbpedia.org/kko/rc/HipHopMusic"
                         "http://kbpedia.org/kko/rc/Single"
                         "http://kbpedia.org/kko/rc/Singing"
                         "http://kbpedia.org/kko/rc/SwingMusic"
                         "http://kbpedia.org/kko/rc/Song-CW"
                         "http://kbpedia.org/kko/rc/SalsaMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Jazz"
                         "http://kbpedia.org/kko/rc/ClassicalMusic"
                         "http://kbpedia.org/kko/rc/MilitaryBand"
                         "http://kbpedia.org/kko/rc/SkaMusic"
                         "http://kbpedia.org/kko/rc/Orchestra"
                         "http://kbpedia.org/kko/rc/GrungeRockMusic"
                         "http://kbpedia.org/kko/rc/SouthernRockMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Ambient"
                         "http://kbpedia.org/kko/rc/DiscoMusic"] "resources/aspects-domain-corpus.csv")

Next let’s create the actual positive training corpus and let’s normalize it:

(cache-aspects-corpus "resources/aspects-entities-corpus.csv" "resources/aspects-corpus/")
(normalize-cached-corpus "resources/corpus/" "resources/corpus-normalized/")

We end up with 22 features for which we can get named entities from the KBpedia Knowledge Base. These will be the 22 features of our model. The complete positive training set has 799 documents in it.

(load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/aspects-entities-corpus-dictionary.pruned.csv")

(build-semantic-interpreter "aspects-entities-pruned" "resources/semantic-interpreters/aspects-entities-pruned/" (distinct (concat (get-domain-pages) (get-general-pages))))

(build-svm-model-vectors "resources/svm/aspects-entities-pruned/" :corpus-folder-normalized "resources/aspects-corpus-normalized/")

(train-svm-model "svm.aspects.entities.pruned" "resources/svm/aspects-entities-pruned/"
                 :weights nil
                 :v nil
                 :c 1
                 :algorithm :l2l2)

Now let’s evaluate the model with default hyperparameters:

(evaluate-model "svm.aspects.entities.pruned" "resources/gold-standard-full.csv")

True positive:  9
False positive:  10
True negative:  913
False negative:  85

Precision:  0.47368422
Recall:  0.095744684
Accuracy:  0.906588
F1:  0.15929204

Now let’s try to improve this F1 score using grid search:

(svm-grid-search "grid-search-aspects-entities-pruned-tests" 
                 "resources/svm/aspects-entities-pruned/" 
                 "resources/gold-standard-full.csv"
                 :selection-metric :f1
                 :grid-parameters [{:c [1 2 4 16 256]
                                    :e [0.001 0.01 0.1]
                                    :algorithm [:l2l2]
                                    :weight [1 15 30]}])

{:gold-standard "resources/gold-standard-full.csv"
:selection-metric :f1
:score 0.44052863
:c 4
:e 0.001
:algorithm :l2l2
:weight 15}

We have been able to greatly improve the F1 score by tweaking the hyperparameters, but the results are still disappointing. There are multiple ways to automatically generate training corpuses, but not all of them are born equal. This is why having a pipeline that can automatically create the training corpuses, optimize the hyperparameters and evaluate the models is more than welcome since this is the bulk of the time a data scientist has to spend to create his models.

Conclusion

After automatically creating multiple different positive and negative training sets, after testing multiple learning methods and optimizing hyperparameters, we found the best training sets with the best learning method and the best hyperparameter to create an initial, optimal, model that has an accuracy of 97.2%, a precision of 88.4%, a recall of
80.9% and overall F1 measure of 84.4% on a gold standard created from real, random, pieces of news from different general and specialized news sites.

The thing that is really interesting and innovative in this method is how a knowledge base of concepts and entities can be used to label positive and negative training sets to feed supervised learners and how the learner can perform well on totally different input text data (in this case, news articles). The same is true when creating training corpuses for unsupervised leaning⁴.

The most wonderful thing from an operational standpoint is that all of this searching, testing and optimizing can be performed by a computer automatically. The only tasks required by a human is to define the scope of a domain and to manually label a gold standard for performance evaluation and hyperparameters optimization.

Footnotes:

¹Create a Domain Text Classifier Using Cognonto

²Dynamic Machine Learning Using the KBpedia Knowledge Graph – Part 1

³Dynamic Machine Learning Using the KBpedia Knowledge Graph – Part 2

⁴Using Cognonto to Generate Domain Specific word2vec Models

Posted at 11:14

November 17

Frederick Giasson: Dynamic Machine Learning Using the KBpedia Knowledge Graph – Part 2

In the first part of this series we found the good hyperparameters for a single linear SVM classifier. In part 2, we will try another technique to improve the performance of the system: ensemble learning.

So far, we already reached 95% of accuracy with some tweaking the hyperparameters and the training corpuses but the F1 score is still around ~70% with the full gold standard which can be improved. There are also situations when precision should be nearly perfect (because false positives are really not acceptable) or when the recall should be optimized.

Here we will try to improve this situation by using ensemble learning. It uses multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. In our examples, each model will have a vote and the weight of the vote will be equal for each mode. We will use five different strategies to create the models that will belong to the ensemble:

Bootstrap aggregating (bagging)
Asymmetric bagging ¹
Random subspace method (feature bagging)
Asymmetric bagging + random subspace method (ABRS) ¹
Bootstrap aggregating + random subspace method (BRS)

Different strategies will be used depending on different things like: are the positive and negative training documents unbalanced? How many features does the model have? etc. Let’s introduce each of these different strategies.

Note that in this article I am only creating ensembles with linear SVM learners. However an ensemble can be composed of multiple different kind of learners, like SVM with non-linear kernels, decisions trees, etc. However, to simplify this article, we will stick to a single linear SVM with multiple different training corpuses and features.

Ensemble Learning Strategies

Bootstrap Aggregating (bagging)

The idea behind bagging is to draw a subset of positive and negative training samples at random and with replacement. Each model of the ensemble will have a different training set but some of the training sample may appear in multiple different training sets.

Asymmetric Bagging

Asymmetric Bagging has been proposed by Tao, Tang, Li and Wu ¹. The idea is to use asymmetric bagging when the number of positive training samples is largely unbalanced relatively to the negative training samples. The idea is to create a subset of random (with replacement) negative training samples, but by always keeping the full set of positive training samples.

Random Subspace method (feature bagging)

The idea behind feature bagging is the same as bagging, but works on the features of the model instead of the training sets. It attempts to reduce the correlation between estimators (features) in an ensemble by training them on random samples of features instead of the entire feature set.

Asymmetric Bagging + Random Subspace method (ABRS)

Asymmetric Bagging and Random Subspace Method has also been proposed by Tao, Tang, Li and Wu ¹. The problems they had with their content-based image retrieval system are the same we have with this kind of automatic training corpuses generated from knowledge graph:

SVM is unstable on small-sized training set
SVM’s optimal hyperplane may be biased when the positive training sample is much less than the negative feedback sample (this is why we used weights in this case), and
The training set is smaller than the number of features in the SVM model.

The third point is not immediately an issue for us (except if you have a domain with many more features than we had in our example), but becomes one when we start using asymmetric bagging.

What we want to do here is to implement asymmetric bagging and the random subspace method to create number of individual models. This method is called ABRS-SVM which stands for Asymmetric Bagging Random Subspace Support Vector Machines.

The algorithm we will use is:

Let the number of positive training documents be , the number of negative training document be and the number of features in the training data be .
Choose to be the number of individual models in the ensemble.
For all individual model , choose where to be the number of negative training documents for
For all individual models , choose where to be the number of input variables for .
For each individual model , create a training set by choosing features from with replacement, by choosing negative training documents from with replacement, by choosing all positive training documents and then train the model.

Bootstrap Aggregating + Random Subspace method (BRS)

Bagging with features bagging is the same as asymmetric bagging with the random subspace method except that we use bagging instead of asymmetric bagging. (ABRS should be used if your positive training sample is severely unbalanced compared to your negative training sample. Otherwise BRS should be used.)

SVM Learner

We use the linear Semantic Vector Machine (SVM) as the learner to use for the ensemble. What we will be creating is a series of SVM models that will be different depending on the ensemble method(s) we will use to create the ensemble.

Build Training Document Vectors

The first step we have to do is to create a structure where all the positive and negative training documents will have their vector representation. Since this is the task that takes the most time in the whole process, we will calculate them using the (build-svm-model-vectors) function and we will serialize the structure on the file system. That way, to create the ensemble’s models, we will only have to load it from the file system without having the re-calculate it each time.

Train, Classify and Evaluate Ensembles

The goal is to create a set of X number of SVM classifiers where each of them use different models. The models can differ in their features or their training corpus. Then each of the classifier will try to classify an input text according to their own model. Finally each classifier will vote to determine if that input text belong, or not, to the domain.

There are four hyperparameters related to ensemble learning:

The mode to use
The number of models we want to create in the ensemble
The number of training documents we want in the training corpus, and
The number of features.

Other hyperparameters could include the ones of the linear SVM classifier, but in this example we will simply reuse the best parameters we found above. We now train the ensemble using the (train-ensemble-svm) function.

Once the ensemble is created and trained, then we have to use the (classify-ensemble-text) function to classify an input text using the ensemble we created. That function takes two parameters: :mode, which is the ensemble’s mode, and :vote-acceptance-ratio, which defines the number of positive votes that is required such that the ensemble positively classify the input text. By default, the ratio is 50%, but if you want to optimize the precision of the ensemble, then you may want to increase that ratio to 70% or even 95% as we will see below.

Finally the ensemble, configured with all its hyperparameters, will be evaluated using the (evaluate-ensemble) function, which is the same as the (evaluate-model) function, but which uses the ensemble instead of a single SVM model to classify all of the articles. As before, we will characterize the assignments in relation to the gold standard.

Let’s now train different ensembles to try to improve the performance of the system.

Asymmetric Bagging

The current corpus training set is highly unbalanced. This is why the first test we will do is to apply the asymmetric bagging strategy. What this does is that each of the SVM classifiers will use the same positive training set with the same number of positive documents. However, each of them will take a random number of negative training documents (with replacement).

(use 'cognonto-esa.core)
(use 'cognonto-esa.ensemble-svm)

(load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/domain-corpus-dictionary.pruned.csv")
(load-semantic-interpreter "base-pruned" "resources/semantic-interpreters/base-pruned/")

(reset! ensemble [])

(train-ensemble-svm "ensemble.base.pruned.ab.c2.w30" "resources/ensemble-svm/base-pruned/" 
                    :mode :ab 
                    :weight {1 30.0}
                    :c 2
                    :e 0.001
                    :nb-models 100
                    :nb-training-documents 3500)

Now let’s evaluate this ensemble with a vote acceptance ratio of 50%

(evaluate-ensemble "ensemble.base.pruned.ab.c2.w30" 
                   "resources/gold-standard-full.csv" 
                   :mode :ab 
                   :vote-acceptance-ratio 0.50)

True positive:  48
False positive:  6
True negative:  917
False negative:  46

Precision:  0.8888889
Recall:  0.5106383
Accuracy:  0.9488692
F1:  0.6486486

Let’s increase the vote acceptance ratio to 90%:

(evaluate-ensemble "ensemble.base.pruned.ab.c2.w30" 
                   "resources/gold-standard-full.csv" 
                   :mode :ab 
                   :vote-acceptance-ratio 0.90)

True positive:  37
False positive:  2
True negative:  921
False negative:  57

Precision:  0.94871795
Recall:  0.39361703
Accuracy:  0.94198626
F1:  0.556391

In both cases, the precision increases considerably compared to the non-ensemble learning results. However, the recall did drop at the same time, which dropped the F1 score as well. Let’s now try with the ABRS method

Asymmetric Bagging + Random Subspace method (ABRS)

The goal of the random subspace method is to select a random set of features. This means that each model will have their own feature set and will make predictions according to them. With the ABRS strategy, we will conclude with highly different models since none will have the same negative training sets nor the same features.

Here what we test is to define each classifier with 65 randomly chosen features out of 174 to restrict the negative training corpus to 3500 randomly selected documents. Then we choose to create 300 models to try to get a really heterogeneous population of models.

(reset! ensemble [])
(train-ensemble-svm "ensemble.base.pruned.abrs.c2.w30" "resources/ensemble-svm/base-pruned/" 
                    :mode :abrs 
                    :weight {1 30.0}
                    :c 2
                    :e 0.001
                    :nb-models 300
                    :nb-features 65
                    :nb-training-documents 3500)

(evaluate-ensemble "ensemble.base.pruned.abrs.c2.w30" 
                   "resources/gold-standard-full.csv" 
                   :mode :abrs
                   :vote-acceptance-ratio 0.50)

True positive:  41
False positive:  3
True negative:  920
False negative:  53

Precision:  0.9318182
Recall:  0.43617022
Accuracy:  0.9449361
F1:  0.59420294

For these features and training sets, using the ABRS method did not improve on the AB method we tried above.

Conclusion

This use case shows three totally different ways to use the KBpedia Knowledge Graph to automatically create positive and negative training sets. We demonstrated how the full process can be automated where the only requirement is to get a list of seed KBpedia reference concepts.

We also quantified the impact of using new versions of KBpedia, and how different strategies, techniques or algorithms can have different impacts on the prediction models.

Creating prediction models using supervised machine learning algorithms (which is currently the bulk of the learners currently used) has two global steps:

Label training sets and generate gold standards, and
Test, compare, and optimize different learners, ensembles and hyperparameters.

Unfortunately, today, given the manual efforts required by the first step, the overwhelming portion of time and budget is spent here to create a prediction model. By automating much of this process, Cognonto and KBpedia substantially reduces this effort. Time and budget can now be re-directed to the second step of “dialing in” the learners, where the real payoff occurs. of training corpuses.

Further, as we also demonstrated, once we automated this process of labeling and reference standards, then we can also automate the testing and optimization of multiple different kind of prediction algorithms, hyperparameters configuration, etc. In short, for both steps, KBpedia provides significant reductions in times and efforts to get to desired results.

Footnotes

¹Asymmetric Bagging and Random Subspace for Support Vector Machines-Based Relevance Feedback in Image Retrieval

Posted at 11:05

Frederick Giasson: Dynamic Machine Learning Using the KBpedia Knowledge Graph – Part 1

In my previous blog post, Create a Domain Text Classifier Using Cognonto, I explained how one can use the KBpedia Knowledge Graph to automatically create positive and negative training corpuses for different machine learning tasks. I explained how SVM classifiers could be trained and used to check if an input text belongs to the defined domain or not.

This article is the first of two articles.In first part I will extend on this idea to explain how the KBpedia Knowledge Graph can be used, along with other machine learning techniques, to cope with different situations and use cases. I will cover the concepts of feature selection, hyperparameter optimization, and ensemble learning (in part 2 of this series). The emphasis here is on the testing and refining of machine learners, versus the set up and configuration times that dominate other approaches.

Depending on the domain of interest, and depending on the required precision or recall, different strategies and techniques can lead to better predictions. More often than not, multiple different training corpuses, learners and hyperparameters need to be tested before ending up with the initial best possible prediction model. This is why I will strongly emphasize the fact that the KBpedia Knowledge Graph and Cognonto can be used to automate fully the creation of a wide range of different training corpuses, to create models, to optimize their hyperparameters, and to evaluate those models.

New Knowledge Graph and Reasoning

For this article, I will use the latest version of the KBpedia Knowledge Graph version 1.10 that we just released. A knowledge graph such as KBpedia is not static. It constantly evolves, gets fixed, and improves. New concepts are created, deprecated concepts are removed, new linkage to external data sources are created, etc. This growth means that any of these changes can have a [positive] impact on the creation of the positive and negative training sets. Applications based on KBpedia should be tested against any new knowledge graph that is released to see if its models will improve. Better concepts, better structure, and more linkages will often lead to better training sets as well.

Such growth in KBpedia is also why automating, and more importantly testing, this process is crucial. Upon the release of major new versions we are able to automate all of these steps to see the final impacts of upgrading the knowledge graph:

Aggregate all the reference concepts that scope the specified domain (by inference)
Create the positive and negative training corpuses
Prune the training corpuses
Configure the classifier (in this case, create the semantic vectors for ESA)
Train the model (in this case, the SVM model)
Optimize the hyperparameters of the algorithm (in this case, the linear SVM hyperparameters), and
Evaluate the model on multiple gold standards.

Because each of these steps belongs to an automated workflow, we can easily check the impact of updating the KBpedia Knowledge Graph on our models.

Reasoning Over The Knowledge Graph

A new step I am adding in this current use case is to use a reasoner to reason over the KBpedia Knowledge Graph. The reasoner is used when we define the scope of the domain to classify. We will browse the knowledge graph to see which seed reference concepts we should add to the scope. Then we will use a reasoner to extend the models to include any new sub-classes relevant to the scope of the domain. This means that we may add further specific features to the final model.

Update Domain Training Corpus Using KBpedia 1.10 and a Reasoner

Recall our prior use case used Music as its domain scope. The first step is to use the new KBpedia version 1.10 along with a reasoner to create the full scope of this updated Music domain.

The result of using this new version and a reasoner is that we now end up with 196 features (reference documents) instead of 64. This also means that we will have 196 documents in our positive training set if we only use the Wikipedia pages linked to these reference concepts (and not their related named entities).

(use 'cognonto-esa.core)
(require '[cognonto-owl.core :as owl])
(require '[cognonto-owl.reasoner :as reasoner])

(def kbpedia-manager (owl/make-ontology-manager))
(def kbpedia (owl/load-ontology "resources/kbpedia_reference_concepts_linkage.n3"
                                :manager kbpedia-manager))
(def kbpedia-reasoner (reasoner/make-reasoner kbpedia))

(define-domain-corpus ["http://kbpedia.org/kko/rc/Music"
                       "http://kbpedia.org/kko/rc/Musician"
                       "http://kbpedia.org/kko/rc/MusicPerformanceOrganization"
                       "http://kbpedia.org/kko/rc/MusicalInstrument"
                       "http://kbpedia.org/kko/rc/Album-CW"
                       "http://kbpedia.org/kko/rc/Album-IBO"
                       "http://kbpedia.org/kko/rc/MusicalComposition"
                       "http://kbpedia.org/kko/rc/MusicalText"
                       "http://kbpedia.org/kko/rc/PropositionalConceptualWork-MusicalGenre"
                       "http://kbpedia.org/kko/rc/MusicalPerformer"]
  kbpedia
  "resources/domain-corpus-dictionary.csv"
  :reasoner kbpedia-reasoner)

Create Training Corpuses

The next step is to create the actual training corpuses: the general and domain ones. We have to load the dictionaries we created in the previous step, and then to locally cache and normalize the corpuses. Remember that the normalization steps are:

Defluff the raw HTML page. We convert the HTML into text, and we only keep the body of the page
Normalize the text with the following rules:
1. remove diacritics characters
2. remove everything between brackets like: [edit] [show]
3. remove punctuation
4. remove all numbers
5. remove all invisible control characters
6. remove all [math] symbols
7. remove all words with 2 characters or fewer
8. remove line and paragraph seperators
9. remove anything that is not an alpha character
10. normalize spaces
11. put everything in lower case, and
12. remove stop words.

(load-dictionaries "resources/general-corpus-dictionary.csv" "resources/domain-corpus-dictionary.csv")

(cache-corpus)

(normalize-cached-corpus "resources/corpus/" "resources/corpus-normalized/")

Create New Gold Standard

Because we never have enough instances in our gold standards to test against, let’s create a third one, but this time adding a music related news feed that will add more positive examples to the gold standard.

(defn create-gold-standard-from-feeds
  [name]
  (let [feeds ["http://www.music-news.com/rss/UK/news"
               "http://rss.cbc.ca/lineup/topstories.xml"
               "http://rss.cbc.ca/lineup/world.xml"
               "http://rss.cbc.ca/lineup/canada.xml"
               "http://rss.cbc.ca/lineup/politics.xml"
               "http://rss.cbc.ca/lineup/business.xml"
               "http://rss.cbc.ca/lineup/health.xml"
               "http://rss.cbc.ca/lineup/arts.xml"
               "http://rss.cbc.ca/lineup/technology.xml"
               "http://rss.cbc.ca/lineup/offbeat.xml"
               "http://www.cbc.ca/cmlink/rss-cbcaboriginal"
               "http://rss.cbc.ca/lineup/sports.xml"
               "http://rss.cbc.ca/lineup/canada-britishcolumbia.xml"
               "http://rss.cbc.ca/lineup/canada-calgary.xml"
               "http://rss.cbc.ca/lineup/canada-montreal.xml"
               "http://rss.cbc.ca/lineup/canada-pei.xml"
               "http://rss.cbc.ca/lineup/canada-ottawa.xml"
               "http://rss.cbc.ca/lineup/canada-toronto.xml"
               "http://rss.cbc.ca/lineup/canada-north.xml"
               "http://rss.cbc.ca/lineup/canada-manitoba.xml"
               "http://feeds.reuters.com/news/artsculture"
               "http://feeds.reuters.com/reuters/businessNews"
               "http://feeds.reuters.com/reuters/entertainment"
               "http://feeds.reuters.com/reuters/companyNews"
               "http://feeds.reuters.com/reuters/lifestyle"
               "http://feeds.reuters.com/reuters/healthNews"
               "http://feeds.reuters.com/reuters/MostRead"
               "http://feeds.reuters.com/reuters/peopleNews"
               "http://feeds.reuters.com/reuters/scienceNews"
               "http://feeds.reuters.com/reuters/technologyNews"
               "http://feeds.reuters.com/Reuters/domesticNews"
               "http://feeds.reuters.com/Reuters/worldNews"
               "http://feeds.reuters.com/reuters/USmediaDiversifiedNews"]]

    (with-open [out-file (io/writer (str "resources/" name ".csv"))]
      (csv/write-csv out-file [["class" "title" "url"]])
      (doseq [feed-url feeds]
        (doseq [item (:entries (feed/parse-feed feed-url))]
          (csv/write-csv out-file "" (:title item) (:link item) :append true))))))

This routine creates this third gold standard. Remember, we use the gold standard to evaluate different methods and models to classify an input text to see if it belongs to the domain or not.

For each piece of news aggregated that way, I manually determined if the candidate document belongs to the domain or not. This task can be tricky, and requires a clear understanding of the proper scope for the domain. In this example, I consider an article to belong to the music domain if it mentions music concepts such as musical albums, songs, multiple music related topics, etc. If only a singer is mentioned in an article because he broke up with his girlfriend, without further mention of anything related to music, I won’t tag it as being part of the domain.

[However, under a different interpretation of what should be in the domain wherein any mention of a singer qualifies, then we could extend the classification process to include named entities (the singer) extraction to help properly classify those articles. This revised scope is not used in this article, but it does indicate how your exact domain needs should inform such scoping decisions.]

You can download this new third gold standard from here.

Evaluate Initial Domain Model

Now that we have updated the training corpuses using the updated scope of the domain compared to the previous tests, let’s analyze the impact of using a new version of KBpedia and to use a reasoner to increase the number of features in our model. Let’s run our automatic process to evaluate the new models. The remaining steps that needs to be run are:

Configure the classifier (in this case, create the semantic vectors for ESA)
Train the model (in this case, the SVM model), and
Evaluate the model on multiple gold standards.

Note: the see the full explanation of how ESA and the SVM classifiers works, please refer to the Create a Domain Text Classifier
Using Cognonto article for more background information.

;; Load positive and negative training corpuses
(load-dictionaries "resources/general-corpus-dictionary.csv" "resources/domain-corpus-dictionary.csv")

;; Build the ESA semantic interpreter 
(build-semantic-interpreter "base" "resources/semantic-interpreters/base/" (distinct (concat (get-domain-pages) (get-general-pages))))

;; Build the vectors to feed to a SVM classifier using ESA
(build-svm-model-vectors "resources/svm/base/" :corpus-folder-normalized "resources/corpus-normalized/")

;; Train the SVM using the best parameters discovered in the previous tests
(train-svm-model "svm.w50" "resources/svm/base/"
                 :weights {1 50.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

Let’s evaluate this model using our three gold standards:

(evaluate-model "svm.goldstandard.1.w50" "resources/gold-standard-1.csv")

True positive:  21
False positive:  3
True negative:  306
False negative:  6

Precision:  0.875
Recall:  0.7777778
Accuracy:  0.97321427
F1:  0.8235294

The performance changes related to the previous results (using KBpedia 1.02) are:

Precision: +10.33%
Recall: -12.16%
Accuracy: +0.31%
F1: +0.26%

The results for the second gold standard are:

(evaluate-model "svm.goldstandard.2.w50" "resources/gold-standard-2.csv")

True positive:  16
False positive:  3
True negative:  317
False negative:  9

Precision:  0.84210527
Recall:  0.64
Accuracy:  0.9652174
F1:  0.72727275

The performances changes related to the previous results (using KBpedia 1.02) are:

Precision: +6.18%
Recall: -29.35%
Accuracy: -1.19%
F1: -14.63%

What we can say is that the new scope for the domain greatly improved the precision of the model. This happens because the new model is probably more complex and better scoped, which leads it to be more selective. However, because of this the recall of the model suffers since some of the positive case of our gold standard are not considered to be positive but negative, which now creates new false positives. As you can see, there is almost always a tradeoff between precision and recall. However, you could have 100% precision by only having one result right, but then the recall would suffer greatly. This is why the F1 score is important since it is a weighted average of the precision and the recall.

Now let’s look at the results of our new gold standard:

(evaluate-model "svm.goldstandard.3.w50" "resources/gold-standard-3.csv")

True positive:  28
False positive:  3
True negative:  355
False negative:  22

Precision:  0.9032258
Recall:  0.56
Accuracy:  0.9387255
F1:  0.69135803

Again, with this new gold standard, we can see the same pattern: the precision is pretty good, but the recall is not that great since about half the true positives did not get noticed by the model.

Now, what could we do to try to improve this situation? The next thing we will investigate is to use feature selection and pruning.

Features Selection Using Pruning and Training Corpus Pruning

A new method that we will investigate to try to improve the performance of the models is called feature selection. As its name says, what we are doing is to select specific features to create our prediction model. The idea here is that not all features are born equal and different features may have different (positive or negative) impacts on the model.

In our specific use case, we want to do feature selection using a pruning technique. What we will do is to count the number of tokens for each of our features, and each of the Wikipedia page related to these features. If the number of tokens in an article is too small (below 100), then we will drop that feature.

[Note: feature selection is a complex topic; other options and nuances are not further discussed here.]

The idea here is not to give undue importance to a feature for which we lack proper positive documents in the training corpus. Depending on the feature, it may, or may not, have an impact on the overall model’s performance.

Pruning the general and domain specific dictionaries is really simple. We only have to read the current dictionaries, to read each of the documents mentioned in the dictionary from the cache, to calculate the number of tokens in each, and then to keep them or to drop them if they reach a certain threshold. Finally we write a new dictionary with the pruned features and documents:

(defn create-pruned-pages-dictionary-csv
  [dictionary-file prunned-file normalized-corpus-folder & {:keys [min-tokens]
                                                            :or {min-tokens 100}}]
  (let [dictionary (rest
                    (with-open [in-file (io/reader dictionary-file)]
                      (doall
                       (csv/read-csv in-file))))]
    (with-open [out-file (io/writer prunned-file)]
      (csv/write-csv out-file (->> dictionary
                                   (mapv (fn [[title rc]]
                                           (when (.exists (io/as-file (str normalized-corpus-folder title ".txt")))
                                             (when (> (->> (slurp (str normalized-corpus-folder title ".txt"))
                                                           tokenize
                                                           count) min-tokens)
                                               [[title rc]]))))
                                   (apply concat)
                                   (into []))))))

Then we can prune the general and domain specific dictionaries using this simple function:

(create-pruned-pages-dictionary-csv "resources/general-corpus-dictionary.csv"
                                    "resources/general-corpus-dictionary.pruned.csv" 
                                    "resources/corpus-normalized/"
                                    min-tokens 100)

(create-pruned-pages-dictionary-csv "resources/domain-corpus-dictionary.csv"
                                    "resources/domain-corpus-dictionary.pruned.csv" 
                                    "resources/corpus-normalized/"
                                    min-tokens 100)

As a result of this specific pruning approach, the number of features drops from 197 to 175.

Evaluating Pruned Training Corpuses and Selected Features

Now that the training corpuses have been pruned, let’s load them and then evaluate their performance on the gold standards.

;; Load positive and negative pruned training corpuses
(load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/domain-corpus-dictionary.pruned.csv")

;; Build the ESA semantic interpreter 
(build-semantic-interpreter "base" "resources/semantic-interpreters/base-pruned/" (distinct (concat (get-domain-pages) (get-general-pages))))

;; Build the vectors to feed to a SVM classifier using ESA
(build-svm-model-vectors "resources/svm/base-pruned/" :corpus-folder-normalized "resources/corpus-normalized/")

;; Train the SVM using the best parameters discovered in the previous tests
(train-svm-model "svm.w50" "resources/svm/base-pruned/"
                 :weights {1 50.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

Let’s evaluate this model using our three gold standards:

(evaluate-model "svm.pruned.goldstandard.1.w50" "resources/gold-standard-1.csv")

True positive:  21
False positive:  2
True negative:  307
False negative:  6

Precision:  0.9130435
Recall:  0.7777778
Accuracy:  0.97619045
F1:  0.84000003

The performances changes related to the initial results (using KBpedia 1.02) are:

Precision: +18.75%
Recall: -12.08%
Accuracy: +0.61%
F1: +2.26%

In this case, compared with the previous results (non-pruned with KBpedia 1.10), we improved the precision without decreasing the recall which is the ultimate goal. This means that the F1 score increased by 2.26% just by pruning, for this gold standard.

The results for the second gold standard are:

(evaluate-model "svm.goldstandard.2.w50" "resources/gold-standard-2.csv")

True positive:  16
False positive:  3
True negative:  317
False negative:  9

Precision:  0.84210527
Recall:  0.64
Accuracy:  0.9652174
F1:  0.72727275

The performances changes related to the previous results (using KBpedia 1.02) are:

Precision: +6.18%
Recall: -29.35%
Accuracy: -1.19%
F1: -14.63%

In this case, the results are identical (with non-pruned with KBpedia 1.10). Pruning did not change anything. Considering the relatively small size of the gold standard, this is to be expected since the model also did not drastically change.

Now let’s look at the results of our new gold standard:

(evaluate-model "svm.goldstandard.3.w50" "resources/gold-standard-3.csv")

True positive:  27
False positive:  7
True negative:  351
False negative:  23

Precision:  0.7941176
Recall:  0.54
Accuracy:  0.9264706
F1:  0.64285713

Now let’s check how these compare to the non-pruned version of the training corpus:

Precision: -12.08%
Recall: -3.7%
Accuracy: -1.31%
F1: -7.02%

Both false positives and false negatives increased with this change, which also led to a decrease in the overall metrics. What happened?

Different things may have happened in fact. Maybe the new set of features are not optimal, or maybe the hyperparameters of the SVM classifier are offset. This is what we will try to figure out by working with two new methods that we will use to try to continue to improve our model: hyperparameters optimization using grid search and using ensembles learning.

Hyperparameters Optimization Using Grid Search

Hyperparameters are parameters that are not learned by the estimators. They are a kind of configuration option for an algorithm. In the case of a linear SVM, hyperparameters are C, epsilon, weight and the algorithm used. Hyperparameter optimization is the task of trying to find the right parameter values in order to optimize the performance of the model.

There are multiple different strategies that we can use to try to find the best values for these hyperparameters, but the one we will use is called the grid search, which exhaustively searches across a manually defined subset of possible hyperparameter values.

The grid search function we want to define will enable us to specify the algorithm(s), the weight(s), C and the stopping tolerence. Then we will want the grid search to keep the hyperparameters that optimize the score of the metric we want to focus on. We also have to specify the gold standard we want to use to evaluate the performance of the different models.

Here is the function that implements that grid search algorithm:

(defn svm-grid-search
  [name model-path gold-standard & {:keys [grid-parameters selection-metric]
                                    :or {grid-parameters [{:c [1 2 4 16 256]
                                                           :e [0.001 0.01 0.1]
                                                           :algorithm [:l2l2]
                                                           :weight [1 15 30]}]
                                         selection-metric :f1}}]
  (let [best (atom {:gold-standard gold-standard
                    :selection-metric selection-metric
                    :score 0.0
                    :c nil
                    :e nil
                    :algorithm nil
                    :weight nil})
        model-vectors (read-string (slurp (str model-path "model.vectors")))]
    (doseq [parameters grid-parameters]
      (doseq [algo (:algorithm parameters)]
        (doseq [weight (:weight parameters)]
          (doseq [e (:e parameters)]
            (doseq [c (:c parameters)]
              (train-svm-model name model-path
                               :weights {1 (double weight)}
                               :v nil
                               :c c
                               :e e
                               :algorithm algo
                               :model-vectors model-vectors)
              (let [results (evaluate-model name gold-standard :output false)]              
                (println "Algorithm:" algo)
                (println "C:" c)
                (println "Epsilon:" e)
                (println "Weight:" weight)
                (println selection-metric ":" (get results selection-metric))
                (println)

                (when (> (get results selection-metric) (:score @best))
                  (reset! best {:gold-standard gold-standard
                                :selection-metric selection-metric
                                :score (get results selection-metric)
                                :c c
                                :e e
                                :algorithm algo
                                :weight weight}))))))))
    @best))

The possible algorithms are:

:l2lr_primal
:l2l2
:l2l2_primal
:l2l1
:multi
:l1l2_primal
:l1lr
:l2lr

To simplify things a little bit for this task, we will merge the three gold standards we have into one. We will use that gold standard moving forward. The merged gold standard can be downloaded from here. We now have a single gold standard with 1017 manually vetted web pages.

Now that we have a new consolidated gold standard, let’s calculate the performance of the models when the training corpuses are pruned and not. This will become the new basis to compare the subsequent results in this article. The metrics when they training corpuses are pruned:

True positive: 56
false positive: 10
True negative: 913
False negative: 38

Precision: 0.8484849
Recall: 0.59574467
Accuracy: 0.95280236
F1: 0.7

Now, let’s run the grid search that will try to optimize the F1 score of the model using the pruned training corpuses and using the full gold standard:

(svm-grid-search "grid-search-base-pruned-tests" 
                 "resources/svm/base-pruned/" 
                 "resources/gold-standard-full.csv"
                 :selection-metric :f1
                 :grid-parameters [{:c [1 2 4 16 256]
                                    :e [0.001 0.01 0.1]
                                    :algorithm [:l2l2]
                                    :weight [1 15 30]}])

{:gold-standard "resources/gold-standard-full.csv"
 :selection-metric :f1
 :score 0.7096774
 :c 2
 :e 0.001
 :algorithm :l2l2
 :weight 30}

With a simple subset of the possible hyperparameter space, we found that by increasing the c parameter to 2 we could improve the performance of the F1 score on the gold standard by 1.37%. It is not a huge gain, but it is still an appreciable gain given the miinimal effort invested so far (basically: waiting for the grid search to finish). Subsequently we could tweak the subset of parameters to try to improve a little further. Let’s try with c = [ 1.5 , 2 , 2.5 ] and weight = [30, 40]. Let’s also try to check with other algorithms as well like L2-regularized L1-loss support vector regression (dual).

The goal here is to configure the initial grid search with general parameters with a wide range of possible values. Then subsequently we can use that tool to fine tune some of the parameters that were returning good results. In any case, the more computer power and time you have, the more tests you will be able to perform.

Part 2

Continue with part 2…

Posted at 11:00

November 16

Semantic Web Company (Austria): Triplifying a real dictionary

The Linked Data Lexicography for High-End Language Technology (LDL4HELTA) project was started in cooperation between Semantic Web Company (SWC) and K Dictionaries. LDL4HELTA combines lexicography and Language Technology with semantic technologies and Linked (Open) Data mechanisms and technologies. One of the implementation steps of the project is to create a language graph from the dictionary data.

The input data, described further, is a Spanish dictionary core translated into multiple languages and available in XML format. This data should be triplified (which means to be converted to RDF – Resource Description Framework) for several purposes, including to enrich it with external resources. The triplified data needs to comply with Semantic Web principles.

To get from a dictionary’s XML format to its triples, I learned that you must have a model. One piece of the sketched model, representing two Spanish words which have senses that relate to each other, is presented in Figure 1.

Figure 1: Language model example (click to enlarge)

This sketched model first needs to be created by a linguist who understands both the language complexity and Semantic Web principles. The extensive model [1] was developed at the Ontology Engineering Group of the Universidad Politécnica de Madrid (UPM).

Language is very complex. With this we all agree! How complex it really is, is probably often underestimated, especially when you need to model all its details and triplify it.

So why is the task so complex?

To start with, the XML structure is complex in itself, as it contains nested structures. Each word constitutes an entry. One single entry can contain information about:

Pronunciation
Inflection
Range Of Application
Sense Indicator
Compositional Phrase
Translations
Translation Example
Alternative Scripting
Register
Geographical Usage
Sense Qualifier

Provenance
Version
Synonyms
Lexical sense
Usage Examples
Homograph information
Language information
Specific display information
Identifiers
and more…

Entries can have predefined values, which can recur but their fields can also have so-called free values, which can vary too. Such fields are:

Aspect
Tense
Subcategorization
Subject Field
Mood
Grammatical Gender
Geographical Usage
Case
and more…

As mentioned above, in order to triplify a dictionary one needs to have a clear defined model. Usually, when modelling linked data or just RDF it is important to make use of existing models and schemas to enable easier and more efficient use and integration. One well-known lexicon model is Lemon. Lemon contains good pieces of information to cover our dictionary needs, but not all of them. We started using also the Ontolex model, which is much more complex and is considered to be the evolution of Lemon. However, some pieces of information were still missing, so we created an additional ontology to cover all missing corners and catch the specific details that did not overlap with the Ontolex model (such as the free values).

An additional level of complexity was the need to identify exactly the missing pieces in Ontolex model and its modules and create the part for the missing information. This was part of creating the dictionary’s model which we calledontolexKD.

As a developer you never sit down to think about all the senses or meanings or translations of a word (except if you specialize in linguistics), so just to understand the complexity was a revelation for me. And still, each dictionary contains information that is specific to it and which needs to be identified and understood.

The process used in order to do the mapping consists of several steps. Imagine this as a processing pipeline which manipulates the XML data. UnifiedViews is an ETL tool, specialized in the management of RDF data, in which you can configure your own processing pipeline. One of its use cases is to triplify different data formats. I used it to map XML to RDF and upload it into a triple store. Of course this particular task can also be achieved with other such tools or methods for that matter. In UnifiedViews the processing pipeline resembles what appears in Figure 2.

Figure 2: UnifiedViews pipeline used to triplify XML (click to enlarge)

The pipeline is composed out of data processing units (DPUs) which communicate iteratively. In a left-to-right order the process in Figure 2 represents:

A DPU used to upload the XML files into UnifiedViews for further processing;
A DPU which transforms XML data to RDF using XSLT. The style sheet is part of the configuration of the unit;
The .rdf generated files are stored on the filesystem;
And, finally, the .rdf generated files are uploaded into a triple store, such as Virtuoso Universal server.

Basically the XML is transformed using XSLT.

Complexity increases also through the URIs (Uniform Resource Identifier) that are needed for mapping the information in the dictionary, because with Linked Data any resource should have a clearly identified and persistent identifier! The start was to represent a single word (headword) under a desired namespace and build on it to associate it with its part of speech, grammatical number, grammatical gender, definition, translation – just to begin with.

The base URIs follow the best practices recommended in the ISA study on persistent URIs following the pattern:http://{domain}/{type}/{concept}/{reference}.

An example of such URIs for the forms of a headword is:

These two URIs represent the singular masculine and singular feminine forms of the Spanish word entendedor.

If the dictionary contains two different adjectival endings, as with entendedor which has different endings for the feminine and masculine forms (entendedora and entendedor), and they are not explicitly mentioned in the dictionary than we use numbers in the URI to describe them. If the gener would be explicitly mentioned the URIs would be:

In addition, we should consider that the aim of triplifying the XML was for all these headwords with senses, forms and translations, to connect and be identified and linked following Semantic Web principles. The actual overlap and linking of the dictionary resources remains open. A second step for improving the triplification and mapping similar entries, if possible at all, still needs to be carried out. As an example, let’s take two dictionaries, say German, which contain a translation into English and an English dictionary which also contains translations into German. We get the following translations:

Bank – bank – German to English

bank – Bank – English to German

The URI of the translation from German to English was designed to look like:

http://kdictionaries.com/id/tranSetDE-EN/Bank-n-SE00006116-sense-bank-n-Bank-n-SE00006116-sense-TC00014378-trans

And the translation from English to German would be:

http://kdictionaries.com/id/tranSetEN-DE/bank-n-SE00006110-sense-Bank-n-bank-n-SE00006110-sense-TC00014370-trans

In this case both represent the same translation but have different URIs because they were generated from different dictionaries (mind the translation order). These should be mapped so as to represent the same concept, theoretically, or should they not?

The word Bank in German can mean either a bench or a bank in English. When I translate both English senses back into German I get again the word Bank, but I cannot be sure which sense I translate unless the sense id is in the URI, hence the SE00006110 and SE00006116. It is important to keep the order of translation (target-source) but later map the fact that both translations refer to the same sense, same concept. This is difficult to establish automatically. It is hard even for a human sometimes.

One of the last steps of complexity was to develop a generic XSLT which can triplify all the different languages of this dictionary series and store the complete data in a triple store. The question remains: is the design of such a universal XSLT possible while taking into account the differences in languages or the differences in dictionaries?

The task at hand is not completed from the point of view of enabling the dictionary to benefit from Semantic Web principles yet. The linguist is probably the first one who can conceptualize “the how to do this”.

As a next step we will improve the Linked Data created so far and bring it to the status of a good linked language graph by enriching the RDF data with additional information, such as the history of a term or additional grammatical information etc.

References:

[1] J. Bosque-Gil, J. Gracia, E. Montiel-Ponsoda, and G. Aguado-de Cea, “Modelling multilingual lexicographic resources for the web of data: the k dictionaries case,” in Proc. of GLOBALEX’16 workshop at LREC’15, Portoroz, Slovenia, May 2016.

Posted at 12:07

November 14

AKSW Group - University of Leipzig: Accepted paper in AAAI 2017

Hello Community! We are very pleased to announce that our paper “Radon– Rapid Discovery of Topological Relations” was accepted for presentation at the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), which will be held in February 4–9 at the Hilton San Francisco, San Francisco, California, USA.

In more detail, we will present the following paper: “Radon– Rapid Discovery of Topological Relations” Mohamed Ahmed Sherif, Kevin Dreßler, Panayiotis Smeros, and Axel-Cyrille Ngonga Ngomo

Abstract. Datasets containing geo-spatial resources are increasingly being represented according to the Linked Data principles. Several time-efficient approaches for discovering links between RDF resources have been developed over the last years. However, the time-efficient discovery of topological relations between geospatial resources has been paid little attention to. We address this research gap by presenting Radon, a novel approach for the rapid computation of topological relations between geo-spatial resources. Our approach uses a sparse tiling index in combination with minimum bounding boxes to reduce the computation time of topological relations. Our evaluation of Radon’s runtime on 45 datasets and in more than 800 experiments shows that it outperforms the state of the art by up to 3 orders of magnitude while maintaining an F-measure of 100%. Moreover, our experiments suggest that Radon scales up well when implemented in parallel.

Acknowledgments
This work is implemented in the link discovery framework LIMES and has been supported by the European Union’s H2020 research and innovation action HOBBIT (GA no. 688227) as well as the BMWI Project GEISER (project no. 01MD16014).

Posted at 13:48

November 13

Bob DuCharme: Pulling RDF out of MySQL

With a command line option and a very short stylesheet.

Posted at 15:09

November 11

Dublin Core Metadata Initiative: SUB Göttingen joins DCMI as Institutional Member

2016-11-11, DCMI is pleased to announce that Göttingen State and University Library (SUB Göttingen) has joined DCMI as an Institutional Member. SUB Göttingen is one of most important research libraries in Germany, plays a leading role in a large number of national and international projects involving the optimization of literature and information provision and the establishment and development of digital research and information infrastructures. Its scope of activities include the cooperative development of a Germany-wide service infrastructure for the acquisition, licensing and provision of electronic resources; the coordination of large-scale joint research projects for developing research infrastructures in the humanities and cultural sciences in Germany; and the consortial establishment of Open Access research infrastructures linked across Europe and the world. Stefanie Rühle with the Data Conversion Group will represent SUB Göttingen on the DCMI Governing Board.

Posted at 23:59

Libby Miller: A speaking camera using Pi3 and Tensorflow

Posted at 12:39

November 10

Leigh Dodds: Donate to the commons this holiday season

Holiday season is nearly upon us. Donating to a charity is an alternative form of gift giving that shows you care, whilst directing your money towards helping those that need it. There are a lot of great and deserving causes you can support, and I’m certainly not going to tell you where you should donate your money.

But I’ve been thinking about the various ways in which I can support projects that I care about. There are a lot of them as it turns out. And it occurred to me that I could ask friends and family who might want to buy me a gift to donate to them instead. It’ll save me getting me getting yet another scarf, pair of socks, or (shudder) a

Posted at 19:23

November 08

Leigh Dodds: The practice of open data

Open data is data that anyone can access, use and share.

Open data is the result of several processes. The most obvious one is the release process that results in data being made available for reuse and sharing.

But there are other processes that may take place before that open data is made available: collecting and curating a dataset; running it through quality checks; or ensuring that data has been properly anonymised.

There are also processes that happen after data has been published. Providing support to users, for example. Or dealing with error reports or service issues with an API or portal.

Some processes are also continuous. Engaging with re-users is something that is best done on an ongoing basis. Re-users can help you decide which datasets to release and when. They can also give you feedback on ways to improve how your data is published. Or how it can be connected and enriched against other sources.

Collectively these processes define the practice of open data.

The practice of open data covers much more than the technical details of helping someone else access your data. It covers a whole range of organisational activities.

Releasing open data can be really easy. But developing your open data practice can take time. It can involve other changes in your organisation, such as creating a more open approach to data sharing. Or getting better at data governance and management.

The extent to which you develop an open data practice depends on how important open data is to your organisation. Is it part of your core strategy or just something you’re doing on a more limited basis?

The breadth and depth of the practice of open data is surprising to many people. The learning process is best experienced. Going through the process of opening a dataset, however small, provides useful insight that can help identify where further learning is needed.

On aspect of the practice of open data involves understanding what data can be open, what can be shared and what must stay closed. Moving data along

Posted at 19:52

November 07

Frederick Giasson: Building and Maintaining the KBpedia Knowledge Graph

The Cognonto demo is powered by an extensive knowledge graph called the KBpedia Knowledge Graph, as organized according to the KBpedia Knowledge Ontology (KKO). KBpedia is used for all kinds of tasks, some of which are demonstrated by the Cognonto use cases. KBpedia powers dataset linkage and mapping tools, machine learning training workflows, entity and concept extractions, category and topic tagging, etc.

The KBpedia Knowledge Graph is a structure of more than 39,000 reference concepts linked to 6 major knowledge bases and 20 popular ontologies in use across the Web. Unlike other knowledge graphs that analyze big corpuses of text to extract “concepts” (n-grams) and their co-occurrences, KBpedia has been created, is curated, is linked, and evolves using humans for the final vetting steps. KBpedia and its build process is thus a semi-automatic system.

The challenge with such a project is to be able to grow and refine (add or remove relations) within the structure without creating unknown conceptual issues. The sheer combinatorial scope of KBpedia means it is not possible for a human to fully understand the impact of adding or removing a relation on its entire structure. There is simply too much complexity in the interaction amongst the reference concepts (and their different kinds of relations) within the KBpedia Knowledge Graph.

What I discuss in this article is how Cognonto creates and then constantly evolves the KBpedia Knowledge Graph. In parallel with our creating KBpedia over the years, we also have needed to develop our own build processes and tools to make sure that every time something changes in KBpedia’s structure that it remains satisfiable and coherent.

The Importance of Testing Each Build

As you may experience for yourself with the Knowledge Graph browser, the KBpedia structure is linked to multiple external sources of information. Each of these sources (six major knowledge bases and another 20 ontologies) has its own world view. Each of these sources use its own concepts to organize its own structure.

What the KBpedia Knowledge Graph does is to merge all these different world views (and their associated instances and entities) into a coherent whole. One of the purposes of the KBpedia Knowledge Graph is to act as a scaffolding for integrating still further external sources, specifically in the knowledge domains relevant to specific clients.

One inherent characteristic of these knowledge sources is that they are constantly changing. Some may be updated only occasionally, others every year, others every few months, others every few weeks, or whatever. In the cases of Wikipedia and Wikidata, two of the most important contributors to KBpedia, thousands of changes occur daily. This dynamism of knowledge sources is an important fact since every time a source is changed, it may mean that its world view may have changed as well. Any of these changes can have an impact on KBpedia and the linkages we have to that external source.

Because of this dynamic environment, we do have to constantly regenerate the KBpedia Knowledge Graph and we constantly have to make sure that any changes in its structure or in the structure of the sources linked to it doesn’t make it insatisfiable nor incoherent.

It is for these reasons that we developed an extensive knowledge graph building process that includes a series of tests that are run every time that the knowledge graph get modified. Each new build is verified that it is still satisfiable and coherent.

The Build Process

The KBpedia Knowledge Graph build process has been developed over the years to create a robust workflow that enables us to regenerate KBpedia every time that something changed in it. The build process ensures that no issues are created every time we change something and regenerate KBpedia. Our build process also calculates a series of statistics and metrics that enable us to follow its evolution.

The process works as follow:

Prepare log files
Perform pre-checks. If any of these test fail, then the generation process won’t start. They check if:
1. Any index file is corrupted
2. All mentioned reference concept IDs exist
3. All mentioned Super Type IDs exist
4. No reference concept IDs are the same as Super Type IDs
5. No new concepts IDs are the same as existing IDs
Create the classes and individuals that define the knowledge graph
Save the knowledge graph
Generate the mapping between the reference concepts and the external ontologies/schemas/vocabularies
1. Wikipedia
2. Wikidata
3. DBpedia
4. Schema.org
5. Geonames
6. OpenCyc
7. General Ontologies (Music Ontology, Bibliographic Ontology, FOAF, and 17 others…)
Execute a series of post-creation tests
1. Check for missing preferred labels
2. Check for missing definitions
3. Check for non-distinct preferred labels
4. Check for reference concepts that do not have any reference to any Super Type (by inference) (also known as ‘orphans’)
5. Check to make sure that the core KBpedia Knowledge Graph is satisfiable
6. Check to make sure that the core KBpedia Knowledge Graph with its external linkages is satisfiable
7. Check to make sure that the core KBpedia Knowledge Graph with its external linkages and extended inference relationships is satisfiable
Finally, calculate a series of statistics and metrics.

It is important that we be able to do these builds and tests rapidly, so that we can move along new version releases rapidly. Remember, all changes to the KBpedia Knowledge Graph are manually vetted.

To accomplish this aim we actually build KBpedia from a set of fairly straightforward input files (for easy inspection and modification). We can completely rebuild all of KBpedia in less than two hours. About 45 minutes are required for building the overall structure and applying the satisfiability and coherency tests. The typology aspects of KBpedia and their tests adds another hour or so to complete the build. The rapidity of the build cycle means we can test and refine nearly in real time, useful when we are changing or refining big chunks of the structure.

An Escheresque Building Process

Building the KBpedia Knowledge Graph is like M.C. Escher’s hand’s drawing themselves. Because of the synergy between the Knowledge Graph reference concepts, its upper structure, its typologies and its numerous links to external linkages, any addition in one of these areas can lead to improvements in other areas of the knowledge graph. These improvements are informed by analyzing the metrics, statistics, and possible errors logged by the build process.

The Knowledge Graph is constantly evolving, self-healing and expanding. This is why that the build process and more importantly its tests are crucial to make sure that new issues are not introduced every time something changes within the structure.

To illustrate these points, let’s dig a little deeper into the KBpedia Knowledge Graph build process.

The Nature of the KBpedia Build Process

The KBpedia Knowledge Graph is built from a few relatively straightforward assignment files serialized in CSV. Each file has its purpose in the build process and is encoded using UTF-8 for internationalization purposes. KBpedia is just a set of simple indexes serialized as CSV files that can easily be exchanged, updated and re-processed.

The process is 100% repeatable and testable. If issues are found in the future that require a new step or a new test, it can easily be improved by plugging-in a new step or a new test into the processing pipeline. In fact, the current pipeline is the incremental result of years of working this process. I’m sure we will add more steps still as time goes on.

The process is also semi-automatic. Certain tests may cause the process to completely fail. If such a failure happens, then immediate actions are outputed in different log files. If the process does complete, then all kinds of log files and statistics about the KBpedia Knowledge Graph structure are written to the file system. Once completed, the human operator can easily check these logs and update the input files to improve something he may have found after analyzing the output files.

Building KBpedia is really an iterative process. It often is generated hundred of times before a new version is released.

Checking for Disjointedness and Inconsistencies

The core and more important test in the process is the satisfiability test that is run once the KBpedia Knowledge Graph is generated. An unsatisfiable class is a class that does not “satisfy” (is inconsistent with) the structure of the knowledge graph. In KBpedia, what needs to be satisfied are the disjoint assertions that exists at the upper level of the knowledge graph. If an assertion between two reference concepts (like a sub-class-of or an equivalent-to relationship) leads to an unsatisfiable disjoint assertion, then an error is raised and the issue will need to be fixed by the human operator.

Here is an example of an unsatisfiable class. In this example, someone wants to say that a musical group (kbpedia:MusicPerformanceOrganization) is a sub-class-of a musician (kbpedia:Musician). This new assertion is obviously an error (since musicians may also be individuals), but the human operator didn’t noticed it when he created the new relationship between the two reference concepts. So, how does the build process catch such errors? Here is how:

Because the two classes belong to two disjoint super classes, then the KBpedia generator finds this issue and returns an error along with logging report that explains why that new assertion makes the structure unsatisfiable. This testing and audit report is pretty powerful (and essential) to be able to maintain the integrity of the knowledge graph.

Unsatisfiability of Linked External Concepts

The satisfiability testing of external concepts linked to KBpedia is performed in two steps:

The testing first checks the satisfiability of the core KBpedia Knowledge Graph and, then
It checks the satisfiability of the KBpedia Knowledge Graph in relation to all of its other links to external data sources.

This second step is essential to make sure that any external concept we link to KBpedia is done properly and does not trigger any linking errors. In fact, we are trying to minimize the number of errors using the unsatisfiability testing. The process of checking if external concepts linked to the KBpedia Knowledge Graph satisfies the structure is the same. If their inclusion leads to such an issue, then it means that the links are the issue, since we know that the KBpedia core structure is satisfiable (since it was the previous step). Once detected, the linkage error(s) will be reviewed and fixed by the human operator and the structure will be regenerated. In the early phases of a new build, these fixes are accumulated and processed in batches. At the end of a new build, only one or a few errors remain to be corrected.

A Fully Connected Graph

Another important test is to make sure that the KBpedia Knowledge Graph is fully connected. We don’t want to have islands of concepts in the graph, we want to make sure that every concept is reachable using sub-class-of, super-class-of or equivalent-to relationships. If the build process detects that some concepts are disconnected from the graph, then new relationships will need to be created to reconnect the graph. These “orphan” tests ensure the integrity and completeness of the overall graph structure.

Typologies Have Their Own Tests

What is a typology? As stated by Merriam Webster, a typology is “a system used for putting things into groups according to how they are similar.” The KBpedia typologies, of which there are about 80, are the classification of types that are closely related, which we term Super Types. Three example Super Types are People, Activities and Products. The Super Types are found in the upper reaches of the KBpedia Knowledge Graph. (See further this article by Mike Bergman describing the upper structure of KBpedia and its relation to the typologies.) Thousands of disjointedness assertions have been defined between individual Super Types to other Super Types. These assertions enforce the fact that the reference concepts related to a Super Type A are not similar to the reference concepts related to, say, Super Type B.

These disjointedness assertions are a major factor in how we can rapidly slice-and-dice the KBpedia knowledge space to rapidly create training corpuses and positive and negative training sets for machine learning. These same disjointedness relationships are what we use to make sure that the KBpedia Knowledge Graph structure is satisfiable and coherent.

Another use of the typologies is to have a general overview of the knowledge graph. Each typology is a kind of lens that shows different parts of the knowledge graph. The build process creates a log of each of the typologies with all the reference conepts that belong to it. Similarly, the build process also creates a mini-ontology for each typology that can be inspected in an ontology editor. We use these outputs to more easily assess the various structures within KBpedia and to find possible conceptual issues as part of our manual vetting before final approvals.

Knowledge is Dynamic and So Must Be Builds and Testing

Creating, maintaining and evolving a knowledge graph the size of KBpedia is a non-trivial task. It is also a task that must be done frequently and rapidly whenever the underlying nature of KBpedia’s constituent knowledge bases dynamically changes. These demands require a robust build process with multiple logic and consistency tests. At every step we have to make sure that the entire structure is satisfiable and coherent. Fortunately, after development over a number of years, we now have processes in place that are battle tested and can continue to be expanded as the KBpedia Knowledge Graph constantly evolves.

Posted at 19:57

November 04

Leigh Dodds: Discogs: a business based on public domain data

When I’m discussing business models around open data I regularly refer to a few different examples. Not all of these have well developed case studies, so I thought I’d start trying to capture them here. In this first write-up I’m going to look at

Posted at 22:28

Libby Miller: Machine learning links

[work in progress – I’m updating it gradually]

Machine Learning

Posted at 16:11

November 01

Leigh Dodds: Checking Fact Checkers

As of last month

Posted at 19:39

Leigh Dodds: Elinor Ostrom and data infrastructure

One of the topics that most interests me at the moment is how we design systems and organisations that contribute to the creation and maintenance of the open data commons.

This is more than a purely academic interest. If we can understand the characteristics of successful open data projects like Open Street Map or Musicbrainz then we could replicate them in other areas. My hope is that we may be able to define a useful tool-kit of organisational and technical design patterns that make it more likely for other similar projects to proceed. These patterns might also give us a way to evaluate and improve other existing systems.

A lot of the current discussion around this topic is going on under the “

Posted at 19:03

October 30

Bob DuCharme: My SQL quick reference

Pun intended.

Posted at 16:49

October 25

Frederick Giasson: Create a Domain Text Classifier Using Cognonto

A common task required by systems that automatically analyze text is to classify an input text into one or multiple classes. A model needs to be created to scope the class (what belongs to it and what does not) and then a classification algorithm uses this model to classify an input text.

Multiple classification algorithms exists to perform such a task: Support Vector Machine (SVM), K-Nearest Neigbours (KNN), C4.5 and others. What is hard with any such text classification task is not so much how to use these algorithms: they are generally easy to configure and use once implemented in a programming language. The hard – and time-consuming – part is to create a sound training corpus that will properly define the class you want to predict. Further, the steps required to create such a training corpus must be duplicated for each class you want to predict.

Since creating the training corpus is what is time consuming, this is where Cognonto provides its advantages.

In this article, we will show you how Cognonto’s KBpedia Knowledge Graph can be used to automatically generate training corpuses that are used to generate classification models. First, we define (scope) a domain with one or multiple KBpedia reference concepts. Second, we aggregate the training corpus for that domain using the KBpedia Knowledge Graph and its linkages to external public datasets that are then used to populate the training corpus of the domain. Third, we use the Explicit Semantic Analysis (ESA) algorithm to create a vectorial representation of the training corpus. Fourth, we create a model using (in this use case) an SVM classifier. Finally, we predict if an input text belongs to the class (scoped domain) or not.

This use case can be used in any workflow that needs to pre-process any set of input texts where the objective is to classify relevant ones into a defined domain.

Unlike more traditional topic taggers where topics are tagged in an input text with weights provided for each of them, we will see how it is possible to use the semantic interpreter to tag main concepts related to an input text even if the surface form of the topic is not mentioned in the text. We accomplish this by leveraging ESA’s semantic interpreter.

General and Specific Domains

In this article, two concepts are at the center of everything: what I call the general domain and the specific domain(s). What I call the general domain can be seen as the set of all specific domains. It includes the set of classes that generally define common things of the World. What we call a specific domain is one or multiple classes that scope a domain of interest. A specific domain is a subset of classes of the general domain.

In Cognonto, the general domain is defined by all the ~39,000 KBpedia reference concepts. A specific domain is any sub-set of the ~39,000 KBpedia reference concept that adequately scopes a domain of interest.

The purpose of this use case is to show how we can determine if an input text belongs to a specific domain of interest. What we have to do is to create two training corpuses: one that defines the general domain, and one that defines the specific domain. However, how do we go about defining these corpuses? One way would be to do this manually, but it would take an awful lot of time to do.

This is the crux of the matter: we will generate the general domain corpus and specific domain ones automatically using the KBpedia Knowledge Graph and all of its linkages to external public datasets. The time and resources thus saved from creating the training corpuses can be spent testing different classification algorithms, tweaking their parameters, evaluating them, etc.

What is so powerful in leveraging the KBpedia Knowledge Graph in this manner is that we can generate training sets for all kind of domains of interests automatically.

Training Corpuses

The first step we have to do is to define the training corpuses that we will use to create the semantic interpreter and the SVM classification models. We have to create the general domain training corpus and the domain specific training corpus. The example domain I have chosen for this use case is scoped by the ideas of Music, Musicians, Music Records, Musical Groups, Musical Instruments, etc.

Define The General Training Corpus

The general training corpus is quite easy to create. The only thing I have to do is to query the KBpedia Knowledge Graph to get all the Wikipedia pages linked to all the KBpedia reference concepts. These pages will become the general training corpus.

Note that in this article I will only use the linkages to the Wikipedia dataset, but I could also use any other datasets that are linked to the KBpedia Knowledge Graph in exactly the same way. Here is how we aggregate all the documents that will belong to a training corpus:

Note all I need do is to use the KBpedia structure, query it, and then write the general corpus into a CSV file. This CSV file will be used later for most of the subsequent tasks.

(define-general-corpus "resources/kbpedia_reference_concepts_linkage.n3" "resources/general-corpus-dictionary.csv")

Define The Specific Domain Training Corpus

The next step is to define the training corpuse of the specific domain for this use case, the music domain. To do so, I need merely search KBpedia to find all the reference concepts I am interested in that will scope my music domain. These domain-specific KBpedia reference concepts will be the features of the SVM models we will test below.

What the define-domain-corpus function does below is simply to query KBpedia to get all the Wikipedia articles related to these concepts, their sub-classes and to create the training corpus from them.

In this article we only define a binary classifier. However, if we would want to create a multi-class classifier then we would have to define multiple specific domain training corpuses exactly the same way. The only time we would have to spend is to search KBpedia (using the Cognonto user interface) to find the reference concepts we want to use to scope the domains we want to define. We will show how quickly this can be done with impressive results in a later use case.

(define-domain-corpus ["http://kbpedia.org/kko/rc/Music"
                       "http://kbpedia.org/kko/rc/Musician"
                       "http://kbpedia.org/kko/rc/MusicPerformanceOrganization"
                       "http://kbpedia.org/kko/rc/MusicalInstrument"
                       "http://kbpedia.org/kko/rc/Album-CW"
                       "http://kbpedia.org/kko/rc/Album-IBO"
                       "http://kbpedia.org/kko/rc/MusicalComposition"
                       "http://kbpedia.org/kko/rc/MusicalText"
                       "http://kbpedia.org/kko/rc/PropositionalConceptualWork-MusicalGenre"
                       "http://kbpedia.org/kko/rc/MusicalPerformer"]
  "resources/kbpedia_reference_concepts_linkage.n3"
  "resources/domain-corpus-dictionary.csv")

Create Training Corpuses

Once the training corpuses are defined, we want to cache them locally to be able to play with them, without having to re-download them from the Web or re-create them each time.

(cache-corpus)

The cache is composed of 24,374 Wikipedia pages, which is about 2G of raw data. However, we have some more processing to perform on the raw Wikipedia pages since what we ultimately want is a set of relevant tokens (words) that will be used to calculate the value of the features of our model using the ESA semantic interpreter. Since we may want to experiment with different normalization rules, what we do is to re-write each document of the corpus in another folder that we will be able to re-create as required if the normalization rules change in the future. We can quickly re-process these input files and save them in separate folders for testing and comparative purposes.

The normalization steps performed by this function are to:

Defluff the raw HTML page. We convert the HTML into text, and we only keep the body of the page
Normalize the text with the following rules:
1. remove diacritics characters
2. remove everything between brackets like: [edit] [show]
3. remove punctuation
4. remove all numbers
5. remove all invisible control characters
6. remove all [math] symbols
7. remove all words with 2 characters or fewer
8. remove line and paragraph seperators
9. remove anything that is not an alpha character
10. normalize spaces
11. put everything in lower case, and
12. remove stop words.

Normalization steps could be dropped or others included, but these are the standard ones Cognonto applies in its baseline configuration.

(normalize-cached-corpus "resources/corpus/" "resources/corpus-normalized/")

After cleaning, the size of the cache is now 208M (instead of the initial 2G for the raw web pages).

Note that unlike what is discussed in the original ESA research papers by Evgeniy Gabrilovich we are not pruning any pages (the ones with less than X number of tokens, etc. This could be done but at a subsequent tweaking step, which our results below indicate is not really necessary.

Now that the training corpuses are created we can now build the semantic interpreter to create the vectors that will be used to train the SVM classifier.

Build Semantic Interpreter

What we want to do is to classify (determine) if an input text belongs to a class as defined by a domain. The relatedness of the input text is based on how closely the specific domain corpus is related to the general one. This classification is performed with some classifiers like SVM, KNN and C4.5. However, each of these algorithms need to use some kind of numerical vector, upon which the actual classifier requires to model and classify the candidate input text. Creating this numeric vector is the job of the ESA Semantic Interpreter.

Let’s dive a little further into the Semantic Interpreter to understand how it operates. Note that you can skip the next section and continue with the following one.

How Does the Semantic Interpreter Work?

The Semantic Interpreter is a process that maps fragments of natural language into a weighted sequence of text concepts ordered by their relevance to the input.

Each concept in the domain is accompanied by a document from the KBpedia Knowledge Graph, which acts as its representative term set to capture the idea (meaning) of the concept. The overall corpus is based on the combined documents from KBpedia that match the slice retrieved from the knowledge graph based on the domain query(ies).

The corpus is composed of concepts that come from the domain ontology associated with KBpedia Knowledge Base documents. We build a sparse matrix where each of the columns corresponds to a concept and where each of the rows corresponds to a word that occurs in the related entity documents . The matrix entry is the TF-IDF value of the word in document .

The TF-IDF value of a given term is calculated as:

where is the number of words in the document , where the term frequency is defined as:

and where the document frequency is the number of documents where the term appears.

Unlike the standard ESA system, pruning is not performed on the matrix to remove the least-related concepts for any given word. We are not doing the pruning due to the fact that the ontologies are highly domain specific as opposed to really broad and general vocabularies. However, with a different mix of training text, and depending on the use case, the stardard ESA model may benefit from pruning the matrix.

Once the matrix is created, we do perform cosine normalization on each column:

where is the TF-IDF weight of the word in the concept document , where is the square root of the sum of exponent of the TF-IDF weight of each word in document . This normalization removes, or at least lowers, the effect of the length of the input documents.

Creating the First Semantic Interpreter

The first semantic interpreter we will create is composed of the general corpus which has 24,374 Wikipedia pages and the music domain-specific corpus composed of 62 Wikipedia pages. The 62 Wikipedia pages that compose the music domain corpus come from the selected KBpedia reference concepts and their sub-classes that we defined in the Define The Specific Domain Training Corpus section above.

(load-dictionaries "resources/general-corpus-dictionary.csv" "resources/domain-corpus-dictionary--base.csv")

(build-semantic-interpreter "base" "resources/semantic-interpreters/base/" (distinct (concat (get-domain-pages) (get-general-pages))))

Evaluating Models

Before building the SVM classifier, we have to create a gold standard that we will use to evaluate the performance of the models we will test. What I did is to aggregate a list of news feeds from the CBC and from Reuters and then I crawled each of them to get the news they were containing. Once I aggregated each of them in a spreadsheet, I manually classified each of them. The result is a gold standard of 336 news pages which were classified as being related to the music domain or not. It can be downloaded from here.

Subsequently, three days later, I re-crawled the same feeds to create a second gold standard that has 345 new spages. It can be downloaded from here. I will use both to evaluate the different SVM models we will create below. (I created the two standards because of some internal tests and statistics we are compiling.)

Both gold standards got created this way:

(defn create-gold-standard-from-feeds
  [name]
  (let [feeds ["http://rss.cbc.ca/lineup/topstories.xml"
               "http://rss.cbc.ca/lineup/world.xml"
               "http://rss.cbc.ca/lineup/canada.xml"
               "http://rss.cbc.ca/lineup/politics.xml"
               "http://rss.cbc.ca/lineup/business.xml"
               "http://rss.cbc.ca/lineup/health.xml"
               "http://rss.cbc.ca/lineup/arts.xml"
               "http://rss.cbc.ca/lineup/technology.xml"
               "http://rss.cbc.ca/lineup/offbeat.xml"
               "http://www.cbc.ca/cmlink/rss-cbcaboriginal"
               "http://rss.cbc.ca/lineup/sports.xml"
               "http://rss.cbc.ca/lineup/canada-britishcolumbia.xml"
               "http://rss.cbc.ca/lineup/canada-calgary.xml"
               "http://rss.cbc.ca/lineup/canada-montreal.xml"
               "http://rss.cbc.ca/lineup/canada-pei.xml"
               "http://rss.cbc.ca/lineup/canada-ottawa.xml"
               "http://rss.cbc.ca/lineup/canada-toronto.xml"
               "http://rss.cbc.ca/lineup/canada-north.xml"
               "http://rss.cbc.ca/lineup/canada-manitoba.xml"
               "http://feeds.reuters.com/news/artsculture"
               "http://feeds.reuters.com/reuters/businessNews"
               "http://feeds.reuters.com/reuters/entertainment"
               "http://feeds.reuters.com/reuters/companyNews"
               "http://feeds.reuters.com/reuters/lifestyle"
               "http://feeds.reuters.com/reuters/healthNews"
               "http://feeds.reuters.com/reuters/MostRead"
               "http://feeds.reuters.com/reuters/peopleNews"
               "http://feeds.reuters.com/reuters/scienceNews"
               "http://feeds.reuters.com/reuters/technologyNews"
               "http://feeds.reuters.com/Reuters/domesticNews"
               "http://feeds.reuters.com/Reuters/worldNews"
               "http://feeds.reuters.com/reuters/USmediaDiversifiedNews"]]

    (with-open [out-file (io/writer (str "resources/" name ".csv"))]
      (csv/write-csv out-file [["class" "title" "url"]])
      (doseq [feed-url feeds]
        (doseq [item (:entries (feed/parse-feed feed-url))]
          (csv/write-csv out-file "" (:title item) (:link item) :append true))))))

Each of the different models we will test in the next sections will be evaluated using the following function:

(defn evaluate-model
  [evaluation-no gold-standard-file]
  (let [gold-standard (rest
                       (with-open [in-file (io/reader gold-standard-file)]
                         (doall
                          (csv/read-csv in-file))))
        true-positive (atom 0)
        false-positive (atom 0)
        true-negative (atom 0)
        false-negative (atom 0)]

    (with-open [out-file (io/writer (str "resources/evaluate-" evaluation-no ".csv"))]
      (csv/write-csv out-file [["class" "title" "url"]])

      (doseq [[class title url] gold-standard]
        (when-not (.exists (io/as-file (str "resources/gold-standard-cache/" (md5 url))))
          (spit (str "resources/gold-standard-cache/" (md5 url)) (slurp url)))
        (let [predicted-class (classify-text (-> (slurp (str "resources/gold-standard-cache/" (md5 url)))
                                                 defluff-content))]
          (println predicted-class " :: " title)
          (csv/write-csv out-file [[predicted-class title url]] :append true)
          (when (and (= class "1")
                     (= predicted-class 1.0))
            (swap! true-positive inc))

          (when (and (= class "0")
                     (= predicted-class 1.0))
            (swap! false-positive inc))

          (when (and (= class "0")
                     (= predicted-class 0.0))
            (swap! true-negative inc))

          (when (and (= class "1")
                     (= predicted-class 0.0))
            (swap! false-negative inc))))

      (println "True positive: " @true-positive)
      (println "false positive: " @false-positive)
      (println "True negative: " @true-negative)
      (println "False negative: " @false-negative)

      (println)

      (let [precision (float (/ @true-positive (+ @true-positive @false-positive)))
            recall (float (/ @true-positive (+ @true-positive @false-negative)))]
        (println "Precision: " precision)
        (println "Recall: " recall)
        (println "Accuracy: " (float (/ (+ @true-positive @true-negative) (+ @true-positive @false-negative @false-positive @true-negative))))
        (println "F1: " (float (* 2 (/ (* precision recall) (+ precision recall)))))))))

What this function does is to calculate the number of true-positive, false-positive, true-negative and false-negatives scores within the gold standard by applying the current model, and then to calculate the precision, recall, accuracy and F1 metrics. You can read more about how binary classifiers can be evaluated from here.

Build SVM Model

Now that we have numeric vector representations of the music domain and now that we have a way to evaluate the quality of the models we will be creating, we can now create and evaluate our prediction models.

The classification algorithm I choose to use for this article is the Support Vector Machine (SVM). I use the Java port of the LIBLINEAR library. Let’s create the first SVM model:

(build-svm-model-vectors "resources/svm/base/")
(train-svm-model "svm.w0" "resources/svm/base/"
                 :weights nil
                 :v nil
                 :c 1
                 :algorithm :l2l2)

This initial model is created using a training set that is composed of 24,311 documents that doesn’t belong to the class (the music specific domain), and 62 documents that does belong to that class.

Now, let’s evaluate how this initial model perform against the the two gold standards:

(evaluate-model "w0" "resources/gold-standard-1.csv" )

True positive:  5
False positive:  0
True negative:  310
False negative:  21

Precision:  1.0
Recall:  0.1923077
Accuracy:  0.9375
F1:  0.32258064

(evaluate-model "w0" "resources/gold-standard-2.csv" )

True positive:  2
false positive:  1
True negative:  319
False negative:  23

Precision:  0.6666667
Recall:  0.08
Accuracy:  0.93043476
F1:  0.14285713

Well, this first run looks like to be really poor! The issue here is a common issue with how the SVM classifier is being used. Ideally, the number of documents that belong to the class and the number of documents that do not belong to the class should be about the same. However, because of the way we defined the music specific domain, and because of the way we created the training corpuses, we ended up with two really unbalanced sets of training documents: 24,311 that doesn’t belong to the class and only 63 that does belong to the class. That is the reason why we are getting these kinds of poor results.

What can we do from here? We have two possibilities:

We use LIBLINEAR’s weight modifier parameter to modify the weight of the terms that exists in the 63 documents that belong to the class. Because the two sets are so unbalanced, the weight should theorically be around 386, or
We add thousands of new documents that belong to the class we want to predict.

Let’s test both options. We will initially play with the weights to see how much we can improve the current situation.

Improving Performance Using Weights

What we will do now is to create a series of models that will differ in the weight we will define to improve the weight of the classified terms in the SVM process.

Weight 10

(train-svm-model "svm.w10" "resources/svm/base/"
                 :weights {1 10.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w10" "resources/gold-standard-1.csv")

True positive:  17
False positive:  1
True negative:  309
False negative:  9

Precision:  0.9444444
Recall:  0.65384614
Accuracy:  0.9702381
F1:  0.77272725

(evaluate-model "w10" "resources/gold-standard-2.csv")

True positive:  15
False positive:  2
True negative:  318
False negative:  10

Precision:  0.88235295
Recall:  0.6
Accuracy:  0.9652174
F1:  0.71428573

This is already a clear improvement for both gold standards. Let’s see if we continue to see improvements if we continue to increase the weight.

Weight 25

(train-svm-model "svm.w25" "resources/svm/base/"
                 :weights {1 25.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w25" "resources/gold-standard-1.csv")

True positive:  20
False positive:  3
True negative:  307
False negative:  6

Precision:  0.8695652
Recall:  0.7692308
Accuracy:  0.97321427
F1:  0.8163265

(evaluate-model "w25" "resources/gold-standard-2.csv")

True positive:  21
False positive:  5
True negative:  315
False negative:  4

Precision:  0.8076923
Recall:  0.84
Accuracy:  0.973913
F1:  0.82352936

The general metrics continued to improve. By increasing the weight, the precision dropped a little bit, but the recall improved quite a bit. The overall F1 score significantly improved. Let’s see with the Weight at 50.

Weight 50

(train-svm-model "svm.w50" "resources/svm/base/"
                 :weights {1 50.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w50" "resources/gold-standard-1.csv")

True positive:  23
False positive:  7
True negative:  303
False negative:  3

Precision:  0.76666665
Recall:  0.88461536
Accuracy:  0.9702381
F1:  0.82142854

(evaluate-model "w50" "resources/gold-standard-2.csv")

True positive:  23
False positive:  6
True negative:  314
False negative:  2

Precision:  0.79310346
Recall:  0.92
Accuracy:  0.9768116
F1:  0.8518519

The trend continues: decline in precision increase of recall and overall F1 score is better in both cases. Let’s try with a weight of 200

Weight 200

(train-svm-model "svm.w200" "resources/svm/base/"
                 :weights {1 200.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w200" "resources/gold-standard-1.csv")

True positive:  23
False positive:  7
True negative:  303
False negative:  3

Precision:  0.76666665
Recall:  0.88461536
Accuracy:  0.9702381
F1:  0.82142854

(evaluate-model "w200" "resources/gold-standard-2.csv")

True positive:  23
False positive:  6
True negative:  314
False negative:  2

Precision:  0.79310346
Recall:  0.92
Accuracy:  0.9768116
F1:  0.8518519

Results are the same, it looks like improving the weights up to a certain point adds further to the predictive power. However, the goal of this article is not to be an SVM parametrization tutorial. Many other tests could be done such as testing different values for the different SVM parameters like the C parameter and others.

Improving Performance Using New Music Domain Documents

Now let’s see if we can improve the performance of the model even more by adding new documents that belong to the class we want to define in the SVM model. The idea of adding documents is good, but how may we quickly process thousands of new documents that belong to that class? Easy, we will use the KBpedia Knowledge Graph and its linkage to entities that exists into the KBpedia Knowledge Base to get thousands of new documents highly related to the music domain we are defining.

Here is how we will proceed. See how we use the type relationship between the classes and their individuals:

The millions of completely typed instances in KBpedia enable us to retrieve such large training sets efficiently and quickly.

Extending the Music Domain Model

To extend the music domain model I added about 5000 albums, musicians and bands documents using the relationships querying strategy outlined in the figure above. What I did is just to add 3 new features but with thousands of new training documents in the corpus.

What I had to do was to:

Extend the domain pages with the new entities
Cache the new entities’ Wikipedia pages
Build a new semantic interpreter that take the new documents into account, and
Build a new SVM model that use the new semantic interpreter’s output.

(extend-domain-pages-with-entities)
(cache-corpus)

(load-dictionaries "resources/general-corpus-dictionary.csv" "resources/domain-corpus-dictionary--extended.csv")

(build-semantic-interpreter "domain-extended" "resources/semantic-interpreters/domain-extended/" (distinct (concat (get-domain-pages) (get-general-pages))))

(build-svm-model-vectors "resources/svm/domain-extended/")

Evaluating the Extended Music Domain Model

Just like what we did for the first series of tests, we now will create different SVM models and evaluate them. Since we now have a nearly balanced set of training corpus documents, we will test much smaller weights (no weight, and then 2 weight).

(train-svm-model "svm.w0" "resources/svm/domain-extended/"
                 :weights nil
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w0" "resources/gold-standard-1.csv")

True positive:  20
False positive:  12
True negative:  298
False negative:  6

Precision:  0.625
Recall:  0.7692308
Accuracy:  0.9464286
F1:  0.6896552

(evaluate-model "w0" "resources/gold-standard-2.csv")

True positive:  18
False positive:  17
True negative:  303
False negative:  7

Precision:  0.51428574
Recall:  0.72
Accuracy:  0.93043476
F1:  0.6

As we can see, the model is scoring much better than the previous one when the weight is zero. However, it is not as good as the previous one when weights are modified. Let’s see if we can benefit increasing the weight for this new training set:

(train-svm-model "svm.w2" "resources/svm/domain-extended/"
                 :weights {1 2.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w2" "resources/gold-standard-1.csv")

True positive:  21
False positive:  23
True negative:  287
False negative:  5

Precision:  0.47727272
Recall:  0.8076923
Accuracy:  0.9166667
F1:  0.59999996

(evaluate-model "w2" "resources/gold-standard-2.csv")

True positive:  20
False positive:  33
True negative:  287
False negative:  5

Precision:  0.3773585
Recall:  0.8
Accuracy:  0.8898551
F1:  0.51282054

Overall the models seems worse with weight 2, let’s try with weight 5:

(train-svm-model "svm.w5" "resources/svm/domain-extended/"
                 :weights {1 5.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

(evaluate-model "w5" "resources/gold-standard-1.csv")

True positive:  25
False positive:  52
True negative:  258
False negative:  1

Precision:  0.32467532
Recall:  0.96153843
Accuracy:  0.8422619
F1:  0.4854369

(evaluate-model "w2" "resources/gold-standard-2.csv")

True positive:  23
False positive:  62
True negative:  258
False negative:  2

Precision:  0.27058825
Recall:  0.92
Accuracy:  0.81449276
F1:  0.41818184

The performances are just getting worse. But this makes sense at the same time. Now that the training set is balanced, there are many more tokens that participate into the semantic interpreter and so in the vectors generated by it and used by the SVM. If we increase the weight of a balanced training set, then this intuitively should re-unbalance the training set and worsen the performances. This is what is apparently happening.

Re-balancing the training set using this strategy does not look to be improving the prediction model, at least not for this domain and not for these SVM parameters.

Improving Using Manual Features Selection

So far, we have been able to test different kind of strategies to create different training corpuses, to select different features, etc. We have been able to do this within a day, mostly waiting for the desktop computer to build the semantic interpreter and the vectors for the training sets. It has been possible thanks to the KBpedia Knowledge Graph that enabled us to easily and automatically slice-and-dice the knowledge structure to perform all these tests quickly and efficiently.

There are other things we could do to continue to improve the prediction model, such as manually selecting features returned by KBpedia. Then we could test different parameters of the SVM classifier, etc. However, such tweaks are the possible topics of later use cases.

Multiclass Classification

Let me add a few additional words about multiclass classification. As we saw, we can easily define domains by selecting one or multiple KBpedia reference concepts and all of their sub-classes. This general process enables us to scope any domain we want to cover. Then we can use the KBpedia Knowledge Graph’s relationship with external data sources to create the training corpus for the scoped domain. Finally, we can use SVM as a binary classifier to determine if an input text belongs to the domain or not. However, what if we want to classify an input text with more than one domain?

This can easily be done by using the one-vs-rest (also called the one-vs-all) multiclass classification strategy. The only thing we have to do is to define multiple domains of interest, and then to create a SVM model for each of them. As noted above, this effort is almost solely one of posing one or more queries to KBpedia for a given domain. Finally, to predict if an input text belongs to any of each domain models we defined, we need to apply an SVM option (like LIBLINEAR) that already implements multi-class SVM classification.

Conclusion

In this article, we tested multiple, different strategies to create a good prediction model using SVM to classify input texts into a music-related class. We tested unbalanced training corpuses, balanced training corpuses, different set of features, etc. Some of these tests improved the prediction model; others made it worse. The key point that should be remembered is that any machine learning effort requires bounding, labeling, testing and refining multiple parameters in order to obtain the best results. Use of the KBpedia Knowledge Graph and its linkage to external public datasets enables Cognonto to now do this previously lengthy and time-consuming tasks quickly and efficiently.

Within a few hours, we created a classifier with an accuracy of about 97% that classifies input text to belong to a music domain or not. We demonstrate how we can create such classifiers more-or-less automatically using the KBpedia Knowledge Graph to define the scope of the domain and to classify new text into that domain based on relevant KBpedia reference concepts. Finally, we note how we may create multi-class classifiers using exactly the same mechanisms.

Posted at 00:49

October 24

Libby Miller: A presence robot with Chromium, WebRTC, Raspberry Pi 3 and EasyRTC

Here’s how to make a presence robot with Chromium 51, WebRTC, Raspberry Pi 3 and EasyRTC. It’s actually very easy, especially now that Chromium 51 comes with Raspian Jessie, although it’s taken me a long time to find the exact incantation.

If you’re going to use it for real, I’d suggest using the

Posted at 21:53

October 21

Leigh Dodds: “Open”

For the purposes of having something to point to in future, here’s a list of different meanings of “open” that I’ve encountered.

XYZ is “open” because:

It’s on the web
It’s free to use
It’s published under an open licence
It’s published under a custom licence, which limits some types of use (usually commercial, often everything except personal)
It’s published under an open licence, but we’ve not checked too deeply in whether we can do that
It’s free to use, so long as you do so within our app or application
There’s a restricted/limited access free version
There’s documentation on how it works
It was (or is) being made in public, with equal participation by anyone
It was (or is) being made in public, lead by a consortium or group that has limitation on membership (even if just fees)
It was (or is) being made privately, but the results are then being made available publicly for you to use

I gather that at

Posted at 14:51

Leigh Dodds: Current gaps in the open data standards framework

In this post I want to highlight what I think are some fairly large gaps in the standards we have for publishing and consuming data on the web. My purpose for writing these down is to try and fill in gaps in my own knowledge, so leave a comment if you think I’m missing something (there’s probably loads!)

To define the scope of those standards, lets try and answer two questions.

Question 1: What are the various activities that we might want to carry out around an open dataset?

A. Discover the metadata and documentation about a dataset
B. Download or otherwise extract the contents of a dataset
C. Manage a dataset within a platform, e.g. create and publish it, update or delete it
D. Monitor a dataset for updates
E. Extract metrics about a dataset, e.g. a description of its contents or quality metrics
F. Mirror a dataset to another location, e.g. exporting its metadata and contents
G. Link or reconcile some data against a dataset or register

Question 2: What are the various activities that we might want to carry out around an open data catalogue?

V. Find whether a dataset exists, e.g. via a search or similar interface
X. List the contents of the platform, e.g. its datasets or other published assets
Y. Manage user accounts, e.g. to create accounts, or grant or remove rights from specific accounts
Z. Extract usage statistics, e.g. metrics on use of the platform and the datasets it contains

Now, based on that quick review: which of these areas of functionality are covered by existing standards?

Posted at 14:22

October 17

AKSW Group - University of Leipzig: AKSW Colloquium, 17.10.2016, Version Control for RDF Triple Stores + NEED4Tweet

In the upcoming Colloquium, October the 17th at 3 PM, two papers will be presented:

Version Control for RDF Triple Stores

Marvin Frommhold will discuss the paper “Version Control for RDF Triple Stores” by Steve Cassidy and James Ballantine which forms the foundation of his own work regarding versioning for RDF.

Abstract: RDF, the core data format for the Semantic Web, is increasingly being deployed both from automated sources and via human authoring either directly or through tools that generate RDF output. As individuals build up large amounts of RDF data and as groups begin to collaborate on authoring knowledge stores in RDF, the need for some kind of version management becomes apparent. While there are many version control systems available for program source code and even for XML data, the use of version control for RDF data is not a widely explored area. This paper examines an existing version control system for program source code, Darcs, which is grounded in a semi-formal theory of patches, and proposes an adaptation to directly manage versions of an RDF triple store.

NEED4Tweet: A Twitterbot for Tweets Named Entity Extraction and
Disambiguation

Afterwards, Diego Esteves will present the paper “NEED4Tweet: A Twitterbot for Tweets Named Entity Extraction and
Disambiguation” by Mena B. Habib and Maurice van Keulen which was accepted at ACL 2015.

Abstract: In this demo paper, we present NEED4Tweet, a Twitterbot for named entity extraction (NEE) and disambiguation (NED) for Tweets. The straightforward application of state-of-the-art extraction and disambiguation approaches on informal text widely used in Tweets, typically results in significantly degraded performance due to the lack of formal structure; the lack of sufficient context required;
and the seldom entities involved. In this paper, we introduce a novel framework
that copes with the introduced challenges. We rely on contextual and semantic features more than syntactic features which are less informative. We believe that disambiguation can help to improve the extraction process. This mimics the way humans understand language.

About the AKSW Colloquium

Posted at 07:55

October 14

AKSW Group - University of Leipzig: LIMES 1.0.0 Released

Dear all,

the LIMES Dev team is happy to announce LIMES 1.0.0.

LIMES, the Link Discovery Framework for Metric Spaces, is a link discovery framework for the Web of Data. It implements time-efficient approaches for large-scale link discovery based on the characteristics of metric spaces. Our approaches facilitate different approximation techniques to compute estimates of the similarity between instances. These estimates are then used to filter out a large amount of those instance pairs that do not suffice the mapping conditions. By these means, LIMES can reduce the number of comparisons needed during the mapping process by several orders of magnitude. The approaches implemented in LIMES include the original LIMES algorithm for edit distances, HR3, HYPPO and ORCHID.

Additionally, LIMES supports the first planning technique for link discovery HELIOS, that minimizes the overall execution of a link specification, without any loss of completeness. Moreover, LIMES implements supervised and unsupervised machine-learning algorithms for finding accurate link specifications. The algorithms implemented here include the supervised, active and unsupervised versions of EAGLE and WOMBAT.

Website: http://aksw.org/Projects/LIMES.html

Download: https://github.com/AKSW/LIMES-dev/releases/tag/1.0.0

GitHub: https://github.com/AKSW/LIMES-dev

User manual: http://aksw.github.io/LIMES-dev/user_manual/

Developer manual: http://aksw.github.io/LIMES-dev/developer_manual/

What is new in LIMES 1.0.0:

New LIMES GUI
New Controller that supports manual and graphical configuration
New machine learning pipeline: supports supervised, unsupervised and active learning algorithms
New dynamic planning for efficient link discovery
Updated execution engine to handle dynamic planning
Added support for qualitative (Precision, Recall, F-measure etc.) and quantitative (runtime duration etc.) evaluation metrics for mapping evaluation, in the presence of a gold standard
Added support for configuration files in XML and RDF formats
Added support for pointsets metrics such as Mean, Hausdorff and Surjection
Added support for MongeElkan, RatcliffObershelp string measures
Added support for Allen’s algebra temporal relations for event data
Added support for all topological relations derived from the DE-9IM model
Migrated the codebase to Java 8 and Jena 3.0.1

We would like to thank everyone who helped to create this release. We also acknowledge the support of the SAKE and HOBBIT projects.

Kind regards,

The LIMES Dev team

Posted at 09:38

October 11

AKSW Group - University of Leipzig: DL-Learner 1.3 (Supervised Structured Machine Learning Framework) Released

Dear all,

the Smart Data Analytics group at AKSW is happy to announce DL-Learner 1.3.

DL-Learner is a framework containing algorithms for supervised machine learning in RDF and OWL. DL-Learner can use various RDF and OWL serialization formats as well as SPARQL endpoints as input, can connect to most popular OWL reasoners and is easily and flexibly configurable. It extends concepts of Inductive Logic Programming and Relational Learning to the Semantic Web in order to allow powerful data analysis.

Website: http://dl-learner.org
GitHub page: https://github.com/AKSW/DL-Learner
Download: https://github.com/AKSW/DL-Learner/releases
ChangeLog: http://dl-learner.org/development/changelog/

DL-Learner is used for data analysis tasks within other tools such as ORE and RDFUnit. Technically, it uses refinement operator based, pattern-based and evolutionary techniques for learning on structured data. For a practical example, see http://dl-learner.org/community/carcinogenesis/. It also offers a plugin for Protégé, which can give suggestions for axioms to add.

In the current release, we added a large number of new algorithms and features. For instance, DL-Learner supports terminological decision tree learning, it integrates the LEAP and EDGE systems as well as the BUNDLE probabilistic OWL reasoner. We migrated the system to Java 8, Jena 3, OWL API 4.2 and Spring 4.3. We want to point to some related efforts here:

A new DL-Learner overview article is available at the Journal of Web Semantics (pre-print PDF).
We started a benchmarking framework for supervised machine learning from structured data (not restricted to RDF/OWL).
An article about the SPARQL reasoning component is now available (published at ECAI).
An article about terminological decision tree learning is available (published at EKAW).

We want to thank everyone who helped to create this release, in particular we want to thank Giuseppe Cota who visited the core developer team and significantly improved DL-Learner. We also acknowledge support by the recently SAKE project, in which DL-Learner will be applied to event analysis in manufacturing use cases, as well as Big Data Europe and HOBBIT projects.

Kind regards,

Lorenz Bühmann, Jens Lehmann, Patrick Westphal and Simon Bin

Posted at 19:41

Planet RDF

Last updated

Bloggers

Feeds

About this site

December 22

December 16

December 15

December 09

November 30

Dissertation Proposal

Dealing with Dubious Facts in Knowledge Graphs

Ankur Padia

1:00-3:00pm Wednesday, 30 November 2016, ITE 325b, UMBC

November 26

In the upcoming Colloquium, November the 28th at 3 PM, two papers will be presented:

Probabilistic Bag-Of-Hyperlinks Model for Entity Linking

About the AKSW Colloquium

November 21

Creating New Domain Using KBpedia Aspects

Creating Model With Reference Concepts

Creating Model With Entities

Conclusion

Footnotes:

November 17

Ensemble Learning Strategies

Bootstrap Aggregating (bagging)

Asymmetric Bagging

Random Subspace method (feature bagging)

Asymmetric Bagging + Random Subspace method (ABRS)

Bootstrap Aggregating + Random Subspace method (BRS)

SVM Learner

Build Training Document Vectors

Train, Classify and Evaluate Ensembles

Asymmetric Bagging

Asymmetric Bagging + Random Subspace method (ABRS)

Conclusion

Footnotes

New Knowledge Graph and Reasoning

Reasoning Over The Knowledge Graph

Update Domain Training Corpus Using KBpedia 1.10 and a Reasoner

Create Training Corpuses

Create New Gold Standard

Evaluate Initial Domain Model

Features Selection Using Pruning and Training Corpus Pruning

Evaluating Pruned Training Corpuses and Selected Features

Hyperparameters Optimization Using Grid Search

Part 2

November 16

November 14

November 13

November 11

November 10

November 08

November 07

The Importance of Testing Each Build

The Build Process

An Escheresque Building Process

The Nature of the KBpedia Build Process

Checking for Disjointedness and Inconsistencies

Unsatisfiability of Linked External Concepts

A Fully Connected Graph

Typologies Have Their Own Tests

Knowledge is Dynamic and So Must Be Builds and Testing

November 04

Machine Learning

November 01

October 30

October 25

General and Specific Domains

Training Corpuses

Define The General Training Corpus

Define The Specific Domain Training Corpus

Create Training Corpuses

Build Semantic Interpreter

How Does the Semantic Interpreter Work?

Creating the First Semantic Interpreter

Evaluating Models

Build SVM Model

Improving Performance Using Weights

Dealing with Dubious Facts
in Knowledge Graphs