User: Dimitris Kontokostas Description Created: 5/19/14 9:21 AM
Extracts the link texts used to refer to a page by means of internal links.
Extracts links from concepts to categories using the SKOS vocabulary.
Extracts links to corresponding Articles in Wikipedia.
This extractor extracts all templates that exist in an article.
This extractor extracts all templates that exist in an article. This data can be used for Wikipedia administrative tasks.
Will be thrown by WriterDestination to signal failed Quads (i.e.
Will be thrown by WriterDestination to signal failed Quads (i.e. Bad Iri)
Extracts labels for Categories.
This extractor extract citation data from articles to boostrap this it is based on the infoboxExtractor
ewxperimental extractor that extracts infobox facts that have a citation on the same line and places the citation in the context
TODO: change the syntax on the mappings wiki to allow an arbitrary number of template properties.
Extract KML files from the Commons and links them to the original document.
Extract KML files from the Commons and links them to the original document. There are only 160 KML files on the Commons right now, but some day there might be more.
These are currently used as overlays, documented at https://commons.wikimedia.org/wiki/Commons:Geocoding/Overlay
Links non-commons DBpedia resources to their DBpedia Commons counterpart using owl:sameAs.
Links non-commons DBpedia resources to their DBpedia Commons counterpart using owl:sameAs. This requires the the Wikipedia page to contain a {{Commons}} template.
Example http://en.wikipedia.org/wiki/Eurasian_blue_tit: Page contains node: {{Commons|Cyanistes caeruleus}}
Produces triple: <dbr:Eurasian_blue_tit> <owl:sameAs> <dbpedia-commons:Cyanistes_caeruleus>.
TODO: generic type may not be optimal.
TODO: generic type may not be optimal.
Used to map information that is only contained in the infobox template name, for example
Used to map information that is only contained in the infobox template name, for example
en:Infobox_Australian_Road {{TemplateMapping | mapToClass = Road | mappings = {{ConstantMapping | ontologyProperty = country | value = Australia }} ... }}
Created by IntelliJ IDEA.
Created by IntelliJ IDEA. User: Mohamed Morsey Date: 11/29/11 Time: 5:49 PM Extracts the information that describes the contributor (editor) of a Wikipedia page, such as his username, and his ID.
Links DBpedia Commons resources to their counterparts in other DBpedia languages (only en, de and fr) using owl:sameAs.
Links DBpedia Commons resources to their counterparts in other DBpedia languages (only en, de and fr) using owl:sameAs. This requires the the Wikimedia page to contain a {{VN}} template.
Example http://commons.wikimedia.org/wiki/Cyanistes_caeruleus: Page contains node: {{VN |de=Blaumeise |en=Blue Tit |fr=Mésange bleue }}
Produces triple: <dbpedia-commons:Cyanistes_caeruleus> <owl:sameAs> <dbr:Eurasian_blue_tit>. <dbpedia-commons:Cyanistes_caeruleus> <owl:sameAs> <dbpedia-de:Blaumeise>. <dbpedia-commons:Cyanistes_caeruleus> <owl:sameAs> <dbpedia-fr:Mésange_bleue>
Extracts disambiguation links.
Extracts links to external web pages.
TODO: generic type may not be optimal.
Identifies the type of a File page.
Extract images from galleries.
Extract images from galleries. I'm not sure what the best RDF representation of this will be, but for now we'll start with:
The gallery tag is documented at https://en.wikipedia.org/wiki/Help:Gallery_tag
Extracts the grammatical gender of people using a heuristic.
Extracts geo-coodinates.
Extracts geo-coodinates.
Extracts links to the official homepage of an instance.
Combines the raw infobox and mappings extractor and tries to split the triples of the raw infobox extractor in triples that were mapped from the mappings extractors and triples that were not mapped
Extracts image annotations created using the Image Annotator gadget (https://commons.wikimedia.org/wiki/Help:Gadget-ImageAnnotator)
Extracts image annotations created using the Image Annotator gadget (https://commons.wikimedia.org/wiki/Help:Gadget-ImageAnnotator)
The RDF produced uses the W3C Media Fragments 1.0 to identify parts of an image: http://www.w3.org/TR/2012/REC-media-frags-20120925/
Reworked Image Extractor
This extractor extracts all properties from all infoboxes.
This extractor extracts all properties from all infoboxes. Extracted information is represented using properties in the http://xx.dbpedia.org/property/ namespace (where xx is the language code). The names of the these properties directly reflect the name of the Wikipedia infobox property. Property names are not cleaned or merged. Property types are not part of a subsumption hierarchy and there is no consistent ontology for the infobox dataset. The infobox extractor performs only a minimal amount of property value clean-up, e.g., by converting a value like “June 2009” to the XML Schema format “2009–06”. You should therefore use the infobox dataset only if your application requires complete coverage of all Wikipeda properties and you are prepared to accept relatively noisy data.
Extracts template variables from template pages (see http://en.wikipedia.org/wiki/Help:Template#Handling_parameters)
Extracts interwiki links
Extractors are mappings that extract data from a JsonNode.
Extractors are mappings that extract data from a JsonNode. Necessary to get some type safety in CompositeExtractor: Class[_ <: Extractor] can be checked at runtime, but Class[_ <: Mapping[PageNode]] can not.
User: hadyelsahar Date: 11/19/13 Time: 12:43 PM
User: hadyelsahar Date: 11/19/13 Time: 12:43 PM
JsonParseExtractor as explained in the design : https://f.cloud.github.com/assets/607468/363286/1f8da62c-a1ff-11e2-99c3-bb5136accc07.png
send page to JsonParser, if jsonparser returns none do nothing if it's parsed correctly send the JsonNode to the next level extractors
Extracts labels to articles based on their title.
Extracts structured data based on hand-generated mappings of Wikipedia infoboxes to the DBpedia ontology.
Extracts all media files of a Wikipedia page.
Extracts all media files of a Wikipedia page. Constructs a thumbnail image from it, and links to the resources in DBpedia Commons
FIXME: we're sometimes dealing with encoded links, sometimes with decoded links. It's quite a mess.
Created by IntelliJ IDEA.
Created by IntelliJ IDEA. User: Mohamed Morsey Date: 9/13/11 Time: 9:03 PM Extracts page's meta-information e.g. editlink, revisonlink, ....
Extracts page abstracts which are not yet extracted.
Extracts page abstracts which are not yet extracted. For each page which is a candidate for extraction
From now on we use MobileFrontend for MW <2.21 and TextExtracts for MW > 2.22 The patched mw instance is no longer needed except from minor customizations in LocalSettings.php TODO: we need to adapt the TextExtracts extension to accept custom wikicode syntax. TextExtracts now uses the article entry and extracts the abstract. The retional for the new extension is that we will not need to load all articles in MySQL, just the templates At the moment, setting up the patched MW takes longer than the loading of all articles in MySQL :) so, even this way it's way better and cleaner ;) We leave the old code commented since we might re-use it soon
Extracts page html.
Extracts page html.
Based on AbstractExtractor, major difference is the parameter apiParametersFormat = "action=parse&prop=text§ion=0&format=xml&page=%s"
This class produces all nif related datasets for the abstract as well as the short-, long-abstracts datasets. Where the long abstracts is the nif:isString attribute of the nif instance representing the abstract section of a wikipage.
We are going to to use this method for generating the abstracts from release 2016-10 onwards. It will be expanded to cover the whole wikipage in the future.
Extracts page ids of articles, e.g.
Extracts page ids of articles, e.g. <http://dbpedia.org/resource/Foo> <http://dbpedia.org/ontology/wikiPageID> "123456"^^<xsd:integer> .
Extracts internal links between DBpedia instances from the internal page links between Wikipedia articles.
Extracts internal links between DBpedia instances from the internal page links between Wikipedia articles. The page links might be useful for structural analysis, data mining or for ranking DBpedia instances using Page Rank or similar algorithms.
Extractors are mappings that extract data from a PageNode.
Extractors are mappings that extract data from a PageNode. Necessary to get some type safety in CompositeExtractor: Class[_ <: Extractor] can be checked at runtime, but Class[_ <: Mapping[PageNode]] can not.
Extracts information about persons (date and place of birth etc.) from the English and German Wikipedia, represented using the FOAF vocabulary.
Extracts PND (Personennamendatei) data about a person.
Extracts PND (Personennamendatei) data about a person. PND is published by the German National Library. For each person there is a record with name, birth and occupation connected with a unique identifier, the PND number. TODO: also use http://en.wikipedia.org/wiki/Template:Authority_control and other templates.
Marker trait for mappings which map one or more properties of a specific class.
Marker trait for mappings which map one or more properties of a specific class. Necessary to make PropertyMappings distinguishable from other Mapping[TemplateNode] types.
Extracts links to the article revision that the data was extracted from, e.g.
Extracts links to the article revision that the data was extracted from, e.g. <http://dbpedia.org/resource/Foo> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/Foo?oldid=123456> .
Extracts redirect links between Articles in Wikipedia.
Holds the redirects between wiki pages At the moment, only redirects between Templates are considered
Extracts revision ids of articles, e.g.
Extracts revision ids of articles, e.g. <http://dbpedia.org/resource/Foo> <http://dbpedia.org/ontology/wikiPageRevisionID> "123456"^^<xsd:integer> .
Extracts information about which concept is a category and how categories are related using the SKOS Vocabulary.
Extracts template variables from template pages (see http://en.wikipedia.org/wiki/Help:Template#Handling_parameters)
Relies on Cat main templates.
Relies on Cat main templates. Goes over all categories and extract DBpedia Resources that are the main subject of that category. We are using this to infer that a resource is a Topical Concept.
TODO only do that for resources that have no other ontology type, in post-processing
TODO check if templates Cat_exp, Cat_main_section, and Cat_more also apply
Extracts sameAs links for resources with themselves.
Extracts sameAs links for resources with themselves. Only makes sense when serialization is configured such that subjects are IRIs and objects are URIs (or vice versa).
Extractors are mappings that extract data from a WikiPage.
Extractors are mappings that extract data from a WikiPage. Necessary to get some type safety in CompositeExtractor: Class[_ <: Extractor] can be checked at runtime, but Class[_ <: Mapping[PageNode]] can not.
Extracts the number of characters in a wikipedia page
Extracts the number of external links to DBpedia instances from the internal page links between Wikipedia articles.
Extracts the number of external links to DBpedia instances from the internal page links between Wikipedia articles. The Out Degree might be useful for structural analysis, data mining or for ranking DBpedia instances using Page Rank or similar algorithms. In Degree cannot be calculated at extraction time but with a post processing step from the PageLinks dataset
User: hadyelsahar Date: 11/19/13 Time: 12:43 PM
User: hadyelsahar Date: 11/19/13 Time: 12:43 PM
ParseExtractors as explained in the design : https://f.cloud.github.com/assets/607468/363286/1f8da62c-a1ff-11e2-99c3-bb5136accc07.png
send page to SimpleWikiParser, if it returns none do nothing if it's parsed correctly send the PageNode to the next level extractors
Created by ali on 7/29/14.
Created by ali on 7/29/14. Extracts aliases triples from Wikidata sources on the form of <http://wikidata.dbpedia.org/resource/Q446> <http://dbpedia.org/ontology/alias> "alias"@lang .
Created by ali on 7/29/14.
Created by ali on 7/29/14. Extracts descriptions triples from Wikidata sources on the form of <http://wikidata.dbpedia.org/resource/Q139> <http://dbpedia.org/ontology/description> "description"@lang.
Extracts labels triples from Wikidata sources on the form of http://data.dbpedia.org/Q64 rdfs:label "new York"@fr http://data.dbpedia.org/Q64 rdfs:label "new York City"@en
it's an extractor to extract Mappings between Wikidata URIs to WikiData URIs inside DBpedia, in the form of : <http://wikidata.dbpedia.org/resource/Q18> <owl:sameas> <http://www.wikidata.org/entity/Q18>
Created by ali on 2/28/15.
Created by ali on 2/28/15. wikidata property page's aliases, descriptions, labels and statements are extracted. wikidata nampespace used for property pages.
aliases are extracted on the form of wikidata:P102 dbo:alias "political party, party"@en .
descriptions are extacted on the form of wikidata:P102 dbo:description "the political party of which this politician is or has been a member"@en .
labels are extracted on the form of wikidata:P102 rdfs:label "member of political party"@en.
statements are extracted on the form of wikidata:P102 wikidata:P1646 wikidata:P580 .
Created by ali on 10/26/14.
Created by ali on 10/26/14. This extractor maps wikidata statements to DBpedia ontology wd:Q64 dbo:primeMinister wd:Q8863.
In order to extract n-ary relation mapped statements are reified. For reification unique statement URIs is created. Mapped statements reified on the form of wd:Q64_P6_Q8863 rdf:type rdf:Statement. wd:Q64_P6_Q8863 rdf:subject wd:Q64 . wd:Q64_P6_Q8863 rdf:predicate dbo:primeMinister. wd:Q64_P6_Q8863 rdf:object wd:Q8863.
Qualifiers use same statement URIs and mapped on the form of wd:Q64_P6_Q8863 dbo:startDate "2001-6-16"xsd:date. wd:Q64_P6_Q8863 dbo:endDate "2014-12-11"xsd:date.
Created by ali on 10/26/14.
Created by ali on 10/26/14. Raw wikidata statements extracted on the form of wd:Q64 wikidata:P6 wd:Q8863.
In order to extract n-ary relation statements are reified. For reification unique statement URIs is created. Mapped statements reified on the form of wd:Q64_P6_Q8863 rdf:type rdf:Statement. wd:Q64_P6_Q8863 rdf:subject wd:Q64 . wd:Q64_P6_Q8863 rdf:predicate wikidata:P6. wd:Q64_P6_Q8863 rdf:object wd:Q8863.
Qualifiers use same statement URIs and extracted on the form of wd:Q64_P6_Q8863 wikidata:P580 "2001-6-16"xsd:date. wd:Q64_P6_Q8863 wikidata:P582 "2014-12-11"xsd:date.
Created by ali on 10/26/14.
Created by ali on 10/26/14. Wikidata statement's references extracted on the form of wd:Q76_P140_V39759 dbo:reference "http://www.christianitytoday.com/ct/2008/januaryweb-only/104-32.0.html?start=2"^^ xsd:string.
it's an extractor to extract sameas data from DBpedia-WikiData on the form of <http://wikidata.dbpedia.org/resource/Q18> owl:sameAs <http://dbpedia.org/resource/London> <http://wikidata.dbpedia.org/resource/Q18> owl:sameAs <http://fr.dbpedia.org/resource/London> <http://wikidata.dbpedia.org/resource/Q18> owl:sameAs <http://co.dbpedia.org/resource/London>
Extracts wiki texts like abstracts or sections in html.
Extracts wiki texts like abstracts or sections in html. NOTE: This class is not only used for abstract extraction but for extracting wiki text of the whole page The NifAbstract Extractor is extending this class. All configurations are now outsourced to //extraction-framework/core/src/main/resources/mediawikiconfig.json change the 'publicParams' entries for tweaking endpoint and time parameters
From now on we use MobileFrontend for MW <2.21 and TextExtracts for MW > 2.22 The patched mw instance is no longer needed except from minor customizations in LocalSettings.php TextExtracts now uses the article entry and extracts the abstract. The retional for the new extension is that we will not need to load all articles in MySQL, just the templates At the moment, setting up the patched MW takes longer than the loading of all articles in MySQL :) so, even this way it's way better and cleaner ;) We leave the old code commented since we might re-use it soon
(Since version 2016-10) replaced by NifExtractor.scala: which will extract the whole page content including the abstract
Extracts the first image of a Wikipedia page.
Extracts the first image of a Wikipedia page. Constructs a thumbnail from it, and the full size image.
FIXME: we're sometimes dealing with encoded links, sometimes with decoded links. It's quite a mess.
(Since version 2017-08) replaced by ImageExtractorNew
Creates new extractors.
Creates new extractors.
Creates new extractors.
Loads the mappings from the configuration and builds a MappingExtractor instance.
Loads the mappings from the configuration and builds a MappingExtractor instance. This should be replaced by a general loader later on, which loads the mapping objects based on the grammar (which can be defined using annotations)
Loads redirects from a cache file or source of Wiki pages.
Loads redirects from a cache file or source of Wiki pages. At the moment, only redirects between Templates are considered
Extracts the link texts used to refer to a page by means of internal links. This data provides one part of the input for the surface forms dataset.