Logs read bytes.
Counts read bytes, sends number to callback function.
Created by Chile on 1/30/2017.
Created by Chile on 11/3/2016.
Allows common handling of java.io.File and java.nio.file.Path
Recursively iterates through a directory and calls a user-defined function on each file.
Recursively iterates through a directory and calls a user-defined function on each file.
if the given base could not be found
Helps to find files and directories in a directory structure as used by the Wikipedia dump download site, for example baseDir/enwiki/20120403/enwiki-20120403-pages-articles.xml.bz2
Helps to find files and directories in a directory structure as used by the Wikipedia dump download site, for example baseDir/enwiki/20120403/enwiki-20120403-pages-articles.xml.bz2
TODO: wikiNameSuffix doesn't belong here, it should be part of the Language class (which should be renamed to WikiCode or so)
Represents a MediaWiki instance and the language used on it.
Represents a MediaWiki instance and the language used on it. Initially, this class was only used for xx.wikipedia.org instances, but now we also use it for mappings.dbpedia.org and www.wikidata.org. For each language, there is only one instance of this class. TODO: rename this class to WikiCode or so, distinguish between enwiki / enwiktionary etc.
The Mediawiki API connector
This class provides the necessary attributes to record either a successful or failed extraction
Defines additional methods on Files, which are missing in the standard library.
Wrapper class for StartElement so we can use attr and getAttr.
Defines additional methods on strings, which are missing in the standard library.
Provides the same flexibility as RichFile for web resources No output stream available!
Provides the same flexibility as RichFile for web resources No output stream available!
Created by Chile on 1/30/2017.
Resolves transitive relations in a graph and removes cycles.
Escapes a Unicode string according to Turtle / N-Triples format.
Escapes a Unicode string according to Turtle / N-Triples format. TODO: allow StringBuilder to be null, create one if necessary.
Executes queries to the MediaWiki API.
Executes queries to the MediaWiki API.
TODO: replace this class by code adapted from WikiDownloader.
Calls a Wikipedia URL, handles redirects to a different language version, processes the response.
Reads result of the api.php query above.
Downloads all pages for a given list of namespaces from api.php and transforms them into the format of the dump files (because XMLSource understands that format).
Downloads all pages for a given list of namespaces from api.php and transforms them into the format of the dump files (because XMLSource understands that format).
TODO: extend this class a bit and replace the XML-handling code in WikiApi.
Information about a Wikipedia.
Reads result of the api.php query above.
Reads result of the api.php query above.
Note: we use linked sets and maps to preserve order. Scala currently has no immutable linked collections, so we use mutable ones (which should also improve performance). Calling .toMap to make them immutable would destroy the order, so we simply return them, but as an immutable interface. Malicious users could still downcast and mutate. Meh.
A simple fixed size thread-pool.
A simple fixed size thread-pool.
TODO: If a worker thread dies because of an uncaught exception, it just goes away and we may not fully use all CPUs. Maybe we should start a new worker thread? Or use a thread pool who does that for us? On the other hand - what about worker.init() and worker.destroy()? We probably don't want to call them twice. No, I guess it's better to let the thread die. Users can always catch Throwable in their implementation of Worker.process().
FIXME: If all worker threads die because of uncaught exceptions, the master thread will probably still add tasks to the queue and block forever. When a worker thread dies, it should count down the number of live threads and if none are left interrupt the master thread if it is blocking in process(). But what if there are multiple master threads? Ough. We need more ways to communicate between masters and workers...
Wraps an XMLEventReader in a fluent API.
Wraps an XMLEventWriter in a fluent API.
User: Dimitris Kontokostas Various utils for loading Extractors I don't like this so much but it's the only way to reuse extraction configuration code on multiple modules (dump / server) Created: 5/19/14 11:06 AM
TODO: modify the bzip code such that there are no run-time dependencies on commons-compress.
TODO: modify the bzip code such that there are no run-time dependencies on commons-compress. Users should be able to use .gz files without having commons-compress on the classpath. Even better, look for several different bzip2 implementations on the classpath...
Created by aditya on 6/21/16.
Download mapping pages for all namespaces from http://mappings.dbpedia.org/ and transform them into the format of the dump files (because XMLSource understands that format).
Download ontology classes and properties from http://mappings.dbpedia.org/ and transform them into the format of the dump files (because XMLSource understands that format).
Download ontology classes and properties from http://mappings.dbpedia.org/ and transform them into the format of the dump files (because XMLSource understands that format). Also save the result as OWL.
This class requires the java.nio.file package, which is available since JDK 7.
This class requires the java.nio.file package, which is available since JDK 7.
If you want to compile and run DBpedia with an earlier JDK version, delete or blank these two files:
core/src/main/scala/org/dbpedia/extraction/util/RichPath.scala dump/src/main/scala/org/dbpedia/extraction/dump/clean/Clean.scala
The launchers 'purge-download' and 'purge-extract' in the dump/ module won't work, but they are not vitally necessary.
Defines additional methods on strings, which are missing in the standard library.
Helper methods to escape / unescape Turtle / N-Triples.
Helper methods to escape / unescape Turtle / N-Triples.
TODO: most of these methods could be much more efficient - they should only create a StringBuffer if the input actually needs to be changed. Otherwise, they should simply return the input string. See StringUtils.escape.
Helper methods to create WikiInfo objects.
Contains several utility functions related to WikiText.
Created by ali on 2/1/15.
Provides the overall state of a worker
Constants for workers.
Logs read bytes. Meant to be used with CountingInputStream.