instead of a defined output dataset name, one can specify a name extension turncated at the end of the input dataset name (e.g.
instead of a defined output dataset name, one can specify a name extension turncated at the end of the input dataset name (e.g. '-transitive' -> instance-types-transitive)
The version string of the DBpedia version being extracted
base-dir gives either an absolute path or a relative path to where all data is stored, normally wikidumps are downloaded here and extracted data is saved next to it, created folder structure is {{lang}}wiki/$date
base-dir gives either an absolute path or a relative path to where all data is stored, normally wikidumps are downloaded here and extracted data is saved next to it, created folder structure is {{lang}}wiki/$date
DEV NOTE: 1. this must stay lazy as it might not be used or creatable in the SPARK extraction 2. Download.scala in core does the creation
DEFAULT ./wikidumps
TODO rename dumpDir to baseDir
the extractor classes to be used when extracting the XML dumps
An array of input dataset names (e.g.
An array of input dataset names (e.g. 'instance-types' or 'mappingbased-literals') (separated by a ',')
the suffix of the files representing the input dataset (usually a combination of RDF serialization extension and compression used - e.g.
the suffix of the files representing the input dataset (usually a combination of RDF serialization extension and compression used - e.g. .ttl.bz2 when using the TURTLE triples compressed with bzip2)
determines if 1.
determines if 1. the download has to be completed and if so 2. looks for the download-complete file
- the language for which to check
An array of languages specified by the exact enumeration of language wiki codes (e.g.
An array of languages specified by the exact enumeration of language wiki codes (e.g. en,de,fr...) or article count ranges ('10000-20000' or '10000-' -> all wiki languages having that much articles...) or '@mappings', '@chapters' when only mapping/chapter languages are of concern or '@downloaded' if all downloaded languages are to be processed (containing the download.complete file) or '@abstracts' to only process languages which provide human readable abstracts (thus not 'wikidata' and the like...)
The directory where all log files will be stored
Local mappings files, downloaded for speed and reproducibility Note: This is lazy to defer initialization until actually called (eg.
Local mappings files, downloaded for speed and reproducibility Note: This is lazy to defer initialization until actually called (eg. this class is not used directly in the distributed extraction framework - DistConfig.ExtractionConfig extends Config and overrides this val to null because it is not needed)
namespaces loaded defined by the languages in use (see languages)
Local ontology file, downloaded for speed and reproducibility Note: This is lazy to defer initialization until actually called (eg.
Local ontology file, downloaded for speed and reproducibility Note: This is lazy to defer initialization until actually called (eg. this class is not used directly in the distributed extraction framework - DistConfig.ExtractionConfig extends Config and overrides this val to null because it is not needed)
A dataset name for the output file generated (e.g.
A dataset name for the output file generated (e.g. 'instance-types' or 'mappingbased-literals')
same as for inputSuffix (for the output dataset)
Number of parallel processes allowed.
Number of parallel processes allowed. Depends on the number of cores, type of disk and IO speed
before processing a given language, check if the download.complete file is present
TODO experimental, ignore for now
Normally extraction jobs are run sequentially (one language after the other), but for some jobs it makes sense to run these in parallel.
Normally extraction jobs are run sequentially (one language after the other), but for some jobs it makes sense to run these in parallel. This only should be used if a single extraction job does not take up the available computing power.
If set, extraction summaries are forwarded via the API of Slack, displaying messages on a dedicated channel.
If set, extraction summaries are forwarded via the API of Slack, displaying messages on a dedicated channel. The URL of the slack webhook to be used the username under which all messages are posted (has to be registered for this webhook?) Threshold of extracted pages over which a summary of the current extraction is posted Threshold of exceptions over which an exception report is posted
get all universal properties, check if there is an override in the provided config file
Documentation of config values
TODO universal.properties is loaded and then overwritten by the job specific config, however, we are working on removing universal properties by setting (and documenting) sensible default values here, that CAN be overwritten in job sepcific config
Guideline: * Use Java/Scaladoc always * Parameters (lazy val) MUST be documented in the following manner:
TODO @Fabian please: * go through universal properties and other configs and move all comments here * after removing place a comment in the property file refering to * set default values according to universal.properties * try to FOLLOW THE GUIDELINES above, add TODO if unclear * if possible, move all def functions to ConfigUtils * check the classes using the params for validation checks and move them here