The goal of the challenge is to have unsupervisedly trained parser to create parses approximating "expected" English parses to the best extent - using cleaned Gutenberg Children corpus data as an input and Link Grammar English parses in three forms as a reference.
Input:
http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/
(that is "cleaned" Gutenberg Children corpus data tokenized with Link Grammar English tokenization rules)
References:
- http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/LG5.5.1/capital/parses/
(the above is "bronze standard" - the corpus above parsed with Link Grammar English dictionary, with tokenization done in slightly different way which can be ignored when comparing results)
- http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GC_LGEnglish_noQuotes_fullyParsed.ull
(the above is "silver standard" - the previous parses gathered in one file, with all sentence parses selected i one file, where all sentences are 100% parsed with Link Grammar English dictionary and have no any direct speech fragments)
- http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GC_LGEnglish_noQuotes_manual.ull
(the above is "gold standard" - the previous parses with 200+ sentences randomly selected and reviewed by human with the links validated)
Requirements:
- The unsupervisedly trained parser should be trained on the input corpus following the same tokenization, assuming the space is word separator and double linefeed is sentence separator.
- The unsupervisedly trained parser should be trained on sentence basis, with no mutual impact from adjacent sentences
- The output parses for each of the reference files should have file names identical to those in the reference data
- The lower/capital case should be ignored as evaluation process will be ignoring the cases
- If the parser provides parses in "phrase structure grammar" (PSG) structure (linking words as well as compound phrases, like http://demo.chaoticlanguage.com/), unlikely to "link grammar" structure (linking only words), the "dependency-grammar" parses should somehow converted to "link grammar" structure
- The sample code for writing parses in ULL format used by reference parses is provided as follows:
- The links from LEFT-WALL in the expected parses may be ignored and not produced because links from LEFT-WALL and links to ending period will be not involved in evaluation of the results.
Other information:
The goal of the challenge is to have unsupervisedly trained parser to create parses approximating "expected" English parses to the best extent - using cleaned Gutenberg Children corpus data as an input and Link Grammar English parses in three forms as a reference.
Input:
http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/
(that is "cleaned" Gutenberg Children corpus data tokenized with Link Grammar English tokenization rules)
References:
(the above is "bronze standard" - the corpus above parsed with Link Grammar English dictionary, with tokenization done in slightly different way which can be ignored when comparing results)
(the above is "silver standard" - the previous parses gathered in one file, with all sentence parses selected i one file, where all sentences are 100% parsed with Link Grammar English dictionary and have no any direct speech fragments)
(the above is "gold standard" - the previous parses with 200+ sentences randomly selected and reviewed by human with the links validated)
Requirements:
Other information: