Unsupervised Parser Challenge for Gutenberg Children corpus

The goal of the challenge is to have unsupervisedly trained parser to create parses approximating "expected" English parses to the best extent - using cleaned Gutenberg Children corpus data as an input and Link Grammar English parses in three forms as a reference.

Input:
http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/
(that is "cleaned" Gutenberg Children corpus data tokenized with Link Grammar English tokenization rules)

References:
1.  http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/LG5.5.1/capital/parses/
(the above is "bronze standard" - the corpus above parsed with Link Grammar English dictionary, with tokenization done in slightly different way which can be ignored when comparing results)
2. http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GC_LGEnglish_noQuotes_fullyParsed.ull
(the above is "silver standard" - the previous parses gathered in one file, with all sentence parses selected i one file, where all sentences are 100% parsed with Link Grammar English dictionary and have no any direct speech fragments)
3. http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GC_LGEnglish_noQuotes_manual.ull
(the above is "gold standard" - the previous parses with 200+ sentences randomly selected and reviewed by human with the links validated)

Requirements: 
1. The unsupervisedly trained parser should be trained on the input corpus following the same tokenization, assuming the space is word separator and double linefeed is sentence separator.
2. The unsupervisedly trained parser should be trained on sentence basis, with no mutual impact from adjacent sentences
3. The output parses for each of the reference files should have file names identical to those in the reference data
4. The lower/capital case should be ignored as evaluation process will be ignoring the cases
5. If the parser provides parses in "phrase structure grammar" (PSG) structure (linking words as well as compound phrases, like http://demo.chaoticlanguage.com/), unlikely to "link grammar" structure (linking only words), the "dependency-grammar" parses should **_somehow_** converted to "link grammar" structure
6. The sample code for writing parses in **ULL** format used by reference parses is provided as follows:
- Scheme: https://github.com/singnet/learn/blob/1b7220f066866e9ada13c96376ab7f87ee53a1aa/run-poc/redefine-mst-parser.scm#L148
- Java: https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/gram/main/LexStructor.java#L548 
7. The links from LEFT-WALL in the expected parses may be ignored and not produced because links from LEFT-WALL and links to ending period will be not involved in evaluation of the results.

Other information:
- Sample parser code in Scheme https://github.com/singnet/learn
- Sample parser code in Java can be found here: https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/gram/main/LexStructor.java#L649


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsupervised Parser Challenge for Gutenberg Children corpus #220

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unsupervised Parser Challenge for Gutenberg Children corpus #220

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions