This is the first release of the SimPA corpus. It contains:
- 1,100 original sentences
- 3,300 lexical simplifications (3 for each original sentence)
- 1,100 syntactic simplifications (1 for each original sentence)
The lexical and syntactic simplifications were done in two steps. Firstly, sentences were lexically simplified by volunteers. Then, a set of the lexically simplified sentences were syntactically simplified (also by volunteers).
This corpus is divided in five files:
- ls.original: original sentences before lexical simplification (this file contains repetitions - each original is repeated three times)
- ls.simplified: lexical simplifications for each entry of ls.original
- ss.original: original sentences before lexical and syntactic simplification
- ss.ls-simplified: lexically simplified sentences used as input for the syntactic simplification task
- ss.simplified: syntactic simplifications for each entry of ss.ls-simplified
Citing SimPA
Carolina Scarton, Gustavo Henrique Paetzold and Lucia Specia (2018): SimPA: A Sentence-Level Simplification Corpus\ for the Public Administration Domain. To appear in Proceedings of LREC 2018, Miyazaki, Japan. [PDF] [BIBTEX]