The code tokval = " ".join(tokval.split()) has the effect of normalizing whitespace. Is this normal for a benchmark intended to measure next token prediction ? Should whitespace be kept to measure the model's ability to predict the next token in "normal" code ? Here is the location in the pre-processing script:
https://github.com/microsoft/CodeXGLUE/blob/ac74a62802a0dd159b3258c78a2df8ad36cdf2b9/Code-Code/CodeCompletion-token/dataset/py150/preprocess.py#L53C17-L53C50
"Line level code completion task shares the train/dev dataset with token level completion" so it might have more impact there - giving overly optimistic results..
Maybe the token should be used in the pre-processing to distinguish between spaces used to separate tokens and spaces in the structure of the code?
The code
tokval = " ".join(tokval.split())has the effect of normalizing whitespace. Is this normal for a benchmark intended to measure next token prediction ? Should whitespace be kept to measure the model's ability to predict the next token in "normal" code ? Here is the location in the pre-processing script:https://github.com/microsoft/CodeXGLUE/blob/ac74a62802a0dd159b3258c78a2df8ad36cdf2b9/Code-Code/CodeCompletion-token/dataset/py150/preprocess.py#L53C17-L53C50
"Line level code completion task shares the train/dev dataset with token level completion" so it might have more impact there - giving overly optimistic results..
Maybe the token should be used in the pre-processing to distinguish between spaces used to separate tokens and spaces in the structure of the code?