|
dic[d[1]]=(d[0],d[1],d[2],list(set(dic[d[1]][3]+d[3])),list(set(dic[d[1]][4]+d[4]))) |
The result of list(set()) is random under some circumstances. It can be easily reproduced by running python -c 'print(list(set(["fa", "dsa", "dsa", "w"])))'.
In some cases, the code snippet above can result in the difference of DFG returned by get_data_flow() and cause varying CodeBLUE scores (specifically dataflow_match_score):
|
dataflow_match_score = dataflow_match.corpus_dataflow_match(references, hypothesis, args.lang) |
|
|
|
print('ngram match: {0}, weighted ngram match: {1}, syntax_match: {2}, dataflow_match: {3}'.\ |
|
format(ngram_match_score, weighted_ngram_match_score, syntax_match_score, dataflow_match_score)) |
I have compared these two functions and found that in GraphCodeBERT there is no "merge nodes" action.
https://github.com/microsoft/CodeBERT/blob/ac04c77ca7cda9dc757dc8b4360e358731c8708e/GraphCodeBERT/codesearch/run.py#L68-L104
|
def get_data_flow(code, parser): |
|
try: |
|
tree = parser[0].parse(bytes(code,'utf8')) |
|
root_node = tree.root_node |
|
tokens_index=tree_to_token_index(root_node) |
|
code=code.split('\n') |
|
code_tokens=[index_to_code_token(x,code) for x in tokens_index] |
|
index_to_code={} |
|
for idx,(index,code) in enumerate(zip(tokens_index,code_tokens)): |
|
index_to_code[index]=(idx,code) |
|
try: |
|
DFG,_=parser[1](root_node,index_to_code,{}) |
|
except: |
|
DFG=[] |
|
DFG=sorted(DFG,key=lambda x:x[1]) |
|
indexs=set() |
|
for d in DFG: |
|
if len(d[-1])!=0: |
|
indexs.add(d[1]) |
|
for x in d[-1]: |
|
indexs.add(x) |
|
new_DFG=[] |
|
for d in DFG: |
|
if d[1] in indexs: |
|
new_DFG.append(d) |
|
codes=code_tokens |
|
dfg=new_DFG |
|
except: |
|
codes=code.split() |
|
dfg=[] |
|
#merge nodes |
|
dic={} |
|
for d in dfg: |
|
if d[1] not in dic: |
|
dic[d[1]]=d |
|
else: |
|
dic[d[1]]=(d[0],d[1],d[2],list(set(dic[d[1]][3]+d[3])),list(set(dic[d[1]][4]+d[4]))) |
|
DFG=[] |
|
for d in dic: |
|
DFG.append(dic[d]) |
|
dfg=DFG |
|
return dfg |
My reference and candidate is:
candidate = \
'''
throws IOException {
int read = super.read(b, off, len);
if (read > 0) {
bytesRead.incrementAndGet();
}
return read;
}
'''
reference = \
'''
throws IOException {
// Obey InputStream contract.
checkPositionIndexes(off, off + len, b.length);
if (len == 0) {
return 0;
}
// The rest of this method implements the process described by the CharsetEncoder javadoc.
int totalBytesRead = 0;
boolean doneEncoding = endOfInput;
DRAINING:
while (true) {
// We stay in draining mode until there are no bytes left in the output buffer. Then we go
// back to encoding/flushing.
if (draining) {
to
'''
I was wondering if #104 ran into the same problem.
Thank you for your replying! @JiyangZhang @Imagist-Shuo @celbree
CodeXGLUE/Code-Code/code-to-code-trans/evaluator/CodeBLEU/dataflow_match.py
Line 100 in 6744a7f
The result of
list(set())is random under some circumstances. It can be easily reproduced by runningpython -c 'print(list(set(["fa", "dsa", "dsa", "w"])))'.In some cases, the code snippet above can result in the difference of DFG returned by
get_data_flow()and cause varying CodeBLUE scores (specificallydataflow_match_score):CodeXGLUE/Code-Code/code-to-code-trans/evaluator/CodeBLEU/calc_code_bleu.py
Lines 64 to 67 in 6744a7f
I have compared these two functions and found that in
GraphCodeBERTthere is no "merge nodes" action.https://github.com/microsoft/CodeBERT/blob/ac04c77ca7cda9dc757dc8b4360e358731c8708e/GraphCodeBERT/codesearch/run.py#L68-L104
CodeXGLUE/Code-Code/code-to-code-trans/evaluator/CodeBLEU/dataflow_match.py
Lines 64 to 105 in 6744a7f
My reference and candidate is:
I was wondering if #104 ran into the same problem.
Thank you for your replying! @JiyangZhang @Imagist-Shuo @celbree