Skip to content

Randomness in CodeBLEU computation #152

@iCSawyer

Description

@iCSawyer

dic[d[1]]=(d[0],d[1],d[2],list(set(dic[d[1]][3]+d[3])),list(set(dic[d[1]][4]+d[4])))

The result of list(set()) is random under some circumstances. It can be easily reproduced by running python -c 'print(list(set(["fa", "dsa", "dsa", "w"])))'.


In some cases, the code snippet above can result in the difference of DFG returned by get_data_flow() and cause varying CodeBLUE scores (specifically dataflow_match_score):

dataflow_match_score = dataflow_match.corpus_dataflow_match(references, hypothesis, args.lang)
print('ngram match: {0}, weighted ngram match: {1}, syntax_match: {2}, dataflow_match: {3}'.\
format(ngram_match_score, weighted_ngram_match_score, syntax_match_score, dataflow_match_score))


I have compared these two functions and found that in GraphCodeBERT there is no "merge nodes" action.

https://github.com/microsoft/CodeBERT/blob/ac04c77ca7cda9dc757dc8b4360e358731c8708e/GraphCodeBERT/codesearch/run.py#L68-L104

def get_data_flow(code, parser):
try:
tree = parser[0].parse(bytes(code,'utf8'))
root_node = tree.root_node
tokens_index=tree_to_token_index(root_node)
code=code.split('\n')
code_tokens=[index_to_code_token(x,code) for x in tokens_index]
index_to_code={}
for idx,(index,code) in enumerate(zip(tokens_index,code_tokens)):
index_to_code[index]=(idx,code)
try:
DFG,_=parser[1](root_node,index_to_code,{})
except:
DFG=[]
DFG=sorted(DFG,key=lambda x:x[1])
indexs=set()
for d in DFG:
if len(d[-1])!=0:
indexs.add(d[1])
for x in d[-1]:
indexs.add(x)
new_DFG=[]
for d in DFG:
if d[1] in indexs:
new_DFG.append(d)
codes=code_tokens
dfg=new_DFG
except:
codes=code.split()
dfg=[]
#merge nodes
dic={}
for d in dfg:
if d[1] not in dic:
dic[d[1]]=d
else:
dic[d[1]]=(d[0],d[1],d[2],list(set(dic[d[1]][3]+d[3])),list(set(dic[d[1]][4]+d[4])))
DFG=[]
for d in dic:
DFG.append(dic[d])
dfg=DFG
return dfg


My reference and candidate is:

  candidate = \
  '''
  throws IOException {
      int read = super.read(b, off, len);
      if (read > 0) {
          bytesRead.incrementAndGet();
      }
      return read;
  }
  '''
  reference = \
  '''
  throws IOException {
      // Obey InputStream contract.
      checkPositionIndexes(off, off + len, b.length);
      if (len == 0) {
      return 0;
      }

      // The rest of this method implements the process described by the CharsetEncoder javadoc.
      int totalBytesRead = 0;
      boolean doneEncoding = endOfInput;

      DRAINING:
      while (true) {
      // We stay in draining mode until there are no bytes left in the output buffer. Then we go
      // back to encoding/flushing.
      if (draining) {
          to
  '''

I was wondering if #104 ran into the same problem.

Thank you for your replying! @JiyangZhang @Imagist-Shuo @celbree

### Tasks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions