De-duplication

This is the implementation of BLADE (Enhancing Silent Compiler Bug De-duplication via Deep Semantic Analysis).

Introduction

The compiler bug duplication problem (where many test failures are caused by the same compiler bug) can lead to huge waste of time and resource in diagnosing test failures produced by compiler testing. It is particularly challenging with regard to the silent compiler bugs that do not produce any error messages. We propose a new method named BLADE for solving silent compiler bugs de-duplication problem in a black-box manner.

In general, BLADE consists of three main components as shown in the figure :

failure-relevant semantic information learning, which employs intermediate representation learning to learn failure-relevant semantic information by determining an auxiliary objective for fine-tuning the pre-trained code representation model.
failure-relevant semantic information extraction, which empolys a model interpretation method to extract vectorized representations from the fine-tuned model as failure-relevant semantic information for de-duplication.
failure-triggering test program prioritization, which adopts the FPF algorithm to prioritize the given set of failure-trigging test programs to be de-duplicated according to thei failure-relevant semantic information.

Data Preparation

We have released our failure-triggering test programs and transformed failure-free ones data at data.zip and put it in the folder './data'. Additionally, the corresponding bug version of every failure-triggering test program is showed in file {compiler_version}-bugs.txt.

Failure-relevant Semantic Information Learning

Run the following command to get the fine-tuned auxiliary classification model.

$ python classification.py

Before that you can log in to Hugging Face to choose a pre-trained code representation model that you prefer and download. We use UniXCoder.

Note that we have released the parameters of the classification model trained on all four datasets. If you want to skip the training phase, you can use the parameters we trained directly. You can find them at model.

Failure-relevant Semantic Information Extraction

Run the following command to get the distance matrix.

$ python get_distance.py

We have re-implement the other four techniques based on our four datasets. If you want to skip this step, in directory 'distances', you can find the distance matrixs on four datasets and calculcated by five techniques : 'BLADE', 'Tamer', 'Trans', 'D3', 'D3-prog' with their numpy files.

Failure-triggering Test Program Prioritization

Run the following command to get the final result.

$ python prioritization.py

The figures will be saved and you can check it.

Additional results

Measuring effectiveness in capturing causal relationships.

It's hard to directly measure the accuracy of captured causal relationships, since the root causes of bugs are described in natural languages within bug reports while the failure-relevant features extracted from test programs are represented as semantic vectors (typical outputs of code representation models). There is no automatic methods measuring their matching degrees. But we manually analyzed some cases to demonstrate the accuracy.

During the fitting process, we meticulously monitored the dynamics of attention for each token. Tokens that exhibited heightened attention were visually marked in a vibrant shade of red, signifying an increase in attention. Conversely, tokens that experienced diminished attention were visually represented in a stark black hue, indicating a decrease in attention.

Case No.1

The fundamental root cause of the bug, which was triggered by the execution of this program, stems from three key factors:

Firstly, the definition of the structure itself, characterized by the presence of two integer member variables.
Secondly, the distinct initial values assigned to the global variables a and b.
And finally, the occurrence of a cross-assignment statement located on lines 05 and 06, contributing to the manifestation of the bug.

When the -O2 optimization level was applied, an erroneous optimization occurred leading to the incorrect elimination of line 5. This optimization was triggered after the cross-assignment of structure type objects a and b, both containing two member variables.

The heatmap representation of the attention diagram elucidates that:

The model accurately detects the presence of a particular structure type defined with two member variables at line 01.
Subsequently, the model identifies a discrepancy in the initialization of a and b, specifically observing that the values of the second member variables differ. This distinction is visually depicted as 0 appearing in black while 1 is highlighted in red within {0,1} at line 02.
The model exhibits heightened focus on the cross-assignment operation between a and b as indicated by line 05 and line 06.

In essence, through the meticulous analysis of attention patterns, the model comprehends the failure-relevant semantics associated with the cross-assignment of variables a and b within two distinct structure types, featuring divergent values for the second member variable. This alignment between attention and the root cause effectively validates the efficacy of the failure-relevant semantics extracted by BLADE.

Case No.2

This example triggers the same bug as case 1, and the root cause of the bug was identical in both cases.

Furthermore, by examining the heatmap of attention, it becomes evident that the model assigns a similar level of attention to this example as it did to case 1. This suggests that the model recognizes the shared patterns and relevance between the two cases, reinforcing the consistency of its attention mechanism.

Case No.3

First and foremost, to ensure clarity of understanding, it is imperative to acknowledge a prerequisite knowledge in the realm of the C language. By default, the char type is considered signed within this language. Consequently, when comparing a char type with an unsigned char type, implicit conversion takes place, wherein the unsigned char is converted to a char.

The root cause of the bug, which was triggered by the execution of this program, can be attributed to the following factors:

The presence of a global variable, denoted as a.
Declaration of an unsigned char type variable, b, assigned a value of 254. It is important to note that this value exceeds 127, and when compared to a char variable, it should be converted to -2 using the formula 254 - 266 = -2.
A char type variable, denoted as c.
At least one of the variables b and c is declared as const.
The comparison between b and c takes place, and under the -O2 optimization level, the value of b fails to be correctly converted to -2 but instead retains the value 254. 6l Finally, the result of the comparison is assigned to the variable a.

Regarding the heatmap changes in attention, it is evident that:

The model discovers the declaration of a, but does not pay attention to its type.( int is black but a is red in line 01)
The model focuses on the whole declaration statement of b, including its type and the initial value assigned to it 254, which is likely to indicate that the model has noticed the semantics that b is greater than 127. (line 02)
The model finds the declaration of c but ignores its initial value. Combining the previous line, it can be seen that the model knows this semantics, that is, it only needs to consider whether the assignment of the unsigned char type is greater than 127. (black 0 in line 03) (red 254 in line 02)
The model focuses on const. (line 03)
The model pays more attention to the execution of the d function with a comparison between b and c. (line 07)
Through the red = e, model knows the comparision result of b and c is assigned to the gloval variable a in function d. (line 04)

To summarize, a comprehensive analysis of attention reveals the failure-relevant semantics discerned by the model. Specifically, it identifies the disparity that arises when comparing an unsigned char type variable, surpassing the value of 127, with a char type variable, while incorporating the presence of at least one of them as const. Notably, this discrepancy manifests differently across various optimization levels. This alignment between attention patterns and the underlying root cause substantiates the efficacy of the failure-relevant semantics extracted by BLADE, thus providing compelling evidence in support of its effectiveness.

Case No.4

It was determined that this example encountered the same bug as case 3, with the root cause being identical.

Despite the if-statements in this program, its logical structure aligns with that of case 3. The model doesn't pay much attention to the if-statement. It primarily focuses on the definitions and comparisons involving char and const unsigned char and the value of unsigned char. This demonstrates the model's accurate identification of the genuine failure-relevant semantics within this set of examples.

Case No.5

Let us embark on the analysis of a more intricate example. The bug triggered by this program can be traced back to its root cause, which consists of the following elements:

The presence of two char type pointers, c and e, alongside a general pointer, d.
Within the initial code block, the value pointed to by the address stored in d is altered to 1.
Declaration of a char type variable, g, without assigning it a specific value.
Numerous complex factors contribute to the optimization process, such as the potential assignment of e or the utilization of the address of g. Ultimately, under the -O2 optimization level, g's address is erroneously assigned by d, leading to g assuming a value of 1.

By scrutinizing the heatmap of attention, we can extract the following insights:

The model identifies the declaration of a, b, the pointer d, and the char type of the variable c. This spans lines 01 to 03.
In the initial code block, the model exhibits attention solely towards the value alteration within the address pointed to by d, inadvertently overlooking other relevant aspects. This is indicated by the black coloration of f in lines 06 and 07.
The model places considerable emphasis on the entire declaration statement of the char variable g, as evidenced by the uniform red shading in line 10. This observation may signify the model's recognition of the varying values assigned to g across different optimization levels, with g's value ultimately influencing the program's output.
The model identifies additional assignment statements that play a pivotal role in triggering the optimization process, such as the potential utilization of g's address in line 13 and the possibility of altering the value of the global char pointer, e. Furthermore, line 14 directly influences the final output.

In summary, through a meticulous examination of attention patterns, the model discerns the failure-relevant semantics, which revolve around an uninitialized char type variable occupying the address of a global pointer that points to another modified value. These findings align remarkably well with the root cause, thereby substantiating the model's exceptional capability to elucidate failure-relevant semantics.

Wasted effort in all four datasets

We extend Table 2 in our paper to all four datasets, showing BLADE outperforms black-box techniques and achieves competable effectiveness in all four datasets in terms of wasted effort.

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
data		data
distances		distances
save_pictures		save_pictures
saved_modelsman		saved_modelsman
README.md		README.md
classification.py		classification.py
gcc430-bugs.txt		gcc430-bugs.txt
gcc440-bugs.txt		gcc440-bugs.txt
gcc450-bugs.txt		gcc450-bugs.txt
get_distance.py		get_distance.py
llvm280-bugs.txt		llvm280-bugs.txt
model.py		model.py
prioritization.py		prioritization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

De-duplication

Introduction

Data Preparation

Failure-relevant Semantic Information Learning

Failure-relevant Semantic Information Extraction

Failure-triggering Test Program Prioritization

Additional results

Measuring effectiveness in capturing causal relationships.

Wasted effort in all four datasets

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

De-duplication

Introduction

Data Preparation

Failure-relevant Semantic Information Learning

Failure-relevant Semantic Information Extraction

Failure-triggering Test Program Prioritization

Additional results

Measuring effectiveness in capturing causal relationships.

Wasted effort in all four datasets

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages