Skip to content

Commit f332ede

Browse files
committed
first commit
0 parents  commit f332ede

850 files changed

Lines changed: 84604 additions & 0 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitattributes

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
*.tar.gz filter=lfs diff=lfs merge=lfs -text
2+
repos/python_repos.zip filter=lfs diff=lfs merge=lfs -text

README.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# 📂 RepoTransBench: A Real-World Benchmark for Repository-Level Code Translation
2+
3+
📄 Here, we anonymously provide the data, automation scripts, prompt templates, and experimental results of RepoTransBench.
4+
5+
> In this paper, we introduce a real-world benchmark for repository-level code translation.
6+
7+
**📦 Repository Dataset:** Download the repository dataset from [RepositoryDataset](https://drive.google.com/file/d/1-BwolLb8MY0dLJnBhYQakTv6lmA8qSLh/view?usp=sharing) and use the command `tar -zxvf python_repos.tar.gz` to extract the dataset to the `./repos` directory.
8+
9+
**🔬 Experimental Results:** Download the experimental results from [ExperimentalResults](https://drive.google.com/file/d/1muVM3cWMceJqRo1FJQQDmvhzqY_vFHq6/view?usp=sharing) and use the command `tar -zxvf experiment_results.tar.gz` to extract the result files.
10+
11+
**🔧 Research Questions:** The research questions results and corresponding scripts are available at the `./RQ` directory.
12+
13+
14+
---
15+
16+
### Translation Level
17+
18+
![Translation Level](asset/TranslationLevel.png)
19+
20+
### Translation Performance
21+
22+
| **Model** | **Success@1** | **Success@2** | **Success@3** | **Build@1** | **Build@2** | **Build@3** | **APR** |
23+
| :------------------------------ | :-----------: | :-----------: | :-----------: | :---------: | :---------: | :---------: | :-------: |
24+
| Llama-3.1-8B-Inst | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
25+
| Llama-3.1-70B-Inst | 1.33% | 2.33% | 3.00% | 2.67% | 4.33% | 6.00% | 1.30% |
26+
| Llama-3.1-405B-Inst | 2.67% | 3.33% | 4.00% | 5.67% | 8.00% | 10.00% | 4.70% |
27+
| DeepSeek-V2.5 | 3.00% | 4.67% | 6.00% | 12.00% | 17.00% | 20.00% | 6.20% |
28+
| GPT-3.5-Turbo | 0.67% | 1.00% | 1.00% | 2.33% | 4.00% | 5.00% | 1.10% |
29+
| GPT-4 | 2.33% | 3.33% | 4.00% | 4.33% | 7.00% | 9.00% | 2.00% |
30+
| GPT-4o | 4.00% | 6.33% | 8.00% | 9.00% | 14.67% | 19.00% | 6.40% |
31+
| Claude-3.5-Sonnet | 7.33% | 10.33% | 12.00% | 28.33% | 37.67% | 42.00% | 16.50% |
32+
| CodeLlama-34B-Inst | 0.00% | 0.00% | 0.00% | 0.37% | 0.67% | 1.00% | 0.00% |
33+
| Codestral-22B | 2.08% | 3.33% | 5.00% | 5.90% | 8.33% | 12.00% | 2.60% |
34+
| DeepSeek-Coder-V2-Inst | 4.86% | 6.33% | 7.00% | 16.84% | 20.33% | 24.00% | 8.40% |
35+
36+
### Debugging Performance
37+
38+
![Debug Results](asset/DebugResults.png)
39+
40+
---
41+
42+
**⚠️ If you want to obtain the results from scratch, please follow these steps:**
43+
44+
**🛠️ Set-Up:** Download the docker container from [Docker4RepoTransBench](https://drive.google.com/file/d/1q4LpOMn-XQfMXrU0GxsJItZNQ6shuGTr/view?usp=sharing) and load it to construct your docker environment.
45+
46+
47+
## 🚀 Evaluation
48+
49+
The evaluation command is as follows, we provide examples for GPT-4o:
50+
51+
```bash
52+
# Translation and debugging
53+
python main.py \
54+
--enable_translate \
55+
--model_name 'GPT-4o' \
56+
--enable_debug \
57+
--debug_mode 'filter' \
58+
```
59+
60+
```bash
61+
# Translation only
62+
python main.py \
63+
--enable_translate \
64+
--model_name 'GPT-4o'
65+
```
66+
67+
```bash
68+
# Debugging only
69+
python main.py \
70+
--model_name 'GPT-4o' \
71+
--enable_history '' \
72+
--history_time '' \ # History time of translation results
73+
--enable_debug \
74+
--debug_mode 'filter' \
75+
```

RQ/RQ1/RQ1_Results.xlsx

42.6 KB
Binary file not shown.

RQ/RQ1/RQ1_calc.py

Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
import os
2+
import re
3+
import pandas as pd
4+
import json
5+
6+
def parse_test_results(file_path):
7+
"""
8+
Parse each txt file to determine if execution was successful, if compilation passed, and the number of passed tests.
9+
:param file_path: Path to the txt file
10+
:return: Execution success (boolean), compilation success (boolean), number of passed tests (integer), total number of tests (integer)
11+
"""
12+
with open(file_path, 'r', encoding='utf-8') as f:
13+
lines = f.readlines()
14+
15+
# Determine if execution was successful
16+
execution_status = lines[1].strip() if len(lines) > 1 else "Failure"
17+
is_successful = (execution_status == "Success")
18+
19+
# Determine if compilation passed and parse pass rate
20+
is_compiled = False
21+
total_tests = 0
22+
passed_tests = 0
23+
24+
for line in lines:
25+
if "T E S T S" in line:
26+
is_compiled = True
27+
match = re.search(r"Tests run: (\d+), Failures: (\d+), Errors: (\d+), Skipped: (\d+)", line)
28+
if match:
29+
run_tests = int(match.group(1))
30+
failures = int(match.group(2))
31+
errors = int(match.group(3))
32+
passed_tests += run_tests - (failures + errors)
33+
total_tests += run_tests
34+
35+
return is_successful, is_compiled, passed_tests, total_tests
36+
37+
38+
def get_repo_java_name(repo_path):
39+
"""
40+
Generate the Java repo name based on repo_path from info_raw.jsonl
41+
:param repo_path: repo_path string
42+
:return: Generated Java path name
43+
"""
44+
return ''.join([item.replace('-', '').replace('_', '').replace('.', '').capitalize() for item in repo_path.split('/')]) + 'Java'
45+
46+
47+
def process_execution_results(base_dir, repo_java_name):
48+
"""
49+
Recursively process execution results in the specified folder, returning statistics for multiple rounds and models.
50+
:param base_dir: The root path of the current model folder
51+
:param repo_java_name: The generated repo Java name
52+
:return: Execution success, pass percentage results for each model and round
53+
"""
54+
result = {}
55+
56+
for model_folder in os.listdir(base_dir):
57+
model_path = os.path.join(base_dir, model_folder)
58+
if os.path.isdir(model_path):
59+
for round_folder in os.listdir(model_path):
60+
round_path = os.path.join(model_path, round_folder)
61+
if round_folder.startswith("round") and os.path.isdir(round_path):
62+
exec_results_path = os.path.join(round_path, "exec_results", repo_java_name)
63+
if os.path.exists(exec_results_path):
64+
for file_name in os.listdir(exec_results_path):
65+
if file_name.endswith("_.txt"):
66+
txt_file = os.path.join(exec_results_path, file_name)
67+
is_successful, is_compiled, passed_tests, total_tests = parse_test_results(txt_file)
68+
column_key = (model_folder, round_folder)
69+
if column_key not in result:
70+
result[column_key] = {"success_rate": 0, "pass_percentage": "0/0", "compiled": 0, "test_pass_rate": []}
71+
72+
# Update success rate, compilation rate, and pass percentage
73+
result[column_key]["success_rate"] = 1 if is_successful and passed_tests == total_tests else 0
74+
result[column_key]["pass_percentage"] = f"{passed_tests}/{total_tests}" if total_tests > 0 else "0/0"
75+
result[column_key]["compiled"] = 1 if is_compiled else 0
76+
77+
# If compiled, record test pass rate
78+
if is_compiled and total_tests > 0:
79+
result[column_key]["test_pass_rate"].append(passed_tests / total_tests)
80+
else:
81+
# If not compiled, test pass rate is 0
82+
result[column_key]["test_pass_rate"].append(0)
83+
84+
return result
85+
86+
87+
def calculate_model_summary(success_df, compiled_df, test_pass_rate_df):
88+
"""
89+
Calculate summary information for each model, including the number of passes for each round, the number of passes if at least one round passes, and the average value over three rounds.
90+
:param success_df: DataFrame for complete pass rate
91+
:param compiled_df: DataFrame for compilation pass rate
92+
:param test_pass_rate_df: DataFrame for test pass rate
93+
:return: DataFrame containing summary information
94+
"""
95+
summary = {}
96+
97+
# Get all models and rounds
98+
models_rounds = success_df.columns.get_level_values(0).unique()
99+
100+
for model in models_rounds:
101+
model_summary = {}
102+
# Count the number of passes for each round (pass/compiled/test_pass_rate)
103+
for round_ in success_df[model].columns:
104+
model_summary[(round_, 'Executable')] = f"{success_df[model][round_].sum()}/100"
105+
model_summary[(round_, 'Compilable')] = f"{compiled_df[model][round_].sum()}/100"
106+
107+
# Calculate average test pass rate, denominator is 100
108+
if test_pass_rate_df[model][round_].notna().sum() > 0:
109+
average_test_pass_rate = test_pass_rate_df[model][round_].sum() * 100 / 100 # Denominator is 100
110+
model_summary[(round_, 'Average Test Pass Rate')] = f"{average_test_pass_rate:.1f}%"
111+
else:
112+
model_summary[(round_, 'Average Test Pass Rate')] = "N/A"
113+
114+
# print(success_df[model].columns)
115+
116+
# Calculate average value for three rounds, that is Pass@1
117+
model_summary[('Pass@1', 'Executable')] = f"{(success_df[model].mean(axis=1).mean() * 100):.2f}%"
118+
model_summary[('Pass@1', 'Compilable')] = f"{(compiled_df[model].mean(axis=1).mean() * 100):.2f}%"
119+
120+
# Calculate the union of the results of randomly selected two rounds from three rounds, and take the average of the three situations, that is Pass@2
121+
# model_summary[('Pass@2', 'Executable')] = f"{(success_df[model][['round_1', 'round_2']].mean(axis=1).mean() * 100):.2f}"
122+
# model_summary[('Pass@2', 'Compilable')] = f"{(compiled_df[model].mean(axis=1).mean() * 100):.2f}"
123+
pass_at_2_success = (
124+
(success_df[model][['round_1', 'round_2']].sum(axis=1) > 0).sum() +
125+
(success_df[model][['round_1', 'round_3']].sum(axis=1) > 0).sum() +
126+
(success_df[model][['round_2', 'round_3']].sum(axis=1) > 0).sum()
127+
) / 3
128+
model_summary[('Pass@2', 'Executable')] = f"{pass_at_2_success:.2f}%"
129+
pass_at_2_compiled = (
130+
(compiled_df[model][['round_1', 'round_2']].sum(axis=1) > 0).sum() +
131+
(compiled_df[model][['round_1', 'round_3']].sum(axis=1) > 0).sum() +
132+
(compiled_df[model][['round_2', 'round_3']].sum(axis=1) > 0).sum()
133+
) / 3
134+
model_summary[('Pass@2', 'Compilable')] = f"{pass_at_2_compiled:.2f}%"
135+
136+
# print(success_df[model][['round_1', 'round_2']].mean(axis=1).mean())
137+
138+
# Count the number of passes for three rounds, if any round passes, that is Pass@3
139+
model_summary[('Pass@3', 'Executable')] = f"{(success_df[model].sum(axis=1) > 0).sum()}%"
140+
model_summary[('Pass@3', 'Compilable')] = f"{(compiled_df[model].sum(axis=1) > 0).sum()}%"
141+
142+
# Calculate the average test pass rate for three rounds
143+
average_overall_test_pass_rate = test_pass_rate_df[model].sum().mean() * 100 / 100 # Denominator is 100
144+
model_summary[('Average Test Pass Rate', 'Average Test Pass Rate')] = f"{average_overall_test_pass_rate:.1f}%" if not pd.isna(average_overall_test_pass_rate) else "N/A"
145+
146+
summary[model] = model_summary
147+
148+
# Create a DataFrame with multi-level indexing
149+
summary_df = pd.DataFrame(summary).T
150+
summary_df.columns = pd.MultiIndex.from_tuples(summary_df.columns)
151+
152+
return summary_df
153+
154+
155+
def main():
156+
base_dir = '../../experiment_results'
157+
info_file = '../../repos/info_raw.jsonl'
158+
results = {}
159+
160+
# Read info_raw.jsonl file
161+
with open(info_file, 'r', encoding='utf-8') as f:
162+
for line in f:
163+
repo_info = json.loads(line)
164+
repo_java_name = get_repo_java_name(repo_info['repo_path'])
165+
results[repo_java_name] = process_execution_results(base_dir, repo_java_name)
166+
167+
# Prepare results for writing to Excel
168+
success_data = {}
169+
percentage_data = {}
170+
compiled_data = {}
171+
test_pass_rate_data = {}
172+
173+
for repo_name, result_dict in results.items():
174+
for (model, round_), result in result_dict.items():
175+
if (model, round_) not in success_data:
176+
success_data[(model, round_)] = {}
177+
percentage_data[(model, round_)] = {}
178+
compiled_data[(model, round_)] = {}
179+
test_pass_rate_data[(model, round_)] = {}
180+
success_data[(model, round_)][repo_name] = result["success_rate"]
181+
percentage_data[(model, round_)][repo_name] = result["pass_percentage"]
182+
compiled_data[(model, round_)][repo_name] = result["compiled"]
183+
184+
# Record test case pass rate, considering non-compilable cases, denominator is 100
185+
if result["test_pass_rate"]:
186+
test_pass_rate_data[(model, round_)][repo_name] = sum(result["test_pass_rate"]) / len(result["test_pass_rate"])
187+
else:
188+
test_pass_rate_data[(model, round_)][repo_name] = 0 # Non-compilable repo rate is 0
189+
190+
# Create DataFrame
191+
success_df = pd.DataFrame(success_data).sort_index(axis=1)
192+
percentage_df = pd.DataFrame(percentage_data).sort_index(axis=1)
193+
compiled_df = pd.DataFrame(compiled_data).sort_index(axis=1)
194+
test_pass_rate_df = pd.DataFrame(test_pass_rate_data).sort_index(axis=1)
195+
196+
# Calculate model summary information
197+
summary_df = calculate_model_summary(success_df, compiled_df, test_pass_rate_df)
198+
199+
# Write to Excel
200+
with pd.ExcelWriter('RQ1_Results.xlsx') as writer:
201+
success_df.to_excel(writer, sheet_name='Success Rate')
202+
percentage_df.to_excel(writer, sheet_name='Pass Percentage')
203+
compiled_df.to_excel(writer, sheet_name='Compiled Rate')
204+
test_pass_rate_df.to_excel(writer, sheet_name='Test Pass Rate')
205+
summary_df.to_excel(writer, sheet_name='Model Summary')
206+
207+
print("Analysis results have been written to RQ1_Results.xlsx")
208+
209+
210+
if __name__ == "__main__":
211+
main()

RQ/RQ2/README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
**`RQ2_gen_json.py`:** Run this script to obtain statistics in json format.
2+
3+
**`RQ2_calc.py`:** Run this script to obtain statistics results.
4+
5+
**`RQ2_figure.py`:** Draw a heat map for the results of RQ2.

RQ/RQ2/RQ2.pdf

37.1 KB
Binary file not shown.

RQ/RQ2/RQ2.xlsx

8.72 KB
Binary file not shown.

0 commit comments

Comments
 (0)