Dear Authors,
I hope you are doing well.
I am currently trying to reproduce the reported results for Vlaser-8B. So far, my EB-Habitat result is relatively close to the reported number, but my EB-ALFRED performance is much lower than expected.
For reference, the reported and reproduced results are:
Expected
- EB-ALFRED average success rate: 0.50
- EB-Habitat success rate: 0.40
Mine
- EB-ALFRED average success rate: 0.10
- EB-Habitat success rate: 0.42
My setup is as follows:
- Cluster: SLURM + Singularity (H100)
- Model:
OpenGVLab/Vlaser-8B
- Code: official repository, with only launcher/runtime adaptations for SLURM/Singularity (mainly display, environment, and path handling)
For EB-ALFRED, the detailed task_success results are:
- base: 0.16
- common_sense: 0.14
- complex_instruction: 0.12
- visual_appearance: 0.10
- spatial: 0.08
- long_horizon: 0.00
- mean over the 6 subsets: 0.10
Other aggregate statistics (mean over 6 subsets) are:
- task_progress: 0.1678
- num_invalid_actions: 8.68
- planner_output_error: 0.46
I was wondering whether you might be able to share your evaluation setup used for the published Vlaser-8B EB-ALFRED result.
Thank you!
Dear Authors,
I hope you are doing well.
I am currently trying to reproduce the reported results for Vlaser-8B. So far, my EB-Habitat result is relatively close to the reported number, but my EB-ALFRED performance is much lower than expected.
For reference, the reported and reproduced results are:
Expected
Mine
My setup is as follows:
OpenGVLab/Vlaser-8BFor EB-ALFRED, the detailed
task_successresults are:Other aggregate statistics (mean over 6 subsets) are:
I was wondering whether you might be able to share your evaluation setup used for the published Vlaser-8B EB-ALFRED result.
Thank you!