Description
when following tutorial simai-simulation usage, running command:
python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py -topo Spectrum-X -g 8 -psn 1
generating topo file successfully:
asw_switch_num: 8
psw_switch_num: 1
Creating Topology of totally 1 segment(s), totally 1 pod(s).
Spectrum-X_8g_8gps_400Gbps_H100
however while running simai-ns3 simulation with command:
AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 16 -w ./example/microAllReduce.txt -n ./Spectrum-X_8g_8gps_400Gbps_H100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf
a Segmentation fault (core dumped) occur :
ps_H100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf
maxRtt=150 maxBdp=54000
Segmentation fault (core dumped)
I try to recompile the ns3 environment with a sudo:
sudo ./scripts/build.sh -c ns3
the problem still happen.
then try running the simai-ns3 simulation with a sudo:
sudo AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 16 -w ./example/microAllReduce.txt -n ./Spectrum-X_8g_8gps_400Gbps_H100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf
then it comes another problem that the simulation end abnormally:
maxRtt=150 maxBdp=54000
Running Simulation.
The final active chunks per dimension 1 after allocating to queues is: 1
ring of node 0, id: 0 dimension: local total nodes in ring: 9 index in ring: 0 offset: 1total nodes in ring: 9
ring of node 0, id: 0 dimension: local total nodes in ring: 9 index in ring: 0 offset: 1total nodes in ring: 9
ring of node 0, id: 0 dimension: local total nodes in ring: 9 index in ring: 0 offset: 1total nodes in ring: 9
ring of node 0, id: 0 dimension: local total nodes in ring: 9 index in ring: 0 offset: 1total nodes in ring: 9
total nodes: 9
Success in opening workload file
model_parallel_NPU_group is 8
checkpoints layers are:
layers initiating fwd_in_bckwd are:
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 1 index in ring: 0 offset: 8total nodes in ring: 1
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 1 index in ring: 0 offset: 8total nodes in ring: 1
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 1 index in ring: 0 offset: 8total nodes in ring: 1
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 1 index in ring: 0 offset: 8total nodes in ring: 1
pp_commize:0
id: embedding_layer , depen: -1 , wg_comp_time: 1
id: embedding_layer , depen: -1 , wg_comp_time: 1
type: HYBRID_TRANSFORMER_FWD_IN_BCKWD ,num passes: 1 ,lines: 2 compute scale: 1 ,comm scale: 1
stat path: ./ncclFlowModel_ ,total rows: 1 ,stat row: 0
CSV path and filename: ./ncclFlowModel_detailed_9.csv
CSV path and filename: ./ncclFlowModel_EndToEnd.csv
simulator run
chunk size is: 16777216 , size is: 16777216 , layer_num is: 0 , node: 0
info: all-reduce forward pass collective issued for layer: embedding_layer, involved dimensions: 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
Illegal instruction
the output file "ncclFlowModel_detailed.csv" and "ncclFlowModel_EndToEnd.csv" are empty (0 byte).
is this some environment problem associate with docker version?