Skip to content

ns3 simulation end with an "illegal instruction" #125

Open
@alienpj

Description

@alienpj

when following tutorial simai-simulation usage, running command:
python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py -topo Spectrum-X -g 8 -psn 1
generating topo file successfully:
asw_switch_num: 8
psw_switch_num: 1
Creating Topology of totally 1 segment(s), totally 1 pod(s).
Spectrum-X_8g_8gps_400Gbps_H100

however while running simai-ns3 simulation with command:
AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 16 -w ./example/microAllReduce.txt -n ./Spectrum-X_8g_8gps_400Gbps_H100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf

a Segmentation fault (core dumped) occur :
ps_H100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf
maxRtt=150 maxBdp=54000
Segmentation fault (core dumped)

I try to recompile the ns3 environment with a sudo:
sudo ./scripts/build.sh -c ns3
the problem still happen.

then try running the simai-ns3 simulation with a sudo:
sudo AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 16 -w ./example/microAllReduce.txt -n ./Spectrum-X_8g_8gps_400Gbps_H100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf
then it comes another problem that the simulation end abnormally:
maxRtt=150 maxBdp=54000
Running Simulation.
The final active chunks per dimension 1 after allocating to queues is: 1
ring of node 0, id: 0 dimension: local total nodes in ring: 9 index in ring: 0 offset: 1total nodes in ring: 9
ring of node 0, id: 0 dimension: local total nodes in ring: 9 index in ring: 0 offset: 1total nodes in ring: 9
ring of node 0, id: 0 dimension: local total nodes in ring: 9 index in ring: 0 offset: 1total nodes in ring: 9
ring of node 0, id: 0 dimension: local total nodes in ring: 9 index in ring: 0 offset: 1total nodes in ring: 9
total nodes: 9
Success in opening workload file
model_parallel_NPU_group is 8
checkpoints layers are:
layers initiating fwd_in_bckwd are:
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 1 index in ring: 0 offset: 8total nodes in ring: 1
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 1 index in ring: 0 offset: 8total nodes in ring: 1
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 1 index in ring: 0 offset: 8total nodes in ring: 1
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 1 index in ring: 0 offset: 8total nodes in ring: 1
pp_commize:0
id: embedding_layer , depen: -1 , wg_comp_time: 1
id: embedding_layer , depen: -1 , wg_comp_time: 1
type: HYBRID_TRANSFORMER_FWD_IN_BCKWD ,num passes: 1 ,lines: 2 compute scale: 1 ,comm scale: 1
stat path: ./ncclFlowModel_ ,total rows: 1 ,stat row: 0
CSV path and filename: ./ncclFlowModel_detailed_9.csv
CSV path and filename: ./ncclFlowModel_EndToEnd.csv
simulator run
chunk size is: 16777216 , size is: 16777216 , layer_num is: 0 , node: 0
info: all-reduce forward pass collective issued for layer: embedding_layer, involved dimensions: 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
Illegal instruction

the output file "ncclFlowModel_detailed.csv" and "ncclFlowModel_EndToEnd.csv" are empty (0 byte).

is this some environment problem associate with docker version?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions