Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Communication Nodes Incorrectly Marked as COMP_NODE #172

Open
lurw2000 opened this issue Dec 6, 2024 · 0 comments
Open

Communication Nodes Incorrectly Marked as COMP_NODE #172

lurw2000 opened this issue Dec 6, 2024 · 0 comments

Comments

@lurw2000
Copy link

lurw2000 commented Dec 6, 2024

I have followed the user guide and converted Pytorch ET + Kineto Trace into Chakra ET. However, it seems that communication nodes are incorrectly marked as computation nodes, for example:

{
  "id": "20",
  "name": "nccl:broadcast",
  "type": "COMP_NODE",
  "ctrlDeps": [
    "19"
  ],
  "dataDeps": [
    "19"
  ],
  "inputs": {
    "values": "[[21, 22, 0, 5, 8, 'cuda:7']]",
    "shapes": "[[5]]",
    "types": "['Tensor(long int)']"
  },
  "outputs": {
    "values": "[]",
    "shapes": "[]",
    "types": "[]"
  },
  "attr": [
    {
      "name": "rf_id",
      "int64Val": "19"
    },
    {
      "name": "fw_parent",
      "int64Val": "0"
    },
    {
      "name": "seq_id",
      "int64Val": "-1"
    },
    {
      "name": "scope",
      "int64Val": "7"
    },
    {
      "name": "tid",
      "int64Val": "1"
    },
    {
      "name": "fw_tid",
      "int64Val": "0"
    },
    {
      "name": "op_schema",
      "stringVal": ""
    },
    {
      "name": "is_cpu_op",
      "boolVal": true
    },
    {
      "name": "stream",
      "int64Val": "0"
    }
  ]
}

At https://github.com/mlcommons/chakra/blob/main/src/converter/pytorch_converter.py#L341

if json_node.is_gpu_op():
    if "ncclDevKernel_SendRecv" in json_node.name:
        parent_node = json_node_map[json_node.parent]
        keyword = (
            json_node_map[parent_node.parent].name
            if parent_node.name == "record_param_comms"
            else parent_node.name
        )
        if "send" in keyword:
            return COMM_SEND_NODE
        if "recv" in keyword:
            return COMM_RECV_NODE
    if "ncclKernel" in json_node.name or "ncclDevKernel" in json_node.name:
        return COMM_COLL_NODE
    return COMP_NODE

It seems that a node must be a GPU node before it becomes a communication node. However, the logic of json_node.is_gpu_op() is really confusing.
At https://github.com/mlcommons/chakra/blob/main/src/converter/pytorch_node.py#L149

def is_gpu_op(self) -> bool:
    """
    Check if the node is a GPU operator.
    
    Returns
        bool: True if the node is a GPU operator, False otherwise.
    """
    return self.cat is not None

However, it seems that the "cat" attribute would be dropped during the Pytorch ET + Kineto Trace link so that none of the nodes would be marked as a GPU node, and consequently marked as COMP_NODE.

I do not know which part is not as expected, but the logic of json_node.is_gpu_op() seems weird to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant