Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

部署完第一次执行PSI失败 #189

Closed
bzzbzz7 opened this issue Jan 3, 2025 · 21 comments
Closed

部署完第一次执行PSI失败 #189

bzzbzz7 opened this issue Jan 3, 2025 · 21 comments

Comments

@bzzbzz7
Copy link

bzzbzz7 commented Jan 3, 2025

Issue Type

Running

Have you searched for existing documents and issues?

Yes

OS Platform and Distribution

centos 8

All_in_one Version

v1.11.0

Module type

secretpad

Module version

secretpad:0.12.0b0

What happend and What you expected to happen.

第一次执行psi失败,creat containor error
p2p模式部署,alice发起项目,alice创建pod失败。

Log output.

bash-5.2# kubectl get kj -A
NAMESPACE      NAME   STARTTIME   COMPLETIONTIME   LASTRECONCILETIME   PHASE
bob            wgjf                                                    
cross-domain   wgjf   4m54s                        4m53s               Running
bash-5.2# 
bash-5.2# 
bash-5.2# kubectl -n cross-domain get kt
NAME                   STARTTIME   COMPLETIONTIME   LASTRECONCILETIME   PHASE
wgjf-zispqgwa-node-3   5m16s                        2m18s               Pending
bash-5.2# 
bash-5.2# kubectl -n cross-domain describe kt wgjf-zispqgwa-node-3
Name:         wgjf-zispqgwa-node-3
Namespace:    cross-domain
Labels:       kuscia.secretflow/controller=kuscia-job
              kuscia.secretflow/job-uid=383d2f6a-cbed-4679-85f7-d5bc5e580026
Annotations:  kuscia.secretflow/initiator: alice
              kuscia.secretflow/interconn-bfia-parties: 
              kuscia.secretflow/interconn-kuscia-parties: bob
              kuscia.secretflow/interconn-self-parties: alice
              kuscia.secretflow/job-id: wgjf
              kuscia.secretflow/self-cluster-as-initiator: true
              kuscia.secretflow/self-cluster-as-participant: true
              kuscia.secretflow/task-alias: wgjf-zispqgwa-node-3
API Version:  kuscia.secretflow/v1alpha1
Kind:         KusciaTask
Metadata:
  Creation Timestamp:  2025-01-03T02:30:13Z
  Generation:          1
  Managed Fields:
    API Version:  kuscia.secretflow/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kuscia.secretflow/initiator:
          f:kuscia.secretflow/interconn-bfia-parties:
          f:kuscia.secretflow/interconn-kuscia-parties:
          f:kuscia.secretflow/interconn-self-parties:
          f:kuscia.secretflow/job-id:
          f:kuscia.secretflow/self-cluster-as-initiator:
          f:kuscia.secretflow/self-cluster-as-participant:
          f:kuscia.secretflow/task-alias:
        f:labels:
          .:
          f:kuscia.secretflow/controller:
          f:kuscia.secretflow/job-uid:
        f:ownerReferences:
          .:
          k:{"uid":"383d2f6a-cbed-4679-85f7-d5bc5e580026"}:
      f:spec:
        .:
        f:initiator:
        f:parties:
        f:scheduleConfig:
        f:taskInputConfig:
    Manager:      kuscia
    Operation:    Update
    Time:         2025-01-03T02:30:13Z
    API Version:  kuscia.secretflow/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:allocatedPorts:
        f:conditions:
        f:lastReconcileTime:
        f:partyTaskStatus:
        f:phase:
        f:podStatuses:
          .:
          f:alice/wgjf-zispqgwa-node-3-0:
            .:
            f:createTime:
            f:message:
            f:namespace:
            f:nodeName:
            f:podName:
            f:podPhase:
            f:reason:
            f:startTime:
        f:serviceStatuses:
          .:
          f:alice/wgjf-zispqgwa-node-3-0-fed:
            .:
            f:createTime:
            f:namespace:
            f:portName:
            f:portNumber:
            f:readyTime:
            f:scope:
            f:serviceName:
          f:alice/wgjf-zispqgwa-node-3-0-global:
            .:
            f:createTime:
            f:namespace:
            f:portName:
            f:portNumber:
            f:readyTime:
            f:scope:
            f:serviceName:
          f:alice/wgjf-zispqgwa-node-3-0-inference:
            .:
            f:createTime:
            f:namespace:
            f:portName:
            f:portNumber:
            f:readyTime:
            f:scope:
            f:serviceName:
          f:alice/wgjf-zispqgwa-node-3-0-spu:
            .:
            f:createTime:
            f:namespace:
            f:portName:
            f:portNumber:
            f:readyTime:
            f:scope:
            f:serviceName:
        f:startTime:
    Manager:      kuscia
    Operation:    Update
    Subresource:  status
    Time:         2025-01-03T02:33:11Z
  Owner References:
    API Version:           kuscia.secretflow/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  KusciaJob
    Name:                  wgjf
    UID:                   383d2f6a-cbed-4679-85f7-d5bc5e580026
  Resource Version:        4375
  UID:                     7bd5c437-eeaf-48c0-b99b-4687af6efe28
Spec:
  Initiator:  alice
  Parties:
    App Image Ref:  secretflow-image
    Domain ID:      bob
    Template:
      Spec:
    App Image Ref:  secretflow-image
    Domain ID:      alice
    Template:
      Spec:
  Schedule Config:
  Task Input Config:  {
  "sf_datasource_config": {
    "bob": {
      "id": "default-data-source"
    },
    "alice": {
      "id": "default-data-source"
    }
  },
  "sf_cluster_desc": {
    "parties": ["bob", "alice"],
    "devices": [{
      "name": "spu",
      "type": "spu",
      "parties": ["bob", "alice"],
      "config": "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
    }, {
      "name": "heu",
      "type": "heu",
      "parties": ["bob", "alice"],
      "config": "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
    }],
    "ray_fed_config": {
      "cross_silo_comm_backend": "brpc_link"
    }
  },
  "sf_node_eval_param": {
    "domain": "data_prep",
    "name": "psi",
    "version": "1.0.0",
    "attr_paths": ["input/input_ds1/keys", "input/input_ds2/keys", "protocol", "sort_result", "receiver_parties", "allow_empty_result", "join_type", "input_ds1_keys_duplicated", "input_ds2_keys_duplicated"],
    "attrs": [{
      "is_na": false,
      "ss": ["id1"]
    }, {
      "is_na": false,
      "ss": ["id2"]
    }, {
      "is_na": false,
      "s": "PROTOCOL_RR22"
    }, {
      "b": true,
      "is_na": false
    }, {
      "is_na": false,
      "ss": ["alice", "bob"]
    }, {
      "is_na": true
    }, {
      "is_na": false,
      "s": "inner_join"
    }, {
      "b": true,
      "is_na": false
    }, {
      "b": true,
      "is_na": false
    }],
    "inputs": [{
      "type": "sf.table.individual",
      "meta": {
        "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
        "line_count": "-1"
      },
      "data_refs": [{
        "uri": "alice_1462036507.csv",
        "party": "alice",
        "format": "csv"
      }]
    }, {
      "type": "sf.table.individual",
      "meta": {
        "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
        "line_count": "-1"
      },
      "data_refs": [{
        "uri": "bob_1005531837.csv",
        "party": "bob",
        "format": "csv"
      }]
    }],
    "checkpoint_uri": "ckwgjf-zispqgwa-node-3-output-0"
  },
  "sf_output_uris": ["wgjf_zispqgwa_node_3_output_0", "wgjf_zispqgwa_node_3_output_1"],
  "sf_input_ids": ["waeprdtm", "tcjoqqyl"],
  "sf_input_partitions_spec": ["", ""],
  "sf_output_ids": ["wgjf-zispqgwa-node-3-output-0", "wgjf-zispqgwa-node-3-output-1"],
  "table_attrs": [{
    "table_id": "waeprdtm",
    "column_attrs": [{
      "col_name": "id1",
      "col_type": "feature"
    }, {
      "col_name": "age",
      "col_type": "feature"
    }, {
      "col_name": "education",
      "col_type": "feature"
    }, {
      "col_name": "default",
      "col_type": "feature"
    }, {
      "col_name": "balance",
      "col_type": "feature"
    }, {
      "col_name": "housing",
      "col_type": "feature"
    }, {
      "col_name": "loan",
      "col_type": "feature"
    }, {
      "col_name": "day",
      "col_type": "feature"
    }, {
      "col_name": "duration",
      "col_type": "feature"
    }, {
      "col_name": "campaign",
      "col_type": "feature"
    }, {
      "col_name": "pdays",
      "col_type": "feature"
    }, {
      "col_name": "previous",
      "col_type": "feature"
    }, {
      "col_name": "job_blue-collar",
      "col_type": "feature"
    }, {
      "col_name": "job_entrepreneur",
      "col_type": "feature"
    }, {
      "col_name": "job_housemaid",
      "col_type": "feature"
    }, {
      "col_name": "job_management",
      "col_type": "feature"
    }, {
      "col_name": "job_retired",
      "col_type": "feature"
    }, {
      "col_name": "job_self-employed",
      "col_type": "feature"
    }, {
      "col_name": "job_services",
      "col_type": "feature"
    }, {
      "col_name": "job_student",
      "col_type": "feature"
    }, {
      "col_name": "job_technician",
      "col_type": "feature"
    }, {
      "col_name": "job_unemployed",
      "col_type": "feature"
    }, {
      "col_name": "marital_divorced",
      "col_type": "feature"
    }, {
      "col_name": "marital_married",
      "col_type": "feature"
    }, {
      "col_name": "marital_single",
      "col_type": "feature"
    }]
  }, {
    "table_id": "tcjoqqyl",
    "column_attrs": [{
      "col_name": "id2",
      "col_type": "feature"
    }, {
      "col_name": "contact_cellular",
      "col_type": "feature"
    }, {
      "col_name": "contact_telephone",
      "col_type": "feature"
    }, {
      "col_name": "contact_unknown",
      "col_type": "feature"
    }, {
      "col_name": "month_apr",
      "col_type": "feature"
    }, {
      "col_name": "month_aug",
      "col_type": "feature"
    }, {
      "col_name": "month_dec",
      "col_type": "feature"
    }, {
      "col_name": "month_feb",
      "col_type": "feature"
    }, {
      "col_name": "month_jan",
      "col_type": "feature"
    }, {
      "col_name": "month_jul",
      "col_type": "feature"
    }, {
      "col_name": "month_jun",
      "col_type": "feature"
    }, {
      "col_name": "month_mar",
      "col_type": "feature"
    }, {
      "col_name": "month_may",
      "col_type": "feature"
    }, {
      "col_name": "month_nov",
      "col_type": "feature"
    }, {
      "col_name": "month_oct",
      "col_type": "feature"
    }, {
      "col_name": "month_sep",
      "col_type": "feature"
    }, {
      "col_name": "poutcome_failure",
      "col_type": "feature"
    }, {
      "col_name": "poutcome_other",
      "col_type": "feature"
    }, {
      "col_name": "poutcome_success",
      "col_type": "feature"
    }, {
      "col_name": "poutcome_unknown",
      "col_type": "feature"
    }, {
      "col_name": "y",
      "col_type": "feature"
    }]
  }]
}
Status:
  Allocated Ports:
    Domain ID:  alice
    Named Port:
      wgjf-zispqgwa-node-3-0/client-server:   23831
      wgjf-zispqgwa-node-3-0/fed:             23834
      wgjf-zispqgwa-node-3-0/global:          23835
      wgjf-zispqgwa-node-3-0/inference:       23832
      wgjf-zispqgwa-node-3-0/node-manager:    23836
      wgjf-zispqgwa-node-3-0/object-manager:  23830
      wgjf-zispqgwa-node-3-0/spu:             23833
  Conditions:
    Last Transition Time:  2025-01-03T02:30:13Z
    Status:                True
    Type:                  PortsAllocated
    Last Transition Time:  2025-01-03T02:30:14Z
    Status:                True
    Type:                  ResourceCreated
  Last Reconcile Time:     2025-01-03T02:33:11Z
  Party Task Status:
    Domain ID:  bob
    Phase:      Pending
    Domain ID:  alice
    Phase:      Pending
  Phase:        Pending
  Pod Statuses:
    alice/wgjf-zispqgwa-node-3-0:
      Create Time:  2025-01-03T02:30:14Z
      Message:      container[secretflow] waiting state reason: "CreateContainerError", message: "context deadline exceeded"
      Namespace:    alice
      Node Name:    root-kuscia-autonomy-alice-localhost-localdomain
      Pod Name:     wgjf-zispqgwa-node-3-0
      Pod Phase:    Pending
      Reason:       CreateContainerError
      Start Time:   2025-01-03T02:30:15Z
  Service Statuses:
    alice/wgjf-zispqgwa-node-3-0-fed:
      Create Time:   2025-01-03T02:30:14Z
      Namespace:     alice
      Port Name:     fed
      Port Number:   23834
      Ready Time:    2025-01-03T02:32:21Z
      Scope:         Cluster
      Service Name:  wgjf-zispqgwa-node-3-0-fed
    alice/wgjf-zispqgwa-node-3-0-global:
      Create Time:   2025-01-03T02:30:14Z
      Namespace:     alice
      Port Name:     global
      Port Number:   23835
      Ready Time:    2025-01-03T02:32:21Z
      Scope:         Domain
      Service Name:  wgjf-zispqgwa-node-3-0-global
    alice/wgjf-zispqgwa-node-3-0-inference:
      Create Time:   2025-01-03T02:30:14Z
      Namespace:     alice
      Port Name:     inference
      Port Number:   23832
      Ready Time:    2025-01-03T02:32:21Z
      Scope:         Cluster
      Service Name:  wgjf-zispqgwa-node-3-0-inference
    alice/wgjf-zispqgwa-node-3-0-spu:
      Create Time:   2025-01-03T02:30:14Z
      Namespace:     alice
      Port Name:     spu
      Port Number:   23833
      Ready Time:    2025-01-03T02:32:21Z
      Scope:         Cluster
      Service Name:  wgjf-zispqgwa-node-3-0-spu
  Start Time:        2025-01-03T02:30:13Z
Events:              <none>
bash-5.2# 
bash-5.2# 
bash-5.2# kubectl -n alice get po
NAME                               READY   STATUS                 RESTARTS   AGE
dataproxy-alice-6d8899569d-g9t5k   1/1     Running                0          37m
wgjf-zispqgwa-node-3-0             0/1     CreateContainerError   0          6m14s
bash-5.2# 
bash-5.2# 
bash-5.2# kubectl -n alice describe po wgjf-zispqgwa-node-3-0
Name:             wgjf-zispqgwa-node-3-0
Namespace:        alice
Priority:         0
Service Account:  default
Node:             root-kuscia-autonomy-alice-localhost-localdomain/172.20.0.2
Start Time:       Fri, 03 Jan 2025 10:30:15 +0800
Labels:           kuscia.secretflow/communication-role-client=true
                  kuscia.secretflow/communication-role-server=true
                  kuscia.secretflow/controller=kusciatask
                  kuscia.secretflow/pod-identity=7bd5c437-eeaf-48c0-b99b-4687af6efe28-0
                  kuscia.secretflow/pod-role=
                  kuscia.secretflow/task-resource-group-uid=39675f44-5793-4494-bf64-277ea6ab7fea
                  kuscia.secretflow/task-resource-uid=cd1e6f2a-05d3-486e-806f-c498fdab023c
                  kuscia.secretflow/task-uid=7bd5c437-eeaf-48c0-b99b-4687af6efe28
Annotations:      kuscia.secretflow/config-template-value-cm-name: wgjf-zispqgwa-node-3-kuscia-gen-conf
                  kuscia.secretflow/config-template-volumes: config-template
                  kuscia.secretflow/initiator: alice
                  kuscia.secretflow/task-id: wgjf-zispqgwa-node-3
                  kuscia.secretflow/task-resource: wgjf-zispqgwa-node-3-d0d2eb4ad63a
                  kuscia.secretflow/task-resource-group: wgjf-zispqgwa-node-3
                  kuscia.secretflow/taskresource-reserving-timestamp: 2025-01-03T10:30:15+08:00
Status:           Pending
IP:               10.88.0.3
IPs:
  IP:  10.88.0.3
Containers:
  secretflow:
    Container ID:  
    Image:         secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.11.0b1
    Image ID:      
    Ports:         23833/TCP, 23834/TCP, 23835/TCP, 23836/TCP, 23830/TCP, 23831/TCP, 23832/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      sh
    Args:
      -c
      python -m secretflow.kuscia.entry ./kuscia/task-config.conf
    State:          Waiting
      Reason:       CreateContainerError
    Ready:          False
    Restart Count:  0
    Environment:
      KUSCIA_PORT_GLOBAL_NUMBER:          23835
      KUSCIA_PORT_NODE_MANAGER_NUMBER:    23836
      KUSCIA_PORT_OBJECT_MANAGER_NUMBER:  23830
      KUSCIA_PORT_CLIENT_SERVER_NUMBER:   23831
      KUSCIA_PORT_INFERENCE_NUMBER:       23832
      KUSCIA_PORT_SPU_NUMBER:             23833
      KUSCIA_PORT_FED_NUMBER:             23834
    Mounts:
      ./kuscia/task-config.conf from config-template (rw,path="task-config.conf")
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config-template:
    Type:        ConfigMap (a volume populated by a ConfigMap)
    Name:        wgjf-zispqgwa-node-3-configtemplate
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  kuscia.secretflow/namespace=alice
Tolerations:     kuscia.secretflow/agent:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age                    From              Message
  ----     ------             ----                   ----              -------
  Warning  FailedScheduling   6m26s                  kuscia-scheduler  0/1 nodes are available: waiting for task resource. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling., can not find related task resource.
  Normal   Scheduled          6m24s                  kuscia-scheduler  Successfully assigned alice/wgjf-zispqgwa-node-3-0 to root-kuscia-autonomy-alice-localhost-localdomain
  Warning  Failed             4m20s                  Agent             Error: context deadline exceeded
  Warning  MissingClusterDNS  4m19s (x3 over 6m25s)  Agent             pod: "wgjf-zispqgwa-node-3-0_alice(8eec2f78-1cb8-449c-bc01-4a48fa44be67)". kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
  Normal   Pulled             4m19s (x2 over 6m20s)  Agent             Container image "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.11.0b1" already present on machine
  Warning  Failed             4m17s                  Agent             Error: failed to reserve container name "secretflow_wgjf-zispqgwa-node-3-0_alice_8eec2f78-1cb8-449c-bc01-4a48fa44be67_0": name "secretflow_wgjf-zispqgwa-node-3-0_alice_8eec2f78-1cb8-449c-bc01-4a48fa44be67_0" is reserved for "4dc0b5734c4effa344e54c539a7f3b4a6be24013da99f6c8d9e144be44fdeef1"
bash-5.2#
@wangzul
Copy link

wangzul commented Jan 3, 2025

从日志中获取的job信息为wgjf

kubectl get pod -a 查看一下wgjf相关pod信息,然后提供一下kubectl get pod -n xx {name} -oyaml日志

@wangzul
Copy link

wangzul commented Jan 6, 2025

从日志中获取的job信息为wgjf

kubectl get pod -a 查看一下wgjf相关pod信息,然后提供一下kubectl get pod -n xx {name} -oyaml日志

kubectl get pod -a 看下有没有pod信息,如果有可以执行kubectl get pod -n xx {name} -oyaml查询一下信息。
如果没有可以查看一下job的信息kubectl get kj -n xx {name} -oyaml

@bzzbzz7
Copy link
Author

bzzbzz7 commented Jan 6, 2025

bash-5.2# kubectl get po -A
NAMESPACE   NAME                               READY   STATUS                 RESTARTS   AGE
alice       dataproxy-alice-7656f68568-45wqh   1/1     Running                0          5h34m
alice       vhau-rjcgwwqd-node-3-0             0/1     CreateContainerError   0          4m4s
bash-5.2# 
bash-5.2# 
bash-5.2# kubectl -n alice get po -oyaml vhau-rjcgwwqd-node-3-0
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kuscia.secretflow/config-template-value-cm-name: vhau-rjcgwwqd-node-3-kuscia-gen-conf
    kuscia.secretflow/config-template-volumes: config-template
    kuscia.secretflow/initiator: alice
    kuscia.secretflow/task-id: vhau-rjcgwwqd-node-3
    kuscia.secretflow/task-resource: vhau-rjcgwwqd-node-3-c818a8c67546
    kuscia.secretflow/task-resource-group: vhau-rjcgwwqd-node-3
    kuscia.secretflow/taskresource-reserving-timestamp: "2025-01-06T15:41:36+08:00"
  creationTimestamp: "2025-01-06T07:41:34Z"
  labels:
    kuscia.secretflow/communication-role-client: "true"
    kuscia.secretflow/communication-role-server: "true"
    kuscia.secretflow/controller: kusciatask
    kuscia.secretflow/pod-identity: 48c00ab4-b3da-46d5-a354-87cdb022a4f9-0
    kuscia.secretflow/pod-role: ""
    kuscia.secretflow/task-resource-group-uid: 0f83d772-4550-4e42-a90f-4e97305e47ba
    kuscia.secretflow/task-resource-uid: 00eca8d7-66e8-48c6-81f7-8b4a98fae2a4
    kuscia.secretflow/task-uid: 48c00ab4-b3da-46d5-a354-87cdb022a4f9
  name: vhau-rjcgwwqd-node-3-0
  namespace: alice
  resourceVersion: "32140"
  uid: 9974d265-8e53-4497-9ffb-fe35f364aafb
spec:
  automountServiceAccountToken: false
  containers:
  - args:
    - -c
    - python -m secretflow.kuscia.entry ./kuscia/task-config.conf
    command:
    - sh
    env:
    - name: KUSCIA_PORT_FED_NUMBER
      value: "26557"
    - name: KUSCIA_PORT_GLOBAL_NUMBER
      value: "26558"
    - name: KUSCIA_PORT_NODE_MANAGER_NUMBER
      value: "26552"
    - name: KUSCIA_PORT_OBJECT_MANAGER_NUMBER
      value: "26553"
    - name: KUSCIA_PORT_CLIENT_SERVER_NUMBER
      value: "26554"
    - name: KUSCIA_PORT_INFERENCE_NUMBER
      value: "26555"
    - name: KUSCIA_PORT_SPU_NUMBER
      value: "26556"
    image: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.11.0b1
    imagePullPolicy: IfNotPresent
    name: secretflow
    ports:
    - containerPort: 26556
      name: spu
      protocol: TCP
    - containerPort: 26557
      name: fed
      protocol: TCP
    - containerPort: 26558
      name: global
      protocol: TCP
    - containerPort: 26552
      name: node-manager
      protocol: TCP
    - containerPort: 26553
      name: object-manager
      protocol: TCP
    - containerPort: 26554
      name: client-server
      protocol: TCP
    - containerPort: 26555
      name: inference
      protocol: TCP
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: ./kuscia/task-config.conf
      name: config-template
      subPath: task-config.conf
    workingDir: /app
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: root-kuscia-autonomy-alice-localhost-localdomain
  nodeSelector:
    kuscia.secretflow/namespace: alice
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: kuscia-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: kuscia.secretflow/agent
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - configMap:
      defaultMode: 420
      name: vhau-rjcgwwqd-node-3-configtemplate
    name: config-template
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2025-01-06T07:41:36Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2025-01-06T07:41:36Z"
    message: 'containers with unready status: [secretflow]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2025-01-06T07:41:36Z"
    message: 'containers with unready status: [secretflow]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2025-01-06T07:41:36Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.11.0b1
    imageID: ""
    lastState: {}
    name: secretflow
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: context deadline exceeded
        reason: CreateContainerError
  hostIP: 172.23.0.2
  phase: Pending
  podIP: 10.88.0.3
  podIPs:
  - ip: 10.88.0.3
  startTime: "2025-01-06T07:41:36Z"
bash-5.2# 
bash-5.2# kubectl -n alice describe po vhau-rjcgwwqd-node-3-0
Name:             vhau-rjcgwwqd-node-3-0
Namespace:        alice
Priority:         0
Service Account:  default
Node:             root-kuscia-autonomy-alice-localhost-localdomain/172.23.0.2
Start Time:       Mon, 06 Jan 2025 15:41:36 +0800
Labels:           kuscia.secretflow/communication-role-client=true
                  kuscia.secretflow/communication-role-server=true
                  kuscia.secretflow/controller=kusciatask
                  kuscia.secretflow/pod-identity=48c00ab4-b3da-46d5-a354-87cdb022a4f9-0
                  kuscia.secretflow/pod-role=
                  kuscia.secretflow/task-resource-group-uid=0f83d772-4550-4e42-a90f-4e97305e47ba
                  kuscia.secretflow/task-resource-uid=00eca8d7-66e8-48c6-81f7-8b4a98fae2a4
                  kuscia.secretflow/task-uid=48c00ab4-b3da-46d5-a354-87cdb022a4f9
Annotations:      kuscia.secretflow/config-template-value-cm-name: vhau-rjcgwwqd-node-3-kuscia-gen-conf
                  kuscia.secretflow/config-template-volumes: config-template
                  kuscia.secretflow/initiator: alice
                  kuscia.secretflow/task-id: vhau-rjcgwwqd-node-3
                  kuscia.secretflow/task-resource: vhau-rjcgwwqd-node-3-c818a8c67546
                  kuscia.secretflow/task-resource-group: vhau-rjcgwwqd-node-3
                  kuscia.secretflow/taskresource-reserving-timestamp: 2025-01-06T15:41:36+08:00
Status:           Pending
IP:               10.88.0.3
IPs:
  IP:  10.88.0.3
Containers:
  secretflow:
    Container ID:  
    Image:         secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.11.0b1
    Image ID:      
    Ports:         26556/TCP, 26557/TCP, 26558/TCP, 26552/TCP, 26553/TCP, 26554/TCP, 26555/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      sh
    Args:
      -c
      python -m secretflow.kuscia.entry ./kuscia/task-config.conf
    State:          Waiting
      Reason:       CreateContainerError
    Ready:          False
    Restart Count:  0
    Environment:
      KUSCIA_PORT_FED_NUMBER:             26557
      KUSCIA_PORT_GLOBAL_NUMBER:          26558
      KUSCIA_PORT_NODE_MANAGER_NUMBER:    26552
      KUSCIA_PORT_OBJECT_MANAGER_NUMBER:  26553
      KUSCIA_PORT_CLIENT_SERVER_NUMBER:   26554
      KUSCIA_PORT_INFERENCE_NUMBER:       26555
      KUSCIA_PORT_SPU_NUMBER:             26556
    Mounts:
      ./kuscia/task-config.conf from config-template (rw,path="task-config.conf")
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config-template:
    Type:        ConfigMap (a volume populated by a ConfigMap)
    Name:        vhau-rjcgwwqd-node-3-configtemplate
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  kuscia.secretflow/namespace=alice
Tolerations:     kuscia.secretflow/agent:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age                    From              Message
  ----     ------             ----                   ----              -------
  Warning  FailedScheduling   4m38s                  kuscia-scheduler  0/1 nodes are available: waiting for task resource. can not find related task resource, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
  Normal   Scheduled          4m36s                  kuscia-scheduler  Successfully assigned alice/vhau-rjcgwwqd-node-3-0 to root-kuscia-autonomy-alice-localhost-localdomain
  Warning  Failed             2m30s                  Agent             Error: context deadline exceeded
  Warning  MissingClusterDNS  2m29s (x3 over 4m36s)  Agent             pod: "vhau-rjcgwwqd-node-3-0_alice(9974d265-8e53-4497-9ffb-fe35f364aafb)". kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
  Normal   Pulled             2m29s (x2 over 4m31s)  Agent             Container image "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.11.0b1" already present on machine
  Warning  Failed             2m26s                  Agent             Error: failed to reserve container name "secretflow_vhau-rjcgwwqd-node-3-0_alice_9974d265-8e53-4497-9ffb-fe35f364aafb_0": name "secretflow_vhau-rjcgwwqd-node-3-0_alice_9974d265-8e53-4497-9ffb-fe35f364aafb_0" is reserved for "d5c8e3ba17d0ba4c520fd9d0ab4f934797c666033bb2a19f322557cb265322b1"
bash-5.2# 

@wangzul
Copy link

wangzul commented Jan 6, 2025

可以提供一下下方内容:

  1. 宿主机的内存、磁盘占用情况。
  2. 容器内部执行kuscia image ls

@bzzbzz7
Copy link
Author

bzzbzz7 commented Jan 6, 2025

[root@localhost scripts]# free -lh
              total        used        free      shared  buff/cache   available
Mem:           31Gi        18Gi       1.9Gi        83Mi        10Gi        11Gi
Low:           31Gi        29Gi       1.9Gi
High:            0B          0B          0B
Swap:         5.0Gi       1.7Gi       3.3Gi
[root@localhost scripts]# df -lh
Filesystem           Size  Used Avail Use% Mounted on
devtmpfs              16G     0   16G   0% /dev
tmpfs                 16G     0   16G   0% /dev/shm
tmpfs                 16G   34M   16G   1% /run
tmpfs                 16G     0   16G   0% /sys/fs/cgroup
/dev/mapper/cl-root   44G   20G   25G  44% /
/dev/vda1           1014M  242M  773M  24% /boot
/dev/vdb1            200G   60G  141G  30% /data
tmpfs                3.2G  1.2M  3.2G   1% /run/user/42
overlay              200G   60G  141G  30% /data/docker-data/overlay2/f179caa4e8d6c7655564bcdc85caf156f241d47f2f3a5ce75efbdcd78d71426d/merged
overlay              200G   60G  141G  30% /data/docker-data/overlay2/4ee2668dfa9c76ae84361198bcf6c516446271f0d91a2d573d38ebd1e9e61d0e/merged
tmpfs                3.2G     0  3.2G   0% /run/user/0
overlay              200G   60G  141G  30% /data/docker-data/overlay2/d0cf73fe63a8b9c9b874f6e24a0e0b661f828fa601477b822a691019c4ecf3b4/merged
overlay              200G   60G  141G  30% /data/docker-data/overlay2/c2c48b8e5ccd98d463a8ba7968d044c7eeb8779c1b295b7c74fd541897d88694/merged
overlay              200G   60G  141G  30% /data/docker-data/overlay2/77aa506483a9f183241e2722047fa44ea1298374fe69f414d4211b8abe8b95ef/merged
overlay              200G   60G  141G  30% /data/docker-data/overlay2/46cdaba7bdbbae0c73cf3d4922e623ec66fa09308f282d7059ca2434613084d7/merged
overlay              200G   60G  141G  30% /data/docker-data/overlay2/4900be96241907a0377820c9e03476cd269aed83bfad42bf3acaa26d45bd5cbd/merged
overlay              200G   60G  141G  30% /data/docker-data/overlay2/9772ab390165051c85be67209ee9b944416d3a3be8789b0451a6f630fa8ea738/merged
overlay              200G   60G  141G  30% /data/docker-data/overlay2/89a3c4cee946437be81d1a1f31e4b60ca52fbabac8c72bd1cb82dbc4ac89ce88/merged
overlay              200G   60G  141G  30% /data/docker-data/overlay2/c40b205a9d395b630bfe0b24766e45a5cf9aa51c06358251ef6385787c77700e/merged
[root@localhost scripts]# 

bash-5.2# kuscia image ls
IMAGE                                                                                TAG                 IMAGE ID            SIZE
docker.io/secretflow/pause                                                           3.6                 6270bb605e12e       686kB
secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/dataproxy                 0.3.0b0             08fc3dc328398       800MB
secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia                    0.13.0b0            6f09d700eea57       1.1GB
secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/scql                      0.9.2b1             a39a73ed2bfa3       582MB
secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8   1.11.0b1            9bcb2e7a4264f       1.7GB
secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/serving-anolis8           0.8.0b0             8d9de8f6e1457       451MB
bash-5.2# 

@wangzul
Copy link

wangzul commented Jan 6, 2025

看起来是没问题的,执行一下kubectl get kj -n cross-domain wgjf -oyaml

@bzzbzz7
Copy link
Author

bzzbzz7 commented Jan 6, 2025

wgjf是很早的job,新的job是vhau:

bash-5.2# kubectl get kj -n cross-domain vhau -oyaml
apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaJob
metadata:
  annotations:
    kuscia.secretflow/initiator: alice
    kuscia.secretflow/interconn-kuscia-parties: carol
    kuscia.secretflow/interconn-self-parties: alice
    kuscia.secretflow/self-cluster-as-initiator: "true"
  creationTimestamp: "2025-01-06T07:41:33Z"
  generation: 1
  name: vhau
  namespace: cross-domain
  resourceVersion: "31855"
  uid: 6e7d27d2-58bc-463b-b159-958c4c2cb278
spec:
  initiator: alice
  maxParallelism: 1
  scheduleMode: BestEffort
  tasks:
  - alias: vhau-rjcgwwqd-node-3
    appImage: secretflow-image
    parties:
    - domainID: carol
    - domainID: alice
    taskID: vhau-rjcgwwqd-node-3
    taskInputConfig: |-
      {
        "sf_datasource_config": {
          "carol": {
            "id": "default-data-source"
          },
          "alice": {
            "id": "default-data-source"
          }
        },
        "sf_cluster_desc": {
          "parties": ["carol", "alice"],
          "devices": [{
            "name": "spu",
            "type": "spu",
            "parties": ["carol", "alice"],
            "config": "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
          }, {
            "name": "heu",
            "type": "heu",
            "parties": ["carol", "alice"],
            "config": "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
          }],
          "ray_fed_config": {
            "cross_silo_comm_backend": "brpc_link"
          }
        },
        "sf_node_eval_param": {
          "domain": "data_prep",
          "name": "psi",
          "version": "1.0.0",
          "attr_paths": ["input/input_ds1/keys", "input/input_ds2/keys", "protocol", "sort_result", "receiver_parties", "allow_empty_result", "join_type", "input_ds1_keys_duplicated", "input_ds2_keys_duplicated"],
          "attrs": [{
            "is_na": false,
            "ss": ["id1"]
          }, {
            "is_na": false,
            "ss": ["id2"]
          }, {
            "is_na": false,
            "s": "PROTOCOL_RR22"
          }, {
            "b": true,
            "is_na": false
          }, {
            "is_na": false,
            "ss": ["alice", "carol"]
          }, {
            "is_na": true
          }, {
            "is_na": false,
            "s": "inner_join"
          }, {
            "b": true,
            "is_na": false
          }, {
            "b": true,
            "is_na": false
          }],
          "inputs": [{
            "type": "sf.table.individual",
            "meta": {
              "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
              "line_count": "-1"
            },
            "data_refs": [{
              "uri": "alice_2108767103.csv",
              "party": "alice",
              "format": "csv"
            }]
          }, {
            "type": "sf.table.individual",
            "meta": {
              "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
              "line_count": "-1"
            },
            "data_refs": [{
              "uri": "bob_1650137662.csv",
              "party": "carol",
              "format": "csv"
            }]
          }],
          "checkpoint_uri": "ckvhau-rjcgwwqd-node-3-output-0"
        },
        "sf_output_uris": ["vhau_rjcgwwqd_node_3_output_0", "vhau_rjcgwwqd_node_3_output_1"],
        "sf_input_ids": ["rvdyopqn", "fmaalvjk"],
        "sf_input_partitions_spec": ["", ""],
        "sf_output_ids": ["vhau-rjcgwwqd-node-3-output-0", "vhau-rjcgwwqd-node-3-output-1"],
        "table_attrs": [{
          "table_id": "fmaalvjk",
          "column_attrs": [{
            "col_name": "id2",
            "col_type": "feature"
          }, {
            "col_name": "contact_cellular",
            "col_type": "feature"
          }, {
            "col_name": "contact_telephone",
            "col_type": "feature"
          }, {
            "col_name": "contact_unknown",
            "col_type": "feature"
          }, {
            "col_name": "month_apr",
            "col_type": "feature"
          }, {
            "col_name": "month_aug",
            "col_type": "feature"
          }, {
            "col_name": "month_dec",
            "col_type": "feature"
          }, {
            "col_name": "month_feb",
            "col_type": "feature"
          }, {
            "col_name": "month_jan",
            "col_type": "feature"
          }, {
            "col_name": "month_jul",
            "col_type": "feature"
          }, {
            "col_name": "month_jun",
            "col_type": "feature"
          }, {
            "col_name": "month_mar",
            "col_type": "feature"
          }, {
            "col_name": "month_may",
            "col_type": "feature"
          }, {
            "col_name": "month_nov",
            "col_type": "feature"
          }, {
            "col_name": "month_oct",
            "col_type": "feature"
          }, {
            "col_name": "month_sep",
            "col_type": "feature"
          }, {
            "col_name": "poutcome_failure",
            "col_type": "feature"
          }, {
            "col_name": "poutcome_other",
            "col_type": "feature"
          }, {
            "col_name": "poutcome_success",
            "col_type": "feature"
          }, {
            "col_name": "poutcome_unknown",
            "col_type": "feature"
          }, {
            "col_name": "y",
            "col_type": "feature"
          }]
        }, {
          "table_id": "rvdyopqn",
          "column_attrs": [{
            "col_name": "id1",
            "col_type": "feature"
          }, {
            "col_name": "age",
            "col_type": "feature"
          }, {
            "col_name": "education",
            "col_type": "feature"
          }, {
            "col_name": "default",
            "col_type": "feature"
          }, {
            "col_name": "balance",
            "col_type": "feature"
          }, {
            "col_name": "housing",
            "col_type": "feature"
          }, {
            "col_name": "loan",
            "col_type": "feature"
          }, {
            "col_name": "day",
            "col_type": "feature"
          }, {
            "col_name": "duration",
            "col_type": "feature"
          }, {
            "col_name": "campaign",
            "col_type": "feature"
          }, {
            "col_name": "pdays",
            "col_type": "feature"
          }, {
            "col_name": "previous",
            "col_type": "feature"
          }, {
            "col_name": "job_blue-collar",
            "col_type": "feature"
          }, {
            "col_name": "job_entrepreneur",
            "col_type": "feature"
          }, {
            "col_name": "job_housemaid",
            "col_type": "feature"
          }, {
            "col_name": "job_management",
            "col_type": "feature"
          }, {
            "col_name": "job_retired",
            "col_type": "feature"
          }, {
            "col_name": "job_self-employed",
            "col_type": "feature"
          }, {
            "col_name": "job_services",
            "col_type": "feature"
          }, {
            "col_name": "job_student",
            "col_type": "feature"
          }, {
            "col_name": "job_technician",
            "col_type": "feature"
          }, {
            "col_name": "job_unemployed",
            "col_type": "feature"
          }, {
            "col_name": "marital_divorced",
            "col_type": "feature"
          }, {
            "col_name": "marital_married",
            "col_type": "feature"
          }, {
            "col_name": "marital_single",
            "col_type": "feature"
          }]
        }]
      }
    tolerable: false
  - alias: vhau-rjcgwwqd-node-4
    appImage: secretflow-image
    dependencies:
    - vhau-rjcgwwqd-node-3
    parties:
    - domainID: carol
    - domainID: alice
    taskID: vhau-rjcgwwqd-node-4
    taskInputConfig: |-
      {
        "sf_datasource_config": {
          "carol": {
            "id": "default-data-source"
          },
          "alice": {
            "id": "default-data-source"
          }
        },
        "sf_cluster_desc": {
          "parties": ["carol", "alice"],
          "devices": [{
            "name": "spu",
            "type": "spu",
            "parties": ["carol", "alice"],
            "config": "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
          }, {
            "name": "heu",
            "type": "heu",
            "parties": ["carol", "alice"],
            "config": "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
          }],
          "ray_fed_config": {
            "cross_silo_comm_backend": "brpc_link"
          }
        },
        "sf_node_eval_param": {
          "domain": "stats",
          "name": "table_statistics",
          "version": "1.0.0",
          "attr_paths": ["input/input_ds/features"],
          "attrs": [{
            "is_na": false,
            "ss": ["age"]
          }],
          "checkpoint_uri": "ckvhau-rjcgwwqd-node-4-output-0"
        },
        "sf_output_uris": ["vhau_rjcgwwqd_node_4_output_0"],
        "sf_input_ids": ["vhau-rjcgwwqd-node-3-output-0"],
        "sf_input_partitions_spec": [""],
        "sf_output_ids": ["vhau-rjcgwwqd-node-4-output-0"],
        "table_attrs": [{
          "table_id": "fmaalvjk",
          "column_attrs": [{
            "col_name": "id2",
            "col_type": "feature"
          }, {
            "col_name": "contact_cellular",
            "col_type": "feature"
          }, {
            "col_name": "contact_telephone",
            "col_type": "feature"
          }, {
            "col_name": "contact_unknown",
            "col_type": "feature"
          }, {
            "col_name": "month_apr",
            "col_type": "feature"
          }, {
            "col_name": "month_aug",
            "col_type": "feature"
          }, {
            "col_name": "month_dec",
            "col_type": "feature"
          }, {
            "col_name": "month_feb",
            "col_type": "feature"
          }, {
            "col_name": "month_jan",
            "col_type": "feature"
          }, {
            "col_name": "month_jul",
            "col_type": "feature"
          }, {
            "col_name": "month_jun",
            "col_type": "feature"
          }, {
            "col_name": "month_mar",
            "col_type": "feature"
          }, {
            "col_name": "month_may",
            "col_type": "feature"
          }, {
            "col_name": "month_nov",
            "col_type": "feature"
          }, {
            "col_name": "month_oct",
            "col_type": "feature"
          }, {
            "col_name": "month_sep",
            "col_type": "feature"
          }, {
            "col_name": "poutcome_failure",
            "col_type": "feature"
          }, {
            "col_name": "poutcome_other",
            "col_type": "feature"
          }, {
            "col_name": "poutcome_success",
            "col_type": "feature"
          }, {
            "col_name": "poutcome_unknown",
            "col_type": "feature"
          }, {
            "col_name": "y",
            "col_type": "feature"
          }]
        }, {
          "table_id": "rvdyopqn",
          "column_attrs": [{
            "col_name": "id1",
            "col_type": "feature"
          }, {
            "col_name": "age",
            "col_type": "feature"
          }, {
            "col_name": "education",
            "col_type": "feature"
          }, {
            "col_name": "default",
            "col_type": "feature"
          }, {
            "col_name": "balance",
            "col_type": "feature"
          }, {
            "col_name": "housing",
            "col_type": "feature"
          }, {
            "col_name": "loan",
            "col_type": "feature"
          }, {
            "col_name": "day",
            "col_type": "feature"
          }, {
            "col_name": "duration",
            "col_type": "feature"
          }, {
            "col_name": "campaign",
            "col_type": "feature"
          }, {
            "col_name": "pdays",
            "col_type": "feature"
          }, {
            "col_name": "previous",
            "col_type": "feature"
          }, {
            "col_name": "job_blue-collar",
            "col_type": "feature"
          }, {
            "col_name": "job_entrepreneur",
            "col_type": "feature"
          }, {
            "col_name": "job_housemaid",
            "col_type": "feature"
          }, {
            "col_name": "job_management",
            "col_type": "feature"
          }, {
            "col_name": "job_retired",
            "col_type": "feature"
          }, {
            "col_name": "job_self-employed",
            "col_type": "feature"
          }, {
            "col_name": "job_services",
            "col_type": "feature"
          }, {
            "col_name": "job_student",
            "col_type": "feature"
          }, {
            "col_name": "job_technician",
            "col_type": "feature"
          }, {
            "col_name": "job_unemployed",
            "col_type": "feature"
          }, {
            "col_name": "marital_divorced",
            "col_type": "feature"
          }, {
            "col_name": "marital_married",
            "col_type": "feature"
          }, {
            "col_name": "marital_single",
            "col_type": "feature"
          }]
        }]
      }
    tolerable: false
status:
  approveStatus:
    alice: JobAccepted
    carol: JobAccepted
  conditions:
  - lastTransitionTime: "2025-01-06T07:41:33Z"
    status: "True"
    type: JobValidated
  lastReconcileTime: "2025-01-06T07:41:34Z"
  phase: Running
  stageStatus:
    alice: JobCreateStageSucceeded
    carol: JobCreateStageSucceeded
  startTime: "2025-01-06T07:41:33Z"
  taskStatus:
    vhau-rjcgwwqd-node-3: Pending
bash-5.2#

@bzzbzz7
Copy link
Author

bzzbzz7 commented Jan 6, 2025

bash-5.2# kubectl get po -A
NAMESPACE   NAME                               READY   STATUS                 RESTARTS   AGE
alice       dataproxy-alice-7656f68568-45wqh   1/1     Running                0          5h55m
alice       vhau-rjcgwwqd-node-3-0             0/1     CreateContainerError   0          25m
alice       mvpx-rjcgwwqd-node-3-0             0/1     CreateContainerError   0          18m
bash-5.2# 
bash-5.2# 
bash-5.2# kubectl get kt -A
NAMESPACE      NAME                   STARTTIME   COMPLETIONTIME   LASTRECONCILETIME   PHASE
cross-domain   mvpx-rjcgwwqd-node-3   18m                          3m11s               Pending
cross-domain   vhau-rjcgwwqd-node-3   25m                          3m11s               Pending
bash-5.2# 
bash-5.2# kubectl get kj -A
NAMESPACE      NAME   STARTTIME   COMPLETIONTIME   LASTRECONCILETIME   PHASE
carol          vhau                                                    
cross-domain   vhau   25m                          25m                 Running
carol          mvpx                                                    
cross-domain   mvpx   18m                          18m                 Running
bash-5.2# 
bash-5.2# 

@bzzbzz7
Copy link
Author

bzzbzz7 commented Jan 6, 2025

image

@wangzul
Copy link

wangzul commented Jan 6, 2025

  1. 再看一下这信息kubectl get pod -n alice vhau-rjcgwwqd-node-3 -oyaml
  2. 看一下报错服务的宿主机docker status配置
  3. kubectl get node 看一下节点的详情信息。

@bzzbzz7
Copy link
Author

bzzbzz7 commented Jan 6, 2025

bash-5.2# kubectl get pod -n alice vhau-rjcgwwqd-node-3-0 -oyaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kuscia.secretflow/config-template-value-cm-name: vhau-rjcgwwqd-node-3-kuscia-gen-conf
    kuscia.secretflow/config-template-volumes: config-template
    kuscia.secretflow/initiator: alice
    kuscia.secretflow/task-id: vhau-rjcgwwqd-node-3
    kuscia.secretflow/task-resource: vhau-rjcgwwqd-node-3-c818a8c67546
    kuscia.secretflow/task-resource-group: vhau-rjcgwwqd-node-3
    kuscia.secretflow/taskresource-reserving-timestamp: "2025-01-06T15:41:36+08:00"
  creationTimestamp: "2025-01-06T07:41:34Z"
  labels:
    kuscia.secretflow/communication-role-client: "true"
    kuscia.secretflow/communication-role-server: "true"
    kuscia.secretflow/controller: kusciatask
    kuscia.secretflow/pod-identity: 48c00ab4-b3da-46d5-a354-87cdb022a4f9-0
    kuscia.secretflow/pod-role: ""
    kuscia.secretflow/task-resource-group-uid: 0f83d772-4550-4e42-a90f-4e97305e47ba
    kuscia.secretflow/task-resource-uid: 00eca8d7-66e8-48c6-81f7-8b4a98fae2a4
    kuscia.secretflow/task-uid: 48c00ab4-b3da-46d5-a354-87cdb022a4f9
  name: vhau-rjcgwwqd-node-3-0
  namespace: alice
  resourceVersion: "32140"
  uid: 9974d265-8e53-4497-9ffb-fe35f364aafb
spec:
  automountServiceAccountToken: false
  containers:
  - args:
    - -c
    - python -m secretflow.kuscia.entry ./kuscia/task-config.conf
    command:
    - sh
    env:
    - name: KUSCIA_PORT_FED_NUMBER
      value: "26557"
    - name: KUSCIA_PORT_GLOBAL_NUMBER
      value: "26558"
    - name: KUSCIA_PORT_NODE_MANAGER_NUMBER
      value: "26552"
    - name: KUSCIA_PORT_OBJECT_MANAGER_NUMBER
      value: "26553"
    - name: KUSCIA_PORT_CLIENT_SERVER_NUMBER
      value: "26554"
    - name: KUSCIA_PORT_INFERENCE_NUMBER
      value: "26555"
    - name: KUSCIA_PORT_SPU_NUMBER
      value: "26556"
    image: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.11.0b1
    imagePullPolicy: IfNotPresent
    name: secretflow
    ports:
    - containerPort: 26556
      name: spu
      protocol: TCP
    - containerPort: 26557
      name: fed
      protocol: TCP
    - containerPort: 26558
      name: global
      protocol: TCP
    - containerPort: 26552
      name: node-manager
      protocol: TCP
    - containerPort: 26553
      name: object-manager
      protocol: TCP
    - containerPort: 26554
      name: client-server
      protocol: TCP
    - containerPort: 26555
      name: inference
      protocol: TCP
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: ./kuscia/task-config.conf
      name: config-template
      subPath: task-config.conf
    workingDir: /app
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: root-kuscia-autonomy-alice-localhost-localdomain
  nodeSelector:
    kuscia.secretflow/namespace: alice
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: kuscia-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: kuscia.secretflow/agent
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - configMap:
      defaultMode: 420
      name: vhau-rjcgwwqd-node-3-configtemplate
    name: config-template
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2025-01-06T07:41:36Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2025-01-06T07:41:36Z"
    message: 'containers with unready status: [secretflow]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2025-01-06T07:41:36Z"
    message: 'containers with unready status: [secretflow]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2025-01-06T07:41:36Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.11.0b1
    imageID: ""
    lastState: {}
    name: secretflow
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: context deadline exceeded
        reason: CreateContainerError
  hostIP: 172.23.0.2
  phase: Pending
  podIP: 10.88.0.3
  podIPs:
  - ip: 10.88.0.3
  startTime: "2025-01-06T07:41:36Z"
bash-5.2# 



bash-5.2# kubectl get node
NAME                                               STATUS   ROLES   AGE     VERSION
root-kuscia-autonomy-alice-localhost-localdomain   Ready    agent   6h20m   v0.13.0b0
bash-5.2# 
bash-5.2# kubectl describe node root-kuscia-autonomy-alice-localhost-localdomain
Name:               root-kuscia-autonomy-alice-localhost-localdomain
Roles:              agent
Labels:             beta.kubernetes.io/arch=x86_64
                    beta.kubernetes.io/os=linux
                    domain=alice
                    kubernetes.io/apiVersion=0.26.6
                    kubernetes.io/arch=x86_64
                    kubernetes.io/hostname=root-kuscia-autonomy-alice-localhost-localdomain
                    kubernetes.io/os=linux
                    kubernetes.io/role=agent
                    kuscia.secretflow/namespace=alice
                    kuscia.secretflow/runtime=runc
Annotations:        node.alpha.kubernetes.io/ttl: 0
CreationTimestamp:  Mon, 06 Jan 2025 10:03:44 +0800
Taints:             kuscia.secretflow/agent=v1:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  root-kuscia-autonomy-alice-localhost-localdomain
  AcquireTime:     <unset>
  RenewTime:       Mon, 06 Jan 2025 16:24:27 +0800
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                                                          Message
  ----                 ------  -----------------                 ------------------                ------                                                          -------
  NetworkUnavailable   False   Mon, 06 Jan 2025 10:03:44 +0800   Mon, 06 Jan 2025 10:03:44 +0800   RouteCreated                                                    RouteController created a route
  PIDPressure          False   Mon, 06 Jan 2025 10:03:44 +0800   Mon, 06 Jan 2025 10:03:44 +0800   AgentHasSufficientPID                                           Agent has sufficient PID available
  MemoryPressure       False   Mon, 06 Jan 2025 16:23:45 +0800   Mon, 06 Jan 2025 10:03:44 +0800   AgentHasSufficientMemory                                        Agent has sufficient memory available, total=31.2GB, available=7.7GB
  DiskPressure         False   Mon, 06 Jan 2025 16:23:45 +0800   Mon, 06 Jan 2025 10:03:44 +0800   AgentHasNoDiskPressure                                          Agent has no disk pressure. @agent_volume(/home/kuscia/var/storage/data): space=19.1GB/44.0GB(43.4%) inode=212.2k/23.1M(0.9%)
  OutOfDisk            False   Mon, 06 Jan 2025 16:23:45 +0800   Mon, 06 Jan 2025 10:03:44 +0800   AgentHasSufficientDisk                                          Agent has sufficient disk space available. @agent_volume: free_space=24.9GB, free_inode=22.9M
  Kernel-Params        False   Mon, 06 Jan 2025 16:23:45 +0800   Mon, 06 Jan 2025 10:03:44 +0800   Kernel parameters not satisfy kuscia recommended requirements   tcp_max_syn_backlog=1024[ERR];somaxconn=128[ERR];tcp_retries2=15[ERR];tcp_slow_start_after_idle=1[ERR];tcp_tw_reuse=2[OK];file-max=3257792[OK]
  Ready                True    Mon, 06 Jan 2025 16:23:45 +0800   Mon, 06 Jan 2025 10:03:44 +0800   AgentReady                                                      Agent is ready
Addresses:
  InternalIP:  172.23.0.2
Capacity:
  cpu:      16
  memory:   6Gi
  pods:     500
  storage:  209611780Ki
Allocatable:
  cpu:      16
  memory:   5644Mi
  pods:     500
  storage:  172094668Ki
System Info:
  Machine ID:                 367e8547-a38d-4166-b569-eca3953594fe
  System UUID:                
  Boot ID:                    1735177200-1736129024180980325
  Kernel Version:             4.18.0-305.3.1.el8.x86_64
  OS Image:                   docker://linux/anolis:23 (guest)
  Operating System:           linux
  Architecture:               x86_64
  Container Runtime Version:  
  Kubelet Version:            v0.13.0b0
  Kube-Proxy Version:         
PodCIDR:                      10.42.0.0/24
PodCIDRs:                     10.42.0.0/24
Non-terminated Pods:          (3 in total)
  Namespace                   Name                                CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                ------------  ----------  ---------------  -------------  ---
  alice                       dataproxy-alice-7656f68568-45wqh    0 (0%)        0 (0%)      0 (0%)           0 (0%)         6h13m
  alice                       vhau-rjcgwwqd-node-3-0              0 (0%)        0 (0%)      0 (0%)           0 (0%)         43m
  alice                       mvpx-rjcgwwqd-node-3-0              0 (0%)        0 (0%)      0 (0%)           0 (0%)         35m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)
  storage            0         0
Events:              <none>
bash-5.2# 



[root@localhost scripts]# docker stats
CONTAINER ID   NAME                                   CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
ef053fa1b4a6   maxkb                                  423.16%   4.012GiB / 31.21GiB   12.85%    19.6MB / 163MB    87.5MB / 3.99MB   378
b410b4f8cd77   pgsql                                  4.08%     155MiB / 2GiB         7.57%     168MB / 19.7MB    8.49MB / 465MB    24
3d0ebe302ccc   root-kuscia-autonomy-secretpad-carol   1.22%     2.476GiB / 4GiB       61.91%    5.06MB / 7.52MB   32.3MB / 12.7MB   156
20e7a54fc852   root-kuscia-autonomy-secretpad-bob     0.84%     2.374GiB / 4GiB       59.35%    4.77MB / 1.06MB   17.8MB / 7.26MB   148
a08650adf696   root-kuscia-autonomy-secretpad-alice   0.94%     2.672GiB / 4GiB       66.79%    9.5MB / 3.62MB    39MB / 51MB       164
c8a08cdd0f1f   root-kuscia-autonomy-carol             14.93%    1.478GiB / 6GiB       24.63%    3.25MB / 3.4MB    1.82GB / 4.6GB    435
12868d960c75   root-kuscia-autonomy-bob               11.17%    816.9MiB / 6GiB       13.30%    2.17MB / 2.19MB   37.4MB / 4.59GB   217
bc8179756c04   root-kuscia-autonomy-alice             8.79%     2.19GiB / 6GiB        36.49%    5.15MB / 5.94MB   1.84GB / 4.08GB   260
10cea780f496   secretnote-sf-sim-alice-1              31.21%    3.501GiB / 31.21GiB   11.22%    12.4MB / 22.2MB   598MB / 5.98GB    1553
102f5b07e4f7   secretnote-sf-sim-bob-1                29.32%    3.641GiB / 31.21GiB   11.67%    1.92MB / 23.8MB   220MB / 5.97GB    1553

@wangzul
Copy link

wangzul commented Jan 6, 2025

尝试重新运行一个新任务,然后获取一下最新的kuscia日志看下
日志存放路径:/home/kuscia/var/logs/kuscia.log

@wangzul
Copy link

wangzul commented Jan 6, 2025

或者尝试stop 几个暂时不使用的docker 容器【清理一下内存】后在此运行PSI任务。

@bzzbzz7
Copy link
Author

bzzbzz7 commented Jan 6, 2025

停了2个容器后,再执行PSI任务成功了,为什么?是内存资源不足吗?可以通过设置request limit资源限额解决 不通容器之间的资源隔离问题吗

@wangzul
Copy link

wangzul commented Jan 6, 2025

任务发起节点是,提示错误的节点吗?还是合作方提示的错误。我需要确认一下。

@bzzbzz7
Copy link
Author

bzzbzz7 commented Jan 7, 2025

任务发起节点是提示错误的节点。合作方节点的pod是正常running的,不过双方都是在同一个宿主机,不同的容器。

@bzzbzz7
Copy link
Author

bzzbzz7 commented Jan 9, 2025

重新部署了一次,再次出现第一次PSI卡住。

bash-5.2# kubectl -n alice describe po  hily-nekbntww-node-3-0
Name:             hily-nekbntww-node-3-0
Namespace:        alice
Priority:         0
Service Account:  default
Node:             root-kuscia-autonomy-alice-localhost-localdomain/172.26.0.2
Start Time:       Thu, 09 Jan 2025 14:00:17 +0800
Labels:           kuscia.secretflow/communication-role-client=true
                  kuscia.secretflow/communication-role-server=true
                  kuscia.secretflow/controller=kusciatask
                  kuscia.secretflow/pod-identity=818d63d7-d90e-477d-a8a1-b82699245088-0
                  kuscia.secretflow/pod-role=
                  kuscia.secretflow/task-resource-group-uid=1f19648a-5528-487f-a7d2-8d1a40054662
                  kuscia.secretflow/task-resource-uid=70678ac5-4a05-4f03-82a2-60724db91001
                  kuscia.secretflow/task-uid=818d63d7-d90e-477d-a8a1-b82699245088
Annotations:      kuscia.secretflow/config-template-value-cm-name: hily-nekbntww-node-3-kuscia-gen-conf
                  kuscia.secretflow/config-template-volumes: config-template
                  kuscia.secretflow/initiator: alice
                  kuscia.secretflow/task-id: hily-nekbntww-node-3
                  kuscia.secretflow/task-resource: hily-nekbntww-node-3-6d20f2b88a3e
                  kuscia.secretflow/task-resource-group: hily-nekbntww-node-3
                  kuscia.secretflow/taskresource-reserving-timestamp: 2025-01-09T14:00:17+08:00
Status:           Pending
IP:               10.88.0.3
IPs:
  IP:  10.88.0.3
Containers:
  secretflow:
    Container ID:  
    Image:         secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.11.0b1
    Image ID:      
    Ports:         21105/TCP, 21099/TCP, 21100/TCP, 21101/TCP, 21102/TCP, 21103/TCP, 21104/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      sh
    Args:
      -c
      python -m secretflow.kuscia.entry ./kuscia/task-config.conf
    State:          Waiting
      Reason:       CreateContainerError
    Ready:          False
    Restart Count:  0
    Environment:
      KUSCIA_PORT_GLOBAL_NUMBER:          21100
      KUSCIA_PORT_NODE_MANAGER_NUMBER:    21101
      KUSCIA_PORT_OBJECT_MANAGER_NUMBER:  21102
      KUSCIA_PORT_CLIENT_SERVER_NUMBER:   21103
      KUSCIA_PORT_INFERENCE_NUMBER:       21104
      KUSCIA_PORT_SPU_NUMBER:             21105
      KUSCIA_PORT_FED_NUMBER:             21099
    Mounts:
      ./kuscia/task-config.conf from config-template (rw,path="task-config.conf")
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config-template:
    Type:        ConfigMap (a volume populated by a ConfigMap)
    Name:        hily-nekbntww-node-3-configtemplate
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  kuscia.secretflow/namespace=alice
Tolerations:     kuscia.secretflow/agent:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age                  From              Message
  ----     ------             ----                 ----              -------
  Warning  FailedScheduling   2m29s                kuscia-scheduler  0/1 nodes are available: waiting for task resource. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling., can not find related task resource.
  Normal   Scheduled          2m27s                kuscia-scheduler  Successfully assigned alice/hily-nekbntww-node-3-0 to root-kuscia-autonomy-alice-localhost-localdomain
  Warning  Failed             21s                  Agent             Error: context deadline exceeded
  Warning  MissingClusterDNS  21s (x3 over 2m27s)  Agent             pod: "hily-nekbntww-node-3-0_alice(fe291298-f48b-4feb-8184-d9a3a4f4a21e)". kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
  Normal   Pulled             20s (x2 over 2m22s)  Agent             Container image "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.11.0b1" already present on machine
  Warning  Failed             19s                  Agent             Error: failed to reserve container name "secretflow_hily-nekbntww-node-3-0_alice_fe291298-f48b-4feb-8184-d9a3a4f4a21e_0": name "secretflow_hily-nekbntww-node-3-0_alice_fe291298-f48b-4feb-8184-d9a3a4f4a21e_0" is reserved for "74db8ea50487de725314f17ef8da5b46575ecaf0cb5747cc5fc61d4db7b563a3"
bash-5.2# kubectl get po -A
NAMESPACE   NAME                             READY   STATUS                 RESTARTS   AGE
bob         dataproxy-bob-768fc7888f-2tmxx   1/1     Running                0          20m
bob         hily-nekbntww-node-3-0           0/1     CreateContainerError   0          6m
bash-5.2# 
bash-5.2# 
bash-5.2# kubectl -n bob describe po hily-nekbntww-node-3-0
Name:             hily-nekbntww-node-3-0
Namespace:        bob
Priority:         0
Service Account:  default
Node:             root-kuscia-autonomy-bob-localhost-localdomain/172.26.0.3
Start Time:       Thu, 09 Jan 2025 14:00:17 +0800
Labels:           kuscia.secretflow/communication-role-client=true
                  kuscia.secretflow/communication-role-server=true
                  kuscia.secretflow/controller=kusciatask
                  kuscia.secretflow/pod-identity=f5919e56-faa8-4afd-a028-da5cfe94b9b0-0
                  kuscia.secretflow/pod-role=
                  kuscia.secretflow/task-resource-group-uid=3ab499e9-4d60-4be9-b327-f419b5da10e6
                  kuscia.secretflow/task-resource-uid=8c44f7f1-07a3-4d1a-9802-21d87abca9ef
                  kuscia.secretflow/task-uid=f5919e56-faa8-4afd-a028-da5cfe94b9b0
Annotations:      kuscia.secretflow/config-template-value-cm-name: hily-nekbntww-node-3-kuscia-gen-conf
                  kuscia.secretflow/config-template-volumes: config-template
                  kuscia.secretflow/initiator: alice
                  kuscia.secretflow/task-id: hily-nekbntww-node-3
                  kuscia.secretflow/task-resource: hily-nekbntww-node-3-bc63c95dcf15
                  kuscia.secretflow/task-resource-group: hily-nekbntww-node-3
                  kuscia.secretflow/taskresource-reserving-timestamp: 2025-01-09T14:05:15+08:00
Status:           Pending
IP:               10.88.0.3
IPs:
  IP:  10.88.0.3
Containers:
  secretflow:
    Container ID:  
    Image:         secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.11.0b1
    Image ID:      
    Ports:         32562/TCP, 32563/TCP, 32557/TCP, 32558/TCP, 32559/TCP, 32560/TCP, 32561/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      sh
    Args:
      -c
      python -m secretflow.kuscia.entry ./kuscia/task-config.conf
    State:          Waiting
      Reason:       CreateContainerError
    Ready:          False
    Restart Count:  0
    Environment:
      KUSCIA_PORT_CLIENT_SERVER_NUMBER:   32560
      KUSCIA_PORT_INFERENCE_NUMBER:       32561
      KUSCIA_PORT_SPU_NUMBER:             32562
      KUSCIA_PORT_FED_NUMBER:             32563
      KUSCIA_PORT_GLOBAL_NUMBER:          32557
      KUSCIA_PORT_NODE_MANAGER_NUMBER:    32558
      KUSCIA_PORT_OBJECT_MANAGER_NUMBER:  32559
    Mounts:
      ./kuscia/task-config.conf from config-template (rw,path="task-config.conf")
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config-template:
    Type:        ConfigMap (a volume populated by a ConfigMap)
    Name:        hily-nekbntww-node-3-configtemplate
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  kuscia.secretflow/namespace=bob
Tolerations:     kuscia.secretflow/agent:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age                  From              Message
  ----     ------             ----                 ----              -------
  Warning  FailedScheduling   6m13s                kuscia-scheduler  0/1 nodes are available: waiting for task resource. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling., can not find related task resource.
  Normal   Scheduled          6m10s                kuscia-scheduler  Successfully assigned bob/hily-nekbntww-node-3-0 to root-kuscia-autonomy-bob-localhost-localdomain
  Warning  Failed             4m5s                 Agent             Error: context deadline exceeded
  Warning  Failed             4m3s                 Agent             Error: failed to reserve container name "secretflow_hily-nekbntww-node-3-0_bob_d6d5795b-ccad-4399-932b-028a1c41c33d_0": name "secretflow_hily-nekbntww-node-3-0_bob_d6d5795b-ccad-4399-932b-028a1c41c33d_0" is reserved for "279331dfd39282c2e1f3d786d74f4f1bef26e5a2e2a0cccaf29e154bba2a8720"
  Warning  MissingClusterDNS  73s (x4 over 6m11s)  Agent             pod: "hily-nekbntww-node-3-0_bob(d6d5795b-ccad-4399-932b-028a1c41c33d)". kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
  Normal   Pulled             73s (x3 over 6m6s)   Agent             Container image "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.11.0b1" already present on machine
bash-5.2# 

@wangzul
Copy link

wangzul commented Jan 9, 2025

重新部署后使用的部署模式有改变吗?
现在单机部署还是多机部署?
系统资源状态是否满足运行 需要你提供这部分信息。

@bzzbzz7
Copy link
Author

bzzbzz7 commented Jan 9, 2025

部署模式没变,还是P2P
单机部署all in one。

[root@localhost ~]# free -lh
              total        used        free      shared  buff/cache   available
Mem:           31Gi       8.8Gi       5.2Gi        39Mi        17Gi        21Gi
Low:           31Gi        25Gi       5.2Gi
High:            0B          0B          0B
Swap:         5.0Gi       155Mi       4.8Gi
[root@localhost ~]# 
[root@localhost ~]# df -h
Filesystem           Size  Used Avail Use% Mounted on
devtmpfs              16G     0   16G   0% /dev
tmpfs                 16G     0   16G   0% /dev/shm
tmpfs                 16G   42M   16G   1% /run
tmpfs                 16G     0   16G   0% /sys/fs/cgroup
/dev/mapper/cl-root   44G   20G   25G  44% /
/dev/vda1           1014M  242M  773M  24% /boot
/dev/vdb1            200G   50G  151G  25% /data
tmpfs                3.2G  1.2M  3.2G   1% /run/user/42
tmpfs                3.2G     0  3.2G   0% /run/user/0
overlay              200G   50G  151G  25% /data/docker-data/overlay2/000c9ff5dcfcc1df58177e6cb1cbe483f679edbc70aad05854cce6855d3d4e40/merged
overlay              200G   50G  151G  25% /data/docker-data/overlay2/4923f1b0b8c4a6a6250d99aef612e53f154cf5e48cca8ecaee4fbc868b2e1eae/merged
overlay              200G   50G  151G  25% /data/docker-data/overlay2/62acca1993d463cf01d74f71d20ecffa54fdfc8949fedd6f984ccb6a9733776d/merged
overlay              200G   50G  151G  25% /data/docker-data/overlay2/cf8536a080e12403d93448103754444ca5cc3a68993556c3adc041b98b607fb9/merged


[root@localhost ~]# docker stats
CONTAINER ID   NAME                                   CPU %     MEM USAGE / LIMIT   MEM %     NET I/O           BLOCK I/O         PIDS
4ea68aedf82e   root-kuscia-autonomy-secretpad-bob     0.76%     2.433GiB / 4GiB     60.82%    5.48MB / 2.37MB   57.3kB / 19.6MB   160
a0dd37967afc   root-kuscia-autonomy-secretpad-alice   0.91%     2.458GiB / 4GiB     61.46%    6.81MB / 2.98MB   2.46MB / 24.2MB   155
9359692bf0bc   root-kuscia-autonomy-bob               26.71%    2.938GiB / 6GiB     48.96%    3.64MB / 2.36MB   390MB / 3.78GB    343
731b76b522c9   root-kuscia-autonomy-alice             17.05%    2.428GiB / 6GiB     40.47%    2.23MB / 3.87MB   449MB / 4.98GB    261

@wangzul
Copy link

wangzul commented Jan 10, 2025

部署模式没变,还是P2P 单机部署all in one。

[root@localhost ~]# free -lh
              total        used        free      shared  buff/cache   available
Mem:           31Gi       8.8Gi       5.2Gi        39Mi        17Gi        21Gi
Low:           31Gi        25Gi       5.2Gi
High:            0B          0B          0B
Swap:         5.0Gi       155Mi       4.8Gi
[root@localhost ~]# 
[root@localhost ~]# df -h
Filesystem           Size  Used Avail Use% Mounted on
devtmpfs              16G     0   16G   0% /dev
tmpfs                 16G     0   16G   0% /dev/shm
tmpfs                 16G   42M   16G   1% /run
tmpfs                 16G     0   16G   0% /sys/fs/cgroup
/dev/mapper/cl-root   44G   20G   25G  44% /
/dev/vda1           1014M  242M  773M  24% /boot
/dev/vdb1            200G   50G  151G  25% /data
tmpfs                3.2G  1.2M  3.2G   1% /run/user/42
tmpfs                3.2G     0  3.2G   0% /run/user/0
overlay              200G   50G  151G  25% /data/docker-data/overlay2/000c9ff5dcfcc1df58177e6cb1cbe483f679edbc70aad05854cce6855d3d4e40/merged
overlay              200G   50G  151G  25% /data/docker-data/overlay2/4923f1b0b8c4a6a6250d99aef612e53f154cf5e48cca8ecaee4fbc868b2e1eae/merged
overlay              200G   50G  151G  25% /data/docker-data/overlay2/62acca1993d463cf01d74f71d20ecffa54fdfc8949fedd6f984ccb6a9733776d/merged
overlay              200G   50G  151G  25% /data/docker-data/overlay2/cf8536a080e12403d93448103754444ca5cc3a68993556c3adc041b98b607fb9/merged


[root@localhost ~]# docker stats
CONTAINER ID   NAME                                   CPU %     MEM USAGE / LIMIT   MEM %     NET I/O           BLOCK I/O         PIDS
4ea68aedf82e   root-kuscia-autonomy-secretpad-bob     0.76%     2.433GiB / 4GiB     60.82%    5.48MB / 2.37MB   57.3kB / 19.6MB   160
a0dd37967afc   root-kuscia-autonomy-secretpad-alice   0.91%     2.458GiB / 4GiB     61.46%    6.81MB / 2.98MB   2.46MB / 24.2MB   155
9359692bf0bc   root-kuscia-autonomy-bob               26.71%    2.938GiB / 6GiB     48.96%    3.64MB / 2.36MB   390MB / 3.78GB    343
731b76b522c9   root-kuscia-autonomy-alice             17.05%    2.428GiB / 6GiB     40.47%    2.23MB / 3.87MB   449MB / 4.98GB    261
          total        used        free      shared  buff/cache   available

Mem: 31Gi 8.8Gi 5.2Gi 39Mi 17Gi 21Gi

  1. 错误信息反馈还是存在资源问题,我看到你的缓存占用了17个G,可以尝试清理一些空间后再次运行吗?
  2. 如果还是存在相同问题,你可以提供一下kuscia.log /home/kuscia/var/logs/kuscia.log
  3. 双方节点的pod都处于CreateContainerError吗?

@wangzul
Copy link

wangzul commented Jan 24, 2025

由于您长时间没有回复,这边先关闭这个问题。后续有其他问题,欢迎再沟通!

@wangzul wangzul closed this as completed Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants