Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hapi supports pir #68344

Merged
merged 10 commits into from
Sep 27, 2024
Merged

hapi supports pir #68344

merged 10 commits into from
Sep 27, 2024

Conversation

zbt78
Copy link
Contributor

@zbt78 zbt78 commented Sep 20, 2024

PR Category

Execute Infrastructure

PR Types

New features

Description

hapi 适配 pir


在hapi中,需要首先构造一个network,然后作为参数传入Model类中,用来训练,推理等等。同时network里面的参数是在传入hapi的Model之前就已经创建好的。hapi的内部需要保存三个program,分别是train,test,eval,都是由default_main_program clone而来。而目前pir program的clone机制对parameter的clone有个问题,表现为:pir的program在clone之后,paramter在新的program中确确实实会clone一份,但新program的op中使用到的parameter operand仍旧是旧program中的,这样执行器执行的时候会出问题。目前通过给network的parameter重新赋值为新program clone出来的parameter,然后再构造program的方式解决了这个问题。

具体表现为:

class AddLayer(nn.Layer):
    def __init__(self):
        super().__init__()
        self.weight = self.create_parameter(
            shape=[1,3],
        )

    def forward(self, x):
        return paddle.add(x, self.weight)

net = AddLayer()
print(paddle.pir.core.default_main_program())
train = paddle.pir.core.default_main_program().clone()
startup = paddle.base.Program()
with paddle.base.program_guard(train, startup):
    x = paddle.static.data(name='x', shape=[1, 3], dtype='float32')
    y = net(x)
    print(train)

输出:

{
    (%0_8121790596710454734) = "builtin.parameter" () {is_distributed:[false],is_parameter:[true],need_clip:[true],parameter_name:"add_layer_0.w_0",persistable:[true],stop_gradient:[false],trainable:[true]} : () -> builtin.tensor<1x3xf32>
}
 
{
    (%0_8906774651776825509) = "pd_op.data" () {dtype:(pd_op.DataType)float32,name:"x",place:(pd_op.Place)Place(undefined:0),shape:(pd_op.IntArray)[1,3],stop_gradient:[true]} : () -> builtin.tensor<1x3xf32>
    (%1_8906774651776825509) = "builtin.parameter" () {is_distributed:[false],is_parameter:[true],need_clip:[true],parameter_name:"add_layer_0.w_0",persistable:[true],stop_gradient:[false],trainable:[true]} : () -> builtin.tensor<1x3xf32>
    (%2_8906774651776825509) = "pd_op.add" (%0_8906774651776825509, %3_8121790596710454734) {stop_gradient:[false],struct_name:"/AddLayer/"} : (builtin.tensor<1x3xf32>, builtin.tensor<1x3xf32>) -> builtin.tensor<1x3xf32>
}

打印结果中的长串数字为program的id,可以看到pd_op.add的第二个操作数是原始program中的参数。


下面是对新增接口的一些解释:

GetAllParameterValues:获取当前program的所有parameter。

set_is_test_attr:有一些算子在训练和推理的模式下表现不一致,需要用is_test属性区别,此接口用来修改训练模式下op的is_test属性。

@CLAassistant
Copy link

CLAassistant commented Sep 20, 2024

CLA assistant check
All committers have signed the CLA.

@paddle-bot paddle-bot bot added the contributor External developers label Sep 20, 2024
Copy link
Contributor

@wanghuancoder wanghuancoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wanghuancoder wanghuancoder merged commit 613df5c into PaddlePaddle:develop Sep 27, 2024
26 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants