Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Need Help] Integrated GPU - Shared memory with CPU #314

Closed
Hideman85 opened this issue Sep 12, 2024 · 1 comment
Closed

[Need Help] Integrated GPU - Shared memory with CPU #314

Hideman85 opened this issue Sep 12, 2024 · 1 comment

Comments

@Hideman85
Copy link

Hideman85 commented Sep 12, 2024

I am in the phase of learning The Forge and I'm trying to get some help with this topic because I'm getting really confused right now.


I would like to be able to run some compute shader on my iGPU and profit of the shared memory with the CPU (Read/Write without transfer/same memory space).

So right now, I try with simple example, a compute shader that double each float of my array/buffer.

My shader double.comp.fsl
RES(RWBuffer(float), myData, UPDATE_FREQ_NONE, b0, binding=0);

// Main compute shader function
NUM_THREADS(8, 8, 1)
void CS_MAIN(SV_GroupThreadID(uint3) inGroupId, SV_GroupID(uint3) groupId)
{
    INIT_MAIN;

    myData[inGroupId.x] *= 2.0; // Simple operation: double each float

    RETURN();
}
I'm able to find my integrated GPU
Renderer *pRenderer = nullptr;
Renderer *pCompute = nullptr;

void MyApp::Init() {
        RendererContextDesc contextSettings = {};
        RendererContext* pContext = NULL;
        initRendererContext(GetName(), &contextSettings, &pContext);

        RendererDesc settings = {};

        // Need one GPU for rendering and one for compute to simplify
        if (pContext && pContext->mGpuCount >= 2)
        {
            uint32_t queueFamilyCount = 0;
            VkPhysicalDeviceMemoryProperties memProperties;
            auto SHARED_FLAG = VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT;

            int bestGpuIndex = -1;
            int bestProfile = -1;

            struct IntegratedComputeGPU { int idx; uint32_t mem;};
            std::vector<IntegratedComputeGPU> gComputeGPUs = {};

            for (int i = 0; i < pContext->mGpuCount; i++) {
                auto profile = pContext->mGpus[i].mSettings.mGpuVendorPreset.mPresetLevel;
                if (profile > bestProfile) {
                    std::string str = "======> GPU " + std::to_string(i) + " profile " + std::to_string(profile);
                    LOGF(LogLevel::eINFO, str.c_str());
                    bestProfile = profile;
                    bestGpuIndex = i;
                }

                auto device = pContext->mGpus[i].mVk.pGpu;
                auto& props = pContext->mGpus[i].mVk.mGpuProperties.properties;

                if (props.deviceType == VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU) {
                    vkGetPhysicalDeviceQueueFamilyProperties(device, &queueFamilyCount, NULL);
                    std::vector<VkQueueFamilyProperties> queueFamilies(queueFamilyCount);
                    vkGetPhysicalDeviceQueueFamilyProperties(device, &queueFamilyCount, queueFamilies.data());

                    for (VkQueueFamilyProperties& queueFamily : queueFamilies) {
                        if (queueFamily.queueFlags & VK_QUEUE_COMPUTE_BIT) {
                            vkGetPhysicalDeviceMemoryProperties(device, &memProperties);

                            for (uint32_t j = 0; j < memProperties.memoryTypeCount; j++) {
                                if (memProperties.memoryTypes[j].propertyFlags & SHARED_FLAG) {
                                    gComputeGPUs.push_back({i, props.limits.maxComputeSharedMemorySize});
                                    break;
                                }
                            }
                            break;
                        }
                    }
                }
            }

            if (gComputeGPUs.size() > 0) {
                int bestComputeIndex = -1;
                uint32_t bestComputeMem = 0;
                for (auto &gpu : gComputeGPUs) {
                    if (gpu.idx != bestGpuIndex && gpu.mem > bestComputeMem) {
                        bestComputeIndex = gpu.idx;
                        bestComputeMem = gpu.mem;
                    }
                }

                // We have all our needs
                if (bestComputeIndex != -1) {
                    std::string str = "======> Compute GPU ";
                    str.append(pContext->mGpus[bestComputeIndex].mVk.mGpuProperties.properties.deviceName);
                    LOGF(LogLevel::eINFO, str.c_str());
                    str = "======> Graphic GPU ";
                    str.append(pContext->mGpus[bestGpuIndex].mVk.mGpuProperties.properties.deviceName);
                    LOGF(LogLevel::eINFO, str.c_str());

                    settings.pContext = pContext;

                    //  First the render gpu
                    settings.mGpuMode = GPU_MODE_SINGLE;
                    settings.mGpuIndex = bestGpuIndex;
                    initRenderer(GetName(), &settings, &pRenderer);
                    if (!pRenderer) return false;

                    //  Second compute one
                    settings.mGpuMode = GPU_MODE_UNLINKED;
                    settings.mGpuIndex = bestComputeIndex;
                    initRenderer(GetName(), &settings, &pCompute);
                    if (!pCompute) return false;
                }
            }
        }

        // Default init
        if (!pRenderer) {
            LOGF(LogLevel::eINFO, "======> Fallback to single GPU");
            initRenderer(GetName(), &settings, &pRenderer);
            if (!pRenderer) return false;
        }
        
        if (pCompute) addBuffer();
}
Shader, RootSignature, Pipeline, all good
void Compute::AddShaders() {
    ShaderLoadDesc desc = {};
    desc.mStages[0].pFileName = "double.comp";
    addShader(pCompute, &desc, &pComputeShader);
}

void Compute::RemoveShaders() {
    removeShader(pCompute, pComputeShader);
}

void Compute::AddRootSignatures() {
    RootSignatureDesc desc = { &pComputeShader, 1 };
    addRootSignature(pCompute, &desc, &pRootSignature);
};

void Compute::RemoveRootSignatures() {
    removeRootSignature(pCompute, pRootSignature);
}

void Compute::AddPipelines() {
    PipelineDesc pipelineDesc = {};
    pipelineDesc.pName = "ComputePipeline";
    pipelineDesc.mType = PIPELINE_TYPE_COMPUTE;
    ComputePipelineDesc& computePipelineSettings = pipelineDesc.mComputeDesc;
    computePipelineSettings.pShaderProgram = pComputeShader;
    computePipelineSettings.pRootSignature = pRootSignature;
    addPipeline(pCompute, &pipelineDesc, &pPipeline);
};

void Compute::RemovePipelines() {
    removePipeline(pCompute, pPipeline);
}

Now the part that I think I'm getting wrong, I try to create a buffer to the GPU from the existing memory 🤔

addBuffer()
// Taken from the The Forge renderer
DECLARE_RENDERER_FUNCTION(void, addBuffer, Renderer* pCompute, const BufferDesc* pDesc, Buffer** pp_buffer)
DECLARE_RENDERER_FUNCTION(void, removeBuffer, Renderer* pCompute, Buffer* pBuffer)

Buffer* buff = nullptr;
std::vector<float> buff(100, 1.f);
uint64_t totalSize = 100 * sizeof(float);

void addBuffer() {
    BufferLoadDesc bDesc = {};
    bDesc.mDesc.mDescriptors = DESCRIPTOR_TYPE_VERTEX_BUFFER | (DESCRIPTOR_TYPE_BUFFER_RAW | DESCRIPTOR_TYPE_RW_BUFFER_RAW);
    bDesc.mDesc.mSize = totalSize;
    bDesc.pData = buff.data();

    ResourceSizeAlign rsa = {};
    getResourceSizeAlign(&bDesc, &rsa);
    ResourceHeap* pHeap;

    ResourceHeapDesc desc = {};
    desc.mDescriptors = DESCRIPTOR_TYPE_BUFFER | (DESCRIPTOR_TYPE_BUFFER_RAW | DESCRIPTOR_TYPE_RW_BUFFER_RAW);
    desc.mFlags = RESOURCE_HEAP_FLAG_SHARED;

    desc.mAlignment = rsa.mAlignment;
    desc.mSize = totalSize;
    addResourceHeap(pCompute, &desc, &pHeap);

    ResourcePlacement placement{pHeap};

    BufferDesc buffDesc = {};
    buffDesc.pName = "SharedBuffer";
    buffDesc.mFlags = BUFFER_CREATION_FLAG_HOST_VISIBLE | BUFFER_CREATION_FLAG_HOST_COHERENT;
    buffDesc.mSize = totalSize;
    buffDesc.pPlacement = &placement;
    buffDesc.mFormat = TinyImageFormat_R32_SFLOAT;
    buffDesc.mDescriptors = DESCRIPTOR_TYPE_RW_BUFFER_RAW;
    buffDesc.mNodeIndex = pCompute->mUnlinkedRendererIndex;
    addBuffer(pCompute, &buffDesc, &buff);
}

I would kindly appreciate help to get a simple example working 🙏 Thanks in advance 🙏

@Hideman85
Copy link
Author

In the end I found the right way to do it as follow:

SyncToken token = {};
BufferLoadDesc desc = {};
desc.mDesc.mDescriptors = DESCRIPTOR_TYPE_RW_BUFFER_RAW;
desc.mDesc.mFlags = BUFFER_CREATION_FLAG_PERSISTENT_MAP_BIT | BUFFER_CREATION_FLAG_HOST_VISIBLE | BUFFER_CREATION_FLAG_HOST_COHERENT;
desc.mDesc.mMemoryUsage = RESOURCE_MEMORY_USAGE_GPU_TO_CPU;
desc.mDesc.mStartState = RESOURCE_STATE_SHADER_RESOURCE;
desc.mDesc.mFormat = TinyImageFormat_R32_SFLOAT;
desc.mDesc.mSize = NB_ELEMENTS * sizeof(float);
desc.mDesc.mElementCount = NB_ELEMENTS;
desc.mDesc.mStructStride = sizeof(float);
desc.mDesc.mNodeIndex = pCompute->mUnlinkedRendererIndex;
desc.ppBuffer = &pComputeBuffer;
addResource(&desc, &token);
waitForToken(&token);
float* data = (float*)pComputeBuffer->pCpuMappedAddress;

The rest is already above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant