Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add tfidf bm25 #2353

Open
wants to merge 123 commits into
base: branch-25.02
Choose a base branch
from

Conversation

jperez999
Copy link

This PR will add support for tfidf and BM25 preprocessing of sparse matrix. It does not require the user to work within the confines of the COO or CSR matrix. It only requires the triplets of data ( row, column, value). With this information, we are able to preprocess the values accordingly. Putting this up to get eyes on this, to make sure this is going in the correct direction or if not, to adjust.

Unit tests are still required for these features.

ajschmidt8 and others added 30 commits July 14, 2020 17:05
[skip ci] Update master references for main branch
[HOTFIX] Remove `-g` from cython compile commands
Our `devel` Docker containers need to be switched to using `conda` compilers to resolve a linking error. `raft` is in those containers, but hasn't yet been built with `conda` compilers. This PR addresses that.

These changes won't cleanly merge into `branch-22.08` unfortunately due to the changes in rapidsai#641, but we can address that another time.

Authors:
   - AJ Schmidt (https://github.com/ajschmidt8)
   - Corey J. Nolet (https://github.com/cjnolet)
   - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
   - Corey J. Nolet (https://github.com/cjnolet)
@shwina I'm going to apologize ahead of time for this, but i was trying to forward merge your branch 22.10 locally to create a new PR from it and I accidentally pushed to your remote branch. I cherry-picked the commits over to a new branch for the hotfix.

Authors:
   - Bradley Dice (https://github.com/bdice)
   - Ashwin Srinath (https://github.com/shwina)

Approvers:
   - Ray Douglass (https://github.com/raydouglass)
[RELEASE] raft v22.12.01 [skip-gpuci]
@github-actions github-actions bot removed the python label Oct 30, 2024
SparseKNNInputs<value_idx, value_t> params;
};

const std::vector<SparseKNNInputs<int, float>> inputs_i32_f = {
Copy link
Member

@rhdong rhdong Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add some additional test cases that generate random csr matrix instead of hardcoding them? Just a suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh boy, I thought this has been updated / fixed. We definitely don't want to be hardcoding these.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, those hardcoded values come from the test that was in place. Take a look at what is currently available in main https://github.com/rapidsai/raft/blob/branch-24.12/cpp/test/sparse/neighbors/brute_force.cu. I felt it was acceptable to leave/use those hardcoded values, because the point of these tests here is not to ensure the brute_force works correctly, it is to check that the new interfaces I created for COO and CSR work correctly. If you want me to change, that is fine but then I think that means I need to create a Kernel for this. I dont believe that is the goal of this PR. But let me know if you think I need to make this change in this PR @cjnolet @rhdong

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on you, the brute force(and other neighbor algo) has been moved to CUVS, and the test cases there could be more solid, just for your reference: https://github.com/rapidsai/cuvs/blob/branch-24.12/cpp/test/neighbors/brute_force.cu#L469 .

@@ -103,4 +106,171 @@ void brute_force_knn(const value_idx* idxIndptr,
metricArg);
}

/**
* Search the sparse kNN for the k-nearest neighbors of a set of sparse query vectors
* using some distance implementation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add the comments for the template parameters.

float metricArg = 0)
{
cudaStream_t stream = raft::resource::get_cuda_stream(handle);

Copy link
Member

@rhdong rhdong Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add a judgment for 0 size data for idx and query, though it should happen rarely. (Considering the following code includes the logic of size() - 1)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhdong do you think I should raise and error or just return before performing bfknn?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should depend on the logic: to return directly (keeping no change on the outputs), if it is normal to have zero-size input, or you could use RAFT_EXPECTS to notify the caller.

auto host_matrix = raft::make_host_matrix<T2, int64_t>(handle, num_rows, num_cols);
raft::copy(host_matrix.data_handle(), device_matrix.data_handle(), device_matrix.size(), stream);

for (int i = 0; i < elements_size; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear on the primary objective of this logic, but just a heads-up: unless you explicitly sync on the stream before this line, we can't assume host_matrix will have the same value as device_matrix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is happening here is that I am loading the incoming host data into a dense matrix. So we represent the COO arrays as a dense matrix and then I am copying that dense matrix from host memory to GPU memory. Before that line I am expecting that both the host and device matrices are zero filled. I did it this way to use raft APIs as much as possible. For loop on host memory did not seem like the most efficient way to fill a matrix.

* @param csr_in: Input CSR matrix
* @param values_out: Output values array
*/
template <typename T1, typename T2, typename IdxT>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The T1, T2 might be a bit unclear; feel free to rename them to something more meaningful if you prefer.

using SparsePreprocessBm25Csr = SparsePreprocessCSR<float, int>;
TEST_P(SparsePreprocessBm25Csr, Result) { Run(true); }

const std::vector<SparsePreprocessInputs<float, int>> sparse_preprocess_inputs = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be conservative and ensure that there are no surprises after merging, it is best to add some use cases for larger matrices.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

larger random matrices is difficult to ensure, because rmat currently makes many duplicates during edge creation. This results in much smaller than anticipated number of edges. I think in its current form it would be misleading. But I can definitely pass much bigger parameters to RMAT. I dont think the end result will be what we expect. We need to first create a function that creates an RMAT and then removes duplicates and keeps looping through this logic until we get a set of edges of the desired amount that have no duplicates. This is outside of the purview of this PR, IMO. How do you feel about it @cjnolet?

values_nnz.view(),
num_rows);
auto rows_csr = raft::make_device_vector<Index_, int64_t>(handle, non_dupe_nnz_count);
raft::sparse::convert::sorted_coo_to_csr(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With larger matrices this seems to be failing with invalid writes:

========= COMPUTE-SANITIZER
Running main() from /raid/workspace/raft/cpp/build/_deps/gtest-src/googletest/src/gtest_main.cc
[==========] Running 4 tests from 4 test suites.
[----------] Global test environment set-up.
[----------] 1 test from SparsePreprocessCSR/SparsePreprocessTfidfCsr
[ RUN      ] SparsePreprocessCSR/SparsePreprocessTfidfCsr.Result/0
========= Invalid __global__ write of size 4 bytes
=========     at void cub::CUB_200500_890_NS::DeviceScanKernel<cub::CUB_200500_890_NS::DeviceScanPolicy<int, thrust::plus<void>>::Policy900, thrust::device_ptr<int>, thrust::device_ptr<int>, cub::CUB_200500_890_NS::ScanTileState<int, (bool)1>, thrust::plus<void>, cub::CUB_200500_890_NS::detail::InputValue<int, int *>, int, int>(T2, T3, T4, int, T5, T6, T7)+0x1b30
=========     by thread (8,0,0) in block (0,0,0)
=========     Address 0x7c6486200e20 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x7c6486200e00 of size 32 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2b76ef]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x15bc3]
=========                in /opt/conda/envs/rapids_raft/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel_ptsz [0x54aa0]
=========                in /opt/conda/envs/rapids_raft/lib/libcudart.so.12
=========     Host Frame:void cub::CUB_200500_890_NS::DeviceScanKernel<cub::CUB_200500_890_NS::DeviceScanPolicy<int, thrust::plus<void> >::Policy900, thrust::device_ptr<int>, thrust::device_ptr<int>, cub::CUB_200500_890_NS::ScanTileState<int, true>, thrust::plus<void>, cub::CUB_200500_890_NS::detail::InputValue<int, int*>, int, int>(thrust::device_ptr<int>, thrust::device_ptr<int>, cub::CUB_200500_890_NS::ScanTileState<int, true>, int, thrust::plus<void>, cub::CUB_200500_890_NS::detail::InputValue<int, int*>, int) [0x2f10a]
=========                in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST
=========     Host Frame:cudaError thrust::cuda_cub::launcher::triple_chevron::doit_host<void (*)(thrust::device_ptr<int>, thrust::device_ptr<int>, cub::CUB_200500_890_NS::ScanTileState<int, true>, int, thrust::plus<void>, cub::CUB_200500_890_NS::detail::InputValue<int, int*>, int), thrust::device_ptr<int>, thrust::device_ptr<int>, cub::CUB_200500_890_NS::ScanTileState<int, true>, int, thrust::plus<void>, cub::CUB_200500_890_NS::detail::InputValue<int, int*>, int>(void (*)(thrust::device_ptr<int>, thrust::device_ptr<int>, cub::CUB_200500_890_NS::ScanTileState<int, true>, int, thrust::plus<void>, cub::CUB_200500_890_NS::detail::InputValue<int, int*>, int), thrust::device_ptr<int> const&, thrust::device_ptr<int> const&, cub::CUB_200500_890_NS::ScanTileState<int, true> const&, int const&, thrust::plus<void> const&, cub::CUB_200500_890_NS::detail::InputValue<int, int*> const&, int const&) const [clone .isra.0] [0x180bb]
=========                in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST
=========     Host Frame:thrust::device_ptr<int> thrust::cuda_cub::detail::exclusive_scan_n_impl<thrust::detail::execute_with_allocator<rmm::mr::thrust_allocator<char>, thrust::cuda_cub::execute_on_stream_base>, thrust::device_ptr<int>, long, thrust::device_ptr<int>, int, thrust::plus<void> >(thrust::cuda_cub::execution_policy<thrust::detail::execute_with_allocator<rmm::mr::thrust_allocator<char>, thrust::cuda_cub::execute_on_stream_base> >&, thrust::device_ptr<int>, long, thrust::device_ptr<int>, int, thrust::plus<void>) [clone .isra.0] [0x292d8]
=========                in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST
=========     Host Frame:void raft::sparse::convert::detail::sorted_coo_to_csr<int>(int const*, int, int*, int, CUstream_st*) [0x5dd30]
=========                in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST
=========     Host Frame:raft::sparse::SparsePreprocessCSR<float, int>::Run(bool) [0x6a8cc]
=========                in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST
=========     Host Frame:void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) [0x9403d]
=========                in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST
=========     Host Frame:testing::Test::Run() [0x94310]
=========                in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST
=========     Host Frame:testing::TestInfo::Run() [0x946d6]
=========                in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST
=========     Host Frame:testing::TestSuite::Run() [0x94e13]
=========                in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST
=========     Host Frame:testing::internal::UnitTestImpl::RunAllTests() [0x9a88d]
=========                in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST
=========     Host Frame:testing::UnitTest::Run() [0x949d8]
=========                in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST
=========     Host Frame:main [0x1774e]
=========                in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST
=========     Host Frame: [0x29d8f]
=========                in /usr/lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:__libc_start_main [0x29e3f]
=========                in /usr/lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame: [0x177b1]
=========                in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST
========= 

Copy link
Member

@rhdong rhdong Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this might be helpful in resolving this error(maybe we can check if the allocated memory size is correct or the indices staff has no invalid ones):

========= Invalid __global__ write of size 4 bytes
=========     at void cub::CUB_200500_890_NS::DeviceScanKernel<cub::CUB_200500_890_NS::DeviceScanPolicy<int, thrust::plus<void>>::Policy900, thrust::device_ptr<int>, thrust::device_ptr<int>, cub::CUB_200500_890_NS::ScanTileState<int, (bool)1>, thrust::plus<void>, cub::CUB_200500_890_NS::detail::InputValue<int, int *>, int, int>(T2, T3, T4, int, T5, T6, T7)+0x1b30
=========     by thread (8,0,0) in block (0,0,0)
=========     Address 0x7c6486200e20 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x7c6486200e00 of size 32 bytes


raft::util::create_dataset<Index_, Type_f>(
handle, rows.view(), columns.view(), values.view(), 5, params.n_rows, params.n_cols);
int non_dupe_nnz_count = raft::util::get_dupe_mask_count<Index_, Type_f>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declaring the non_dupe_nnz_count as int64_t might be safer since it is used as int64_t in the following code.

@cjnolet cjnolet changed the base branch from branch-24.12 to branch-25.02 December 11, 2024 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

10 participants