Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation with Java containers #11

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

maykov
Copy link

@maykov maykov commented Dec 8, 2021

    I'm assuming that each table contains n rows. The problem description doesn't say anything about
    table sizes. In the real world, its possible to implement multiple strategies and then apply them depending on the data
    characteristics.

    There is a  join on 2 sides of a < b+c. Left side contains n rows while the right side contains n*n rows.
    I'm using the following approach:

    1. Compute t2 join t3 in ram. (Store only {b+c, y*z}). Sort by b+c, iterate from the top and keep the running
    sum. Replace b*c with the sum(b*c) for such b,c that b+c > the given b+c

    2. Read t1 and store it in a hashmap with a as a key. For value of a, lookup b+c which are greater than a and
    sum precomputed xyz. Store these sums in the hashtable along with the row number for the first given value of a

    3. Find top 10 in the hashtable using a sorted container. Use row number from the original table to break ties.
    
    Since this approach stores the entire outer join in RAM and sort it, the space complexity is O(n*n). The time 
    complexity is O(n*n*log n)

    More optimization opportunities:
    I use standard Java containers here to store objects which contain either 2 doubles or a double and an integer.
    This will incur a pretty significant cost of an object pointers and pointer indirection. Also, this will not
    be good for the CPU cash. We should look into using a different language or may be some sort of a native
    memory allocation technique

    I'm pretty sure, many operations here could be parallelized. May be by using streams library. It could also
    be done manually.

@avavilau
Copy link
Owner

@maykov can you think about faster solution, which not suffers from potential OOM?

@maykov
Copy link
Author

maykov commented Dec 12, 2021

Sorry, I didnt see this message.

Here is one potential way:

Read t1 into RAM and sort by a. Read t2 into RAM. While reading t2 into RAM, update t1 in RAM with SUM(yz) for all such a<b+c.
This approach wont require O(n
n) RAM, only O(n). However, execution time will be O(nnn). (For each b+c, update all a<b+c)

Another idea could be to replace all ArrayLists and such with double[][].

Should I implement any of these?

So far, I'm not seeing a way to reduce runtime to less than O(nnlog(n)). This is due to the fact that join is on a<b+c, but top 10 is on sum(xyz). Is there a way? I will keep thinking.

@avavilau
Copy link
Owner

@maykov any progress?

@maykov
Copy link
Author

maykov commented Dec 21, 2021

Hi @avavilau , I need to brainstorm with you before going forward with a solution. Which approach should work the best:

  • replace all ArrayList and such with double[][] ?
  • do not create t2 join t3 in RAM but read t1 and t2. Update t1 with sum(yz) while reading t3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants