Implementation with Java containers #11

maykov · 2021-12-08T04:57:55Z

    I'm assuming that each table contains n rows. The problem description doesn't say anything about
    table sizes. In the real world, its possible to implement multiple strategies and then apply them depending on the data
    characteristics.

    There is a  join on 2 sides of a < b+c. Left side contains n rows while the right side contains n*n rows.
    I'm using the following approach:

    1. Compute t2 join t3 in ram. (Store only {b+c, y*z}). Sort by b+c, iterate from the top and keep the running
    sum. Replace b*c with the sum(b*c) for such b,c that b+c > the given b+c

    2. Read t1 and store it in a hashmap with a as a key. For value of a, lookup b+c which are greater than a and
    sum precomputed xyz. Store these sums in the hashtable along with the row number for the first given value of a

    3. Find top 10 in the hashtable using a sorted container. Use row number from the original table to break ties.
    
    Since this approach stores the entire outer join in RAM and sort it, the space complexity is O(n*n). The time 
    complexity is O(n*n*log n)

    More optimization opportunities:
    I use standard Java containers here to store objects which contain either 2 doubles or a double and an integer.
    This will incur a pretty significant cost of an object pointers and pointer indirection. Also, this will not
    be good for the CPU cash. We should look into using a different language or may be some sort of a native
    memory allocation technique

    I'm pretty sure, many operations here could be parallelized. May be by using streams library. It could also
    be done manually.

avavilau · 2021-12-10T10:23:59Z

@maykov can you think about faster solution, which not suffers from potential OOM?

maykov · 2021-12-12T07:35:38Z

Sorry, I didnt see this message.

Here is one potential way:

Read t1 into RAM and sort by a. Read t2 into RAM. While reading t2 into RAM, update t1 in RAM with SUM(yz) for all such a<b+c.
This approach wont require O(nn) RAM, only O(n). However, execution time will be O(nnn). (For each b+c, update all a<b+c)

Another idea could be to replace all ArrayLists and such with double[][].

Should I implement any of these?

So far, I'm not seeing a way to reduce runtime to less than O(nnlog(n)). This is due to the fact that join is on a<b+c, but top 10 is on sum(xyz). Is there a way? I will keep thinking.

avavilau · 2021-12-16T19:57:50Z

@maykov any progress?

maykov · 2021-12-21T19:52:09Z

Hi @avavilau , I need to brainstorm with you before going forward with a solution. Which approach should work the best:

replace all ArrayList and such with double[][] ?
do not create t2 join t3 in RAM but read t1 and t2. Update t1 with sum(yz) while reading t3.

Alexey Maykov added 2 commits December 7, 2021 18:54

nitial impl

833b9cb

added comment

a076c01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation with Java containers #11

Implementation with Java containers #11

maykov commented Dec 8, 2021

avavilau commented Dec 10, 2021

maykov commented Dec 12, 2021

avavilau commented Dec 16, 2021

maykov commented Dec 21, 2021

Implementation with Java containers #11

Are you sure you want to change the base?

Implementation with Java containers #11

Conversation

maykov commented Dec 8, 2021

avavilau commented Dec 10, 2021

maykov commented Dec 12, 2021

avavilau commented Dec 16, 2021

maykov commented Dec 21, 2021