Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft flow PR #127

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from
Draft

Draft flow PR #127

wants to merge 8 commits into from

Conversation

lyriccoder
Copy link
Member

@lyriccoder lyriccoder commented Feb 4, 2021

  1. d6tflow can't run in parallel child tasks

  2. it can save results in a task only 1 time (child task can be only run synchronously). So if the task saves smth, it is completed. We have to collect all results from child tasks and then pass large collection to the next task. And do it every time from task to task

  3. Since we have to pass large collection we will keep in memory files and it's metadata.
    We can keep only metadata of all files, and then open each files in a separate thread and then find EM, Target method and it's invocation. But it is very ugly and take too much time

  4. I have to use joblib since ProcessPool cannot serialize some AST objects

You can run program as python3 main.py --dir /mnt/d/temp/dataset/01/ --output output

@lyriccoder lyriccoder marked this pull request as draft February 4, 2021 14:44
Copy link
Member

@KatGarmash KatGarmash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please remove all the commented code?

if not files_without_tests:
raise Exception("No java files were found")

full_dataset_folder = Path(self.dir_to_save) / 'full_dataset'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long method. the part with directory handling could be extracted, for example

for filename in tqdm(files_without_tests):
try:
text = next(result)
if text:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

under what condition can a text be None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When result is empty, empty string, when the file consists only of comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when can a result be empty, beside the case when soure code is empty ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we pass empty file, i believe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u leave a comment about it, maybe?

csv = self.inputLoad()['csv']
rows = [x for _, x in csv.iterrows()]

with Parallel(n_jobs=2, require='sharedmem') as parallel:
Copy link
Member

@KatGarmash KatGarmash Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why n_jobs=2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just a template, we can fix in future

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u use self.system_cores_qty, like u did in the preprocess task?


with Parallel(n_jobs=2, require='sharedmem') as parallel:
results = parallel((delayed(self._find_EMs)(a) for a in rows))
self.save({"data": [x for x in results if x]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the type of x?

Copy link
Member Author

@lyriccoder lyriccoder Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the result of _find_EMs function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u type-annotate the _find_EMs function then?

def _find_EMs(self, row):
result_dict = {}
try:
ast = AST.build_from_javalang(build_ast(row['original_filename']))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long method. this whole routine about extracting method declrations could be put into a separate method

target_node = ast.get_subtree(method_node)
for method_invoked in target_node.get_proxy_nodes(
ASTNodeType.METHOD_INVOCATION):
extracted_m_decl = method_declarations.get(method_invoked.member, [])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is extracted_m_decl ? what is its type?

dir_to_save = d6tflow.Parameter()
system_cores_qty = d6tflow.IntParameter()

def _find_EMs(self, row):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u rename row to a bit more informative? what's in the row?

dir_to_save=args.output,
system_cores_qty=args.jobs
))
data = TaskFindEM(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task TaskFindEM doesn't look like the final step. or is it?
where do u write the final dataset to disk?

Copy link
Member

@KatGarmash KatGarmash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see inline comments

@lyriccoder lyriccoder requested a review from acheshkov February 8, 2021 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants