Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download internal hyperlinks #498

Open
AlirezaMalih opened this issue May 15, 2024 · 9 comments
Open

download internal hyperlinks #498

AlirezaMalih opened this issue May 15, 2024 · 9 comments

Comments

@AlirezaMalih
Copy link

are there any access point or button to download all the internal hyperlinks in gexf file in a same time?
for instance, i have 100 urls to crawl, so for each of them i should go manually and download its own gexf file which includes their internal hyperlinks, so i wonder are there any access to download all these internal hyperlinks of all these 100 urls in a same time?
tnx :)

@Yomguithereal
Copy link
Member

I don't remember the UI being able to perform this, no. But this command line tool here (namely the hyphe dump command) should be able to do so: https://github.com/medialab/minet/blob/master/docs/cli.md#dump

@AlirezaMalih
Copy link
Author

I don't remember the UI being able to perform this, no. But this command line tool here (namely the hyphe dump command) should be able to do so: https://github.com/medialab/minet/blob/master/docs/cli.md#dump

tnx :), i tried to use it and download the csv file, but this file is not completed like gexf format, cause i should convert them to gexf format and it would not be like that format, specially when i want to use them with NetworkX, which would not be like the original format. can i download the as gexf format?

@Yomguithereal
Copy link
Member

can i download the as gexf format?

I don't think you can. But the CSV file should be very easy to load as a graph using networkx with some data wrangling. Else you can also try this tool: https://medialab.github.io/table2net/

@AlirezaMalih
Copy link
Author

can i download the as gexf format?

I don't think you can. But the CSV file should be very easy to load as a graph using networkx with some data wrangling. Else you can also try this tool: https://medialab.github.io/table2net/

merci, well i tried to load the csv with networkx, but the main problem is the csv file have the sources link which crawled, but not the target sources which from a web page goes to other webpages, so its like just nodes of webpages but without the paths or targets. do you know how can i download or include them in csv file? tnx :)

@boogheta
Copy link
Member

boogheta commented May 21, 2024

Hello there, I'm not sure I understand exactly your needs : are you trying to do a graph of links between webpages and not of websites ?
If that's the case, you should build a corpus with the setting default creation rule set to page in the Settings when creating the corpus, then the network should be made of what you're looking for.
Otherwise, Hyphe is made to aggregate links between web entities that are groups of webpages, getting the detailed links between all the pages is possible but not trivial and would require you to call the API manually.

@AlirezaMalih
Copy link
Author

Hello there, I'm not sure I understand exactly your needs : are you trying to do a graph of links between webpages and not of websites ? If that's the case, you should build a corpus with the setting default creation rule set to page in the Settings when creating the corpus, then the network should be made of what you're looking for. Otherwise, Hyphe is made to aggregate links between web entities that are groups of webpages, getting the detailed links between all the pages is possible but not trivial and would require you to call the API manually.

thank you, well i'm looking forward to download each by each these webpages corpus, internal hyperlinks corpus, for instance, i have a website which i crawled, so the internal network corpus within the website can be download manually as gexf format, i mean the internal hyperlinks can be downloadable by each of them, i need to download these gexf file each by each cause i have too many webpages too crawl and the process of downloading each gexf file takes time. Merci :)

@AlirezaMalih
Copy link
Author

Hello there, I'm not sure I understand exactly your needs : are you trying to do a graph of links between webpages and not of websites ? If that's the case, you should build a corpus with the setting default creation rule set to page in the Settings when creating the corpus, then the network should be made of what you're looking for. Otherwise, Hyphe is made to aggregate links between web entities that are groups of webpages, getting the detailed links between all the pages is possible but not trivial and would require you to call the API manually.

it might be better to mention like this:

i can't download internal hyperlinks, with minet also, i tried to put ignore_internal_links= True, but didnt work, i'm trying to download the internal hyperlinks and dots with 'gexf' format, but for large amount of data i cant manually do this.

@AlirezaMalih
Copy link
Author

AlirezaMalih commented May 23, 2024

I mean the data inside of '.../webentityPagesNetwork/...' in I cant download them all, just can have the csv for nodes (pages), but not the hyperlinks between them.

@boogheta
Copy link
Member

boogheta commented May 23, 2024

I repeat my question: what is your methodological need ? Using Hyphe to work with inner links is like using an anvil instead of a hammer. If you're looking to build a network of webpages in general, Hyphe is not the good tool for this and there are ways to collect that data but it would way more straightforward for you to just do it with other tools (minet has great crawl and links extraction tools for this for instance, issuecrawler or socscibot might be also good options with graphical interfaces).

Although if you have needs relative to the aggregation of those webpages into groups such as Hyphe's webentities, then Hyphe might be adapted, but in that case I'm not sure what you expect from gathering the whole detailed links page by page and I'd be curious to understand.
In any case, the only way to do that with Hyphe would be to use the API directly by calling the store.paginate_webentity_pagelinks_network route and setting the include_external_links option to true. But that's not really easy to do and you'll have to code to do that. There are python examples within the bin directory or you can use the test_client.py script to call the API from the command line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants