improve SEO to enable GigaDB datasets to be found by google/bing/beidu searches #514

only1chunts · 2020-10-06T17:38:13Z

User Story

As a Website User
I want to find GigaDB content through search engines
So that I can conveniently find GigaDB content relevant to my needs

Acceptance Criteria

Given GigaDB is optimised for Google
When I search "A molecular map of lung neuroendocrine neoplasms"
Then the corresponding manuscript in GigaScience should appears on the first page

Given GigaDB is optimised for Google
When I search "A molecular map of lung neuroendocrine neoplasms"
Then corresponding dataset in GigaDB should appears on the first page

Given GigaDB is optimised for Bing
When I search "A molecular map of lung neuroendocrine neoplasms"
Then the corresponding manuscript in GigaScience should appears on the first page

Given GigaDB is optimised for Bing
When I search "A molecular map of lung neuroendocrine neoplasms"
Then corresponding dataset in GigaDB should appears on the first page

Given GigaDB is optimised for Baidu
When I search "A molecular map of lung neuroendocrine neoplasms"
Then the corresponding manuscript in GigaScience should appears on the first page

Given GigaDB is optimised for Baidu
When I search "A molecular map of lung neuroendocrine neoplasms"
Then corresponding dataset in GigaDB should appears on the first page

Given GigaDB is optimised for Yandex
When I search "A molecular map of lung neuroendocrine neoplasms"
Then the corresponding manuscript in GigaScience should appears on the first page

Given GigaDB is optimised for Yandex
When I search "A molecular map of lung neuroendocrine neoplasms"
Then corresponding dataset in GigaDB should appears on the first page

Additional Info

Product Backlog Item Ready Checklist

Business value is clearly articulated
Item is understood enough by the IT team so it can make an informed decision as to whether it can complete this item
Dependencies are identified and no external dependencies would block this item from being completed
At the time of the scheduled sprint, the IT team has the appropriate composition to complete this item
This item is estimated and small enough to comfortably be completed in one sprint
Acceptance criteria are clear and testable
Performance criteria, if any, are defined and testable
The Scrum team understands how to demonstrate this item at the sprint review

Product Backlog Item Done Checklist

Code is complete
Automated tests related to the changes are implemented and passing
All automated test suites are passing locally
Code is refactored to best practices and coding standards
Documentation is updated as needed
A Pull Request has been created and review requested
Pull Request is reviewed and approved
The item has been merged to the develop branch
All automated test suites are passing on continuous Integration pipeline and item is ready to release

Is your feature request related to a problem? Please describe.
historically GigaDB datasets have appeared in google search results, but recently they have stopped appearing?
This can be seen if you try searching for ANY word using google with the restriction of "site:www.gigadb.org" e.g. genome site:www.gigadb.org

Describe the solution you'd like
Its probably that the solution will require the use of schema.org , according to #73 this has been implemented on the home page, but needs to be extended to the dataset pages (and all other gigadb.org pages).

only1chunts · 2020-11-06T12:04:47Z

It might also be that we need to check whether any of the bot-blocker stuff that was done to prevent bots crawling the FTP server have accidentally affected gigadb.org too?

only1chunts · 2020-11-10T16:07:15Z

It appears that the dataset pages contain these lines in the HTML:

<meta name="robots" content="noindex">
<meta name="googlebot" content="noindex">

Could that be causing the lack of indexing in google?
to see this I used the rich-results checker tool, e.g.:
https://search.google.com/test/rich-results?utm_campaign=sdtt&utm_medium=url&id=rkYdN_AWFjpXO_JUaPx9Gg

only1chunts · 2021-01-19T16:17:35Z

The rich-results checker link in the previous comment shows the metadata is being made available but with a number of issues:

Page loading issue: Not all page resources could be loaded. This can affect how Google sees and understands your page. Fix availability problems for any resources that can affect how Google understands your page.
ERROR - Invalid object type for field "license"
Warning - Missing field "creator" (optional)
Warning - Invalid value type for field "license" (optional)
Warning - Missing field "encodingFormat" (optional)

Add canonical URL tag to dataset pages Refs: #514

only1chunts · 2024-05-28T19:47:15Z

@ChrisArmit makes a good point when testing the release:

Given GigaDB is optimised for Google
When I search "A molecular map of lung neuroendocrine neoplasms"
Then the corresponding manuscript in GigaScience should appears on the first page

—-This works

Given GigaDB is optimised for Google
When I search "A molecular map of lung neuroendocrine neoplasms"
Then corresponding dataset in GigaDB should appears on the first page

—-This does not work

Infact when you add the word gigadb to the search term in google it does find the correct GigaDB but shows it like this:

If you click the learn why button it takes you here:
https://support.google.com/webmasters/answer/7489871?hl=en

luistoptal · 2024-06-04T01:43:26Z

@only1chunts The robots.txt is disallowing any crawlers including google bots to read any of the pages of the site

Suggested change:

replace content of robots.txt with this:

User-agent: *
Allow: /

prevent bots from indexing admin pages, by disallowing them in robots.txt adding additional rulees
remove <meta name="robots" content="noindex, nofollow"><meta name="googlebot" content="noindex, nofollow"> from public pages

rija · 2024-06-04T15:18:20Z

@only1chunts The robots.txt is disallowing any crawlers including google bots to read any of the pages of the site

Suggested change:

replace content of robots.txt with this:
User-agent: *
Allow: /
prevent bots from indexing admin pages, by disallowing them in robots.txt adding additional rulees

remove <meta name="robots" content="noindex, nofollow"><meta name="googlebot" content="noindex, nofollow"> from public pages

Interesting.
Up until the switchover, we disallowed search engine indexing on non-live environment as we don't want non-live data to be public.
After the switchover, we didn't change that setting but we saw in the logs that search engines bots was indexing the new web site anways thus we though they didn't respect the no-indexing directives.
In any case, the directive needs to be reversed, but on live only. We still don't want our dev and staging environment to appear on search engine. I think it's better I create a specific ticket for switching indexing directive based on environments (as it's not as trivial as it appears).

Another issue that affect SEO is that search engines still know gigadb.org as and http site. and don't see https://gigadb.org as the same site because we don't have yet explicit redirection from http to https (there's issue #1799 for that but no yet implemented).

Finally we may have to manually re-index our website on the various search engines using their dashboards and search tools to have a better control on how we show up (and maybe to ensure that the old http site is no longer indexed)

rija · 2025-02-10T09:48:37Z

I created #2190 as follow up to my comment above

only1chunts added the enhancement label Oct 6, 2020

rija added asa:WebsiteUser backlog:Story labels May 14, 2021

rija mentioned this issue Apr 15, 2024

issues moving between sample and file view after clicking through some pages of files #1783

Closed

rija added freelance frontend labels Apr 15, 2024

luistoptal mentioned this issue Apr 17, 2024

add canonical URL to dataset pages #1796

Merged

rija added a commit that referenced this issue May 21, 2024

Feat: Add canonical URL to dataset pages (Merge pull request #1796)

d33656e

Add canonical URL tag to dataset pages Refs: #514

rija removed freelance frontend labels Jun 11, 2024

only1chunts modified the milestones: Y. Increase FAIRness, SEO Dec 19, 2024

only1chunts added this to Backlog: GigaDB Database Jan 27, 2025

only1chunts mentioned this issue Jan 27, 2025

implement schema.org metadata for every dataset #2182

Open

14 tasks

rija closed this as completed Feb 10, 2025

rija added backlog:Epic and removed backlog:Story labels Feb 10, 2025

rija reopened this Feb 10, 2025

rija removed this from Backlog: GigaDB Database Feb 10, 2025

rija added this to Backlog: GigaDB Database Feb 10, 2025

rija moved this to Ready in Backlog: GigaDB Database Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve SEO to enable GigaDB datasets to be found by google/bing/beidu searches #514

improve SEO to enable GigaDB datasets to be found by google/bing/beidu searches #514

only1chunts commented Oct 6, 2020 •

edited by rija

Loading

only1chunts commented Nov 6, 2020

only1chunts commented Nov 10, 2020 •

edited

Loading

only1chunts commented Jan 19, 2021

only1chunts commented May 28, 2024

luistoptal commented Jun 4, 2024

rija commented Jun 4, 2024 •

edited

Loading

rija commented Feb 10, 2025

improve SEO to enable GigaDB datasets to be found by google/bing/beidu searches #514

improve SEO to enable GigaDB datasets to be found by google/bing/beidu searches #514

Comments

only1chunts commented Oct 6, 2020 • edited by rija Loading

User Story

Acceptance Criteria

Additional Info

Product Backlog Item Ready Checklist

Product Backlog Item Done Checklist

only1chunts commented Nov 6, 2020

only1chunts commented Nov 10, 2020 • edited Loading

only1chunts commented Jan 19, 2021

only1chunts commented May 28, 2024

luistoptal commented Jun 4, 2024

rija commented Jun 4, 2024 • edited Loading

rija commented Feb 10, 2025

only1chunts commented Oct 6, 2020 •

edited by rija

Loading

only1chunts commented Nov 10, 2020 •

edited

Loading

rija commented Jun 4, 2024 •

edited

Loading