Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve SEO to enable GigaDB datasets to be found by google/bing/beidu searches #514

Open
17 tasks
only1chunts opened this issue Oct 6, 2020 · 7 comments
Open
17 tasks

Comments

@only1chunts
Copy link
Member

only1chunts commented Oct 6, 2020

User Story

As a Website User
I want to find GigaDB content through search engines
So that I can conveniently find GigaDB content relevant to my needs

Acceptance Criteria

Given GigaDB is optimised for Google
When I search "A molecular map of lung neuroendocrine neoplasms"
Then the corresponding manuscript in GigaScience should appears on the first page

Given GigaDB is optimised for Google
When I search "A molecular map of lung neuroendocrine neoplasms"
Then corresponding dataset in GigaDB should appears on the first page

Given GigaDB is optimised for Bing
When I search "A molecular map of lung neuroendocrine neoplasms"
Then the corresponding manuscript in GigaScience should appears on the first page

Given GigaDB is optimised for Bing
When I search "A molecular map of lung neuroendocrine neoplasms"
Then corresponding dataset in GigaDB should appears on the first page

Given GigaDB is optimised for Baidu
When I search "A molecular map of lung neuroendocrine neoplasms"
Then the corresponding manuscript in GigaScience should appears on the first page

Given GigaDB is optimised for Baidu
When I search "A molecular map of lung neuroendocrine neoplasms"
Then corresponding dataset in GigaDB should appears on the first page

Given GigaDB is optimised for Yandex
When I search "A molecular map of lung neuroendocrine neoplasms"
Then the corresponding manuscript in GigaScience should appears on the first page

Given GigaDB is optimised for Yandex
When I search "A molecular map of lung neuroendocrine neoplasms"
Then corresponding dataset in GigaDB should appears on the first page

Additional Info

Product Backlog Item Ready Checklist

  • Business value is clearly articulated
  • Item is understood enough by the IT team so it can make an informed decision as to whether it can complete this item
  • Dependencies are identified and no external dependencies would block this item from being completed
  • At the time of the scheduled sprint, the IT team has the appropriate composition to complete this item
  • This item is estimated and small enough to comfortably be completed in one sprint
  • Acceptance criteria are clear and testable
  • Performance criteria, if any, are defined and testable
  • The Scrum team understands how to demonstrate this item at the sprint review

Product Backlog Item Done Checklist

  • Code is complete
  • Automated tests related to the changes are implemented and passing
  • All automated test suites are passing locally
  • Code is refactored to best practices and coding standards
  • Documentation is updated as needed
  • A Pull Request has been created and review requested
  • Pull Request is reviewed and approved
  • The item has been merged to the develop branch
  • All automated test suites are passing on continuous Integration pipeline and item is ready to release

Is your feature request related to a problem? Please describe.
historically GigaDB datasets have appeared in google search results, but recently they have stopped appearing?
This can be seen if you try searching for ANY word using google with the restriction of "site:www.gigadb.org" e.g. genome site:www.gigadb.org

Describe the solution you'd like
Its probably that the solution will require the use of schema.org , according to #73 this has been implemented on the home page, but needs to be extended to the dataset pages (and all other gigadb.org pages).

@only1chunts
Copy link
Member Author

It might also be that we need to check whether any of the bot-blocker stuff that was done to prevent bots crawling the FTP server have accidentally affected gigadb.org too?

@only1chunts
Copy link
Member Author

only1chunts commented Nov 10, 2020

It appears that the dataset pages contain these lines in the HTML:

<meta name="robots" content="noindex">
<meta name="googlebot" content="noindex">

Could that be causing the lack of indexing in google?
to see this I used the rich-results checker tool, e.g.:
https://search.google.com/test/rich-results?utm_campaign=sdtt&utm_medium=url&id=rkYdN_AWFjpXO_JUaPx9Gg

@only1chunts
Copy link
Member Author

The rich-results checker link in the previous comment shows the metadata is being made available but with a number of issues:

Page loading issue: Not all page resources could be loaded. This can affect how Google sees and understands your page. Fix availability problems for any resources that can affect how Google understands your page.
ERROR - Invalid object type for field "license"
Warning - Missing field "creator" (optional)
Warning - Invalid value type for field "license" (optional)
Warning - Missing field "encodingFormat" (optional)

@only1chunts
Copy link
Member Author

@ChrisArmit makes a good point when testing the release:

Given GigaDB is optimised for Google
When I search "A molecular map of lung neuroendocrine neoplasms"
Then the corresponding manuscript in GigaScience should appears on the first page

—-This works

Given GigaDB is optimised for Google
When I search "A molecular map of lung neuroendocrine neoplasms"
Then corresponding dataset in GigaDB should appears on the first page

—-This does not work

Infact when you add the word gigadb to the search term in google it does find the correct GigaDB but shows it like this:
image

If you click the learn why button it takes you here:
https://support.google.com/webmasters/answer/7489871?hl=en

@luistoptal
Copy link
Collaborator

@only1chunts The robots.txt is disallowing any crawlers including google bots to read any of the pages of the site

Suggested change:

  • replace content of robots.txt with this:
User-agent: *
Allow: /
  • prevent bots from indexing admin pages, by disallowing them in robots.txt adding additional rulees

  • remove <meta name="robots" content="noindex, nofollow"><meta name="googlebot" content="noindex, nofollow"> from public pages

@rija
Copy link
Contributor

rija commented Jun 4, 2024

@only1chunts The robots.txt is disallowing any crawlers including google bots to read any of the pages of the site

Suggested change:

  • replace content of robots.txt with this:
User-agent: *
Allow: /
  • prevent bots from indexing admin pages, by disallowing them in robots.txt adding additional rulees
  • remove <meta name="robots" content="noindex, nofollow"><meta name="googlebot" content="noindex, nofollow"> from public pages

Interesting.
Up until the switchover, we disallowed search engine indexing on non-live environment as we don't want non-live data to be public.
After the switchover, we didn't change that setting but we saw in the logs that search engines bots was indexing the new web site anways thus we though they didn't respect the no-indexing directives.
In any case, the directive needs to be reversed, but on live only. We still don't want our dev and staging environment to appear on search engine. I think it's better I create a specific ticket for switching indexing directive based on environments (as it's not as trivial as it appears).

Another issue that affect SEO is that search engines still know gigadb.org as and http site. and don't see https://gigadb.org as the same site because we don't have yet explicit redirection from http to https (there's issue #1799 for that but no yet implemented).

Finally we may have to manually re-index our website on the various search engines using their dashboards and search tools to have a better control on how we show up (and maybe to ensure that the old http site is no longer indexed)

@rija
Copy link
Contributor

rija commented Feb 10, 2025

I created #2190 as follow up to my comment above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

3 participants