Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search: Return Results for URLs #5936

Open
4 tasks done
StephenB87 opened this issue Aug 31, 2023 · 15 comments
Open
4 tasks done

Search: Return Results for URLs #5936

StephenB87 opened this issue Aug 31, 2023 · 15 comments
Labels
change request Issue requests a new feature or improvement

Comments

@StephenB87
Copy link

Context

A user should be able to search for a URL and the return results should match that URL. For example, if I am searching for https://example1.abc.com/, the results should match the entire URL. Currently, they do not match, when using default settings for the search plugin:

Screenshot 2023-08-31 at 1 40 08 PM

Description

When searching for a URL, the results that are returned should match the entire URL string. For example, if I am searching for https://example1.abc.com/, the results should match the entire URL, not https, example1, xyz com. See screenshot above for context.

Related links

Use Cases

Site search plugin can be further enhanced to return match results for URLs. Many organizations have documentation that contains multiple URLs. Someone might not know what the URL is for, but if they can search the URL they can find relevant documentation.

Visuals

No response

Before submitting

@squidfunk squidfunk added the needs investigation Issue must be investigated by the maintainers label Aug 31, 2023
@MaximilianKohler
Copy link
Contributor

MaximilianKohler commented Sep 1, 2023

Yeah, I've been needing this as well. #4384 (comment)

I figured out a few things:

Separator (docs) https://squidfunk.github.io/mkdocs-material/setup/setting-up-site-search/?h=separator#special-characters

Their default is:

plugins:
  - search:
      separator: '[\s\-,:!=\[\]()"/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;'

This part: (?!\b)(?=[A-Z][a-z]) results in no results for PubPeer, only pubpeer. So I removed it.

This part: \.(?!\d) results in pubpeer.com returning all results for all words with com in them. So I removed it.

Some parts of this are needed to return any results for https://example.com:

[\s\-,:!=\[\]()"`/]+

But it also returns results for all instances of https.

Removing / gives results for https://example.com but incomplete results for example.com.

Removing : screws it up completely. So that's where I'm stuck at.

@squidfunk
Copy link
Owner

Thanks for suggesting. I'm not sure that many users need this, but we can definitely let it sit here for a while to collect some feedback. We might consider shipping it with our new search functionality.

@squidfunk squidfunk added change request Issue requests a new feature or improvement and removed needs investigation Issue must be investigated by the maintainers labels Sep 2, 2023
@MaximilianKohler
Copy link
Contributor

@squidfunk does it need to be a big deal? Can't it be something as "simple" as changing options/code like I listed above?

@squidfunk
Copy link
Owner

squidfunk commented Sep 2, 2023

I don't think so. From what you write in #5936 (comment), you want to match:

  • example.com
  • https://example.com/foo/bar

I don't think you can achieve it by just changing the search separator, because currently, you can either have 1. or 2., but not both. You will also match instances of https that are then incorporated into the ranking of documents and match URls with other domains. It's just how lunr.js currently works. What you essentially want is that exact matches rank higher (or outcompete) partial matches + span rankings, i.e., words that occur together should rank higher than those that don't. I've fiddled around with lunr.js for a long time, and judging from what I know about its architecture, it's just not possible.

This is one of the reasons why I'm currently rewriting search from scratch.

@squidfunk
Copy link
Owner

squidfunk commented Oct 30, 2023

Thanks again for suggesting. While I'm not sure whether the general case of adding URLs to indexes is something many users want, I'm confident that I found a good design that allows for defining different separators for different fields:

const config: Config<Document> = {
  schemas: [
    {
      kind: "term",
      data: {
        separator: whitespace, // "foo bar baz" -> "foo", "bar", "baz"
        fields: [
          { name: "foo", from: ({ foo }) => foo },
          { name: "baz", from: ({ bar }) => bar?.baz }
        ]
      }
    },
    {
      kind: "term",
      data: {
        separator: none, // "http://example.com" -> "http://example.com"
        fields: [
          { name: "url", from: ({ url }) => url }
        ]
      }
    }
  ]
}

This will also allow to index the same terms in different ways, including camel- and pascal-case terms, e.g. MkDocs as mkdocs, as well as mk + docs, which was asked for a few times. I hope to finish the new implementation soon, working hard on it!

@squidfunk
Copy link
Owner

A further thing that came to my mind: the approach I mentioned above will work for when URLs are provided as metadata, i.e., as a separate field, but not if they are contained in the text. No tokenization separator will allow to cover that. However, we could essentially add a preprocessing step to the text to extract URLs and then index them accordingly. I'm not sure we will offer this functionality from the start, but as long as you specify URLs as metadata to documents ( or use the actual URL of the document for indexing), it should work in the first iteration of the new search.

@squidfunk
Copy link
Owner

Please see the announcement in #6307.

@squidfunk
Copy link
Owner

I invite you to try the 2nd search research preview – I think this should solve the issue at hand:

screenshot-localhost-3000-1700492546631

If you add characters to the separator that are contained in URLs, they will be tokenized as well, but that should not matter, since they should now be found when they match correctly. Additionally, tokenizing gives you the opportunity to search for path parts:

screenshot-localhost-3000-1700492676032

@squidfunk
Copy link
Owner

@StephenB87 @MaximilianKohler did any of you check out the research preview? Does it improve results?

@MaximilianKohler
Copy link
Contributor

I have been following your progress in these issues but I haven't installed any beta/preview versions. Your screenshots look like the new versions are a full solution though.

@squidfunk
Copy link
Owner

squidfunk commented Feb 17, 2024

It would be good, to receive some feedback, especially looking at #5936 (comment) which I got no reaction on. It's one thing to raise feature requests or report things that can be improved, but it's another one to provide feedback on solutions that we propose to address those issues 😉 Otherwise, when we release the new version and it does not fully solve what was requested here, it might be too late. That's why we try to get early feedback.

@MaximilianKohler
Copy link
Contributor

I didn't quite understand that comment. Do you mean "naked link" (https://squidfunk.github.io/mkdocs-material) vs contained in text?

Naked links would be the most important for me, but contained-in-text would be great if possible.

I didn't try any of the PRs/betas/tests because I'm not too familiar with switching back and forth between them and the master branch, and I think I read something about it being difficult or problematic. Oh, it might have been this #6372, including your note to not use it in production.

Looking at the installation docs I'm not too sure how switching back and forth works. I'm a novice, on Windows, and managed to get the master installed with pip, but I don't understand where it's been installed, how installing a PR would work or replace or conflict with the existing install, where the PR would be and how to use it, etc.

That reminds me, I was curious how you made the front page with this and couldn't find the source. Here's why I was curious. I posted that to a few other places but didn't find an answer.

@squidfunk
Copy link
Owner

I didn't try any of the PRs/betas/tests because I'm not too familiar with switching back and forth between them and the master branch, and I think I read something about it being difficult or problematic. Oh, it might have been this #6372, including your note to not use it in production.

I added instructions in the research preview:

pip install git+https://github.com/squidfunk/mkdocs-material.git@spike/search-preview-2

Additionally, somebody asked how to switch back in the same issue:

pip install mkdocs-material

Regardless, we'll keep working on this. If the solution we come up with doesn't entirely meet your requirements, you can always customize or fork the theme to get it exactly to your taste ☺️ I was just hoping for feedback, and think it is a fair ask, given that I try to solve the problem reported here, but I get that installing branches might be too much of an ask.

@MaximilianKohler
Copy link
Contributor

I added instructions in the #6372:

That's just one command. I realize that command fetches the PR but it doesn't answer my other questions.

Additionally, somebody asked #6372 (comment) in the same issue:

Yes, I saw. It doesn't answer my other questions.

@kamilkrzyskow
Copy link
Collaborator

kamilkrzyskow commented Mar 9, 2024

@MaximilianKohler

I'm not too sure how switching back and forth works. I'm a novice, on Windows, and managed to get the master installed with pip, but I don't understand where it's been installed, how installing a PR would work or replace or conflict with the existing install, where the PR would be and how to use it, etc.

Installing a package with the same name should override the previous one, or detect there is no need to install the version, so it doesn't install, in that case you can add the --force-reinstall flag at the end.
You can also use pip show mkdocs-material to see the Location where it's currently installed.
You can (and should) use a Virtual Environment to separate the package installations:
https://docs.python.org/3/tutorial/venv.html
This is mentioned in the reproduction guide:
https://squidfunk.github.io/mkdocs-material/guides/creating-a-reproduction/

You can also watch a whole guide on how to setup a development environment if you want to be extra thorough:
https://www.youtube.com/@coreyms/search?query=visual%20code (not affiliated with him, nor did I use those guides, but he provides Mac and Windows guides and from the key points it seems detailed enough, despite being 4 years old the guides shouldn't be too outdated ✌️)

Most of the questions you asked, are one or 2 google/chatgpt searches away, and I'm not sure if they are in scope of the material theme's documentation.

That reminds me, I was curious how you made the front page with this and couldn't find the source. Here's why I was curious. I posted that to a few other places but didn't find an answer.

Searching the discussions board (including closed discussions) for parallax or landing page will lead you to answers:

The landing page is a custom addition for the theme's documentation, the source code can be viewed by sponsors with access to Insiders. The images are under a special licence and can't be reused iirc. It's not a supported feature, just a customization so there is no easy configuration settings for it.

EDIT: Fixup for the "It's not a supported feature". The custom home page is supported, with custom templates, and an example is provided in the community version:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change request Issue requests a new feature or improvement
Projects
None yet
Development

No branches or pull requests

4 participants