Skip to content

Commit

Permalink
Merge pull request osm-search#3542 from lonvia/remove-legacy-tokenizer
Browse files Browse the repository at this point in the history
Remove legacy tokenizer
  • Loading branch information
lonvia authored Sep 24, 2024
2 parents e92e03e + f960a9b commit d856788
Show file tree
Hide file tree
Showing 53 changed files with 59 additions and 4,339 deletions.
4 changes: 2 additions & 2 deletions .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@ ignored-classes=NominatimArgs,closing
# 'too-many-ancestors' is triggered already by deriving from UserDict
# 'not-context-manager' disabled because it causes false positives once
# typed Python is enabled. See also https://github.com/PyCQA/pylint/issues/5273
disable=too-few-public-methods,duplicate-code,too-many-ancestors,bad-option-value,no-self-use,not-context-manager,use-dict-literal,chained-comparison,attribute-defined-outside-init,too-many-boolean-expressions,contextmanager-generator-missing-cleanup
disable=too-few-public-methods,duplicate-code,too-many-ancestors,bad-option-value,no-self-use,not-context-manager,use-dict-literal,chained-comparison,attribute-defined-outside-init,too-many-boolean-expressions,contextmanager-generator-missing-cleanup,too-many-positional-arguments

good-names=i,j,x,y,m,t,fd,db,cc,x1,x2,y1,y2,pt,k,v,nr

[DESIGN]

max-returns=7
max-returns=7
14 changes: 0 additions & 14 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,6 @@ endif()

set(BUILD_IMPORTER on CACHE BOOL "Build everything for importing/updating the database")
set(BUILD_API on CACHE BOOL "Build everything for the API server")
set(BUILD_MODULE off CACHE BOOL "Build PostgreSQL module for legacy tokenizer")
set(BUILD_TESTS on CACHE BOOL "Build test suite")
set(BUILD_OSM2PGSQL on CACHE BOOL "Build osm2pgsql (expert only)")
set(INSTALL_MUNIN_PLUGINS on CACHE BOOL "Install Munin plugins for supervising Nominatim")
Expand Down Expand Up @@ -139,14 +138,6 @@ if (BUILD_TESTS)
endif()
endif()

#-----------------------------------------------------------------------------
# Postgres module
#-----------------------------------------------------------------------------

if (BUILD_MODULE)
add_subdirectory(module)
endif()

#-----------------------------------------------------------------------------
# Installation
#-----------------------------------------------------------------------------
Expand Down Expand Up @@ -195,11 +186,6 @@ if (BUILD_OSM2PGSQL)
endif()
endif()

if (BUILD_MODULE)
install(PROGRAMS ${PROJECT_BINARY_DIR}/module/nominatim.so
DESTINATION ${NOMINATIM_LIBDIR}/module)
endif()

install(FILES settings/env.defaults
settings/address-levels.json
settings/phrase-settings.json
Expand Down
3 changes: 1 addition & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,7 @@ pylint3 --extension-pkg-whitelist=osmium nominatim
Before submitting a pull request make sure that the tests pass:

```
cd build
make test
make tests
```

## Releases
Expand Down
94 changes: 10 additions & 84 deletions docs/admin/Advanced-Installations.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,76 +131,13 @@ script ([Geofabrik](https://download.geofabrik.de)) provides daily updates.

## Using an external PostgreSQL database

You can install Nominatim using a database that runs on a different server when
you have physical access to the file system on the other server. Nominatim
uses a custom normalization library that needs to be made accessible to the
PostgreSQL server. This section explains how to set up the normalization
library.

!!! note
The external module is only needed when using the legacy tokenizer.
If you have chosen the ICU tokenizer, then you can ignore this section
and follow the standard import documentation.

### Option 1: Compiling the library on the database server

The most sure way to get a working library is to compile it on the database
server. From the prerequisites you need at least cmake, gcc and the
PostgreSQL server package.

Clone or unpack the Nominatim source code, enter the source directory and
create and enter a build directory.

```sh
cd Nominatim
mkdir build
cd build
```

Now configure cmake to only build the PostgreSQL module and build it:

```
cmake -DBUILD_IMPORTER=off -DBUILD_API=off -DBUILD_TESTS=off -DBUILD_DOCS=off -DBUILD_OSM2PGSQL=off ..
make
```

When done, you find the normalization library in `build/module/nominatim.so`.
Copy it to a place where it is readable and executable by the PostgreSQL server
process.

### Option 2: Compiling the library on the import machine

You can also compile the normalization library on the machine from where you
run the import.

!!! important
You can only do this when the database server and the import machine have
the same architecture and run the same version of Linux. Otherwise there is
no guarantee that the compiled library is compatible with the PostgreSQL
server running on the database server.

Make sure that the PostgreSQL server package is installed on the machine
**with the same version as on the database server**. You do not need to install
the PostgreSQL server itself.

Download and compile Nominatim as per standard instructions. Once done, you find
the normalization library in `build/module/nominatim.so`. Copy the file to
the database server at a location where it is readable and executable by the
PostgreSQL server process.

### Running the import

On the client side you now need to configure the import to point to the
correct location of the library **on the database server**. Add the following
line to your your `.env` file:

```
NOMINATIM_DATABASE_MODULE_PATH="<directory on the database server where nominatim.so resides>"
```

Now change the `NOMINATIM_DATABASE_DSN` to point to your remote server and continue
to follow the [standard instructions for importing](Import.md).
You can install Nominatim using a database that runs on a different server.
Simply point the configuration variable `NOMINATIM_DATABASE_DSN` to the
server and follow the standard import documentation.

The import will be faster, if the import is run directly from the database
machine. You can easily switch to a different machine for the query frontend
after the import.

## Moving the database to another machine

Expand All @@ -225,20 +162,9 @@ target machine.
data updates but the resulting database is only about a third of the size
of a full database.

Next install Nominatim on the target machine by following the standard installation
instructions. Again, make sure to use the same version as the source machine.
Next install nominatim-api on the target machine by following the standard
installation instructions. Again, make sure to use the same version as the
source machine.

Create a project directory on your destination machine and set up the `.env`
file to match the configuration on the source machine. Finally run

nominatim refresh --website

to make sure that the local installation of Nominatim will be used.

If you are using the legacy tokenizer you might also have to switch to the
PostgreSQL module that was compiled on your target machine. If you get errors
that PostgreSQL cannot find or access `nominatim.so` then rerun

nominatim refresh --functions

on the target machine to update the the location of the module.
file to match the configuration on the source machine. That's all.
12 changes: 0 additions & 12 deletions docs/admin/Installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,18 +178,6 @@ make
sudo make install
```

!!! warning
The default installation no longer compiles the PostgreSQL module that
is needed for the legacy tokenizer from older Nominatim versions. If you
are upgrading an older database or want to run the
[legacy tokenizer](../customize/Tokenizers.md#legacy-tokenizer) for
some other reason, you need to enable the PostgreSQL module via
cmake: `cmake -DBUILD_MODULE=on ../Nominatim`. To compile the module
you need to have the server development headers for PostgreSQL installed.
On Ubuntu/Debian run: `sudo apt install postgresql-server-dev-<postgresql version>`
The legacy tokenizer is deprecated and will be removed in Nominatim 5.0


Nominatim installs itself into `/usr/local` per default. To choose a different
installation directory add `-DCMAKE_INSTALL_PREFIX=<install root>` to the
cmake command. Make sure that the `bin` directory is available in your path
Expand Down
53 changes: 0 additions & 53 deletions docs/customize/Settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,26 +64,6 @@ Nominatim grants minimal rights to this user to all tables that are needed
for running geocoding queries.


#### NOMINATIM_DATABASE_MODULE_PATH

| Summary | |
| -------------- | --------------------------------------------------- |
| **Description:** | Directory where to find the PostgreSQL server module |
| **Format:** | path |
| **Default:** | _empty_ (use `<project_directory>/module`) |
| **After Changes:** | run `nominatim refresh --functions` |
| **Comment:** | Legacy tokenizer only |

Defines the directory in which the PostgreSQL server module `nominatim.so`
is stored. The directory and module must be accessible by the PostgreSQL
server.

For information on how to use this setting when working with external databases,
see [Advanced Installations](../admin/Advanced-Installations.md).

The option is only used by the Legacy tokenizer and ignored otherwise.


#### NOMINATIM_TOKENIZER

| Summary | |
Expand Down Expand Up @@ -114,20 +94,6 @@ on the file format.
If a relative path is given, then the file is searched first relative to the
project directory and then in the global settings directory.

#### NOMINATIM_MAX_WORD_FREQUENCY

| Summary | |
| -------------- | --------------------------------------------------- |
| **Description:** | Number of occurrences before a word is considered frequent |
| **Format:** | int |
| **Default:** | 50000 |
| **After Changes:** | cannot be changed after import |
| **Comment:** | Legacy tokenizer only |

The word frequency count is used by the Legacy tokenizer to automatically
identify _stop words_. Any partial term that occurs more often then what
is defined in this setting, is effectively ignored during search.


#### NOMINATIM_LIMIT_REINDEXING

Expand Down Expand Up @@ -162,25 +128,6 @@ codes, to restrict import to a subset of languages.
Currently only affects the initial import of country names and special phrases.


#### NOMINATIM_TERM_NORMALIZATION

| Summary | |
| -------------- | --------------------------------------------------- |
| **Description:** | Rules for normalizing terms for comparisons |
| **Format:** | string: semicolon-separated list of ICU rules |
| **Default:** | :: NFD (); [[:Nonspacing Mark:] [:Cf:]] >; :: lower (); [[:Punctuation:][:Space:]]+ > ' '; :: NFC (); |
| **Comment:** | Legacy tokenizer only |

[Special phrases](Special-Phrases.md) have stricter matching requirements than
normal search terms. They must appear exactly in the query after this term
normalization has been applied.

Only has an effect on the Legacy tokenizer. For the ICU tokenizer the rules
defined in the
[normalization section](Tokenizers.md#normalization-and-transliteration)
will be used.


#### NOMINATIM_USE_US_TIGER_DATA

| Summary | |
Expand Down
47 changes: 0 additions & 47 deletions docs/customize/Tokenizers.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,53 +15,6 @@ they can be configured.
chosen tokenizer is very limited as well. See the comments in each tokenizer
section.

## Legacy tokenizer

!!! danger
The Legacy tokenizer is deprecated and will be removed in Nominatim 5.0.
If you still use a database with the legacy tokenizer, you must reimport
it using the ICU tokenizer below.

The legacy tokenizer implements the analysis algorithms of older Nominatim
versions. It uses a special Postgresql module to normalize names and queries.
This tokenizer is automatically installed and used when upgrading an older
database. It should not be used for new installations anymore.

### Compiling the PostgreSQL module

The tokeinzer needs a special C module for PostgreSQL which is not compiled
by default. If you need the legacy tokenizer, compile Nominatim as follows:

```
mkdir build
cd build
cmake -DBUILD_MODULE=on
make
```

### Enabling the tokenizer

To enable the tokenizer add the following line to your project configuration:

```
NOMINATIM_TOKENIZER=legacy
```

The Postgresql module for the tokenizer is available in the `module` directory
and also installed with the remainder of the software under
`lib/nominatim/module/nominatim.so`. You can specify a custom location for
the module with

```
NOMINATIM_DATABASE_MODULE_PATH=<path to directory where nominatim.so resides>
```

This is in particular useful when the database runs on a different server.
See [Advanced installations](../admin/Advanced-Installations.md#using-an-external-postgresql-database) for details.

There are no other configuration options for the legacy tokenizer. All
normalization functions are hard-coded.

## ICU tokenizer

The ICU tokenizer uses the [ICU library](http://site.icu-project.org/) to
Expand Down
2 changes: 0 additions & 2 deletions docs/develop/Testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,6 @@ The tests can be configured with a set of environment variables (`behave -D key=
* `DB_PORT` - (optional) port of database on host
* `DB_USER` - (optional) username of database login
* `DB_PASS` - (optional) password for database login
* `SERVER_MODULE_PATH` - (optional) path on the Postgres server to Nominatim
module shared library file (only needed for legacy tokenizer)
* `REMOVE_TEMPLATE` - if true, the template and API database will not be reused
during the next run. Reusing the base templates speeds
up tests considerably but might lead to outdated errors
Expand Down
Loading

0 comments on commit d856788

Please sign in to comment.