Skip to content

Commit

Permalink
Use Elasticsearch 8 (#1309)
Browse files Browse the repository at this point in the history
*  Use Elasticsearch 8.8
* Add note about _es_id
* elasticCluster isn't actually ever used by the ES client, so remove it
* Update ES instructions in README
* ES index name: whelk_dev -> libris_local
* Increase number of shards and replicas

Changes necessary for XL to work with Elasticsearch 8. Main change is that ES 8 has security (and TLS) enabled by default, so we now have to user+pass over HTTPS when talking to the ES cluster (certificate generation and stuff like that is handled automatically by ES when starting for the first time).

The devops part has already been merged.

How to tune things for search speed is a separate issue that we can look at more once we have QA running in the new ES cluster. The increased number_of_shards / number_of_replicas is just a starting point based on our current usage in the main cluster and current recommendations (and anyway we should make number_of_shards and number_of_replicas configurable, but that's a devops thing).
  • Loading branch information
andersju authored Sep 18, 2023
1 parent f90a508 commit 63f7a20
Show file tree
Hide file tree
Showing 10 changed files with 92 additions and 45 deletions.
33 changes: 17 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Related external repositories:

## Dependencies

The instructions below assume an Ubuntu 20.04 system (Debian should be identical), but should work
The instructions below assume an Ubuntu 22.04 system (Debian should be identical), but should work
for e.g. Fedora/CentOS/RHEL with minor adjustments.

1. [Gradle](http://gradle.org/)
Expand All @@ -55,15 +55,10 @@ for e.g. Fedora/CentOS/RHEL with minor adjustments.
[gradle wrapper](https://docs.gradle.org/current/userguide/gradle_wrapper.html)
to automatically get the specified version of Gradle and Groovy.

2. [Elasticsearch](http://elasticsearch.org/) (version 7.x)
2. [Elasticsearch](http://elasticsearch.org/) (version 8.x)

[Download Elasticsearch](https://www.elastic.co/downloads/elasticsearch-oss)
(for Ubuntu/Debian, select "Install with apt-get"; before importing the Elasticsearch
PGP key you might have to do `sudo apt install gnupg` if you're running a minimal distribution.)

**NOTE:**
* We use the elasticsearch-oss version.
* The [ICU Analysis plugin](https://www.elastic.co/guide/en/elasticsearch/plugins/7.12/analysis-icu.html) (`icu-analysis`) must be installed; see "Setting up Elasticsearch" below.
[Download Elasticsearch](https://www.elastic.co/downloads/elasticsearch)
For Ubuntu/Debian, select "apt-get" and follow the instructions.

3. [PostgreSQL](https://www.postgresql.org/) (version 14.2 or later)

Expand Down Expand Up @@ -159,8 +154,11 @@ whelk_dev=> \q

### Setting up Elasticsearch

Edit `/etc/elasticsearch/elasticsearch.yml`. Uncomment `cluster.name` and set it to something unique
on the network. This name is later specified when you configure the XL system.
Assuming Elasticsearch is already running, first set the password of the `elastic` user:

```
printf "elastic\nelastic" | sudo /usr/share/elasticsearch/bin/elasticsearch-reset-password -b -i -u elastic
```

Next, install the ICU Analysis plugin:

Expand All @@ -174,8 +172,14 @@ Finally, (re)start Elasticsearch:
sudo systemctl restart elasticsearch
```

(To adjust the JVM heap size for Elasticsearch, edit `/etc/elasticsearch/jvm.options` and then restart
Elasticsearch.)
To adjust the JVM heap size for Elasticsearch, edit `/etc/elasticsearch/jvm.options`
and then restart Elasticsearch. In a local development environment, you might want to
add the following to prevent Elasticsearch from hogging memory:

```
-Xms2g
-Xmx2g
```

### Configuring secrets

Expand All @@ -184,9 +188,6 @@ Use `librisxl/secret.properties.in` as a starting point:
```
cd librisxl
cp secret.properties.in secret.properties
# In secret.properties, set:
# - elasticCluster to whatever you set cluster.name to in the Elasticsearch configuration above.
vim secret.properties
# Make sure libris.kb.se.localhost points to 127.0.0.1
echo '127.0.0.1 libris.kb.se.localhost' | sudo tee -a /etc/hosts
```
Expand Down
3 changes: 1 addition & 2 deletions gui-whelktool/cli_run_local.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,4 @@
set -uex

username=$(whoami)

java -DsecretBaseUri=http://libris.kb.se.localhost:5000/ -DsecretSqlUrl=jdbc:postgresql://$username:_XL_PASSWORD_@localhost/whelk_dev -DsecretElasticHost=localhost -DsecretElasticCluster=elasticsearch_$username -DsecretElasticIndex=whelk_dev -DsecretApplicationId=https://libris.kb.se/ -DsecretSystemContextUri=https://id.kb.se/sys/context/kbv -DsecretLocales=sv,en -DsecretTimezone=Europe/Stockholm -jar build/libs/gui-whelktool.jar
java -DsecretBaseUri=http://libris.kb.se.localhost:5000/ -DsecretSqlUrl=jdbc:postgresql://$username:_XL_PASSWORD_@localhost/whelk_dev -DsecretElasticHost=localhost -DsecretElasticCluster=elasticsearch_$username -DsecretElasticIndex=libris_local -DsecretApplicationId=https://libris.kb.se/ -DsecretSystemContextUri=https://id.kb.se/sys/context/kbv -DsecretLocales=sv,en -DsecretTimezone=Europe/Stockholm -jar build/libs/gui-whelktool.jar
2 changes: 0 additions & 2 deletions gui-whelktool/src/main/java/whelk/gui/RunPanel.java
Original file line number Diff line number Diff line change
Expand Up @@ -90,8 +90,6 @@ public void run()
System.getProperty("secretBaseUri") + "\n" +
"elasticHost = " +
System.getProperty("secretElasticHost") + "\n" +
"elasticCluster = " +
System.getProperty("secretElasticCluster") + "\n" +
"elasticIndex = " +
System.getProperty("secretElasticIndex") + "\n" +
"applicationId = " +
Expand Down
7 changes: 5 additions & 2 deletions librisxl-tools/elasticsearch/libris_config.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
"limit": 100000
}
},
"number_of_shards": 6,
"number_of_replicas": 1
"number_of_shards": 15,
"number_of_replicas": 2
},
"index.query.default_field": "_all",
"analysis": {
Expand Down Expand Up @@ -310,6 +310,9 @@
"type": "text",
"store": false,
"analyzer": "softmatcher"
},
"_es_id": {
"type": "keyword"
}
},
"date_detection": false,
Expand Down
1 change: 0 additions & 1 deletion oaipmh/secret_dev_test.properties
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
baseUri = https://libris-dev.kb.se/
sqlUrl = jdbc:postgresql://whelk:[email protected]/whelk
elasticHost = es01-dev.libris.kb.se
elasticCluster = lxdev_elasticsearch
elasticIndex = libris
oauth2verifyurl = https://bibdb-stg.libris.kb.se/api/o/verify
6 changes: 3 additions & 3 deletions secret.properties.in
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ sqlUrl = jdbc:postgresql://whelk:whelk@localhost/whelk_dev
sqlMaxPoolSize = 4

elasticHost = localhost:9200
# elasticCluster should match the value of cluster.name in elasticsearch.yml
elasticCluster = <something unique>
elasticIndex = whelk_dev
elasticIndex = libris_local
elasticUser = elastic
elasticPassword = elastic

oauth2verifyurl = https://login-dev.libris.kb.se/oauth/verify

Expand Down
52 changes: 46 additions & 6 deletions whelk-core/src/main/groovy/whelk/component/ElasticClient.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -11,24 +11,36 @@ import io.github.resilience4j.retry.Retry
import io.github.resilience4j.retry.RetryConfig
import io.github.resilience4j.retry.RetryRegistry
import io.prometheus.client.CollectorRegistry
import org.apache.http.Header
import org.apache.http.HttpEntity
import org.apache.http.HttpHeaders
import org.apache.http.HttpResponse
import org.apache.http.client.HttpClient
import org.apache.http.client.config.RequestConfig
import org.apache.http.client.methods.HttpGet
import org.apache.http.client.methods.HttpPost
import org.apache.http.client.methods.HttpPut
import org.apache.http.client.methods.HttpRequestBase
import org.apache.http.config.Registry
import org.apache.http.config.RegistryBuilder
import org.apache.http.conn.HttpClientConnectionManager
import org.apache.http.conn.socket.ConnectionSocketFactory
import org.apache.http.conn.socket.PlainConnectionSocketFactory
import org.apache.http.conn.ssl.NoopHostnameVerifier
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.conn.ssl.TrustStrategy
import org.apache.http.entity.ContentType
import org.apache.http.entity.StringEntity
import org.apache.http.impl.client.CloseableHttpClient
import org.apache.http.impl.client.HttpClients
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager
import org.apache.http.message.BasicHeader
import org.apache.http.ssl.SSLContexts
import org.apache.http.util.EntityUtils
import whelk.exception.ElasticIOException
import whelk.exception.UnexpectedHttpStatusException

import javax.net.ssl.SSLContext
import java.time.Duration
import java.util.function.Function

Expand Down Expand Up @@ -62,8 +74,16 @@ class ElasticClient {
RetryRegistry retryRegistry = RetryRegistry.ofDefaults()
Retry globalRetry

static ElasticClient withDefaultHttpClient(List<String> elasticHosts) {
HttpClientConnectionManager cm = new PoolingHttpClientConnectionManager()
static ElasticClient withDefaultHttpClient(List<String> elasticHosts, String elasticUser, String elasticPassword) {
TrustStrategy acceptingTrustStrategy = (cert, authType) -> true
SSLContext sslContext = SSLContexts.custom().loadTrustMaterial(null, acceptingTrustStrategy).build()
SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext, NoopHostnameVerifier.INSTANCE)
Registry<ConnectionSocketFactory> registry = RegistryBuilder.<ConnectionSocketFactory> create()
.register("https", sslsf)
.register("http", new PlainConnectionSocketFactory())
.build()

HttpClientConnectionManager cm = new PoolingHttpClientConnectionManager(registry)
cm.setMaxTotal(CONNECTION_POOL_SIZE)
cm.setDefaultMaxPerRoute(MAX_CONNECTIONS_PER_HOST)

Expand All @@ -73,16 +93,30 @@ class ElasticClient {
.setSocketTimeout(READ_TIMEOUT_MS)
.build()

String auth = elasticUser + ":" + elasticPassword
Header authHeader = new BasicHeader(HttpHeaders.AUTHORIZATION, "Basic " + auth.bytes.encodeBase64().toString())
List<Header> headers = List.of(authHeader)

CloseableHttpClient httpClient = HttpClients.custom()
.setSSLSocketFactory(sslsf)
.setConnectionManager(cm)
.setDefaultRequestConfig(requestConfig)
.setDefaultHeaders(headers)
.build()

return new ElasticClient(httpClient, elasticHosts, true)
}

static ElasticClient withBulkHttpClient(List<String> elasticHosts) {
HttpClientConnectionManager cm = new PoolingHttpClientConnectionManager()
static ElasticClient withBulkHttpClient(List<String> elasticHosts, String elasticUser, String elasticPassword) {
TrustStrategy acceptingTrustStrategy = (cert, authType) -> true
SSLContext sslContext = SSLContexts.custom().loadTrustMaterial(null, acceptingTrustStrategy).build()
SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext, NoopHostnameVerifier.INSTANCE)
Registry<ConnectionSocketFactory> registry = RegistryBuilder.<ConnectionSocketFactory> create()
.register("https", sslsf)
.register("http", new PlainConnectionSocketFactory())
.build()

HttpClientConnectionManager cm = new PoolingHttpClientConnectionManager(registry)
cm.setMaxTotal(CONNECTION_POOL_SIZE)
cm.setDefaultMaxPerRoute(MAX_CONNECTIONS_PER_HOST)

Expand All @@ -91,11 +125,17 @@ class ElasticClient {
.setSocketTimeout(BATCH_READ_TIMEOUT_MS)
.build()

String auth = elasticUser + ":" + elasticPassword
Header authHeader = new BasicHeader(HttpHeaders.AUTHORIZATION, "Basic " + auth.bytes.encodeBase64().toString())
List<Header> headers = List.of(authHeader)

CloseableHttpClient httpClient = HttpClients.custom()
.setSSLSocketFactory(sslsf)
.setConnectionManager(cm)
.setDefaultRequestConfig(requestConfig)
.setDefaultHeaders(headers)
.build()

return new ElasticClient(httpClient, elasticHosts, false)
}

Expand Down
30 changes: 19 additions & 11 deletions whelk-core/src/main/groovy/whelk/component/ElasticSearch.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,8 @@ class ElasticSearch {

String defaultIndex = null
private List<String> elasticHosts
private String elasticCluster
private String elasticUser
private String elasticPassword
private ElasticClient client
private ElasticClient bulkClient
private boolean isPitApiAvailable = false
Expand All @@ -56,18 +57,20 @@ class ElasticSearch {
ElasticSearch(Properties props) {
this(
props.getProperty("elasticHost"),
props.getProperty("elasticCluster"),
props.getProperty("elasticIndex")
props.getProperty("elasticIndex"),
props.getProperty("elasticUser"),
props.getProperty("elasticPassword")
)
}

ElasticSearch(String elasticHost, String elasticCluster, String elasticIndex) {
ElasticSearch(String elasticHost, String elasticIndex, String elasticUser, String elasticPassword) {
this.elasticHosts = getElasticHosts(elasticHost)
this.elasticCluster = elasticCluster
this.defaultIndex = elasticIndex
this.elasticUser = elasticUser
this.elasticPassword = elasticPassword

client = ElasticClient.withDefaultHttpClient(elasticHosts)
bulkClient = ElasticClient.withBulkHttpClient(elasticHosts)
client = ElasticClient.withDefaultHttpClient(elasticHosts, elasticUser, elasticPassword)
bulkClient = ElasticClient.withBulkHttpClient(elasticHosts, elasticUser, elasticPassword)

new Timer("ElasticIndexingRetries", true).schedule(new TimerTask() {
void run() {
Expand Down Expand Up @@ -119,7 +122,7 @@ class ElasticSearch {
host = host.trim()
if (!host.contains(":"))
host += ":9200"
hosts.add("http://" + host)
hosts.add("https://" + host)
}
return hosts
}
Expand Down Expand Up @@ -390,6 +393,12 @@ class ElasticSearch {
}
}

// In ES up until 7.8 we could use the _id field for aggregations and sorting, but it was discouraged
// for performance reasons. In 7.9 such use was deprecated, and since 8.x it's no longer supported, so
// we follow the advice and use a separate field.
// (https://www.elastic.co/guide/en/elasticsearch/reference/8.8/mapping-id-field.html).
framed["_es_id"] = toElasticId(copy.getShortId())

if (log.isTraceEnabled()) {
log.trace("Framed data: ${framed}")
}
Expand Down Expand Up @@ -525,7 +534,7 @@ class ElasticSearch {
Map query = [
'bool': ['should': t1 + t2 ]
]

Scroll<String> ids = new DefaultScroll(query)
try {
ids.hasNext()
Expand Down Expand Up @@ -612,8 +621,7 @@ class ElasticSearch {
private abstract class Scroll<T> implements Iterator<T> {
final int FETCH_SIZE = 500

// TODO: change to _shard_doc when we upgrade to ES 7.12+
protected final List SORT = [['_id': 'asc']]
protected final List SORT = [['_es_id': 'asc']]
protected final List FILTER_PATH = ['took', 'hits.hits.sort', 'pit_id', 'hits.total.value']

Iterator<T> fetchedItems
Expand Down
2 changes: 1 addition & 1 deletion whelk-core/src/main/groovy/whelk/search/ElasticFind.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ class ElasticFind {
p.put("_offset", [Integer.toString(offset)] as String[])
p.put("_limit", [Integer.toString(PAGE_SIZE)] as String[])

p.putIfAbsent("_sort", ["_id"] as String[])
p.putIfAbsent("_sort", ["_es_id"] as String[])

return p
}
Expand Down
1 change: 0 additions & 1 deletion whelktool/src/main/groovy/datatool/WhelkTool.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -663,7 +663,6 @@ class WhelkTool {
if (whelk.elastic) {
log " ElasticSearch:"
log " hosts: ${whelk.elastic.elasticHosts}"
log " cluster: ${whelk.elastic.elasticCluster}"
log " index: ${whelk.elastic.defaultIndex}"
}
log "Using script: $scriptFile"
Expand Down

0 comments on commit 63f7a20

Please sign in to comment.