Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different behaviors for start index in primary and replica #116

Open
sarthakn7 opened this issue Jun 9, 2020 · 4 comments
Open

Different behaviors for start index in primary and replica #116

sarthakn7 opened this issue Jun 9, 2020 · 4 comments

Comments

@sarthakn7
Copy link
Contributor

sarthakn7 commented Jun 9, 2020

Following are the results when nrtsearch is started with restored state and start index is called:

  1. Primary: start index fails with index not saved or committed message in exception (correction - no segments file found), subsequent start index with restore also fails since directories were created
  2. Replica: start index works and index is started with 0 segments. It also didn't seem like the replica was retrieving the segments from primary after this.
@umeshdangat
Copy link
Member

@sarthakn7 I was not able to reproduce either of the behavior you mention above.

These are the steps I tried to reproduce the scenarios you mention:

Replica
Start JVM
JAVA_OPTS="-Xms16g -Xmx16g -Xss256k -XX:+UseG1GC -XX:+UseCompressedOops -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5006" ./build/install/nrtsearch/bin/lucene-server ~/scratch/nrtsearch/talk_v1/nrtsearch_generic_replica.yaml

start index in restore mode as replica
curl -XPOST localhost:9900/v1/start_index -d @/nail/home/umesh/scratch/platypus/talk/startIndexReplicaRestore.json

stop index
curl -XPOST localhost:9900/v1/stop_index -d '{"indexName": "talk_v1"}'

start index without restore
curl -XPOST localhost:9900/v1/start_index -d @/nail/home/umesh/scratch/platypus/talk/startIndexReplica.json

Logs. Also verified using v1/indices
Jun 09, 2020 1:27:46 PM com.yelp.nrtsearch.server.grpc.LuceneServer$LuceneServerImpl startIndex
INFO: StartIndexHandler returned maxDoc: 22562973
numDocs: 22562973

Primary
start jvm
JAVA_OPTS="-Xms16g -Xmx16g -Xss256k -XX:+UseG1GC -XX:+UseCompressedOops -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5006" ./build/install/nrtsearch/bin/lucene-server ~/scratch/platypus/talk_v1/nrtsearch_generic_primary.yaml

start index restore mode as primary
curl -XPOST localhost:9900/v1/start_index -d @/nail/home/umesh/scratch/platypus/lists_v1/startIndexPrimaryRestore.json

stop index
curl -XPOST localhost:9900/v1/stop_index -d '{"indexName": "talk_v1"}'

start index non restore mode as primary
umesh@dev24-uswest1cdevc:~/scratch/platypus/lists_v1$ curl -XPOST localhost:9900/v1/start_index -d  @/nail/home/umesh/scratch/platypus/lists_v1/startIndexPrimary.json

logs
{"maxDoc":22562973,"numDocs":22562973,"segments":"StandardDirectoryReader(segments_2:501397:nrt

@sarthakn7
Copy link
Contributor Author

sarthakn7 commented Jun 9, 2020

More detailed steps to reproduce:

  1. Delete all state, index and archiver directories
  2. Start JVM with restoreState: true

For primary:
3. Start index without restore - fails with index not saved or committed message in exception (correction - no segments file found)
4. Start index with restore - fails with directory already present exception

For replica:
3. Start index without restore - works fine, index is started with 0 segments

@umeshdangat
Copy link
Member

1 ensures that we delete all local state and index data.
2 ensures we get the state back (names of indexes previously backed up/committed)

Both 3 for primary and replica are bad inputs since we are essentially saying "I have my previous state use that and start the indexes I know of." We assume the index dir is present at this time.

  • Primary tries to create an IndexWriter and fails since we have no segments file (Note: this is still not error you report above)
  • Replica does not try to create an indexWriter and thus simply creates the stub dirs (which primary also does before it fails on creation of IndexWriter).

Stack Trace on failure to create IndexWriter (for primary:3 above)

Jun 09, 2020 2:48:54 PM com.yelp.nrtsearch.server.luceneserver.StartIndexHandler handle
SEVERE: Cannot start IndexState/ShardState
org.apache.lucene.index.IndexNotFoundException: no segments* file found in LockValidatingDirectoryWrapper(MMapDirectory@/nail/home/umesh/nrtsearch/primary_index/talk_v1/shard0/index lockFactory=org.apache.lucene.sto$e.NativeFSLockFactory@32a9cbe8): files: [write.lock]
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:841)
        at com.yelp.nrtsearch.server.luceneserver.ShardState.startPrimary(ShardState.java:654)
        at com.yelp.nrtsearch.server.luceneserver.StartIndexHandler.handle(StartIndexHandler.java:91)
        at com.yelp.nrtsearch.server.grpc.LuceneServer$LuceneServerImpl.startIndex(LuceneServer.java:363)
        at com.yelp.nrtsearch.server.grpc.LuceneServerGrpc$MethodHandlers.invoke(LuceneServerGrpc.java:2352)
        at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
        at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
        at java.base/java.lang.Thread.run(Thread.java:832)

So I think we can deal with this as a bad user input. That is if we

  • already have a state dir
  • and we do not have an index dir
    Means the only valid start_index operation in this state is restore and any other start_index (without restore) should be rejected sooner. @sarthakn7 Let me know if this approach makes sense and I code it up.

@sarthakn7
Copy link
Contributor Author

@umeshdangat yes that makes sense 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants