Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] macOS: Controller process sometimes terminated with SIGKILL #2429

Open
davidkyle opened this issue Jan 4, 2023 · 2 comments
Open

[ML] macOS: Controller process sometimes terminated with SIGKILL #2429

davidkyle opened this issue Jan 4, 2023 · 2 comments
Labels

Comments

@davidkyle
Copy link
Member

davidkyle commented Jan 4, 2023

Some developers working in the Elasticsearch repository have reported intermittent problems with the machine learning controller process crashing when running a locally built Elasticsearch. The usual symptoms are Elasticsearch will fail to start and the log will contain this message:

[ERROR][o.e.b.Elasticsearch      ] [runTask-0] fatal exception while booting Elasticsearch org.elasticsearch.ElasticsearchException: Failure running machine learning native code. This could be due to running on an unsupported OS or distribution, missing OS libraries, or a problem with the temp directory. To bypass this problem by running Elasticsearch without machine learning functionality set [xpack.ml.enabled: false].

A crash report can be found in the macOS Console app.

Path:                /Users/USER/*/controller.app/Contents/MacOS/controller
Identifier:          co.elastic.ml-cpp.controller
Version:             8.7.0
Code Type:           ARM-64 (Native)

Exception Type:  EXC_BAD_ACCESS (SIGKILL (Code Signature Invalid))
Exception Subtype: UNKNOWN_0x32 at 0x000000010249c000
Exception Codes: 0x0000000000000032, 0x000000010249c000
VM Region Info: 0x10249c000 is in 0x10249c000-0x1024b0000;  bytes after start: 0  bytes before end: 81919
      REGION TYPE                    START - END         [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      UNUSED SPACE AT START
--->  mapped file                 10249c000-1024b0000    [   80K] r-x/r-x SM=COW  ...t_id=787f689f
      mapped file                 1024b0000-1024b4000    [   16K] rw-/rw- SM=COW  ...t_id=787f689f
Exception Note:  EXC_CORPSE_NOTIFY
Termination Reason: CODESIGNING 2 

The error has been observed on Apple silicon only (so far).

Reproducing

It has not been possible to reproduce reliably but once the problem occurs a crash report can be generated by running controller --help in the Elasticsearch repository.

<ES_REPO>/distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app/Contents/MacOS/controller --help

Running the app from a different location works ?!

Copy the app to a folder in the home directory and running the copy does not result in a crash:

cd ~/Desktop
cp -r <ES_REPO>/distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app .
controller.app/Contents/MacOS/controller --help

Possible Causes

macOS Quarantine

No.

The downloaded controller.app does not have the the quarantine attribute set.
find . -xattrname com.apple.quarantine returns nothing.

Security Policy

No.
After disabling security with sudo spctl --global-disable the controller app still crashes.

cd <ES_REPO>
sudo spctl --global-disable
sudo spctl --asses -vv ./distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app/Contents/MacOS/controller

./distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app/Contents/MacOS/controller: accepted
override=security disabled

echo $?
0

When security is enabled the spctl --assess function returns the same message as codesign --verify

cd <ES_REPO>
sudo spctl --asses -vv ./distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app/Contents/MacOS/controller

./distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app/Contents/MacOS/controller: code has no resources but signature indicates they must be present

echo $?
1

Code Signing

Maybe.

The crash report indicates code signing is involved

Exception Type:  EXC_BAD_ACCESS (SIGKILL (Code Signature Invalid))

and

Termination Reason: CODESIGNING 2 

Verifying the signing returns an error message

cd <ES_REPO>
codesign -d --verify --verbose=4 ./distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app


./distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app/Contents/MacOS/controller: code has no resources but signature indicates they must be present

It is not clear if that is a terminal error however.

Workarounds

In the commands below replace elasticsearch-8.7.0-SNAPSHOT with your version.

  • Deleting the bundled app from the local build and rebuilding is the most reliable fix:
cd <ES_REPO>
rm -rf distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app
./gradlew run
  • Resigning the app with an ad-hoc signature works for some
 codesign --force --deep --sign - <ES_REPO>/distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app
  • If all else fails Restart the Machine
@davidkyle
Copy link
Member Author

The error re-appeared randomly on development machine (perhaps the ES build system downloaded new binaries?)

I was able to test resigning the app with an ad hoc signature and found that it does stop the app being killed when started.

codesign --force --deep --sign - <ES_REPO>/distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.7.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app

--sign - means use an ad hoc identity. This is a workaround for the a local development machine only.

From the codesign man page:

If identity is the single letter "-" (dash), ad-hoc signing is performed.
Ad-hoc signing does not use an identity at all, and identifies exactly
one instance of code. Significant restrictions apply to the use of ad-hoc
signed code; consult documentation before using this.

@davidkyle
Copy link
Member Author

Almost exactly 1 year later this problem has returned.

The easiest way to resolve the problem is still to delete controller.app

cd <ES_REPO>
rm -rf distribution/archives/darwin-aarch64-tar/build/install/elasticsearch-8.13.0-SNAPSHOT/modules/x-pack-ml/platform/darwin-aarch64/controller.app
./gradlew run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant