-
Notifications
You must be signed in to change notification settings - Fork 4
New arkcase-tika pod #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from all commits
f473b8c
e885d00
e3f93e7
2eb58ce
43aaf40
f4ce5b9
22201da
e08dd2a
249e9c9
8219ee3
bdf5bfa
e59436a
dd21319
d40e4af
2b8a143
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -44,3 +44,6 @@ dependencies: | |
|
|
||
| - name: zookeeper | ||
| version: ~0.10.0-0 | ||
|
|
||
| - name: tika | ||
| version: ~0.10.0-0 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| apiVersion: v2 | ||
| name: tika | ||
| version: 0.10.0-0 | ||
| appVersion: "3.2.3" | ||
| description: A Helm chart for Apache Tika Server as used by ArkCase | ||
| type: application | ||
| dependencies: | ||
| - name: common | ||
| version: ~0.10.0-0 | ||
| repository: "https://arkcase.github.io/ark_helm_charts" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| supported: false |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,140 @@ | ||
| <?xml version="1.0" encoding="UTF-8"?> | ||
| <properties> | ||
| <service-loader initializableProblemHandler="ignore"/> | ||
|
|
||
| <parsers> | ||
| <parser class="org.apache.tika.parser.DefaultParser"> | ||
| <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/> | ||
| </parser> | ||
| <parser class="org.apache.tika.parser.DefaultParser"> | ||
| <parser-exclude class="org.apache.tika.parser.mp4.MP4Parser"/> | ||
| <parser-exclude class="org.apache.tika.parser.mp3.Mp3Parser"/> | ||
| <parser-exclude class="org.apache.tika.parser.audio.AudioParser"/> | ||
| </parser> | ||
| <parser class="com.armedia.acm.tika.parser.EnhancedMp4Parser"/> | ||
| <parser class="com.armedia.acm.tika.parser.EnhancedMp3Parser"/> | ||
| <parser class="com.armedia.acm.tika.parser.EnhancedAudioParser"/> | ||
| </parsers> | ||
|
|
||
| <metadataFilters> | ||
| <metadataFilter class="com.armedia.acm.tika.filter.ContentTypeNormalizationFilter"> | ||
| <contentTypeFixes>audio/vnd.wave:audio/wav,audio/mpeg:audio/mp3,audio/x-flac:audio/flac,video/quicktime:video/mp4</contentTypeFixes> | ||
| <extensionOverrides>.mpga:.mp3,.qt:.mp4</extensionOverrides> | ||
| </metadataFilter> | ||
| <metadataFilter class="com.armedia.acm.tika.filter.CreatedDateNormalizationFilter"/> | ||
| <metadataFilter class="com.armedia.acm.tika.filter.GpsEnrichmentFilter"/> | ||
|
Comment on lines
+20
to
+25
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this is how we extend Tika Server capabilities without modifying Tika’s source code.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then we need to add those JAR's Nexus/Maven coordinates to the container's build process. Can you provide the list for them or create an MR for the container?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's already created, but I need the arkcase-tika merged, until then, the jar is not available on nexus.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can still add the coordinates to the Dockerfile, commented out ... same with the RUN commands to fetch them. The point here is to start getting everything lined up. Furthermore, it might actually behoove us to have those JARs be separate from ArkCase since they're not specifically dependent on ArkCase to begin with ... are they?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it’s a separate small module outside of ArkCase that contains only the custom parsing logic and filters. I have tested this locally using the updated ark_tika Docker image (including the arkcase-tika.jar) along with the Helm changes and the new Tika pod. The pod is up and running, the configuration is being picked up correctly, and all custom parsers/filters are properly registered. At the moment, I’m experiencing issues running the core pod with the latest develop branch. Once I get it running, I’ll be able to fully confirm that the ArkCase–Tika integration is working as expected. |
||
| </metadataFilters> | ||
|
|
||
| <server> | ||
| <params> | ||
| <!-- which port to start the server on. If you specify a range, | ||
| e.g. 9995-9998, TikaServerCli will start four forked servers, | ||
| one at each port. You can also specify multiple forked servers | ||
| via a comma-delimited value: 9995,9997. --> | ||
| <!-- | ||
| <port>9998</port> | ||
| <host>localhost</host> | ||
| --> | ||
|
|
||
| <!-- if specified, this will be the id that is used in the | ||
| /status endpoint and elsewhere. If an id is specified | ||
| and more than one forked processes are invoked, each process | ||
| will have an id followed by the port, e.g my_id-9998. If a | ||
| forked server has to restart, it will maintain its original id. | ||
| If not specified, a UUID will be generated. | ||
| --> | ||
| <id>${POD_NAME}</id> | ||
|
|
||
| <!-- Origin URL for cors requests. Set to '*' if you | ||
| want to allow all CORS requests. Leave blank or remove element | ||
| if you do not want to enable CORS. --> | ||
| <cors>*</cors> | ||
|
|
||
| <!-- which digests to calculate, comma delimited (e.g. md5,sha256); | ||
| optionally specify encoding followed by a colon (e.g. "sha1:32"). | ||
| Can be empty if you don't want to calculate a digest --> | ||
| <digest>sha256</digest> | ||
|
|
||
| <!-- how much to read to memory during the digest phase before | ||
| spooling to disc...only if digest is selected --> | ||
| <!-- Start off with exactly 1GiB --> | ||
| <digestMarkLimit>1073741824</digestMarkLimit> | ||
|
|
||
| <!-- request URI log level 'debug' or 'info' --> | ||
| <logLevel>${LOG_LEVEL}</logLevel> | ||
|
|
||
| <!-- whether or not to return the stacktrace in the data returned | ||
| to the user when a parse exception happens--> | ||
| <returnStackTrace>false</returnStackTrace> | ||
|
|
||
| <!-- If set to 'true', this runs tika server "in process" | ||
| in the legacy 1.x mode. | ||
| This means that the server will be susceptible to infinite loops | ||
| and crashes. | ||
| If set to 'false', the server will spawn a forked | ||
| process and restart the forked process on catastrophic failures | ||
| (this was called -spawnChild mode in 1.x). | ||
| noFork=false is the default in 2.x --> | ||
| <noFork>false</noFork> | ||
|
|
||
| <!-- maximum time to allow per parse before shutting down and restarting | ||
| the forked parser. Not allowed if noFork=true. --> | ||
| <taskTimeoutMillis>300000</taskTimeoutMillis> | ||
|
|
||
| <!-- maximum amount of time to wait for a forked process to | ||
| start up. Not allowed if noFork=true. --> | ||
| <maxForkedStartupMillis>120000</maxForkedStartupMillis> | ||
|
|
||
| <!-- maximum number of times to allow a specific forked process | ||
| to be restarted. | ||
| Not allowed if noFork=true. --> | ||
| <maxRestarts>-1</maxRestarts> | ||
|
|
||
| <!-- maximum files to parse per forked process before | ||
| restarting the forked process to clear potential | ||
| memory leaks. | ||
| Not allowed if noFork=true. --> | ||
| <maxFiles>100000</maxFiles> | ||
|
|
||
| <!-- if you want to specify a specific javaPath for | ||
| the forked process. This path should end | ||
| the application 'java', e.g. /my/special-java/java | ||
| Not allowed if noFork=true. --> | ||
| <javaPath>java</javaPath> | ||
|
|
||
| <!-- jvm args to use in the forked process --> | ||
| <forkedJvmArgs> | ||
| <arg>-Xms1g</arg> | ||
| <arg>-Xmx1g</arg> | ||
| </forkedJvmArgs> | ||
|
|
||
| <!-- this must be set to true for any handler that uses a fetcher or emitter. | ||
| These pipes features are inherently unsecure because the client has the | ||
| same read/write access as the tika-server process. Implementers must secure | ||
| Tika server so that only their clients can reach it. A byproduct of | ||
| setting this to true is that the /status endpoint is turned on --> | ||
| <enableUnsecureFeatures>true</enableUnsecureFeatures> | ||
|
|
||
| <!-- you can optionally select specific endpoints to turn on/load. This can | ||
| improve resource usage and decrease your attack surface. If you want to | ||
| access the status endpoint, specify it here or set unsecureFeatures to true --> | ||
| <!-- | ||
| <endpoints> | ||
| <endpoint>status</endpoint> | ||
| <endpoint>rmeta</endpoint> | ||
| </endpoints> | ||
| --> | ||
| </params> | ||
| <tlsConfig> | ||
| <params> | ||
| <active>true</active> | ||
| <keyStoreType>PKCS12</keyStoreType> | ||
| <keyStoreFile>${KEYSTORE}</keyStoreFile> | ||
| <keyStorePassword>${KEYSTORE_PASSWORD}</keyStorePassword> | ||
| <trustStoreType>PKCS12</trustStoreType> | ||
| <trustStoreFile>${TRUSTSTORE}</trustStoreFile> | ||
| <trustStorePassword>${TRUSTSTORE_PASSWORD}</trustStorePassword> | ||
| </params> | ||
| </tlsConfig> | ||
| </server> | ||
| </properties> | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| network: | ||
| enabled: true | ||
| template: | ||
| mode: any | ||
| initialDelay: 10 | ||
| delay: 10 | ||
| timeout: 10 | ||
| attempts: 60 | ||
| mode: all | ||
| dependencies: | ||
| acme: | ||
| url: "@env:ACME_URL" | ||
|
|
||
| settings: | ||
| # We specifically don't consume from ACME b/c those are handled directly | ||
| # by the ACME templates, and there's no need to address them beyond that | ||
| # acme: {} | ||
| # | ||
| # We specifically don't consume from ZooKeeper b/c those are handled directly | ||
| # by the Clustering templates, and there's no need to address them beyond that | ||
| # zookeeper: {} |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,111 @@ | ||
| {{- if and (include "arkcase.subsystem.enabled" $) (not (include "arkcase.subsystem.external" $)) }} | ||
| {{- $cluster := (include "arkcase.cluster" $ | fromYaml) }} | ||
| {{- $replicas := ($cluster.replicas | int) -}} | ||
| --- | ||
| apiVersion: apps/v1 | ||
| kind: Deployment | ||
| metadata: | ||
| name: {{ include "arkcase.fullname" $ | quote }} | ||
| namespace: {{ $.Release.Namespace | quote }} | ||
| labels: {{- include "arkcase.labels.service" $ | nindent 4 }} | ||
| {{- with ($.Values.labels).common }} | ||
| {{- toYaml . | nindent 4 }} | ||
| {{- end }} | ||
| {{- with ($.Values.annotations).common }} | ||
| annotations: {{- toYaml . | nindent 4 }} | ||
| {{- end }} | ||
| spec: | ||
| replicas: {{ $replicas }} | ||
| selector: &labelSelector | ||
| matchLabels: {{- include "arkcase.labels.matchLabels" $ | nindent 6 }} | ||
| strategy: | ||
| type: Recreate | ||
| template: | ||
| metadata: | ||
| name: {{ include "arkcase.fullname" $ | quote }} | ||
| namespace: {{ $.Release.Namespace | quote }} | ||
| labels: {{- include "arkcase.labels.service" $ | nindent 8 }} | ||
| {{- include "arkcase.labels.deploys" "tika" | nindent 8 }} | ||
| {{- with ($.Values.labels).common }} | ||
| {{- toYaml . | nindent 8 }} | ||
| {{- end }} | ||
| annotations: | ||
| # NB: Both these annotation values must be of type "string" | ||
| prometheus.io/scrape: "true" | ||
| prometheus.io/port: "9100" | ||
| {{- with ($.Values.annotations).common }} | ||
| {{- toYaml . | nindent 8 }} | ||
| {{- end }} | ||
| spec: | ||
| affinity: | ||
| podAntiAffinity: | ||
| {{- if $cluster.onePerHost }} | ||
| requiredDuringSchedulingIgnoredDuringExecution: | ||
| - topologyKey: "kubernetes.io/hostname" | ||
| namespaces: [ {{ $.Release.Namespace | quote }} ] | ||
| labelSelector: | ||
| <<: *labelSelector | ||
| {{- else }} | ||
| preferredDuringSchedulingIgnoredDuringExecution: | ||
| - weight: 1 | ||
| podAffinityTerm: | ||
| topologyKey: "kubernetes.io/hostname" | ||
| namespaces: [ {{ $.Release.Namespace | quote }} ] | ||
| labelSelector: | ||
| <<: *labelSelector | ||
| {{- end }} | ||
| {{- include "arkcase.image.pullSecrets" $ | nindent 6 }} | ||
| {{- with $.Values.hostAliases }} | ||
| hostAliases: {{- toYaml . | nindent 8 }} | ||
| {{- end }} | ||
| {{- if $.Values.schedulerName }} | ||
| schedulerName: {{ $.Values.schedulerName | quote }} | ||
| {{- end }} | ||
| securityContext: {{- include "arkcase.securityContext" $ | nindent 8 }} | ||
| terminationGracePeriodSeconds: 15 | ||
| initContainers: | ||
| - name: init-set-permissions | ||
| {{- include "arkcase.image" (dict "ctx" $ "name" "setperm" "repository" "arkcase/setperm") | nindent 10 }} | ||
| env: {{- include "arkcase.tools.baseEnv" $ | nindent 12 }} | ||
| - name: TEMP_DIR | ||
| value: &tempDir "/app/temp" | ||
| - name: JOBS | ||
| value: |- | ||
| jobs: | ||
| - ownership: {{ coalesce ($.Values.persistence).ownership "1999:1999" | quote }} | ||
| permissions: "u=rwX,g=rX,o=" | ||
| flags: [ "recurse", "forced", "create", "changes" ] | ||
| targets: [ "$(TEMP_DIR)" ] | ||
| volumeMounts: | ||
| - name: &tempVol "temp" | ||
| mountPath: *tempDir | ||
| containers: | ||
| - name: tika | ||
| {{- include "arkcase.image" $ | nindent 10 }} | ||
| env: {{- include "arkcase.tools.baseEnv" $ | nindent 12 }} | ||
| {{- include "arkcase.acme.env" $ | nindent 12 }} | ||
| {{- include "arkcase.alt-java" $ | nindent 12 }} | ||
| - name: ACME_KEYSTORE_WITH_TRUSTS | ||
| value: "true" | ||
| - name: TEMP_DIR | ||
| value: *tempDir | ||
| {{- if $.Values.env }} | ||
| {{- $.Values.env | toYaml | nindent 12 }} | ||
| {{- end }} | ||
| {{- include "arkcase.subsystem.ports" $ | nindent 10 }} | ||
| command: [ "/entrypoint" ] | ||
| resources: {{- include "arkcase.resources" $ | nindent 12 }} | ||
| securityContext: {{- include "arkcase.securityContext" (dict "ctx" $ "container" "tika") | nindent 12 }} | ||
| volumeMounts: | ||
| - name: *tempVol | ||
| mountPath: *tempDir | ||
| {{- include "arkcase.acme.volumeMount" $ | nindent 12 }} | ||
| {{- include "arkcase.file-resource.volumeMount" (dict "ctx" $ "mountPath" "/entrypoint") | nindent 12 }} | ||
| {{- include "arkcase.file-resource.volumeMount" (dict "ctx" $ "mountPath" "/app/conf/tika-config.xml" "subPath" "tika-config.xml") | nindent 12 }} | ||
| volumes: | ||
| - name: *tempVol | ||
| emptyDir: | ||
| sizeLimit: {{ ((.Values.persistence).volumeSize).temp | default "2Gi" }} | ||
| {{- include "arkcase.acme.volume" $ | nindent 8 }} | ||
| {{- include "arkcase.file-resource.volumes" $ | nindent 8 }} | ||
| {{- end }} |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| {{- include "arkcase.file-resources" $ -}} |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| {{- if not (include "arkcase.subsystem.external" $) -}} | ||
| {{- include "arkcase.subsystem.service" $ -}} | ||
| {{- end -}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-executable files don't require permissions being specified. That extension was added to the scripts specifically for executable files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done