Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Specifically, the stack is comprised of the following separate components, each
- [PostgreSQL](https://www.postgresql.org/)/[MariaDB](https://mariadb.org/), for database storage (only one instance is needed)
- [Pentaho](https://www.hitachivantara.com/en-us/products/dataops-software/data-integration-analytics.html) (for reporting services)
- [Alfresco](https://www.alfresco.com/), for content storage services
- [Tika](https://tika.apache.org/), for document metadata and text extraction services

In particular, Pentaho and Alfresco are offered in both Enterprise and Community editions. The edition deployed is automatically selected by the framework, by way of detecting the presence of the [required license data](#licenses) in the configuration at deployment time.

Expand Down
1 change: 1 addition & 0 deletions doc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ Specifically, the stack is comprised of the following separate components, each
- [Pentaho](https://www.hitachivantara.com/en-us/products/dataops-software/data-integration-analytics.html) (for reporting services)
- [Alfresco](https://www.alfresco.com/), for content storage services
- [Minio](https://min.io/), for content storage services in S3-compatible mode
- [Tika](https://tika.apache.org/), for document metadata and text extraction services

In particular, Pentaho and Alfresco are offered in both Enterprise and Community editions. The edition deployed is automatically selected by the framework, by way of detecting the presence of the [required license data](docs/Licenses.md) in the configuration at deployment time.

Expand Down
3 changes: 3 additions & 0 deletions src/app/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -44,3 +44,6 @@ dependencies:

- name: zookeeper
version: ~0.10.0-0

- name: tika
version: ~0.10.0-0
10 changes: 10 additions & 0 deletions src/app/charts/tika/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
apiVersion: v2
name: tika
version: 0.10.0-0
appVersion: "3.2.3"
description: A Helm chart for Apache Tika Server as used by ArkCase
type: application
dependencies:
- name: common
version: ~0.10.0-0
repository: "https://arkcase.github.io/ark_helm_charts"
1 change: 1 addition & 0 deletions src/app/charts/tika/clustering.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
supported: false
140 changes: 140 additions & 0 deletions src/app/charts/tika/files/config/txt/tika-config.xml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-executable files don't require permissions being specified. That extension was added to the scripts specifically for executable files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<service-loader initializableProblemHandler="ignore"/>

<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
</parser>
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude class="org.apache.tika.parser.mp4.MP4Parser"/>
<parser-exclude class="org.apache.tika.parser.mp3.Mp3Parser"/>
<parser-exclude class="org.apache.tika.parser.audio.AudioParser"/>
</parser>
<parser class="com.armedia.acm.tika.parser.EnhancedMp4Parser"/>
<parser class="com.armedia.acm.tika.parser.EnhancedMp3Parser"/>
<parser class="com.armedia.acm.tika.parser.EnhancedAudioParser"/>
</parsers>

<metadataFilters>
<metadataFilter class="com.armedia.acm.tika.filter.ContentTypeNormalizationFilter">
<contentTypeFixes>audio/vnd.wave:audio/wav,audio/mpeg:audio/mp3,audio/x-flac:audio/flac,video/quicktime:video/mp4</contentTypeFixes>
<extensionOverrides>.mpga:.mp3,.qt:.mp4</extensionOverrides>
</metadataFilter>
<metadataFilter class="com.armedia.acm.tika.filter.CreatedDateNormalizationFilter"/>
<metadataFilter class="com.armedia.acm.tika.filter.GpsEnrichmentFilter"/>
Comment on lines +20 to +25
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see com.armedia.acm.* ... do we have to add JARs of our own to extend the Tika container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is how we extend Tika Server capabilities without modifying Tika’s source code.
We provide our custom parser implementations as a separate JAR and load them through the container’s classpath, and then activate/configure them via tika-config.xml.
This follows Tika’s recommended extension mechanism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we need to add those JAR's Nexus/Maven coordinates to the container's build process. Can you provide the list for them or create an MR for the container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already created, but I need the arkcase-tika merged, until then, the jar is not available on nexus.
ArkCase/ark_tika#1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still add the coordinates to the Dockerfile, commented out ... same with the RUN commands to fetch them. The point here is to start getting everything lined up.

Furthermore, it might actually behoove us to have those JARs be separate from ArkCase since they're not specifically dependent on ArkCase to begin with ... are they?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it’s a separate small module outside of ArkCase that contains only the custom parsing logic and filters.

I have tested this locally using the updated ark_tika Docker image (including the arkcase-tika.jar) along with the Helm changes and the new Tika pod. The pod is up and running, the configuration is being picked up correctly, and all custom parsers/filters are properly registered.

At the moment, I’m experiencing issues running the core pod with the latest develop branch. Once I get it running, I’ll be able to fully confirm that the ArkCase–Tika integration is working as expected.

</metadataFilters>

<server>
<params>
<!-- which port to start the server on. If you specify a range,
e.g. 9995-9998, TikaServerCli will start four forked servers,
one at each port. You can also specify multiple forked servers
via a comma-delimited value: 9995,9997. -->
<!--
<port>9998</port>
<host>localhost</host>
-->

<!-- if specified, this will be the id that is used in the
/status endpoint and elsewhere. If an id is specified
and more than one forked processes are invoked, each process
will have an id followed by the port, e.g my_id-9998. If a
forked server has to restart, it will maintain its original id.
If not specified, a UUID will be generated.
-->
<id>${POD_NAME}</id>

<!-- Origin URL for cors requests. Set to '*' if you
want to allow all CORS requests. Leave blank or remove element
if you do not want to enable CORS. -->
<cors>*</cors>

<!-- which digests to calculate, comma delimited (e.g. md5,sha256);
optionally specify encoding followed by a colon (e.g. "sha1:32").
Can be empty if you don't want to calculate a digest -->
<digest>sha256</digest>

<!-- how much to read to memory during the digest phase before
spooling to disc...only if digest is selected -->
<!-- Start off with exactly 1GiB -->
<digestMarkLimit>1073741824</digestMarkLimit>

<!-- request URI log level 'debug' or 'info' -->
<logLevel>${LOG_LEVEL}</logLevel>

<!-- whether or not to return the stacktrace in the data returned
to the user when a parse exception happens-->
<returnStackTrace>false</returnStackTrace>

<!-- If set to 'true', this runs tika server "in process"
in the legacy 1.x mode.
This means that the server will be susceptible to infinite loops
and crashes.
If set to 'false', the server will spawn a forked
process and restart the forked process on catastrophic failures
(this was called -spawnChild mode in 1.x).
noFork=false is the default in 2.x -->
<noFork>false</noFork>

<!-- maximum time to allow per parse before shutting down and restarting
the forked parser. Not allowed if noFork=true. -->
<taskTimeoutMillis>300000</taskTimeoutMillis>

<!-- maximum amount of time to wait for a forked process to
start up. Not allowed if noFork=true. -->
<maxForkedStartupMillis>120000</maxForkedStartupMillis>

<!-- maximum number of times to allow a specific forked process
to be restarted.
Not allowed if noFork=true. -->
<maxRestarts>-1</maxRestarts>

<!-- maximum files to parse per forked process before
restarting the forked process to clear potential
memory leaks.
Not allowed if noFork=true. -->
<maxFiles>100000</maxFiles>

<!-- if you want to specify a specific javaPath for
the forked process. This path should end
the application 'java', e.g. /my/special-java/java
Not allowed if noFork=true. -->
<javaPath>java</javaPath>

<!-- jvm args to use in the forked process -->
<forkedJvmArgs>
<arg>-Xms1g</arg>
<arg>-Xmx1g</arg>
</forkedJvmArgs>

<!-- this must be set to true for any handler that uses a fetcher or emitter.
These pipes features are inherently unsecure because the client has the
same read/write access as the tika-server process. Implementers must secure
Tika server so that only their clients can reach it. A byproduct of
setting this to true is that the /status endpoint is turned on -->
<enableUnsecureFeatures>true</enableUnsecureFeatures>

<!-- you can optionally select specific endpoints to turn on/load. This can
improve resource usage and decrease your attack surface. If you want to
access the status endpoint, specify it here or set unsecureFeatures to true -->
<!--
<endpoints>
<endpoint>status</endpoint>
<endpoint>rmeta</endpoint>
</endpoints>
-->
</params>
<tlsConfig>
<params>
<active>true</active>
<keyStoreType>PKCS12</keyStoreType>
<keyStoreFile>${KEYSTORE}</keyStoreFile>
<keyStorePassword>${KEYSTORE_PASSWORD}</keyStorePassword>
<trustStoreType>PKCS12</trustStoreType>
<trustStoreFile>${TRUSTSTORE}</trustStoreFile>
<trustStorePassword>${TRUSTSTORE_PASSWORD}</trustStorePassword>
</params>
</tlsConfig>
</server>
</properties>
21 changes: 21 additions & 0 deletions src/app/charts/tika/subsys-deps.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
network:
enabled: true
template:
mode: any
initialDelay: 10
delay: 10
timeout: 10
attempts: 60
mode: all
dependencies:
acme:
url: "@env:ACME_URL"

settings:
# We specifically don't consume from ACME b/c those are handled directly
# by the ACME templates, and there's no need to address them beyond that
# acme: {}
#
# We specifically don't consume from ZooKeeper b/c those are handled directly
# by the Clustering templates, and there's no need to address them beyond that
# zookeeper: {}
111 changes: 111 additions & 0 deletions src/app/charts/tika/templates/ark-tika.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
{{- if and (include "arkcase.subsystem.enabled" $) (not (include "arkcase.subsystem.external" $)) }}
{{- $cluster := (include "arkcase.cluster" $ | fromYaml) }}
{{- $replicas := ($cluster.replicas | int) -}}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "arkcase.fullname" $ | quote }}
namespace: {{ $.Release.Namespace | quote }}
labels: {{- include "arkcase.labels.service" $ | nindent 4 }}
{{- with ($.Values.labels).common }}
{{- toYaml . | nindent 4 }}
{{- end }}
{{- with ($.Values.annotations).common }}
annotations: {{- toYaml . | nindent 4 }}
{{- end }}
spec:
replicas: {{ $replicas }}
selector: &labelSelector
matchLabels: {{- include "arkcase.labels.matchLabels" $ | nindent 6 }}
strategy:
type: Recreate
template:
metadata:
name: {{ include "arkcase.fullname" $ | quote }}
namespace: {{ $.Release.Namespace | quote }}
labels: {{- include "arkcase.labels.service" $ | nindent 8 }}
{{- include "arkcase.labels.deploys" "tika" | nindent 8 }}
{{- with ($.Values.labels).common }}
{{- toYaml . | nindent 8 }}
{{- end }}
annotations:
# NB: Both these annotation values must be of type "string"
prometheus.io/scrape: "true"
prometheus.io/port: "9100"
{{- with ($.Values.annotations).common }}
{{- toYaml . | nindent 8 }}
{{- end }}
spec:
affinity:
podAntiAffinity:
{{- if $cluster.onePerHost }}
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: "kubernetes.io/hostname"
namespaces: [ {{ $.Release.Namespace | quote }} ]
labelSelector:
<<: *labelSelector
{{- else }}
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
topologyKey: "kubernetes.io/hostname"
namespaces: [ {{ $.Release.Namespace | quote }} ]
labelSelector:
<<: *labelSelector
{{- end }}
{{- include "arkcase.image.pullSecrets" $ | nindent 6 }}
{{- with $.Values.hostAliases }}
hostAliases: {{- toYaml . | nindent 8 }}
{{- end }}
{{- if $.Values.schedulerName }}
schedulerName: {{ $.Values.schedulerName | quote }}
{{- end }}
securityContext: {{- include "arkcase.securityContext" $ | nindent 8 }}
terminationGracePeriodSeconds: 15
initContainers:
- name: init-set-permissions
{{- include "arkcase.image" (dict "ctx" $ "name" "setperm" "repository" "arkcase/setperm") | nindent 10 }}
env: {{- include "arkcase.tools.baseEnv" $ | nindent 12 }}
- name: TEMP_DIR
value: &tempDir "/app/temp"
- name: JOBS
value: |-
jobs:
- ownership: {{ coalesce ($.Values.persistence).ownership "1999:1999" | quote }}
permissions: "u=rwX,g=rX,o="
flags: [ "recurse", "forced", "create", "changes" ]
targets: [ "$(TEMP_DIR)" ]
volumeMounts:
- name: &tempVol "temp"
mountPath: *tempDir
containers:
- name: tika
{{- include "arkcase.image" $ | nindent 10 }}
env: {{- include "arkcase.tools.baseEnv" $ | nindent 12 }}
{{- include "arkcase.acme.env" $ | nindent 12 }}
{{- include "arkcase.alt-java" $ | nindent 12 }}
- name: ACME_KEYSTORE_WITH_TRUSTS
value: "true"
- name: TEMP_DIR
value: *tempDir
{{- if $.Values.env }}
{{- $.Values.env | toYaml | nindent 12 }}
{{- end }}
{{- include "arkcase.subsystem.ports" $ | nindent 10 }}
command: [ "/entrypoint" ]
resources: {{- include "arkcase.resources" $ | nindent 12 }}
securityContext: {{- include "arkcase.securityContext" (dict "ctx" $ "container" "tika") | nindent 12 }}
volumeMounts:
- name: *tempVol
mountPath: *tempDir
{{- include "arkcase.acme.volumeMount" $ | nindent 12 }}
{{- include "arkcase.file-resource.volumeMount" (dict "ctx" $ "mountPath" "/entrypoint") | nindent 12 }}
{{- include "arkcase.file-resource.volumeMount" (dict "ctx" $ "mountPath" "/app/conf/tika-config.xml" "subPath" "tika-config.xml") | nindent 12 }}
volumes:
- name: *tempVol
emptyDir:
sizeLimit: {{ ((.Values.persistence).volumeSize).temp | default "2Gi" }}
{{- include "arkcase.acme.volume" $ | nindent 8 }}
{{- include "arkcase.file-resource.volumes" $ | nindent 8 }}
{{- end }}
1 change: 1 addition & 0 deletions src/app/charts/tika/templates/file-resources.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{{- include "arkcase.file-resources" $ -}}
3 changes: 3 additions & 0 deletions src/app/charts/tika/templates/service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{{- if not (include "arkcase.subsystem.external" $) -}}
{{- include "arkcase.subsystem.service" $ -}}
{{- end -}}
Loading