Skip to content

New arkcase-tika pod#84

Open
cherepnalkovski wants to merge 15 commits intodevelopfrom
tika-setup
Open

New arkcase-tika pod#84
cherepnalkovski wants to merge 15 commits intodevelopfrom
tika-setup

Conversation

@cherepnalkovski
Copy link
Contributor

No description provided.

@cherepnalkovski cherepnalkovski changed the base branch from main to develop February 6, 2026 13:12
doc/README.md Outdated
- [Pentaho](https://www.hitachivantara.com/en-us/products/dataops-software/data-integration-analytics.html) (for reporting services)
- [Alfresco](https://www.alfresco.com/), for content storage services
- [Minio](https://min.io/), for content storage services in S3-compatible mode
- [Tika](https://tika.apache.org/), for document metadata and text extraction services
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the indentation to maintain consistency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like Idea is messing this up, I will fix this trough the UI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-executable files don't require permissions being specified. That extension was added to the scripts specifically for executable files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +20 to +25
<metadataFilter class="com.armedia.acm.tika.filter.ContentTypeNormalizationFilter">
<contentTypeFixes>audio/vnd.wave:audio/wav,audio/mpeg:audio/mp3,audio/x-flac:audio/flac,video/quicktime:video/mp4</contentTypeFixes>
<extensionOverrides>.mpga:.mp3,.qt:.mp4</extensionOverrides>
</metadataFilter>
<metadataFilter class="com.armedia.acm.tika.filter.CreatedDateNormalizationFilter"/>
<metadataFilter class="com.armedia.acm.tika.filter.GpsEnrichmentFilter"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see com.armedia.acm.* ... do we have to add JARs of our own to extend the Tika container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is how we extend Tika Server capabilities without modifying Tika’s source code.
We provide our custom parser implementations as a separate JAR and load them through the container’s classpath, and then activate/configure them via tika-config.xml.
This follows Tika’s recommended extension mechanism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we need to add those JAR's Nexus/Maven coordinates to the container's build process. Can you provide the list for them or create an MR for the container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already created, but I need the arkcase-tika merged, until then, the jar is not available on nexus.
ArkCase/ark_tika#1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still add the coordinates to the Dockerfile, commented out ... same with the RUN commands to fetch them. The point here is to start getting everything lined up.

Furthermore, it might actually behoove us to have those JARs be separate from ArkCase since they're not specifically dependent on ArkCase to begin with ... are they?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it’s a separate small module outside of ArkCase that contains only the custom parsing logic and filters.

I have tested this locally using the updated ark_tika Docker image (including the arkcase-tika.jar) along with the Helm changes and the new Tika pod. The pod is up and running, the configuration is being picked up correctly, and all custom parsers/filters are properly registered.

At the moment, I’m experiencing issues running the core pod with the latest develop branch. Once I get it running, I’ll be able to fully confirm that the ArkCase–Tika integration is working as expected.

Comment on lines 21 to 27
strategy:
type: {{ $.Values.updateStrategy }}
{{- if (eq $.Values.updateStrategy "RollingUpdate") }}
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can stay as-is for now, but we'll probably want to use a HorizontalPodAutoscaler for this Deployment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, let's just replace this with:

strategy:
  type: Recreate

We don't really need to do a rolling update since these are meant to be stateless.

Comment on lines 99 to 103
- name: JAVA_OPTS
value: >-
-Xmx1536m
-Xms512m
-XX:+UseG1GC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to use the percentage-based stuff so we can then control how much memory the pod is allocated via its resource allocations ... look for what's done for other pods like messaging or core

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1,10 @@
apiVersion: v2
name: tika
version: 0.10.0-2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version can remain as 0.10.0-0 since it's the first release of this subchart.

Comment on lines 85 to 94
persistence:
enabled: true
volumeSize:
init: "1Gi"
data: "16Gi"
volumes:
data:
capacity: "16Gi"
accessModes:
- ReadWriteOnce
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Tika requires persistence, so this can be removed or substituted with:

persistence:
  enabled: false

Comment on lines 115 to 118
httpGet:
scheme: HTTP
path: "/tika"
port: *tika
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheme should be HTTPS, b/c we should be using SSL everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Tika Server 3.x does not support native HTTPS. I tried configuring it via both CLI flags and tika-config.xml, but the server doesn't recognize any SSL/keystore parameters.
How do we handle services that only support HTTP? Can you help with this one ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see they have just committed a change to improve the TLS docs: https://www.mail-archive.com/commits@tika.apache.org/msg20561.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, but the fix is planned for version 4.0.0 😢
It’s estimated to be released in January, however it’s still not available and there haven’t been any updates so far.
For now, we’ll probably need to use a proxy as a workaround.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=148639291#TikaServer-SSL(Beta)

This is already available since Tika 2.4.0. I'll test this and figure this out. Most likely the challenge had to do with access to the keystore(s).

Copy link
Contributor Author

@cherepnalkovski cherepnalkovski Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drivera-armedia did you had a chance to look at it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This second service declaration is unnecessary. The reason some pods have two service declarations is b/c they need to be aware of their cluster members. Tika is not a cluster per-se, and thus doesn't need to know how many other Tika pods are available.

Thus, the above service declaration is sufficient, and this one is unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

README.md Outdated
- [PostgreSQL](https://www.postgresql.org/)/[MariaDB](https://mariadb.org/), for database storage (only one instance is needed)
- [Pentaho](https://www.hitachivantara.com/en-us/products/dataops-software/data-integration-analytics.html) (for reporting services)
- [Alfresco](https://www.alfresco.com/), for content storage services
- [Tika](https://tika.apache.org/), for document metadata and text extraction services
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, please fix the indentation to maintain consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +20 to +25
<metadataFilter class="com.armedia.acm.tika.filter.ContentTypeNormalizationFilter">
<contentTypeFixes>audio/vnd.wave:audio/wav,audio/mpeg:audio/mp3,audio/x-flac:audio/flac,video/quicktime:video/mp4</contentTypeFixes>
<extensionOverrides>.mpga:.mp3,.qt:.mp4</extensionOverrides>
</metadataFilter>
<metadataFilter class="com.armedia.acm.tika.filter.CreatedDateNormalizationFilter"/>
<metadataFilter class="com.armedia.acm.tika.filter.GpsEnrichmentFilter"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we need to add those JAR's Nexus/Maven coordinates to the container's build process. Can you provide the list for them or create an MR for the container?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no specific helpers needed, let's remove this file.

Comment on lines 21 to 27
strategy:
type: {{ $.Values.updateStrategy }}
{{- if (eq $.Values.updateStrategy "RollingUpdate") }}
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, let's just replace this with:

strategy:
  type: Recreate

We don't really need to do a rolling update since these are meant to be stateless.

Comment on lines 77 to 78
- name: INIT_DIR
value: &initDir "/app/init"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need an init volume? This is meant to be a stateless pod - what initialization does it need that can't be done as part of the container's boot sequence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we don't need it. It's removed, and DATA_DIR is replaced wiht TEMP_DIR

- ownership: {{ coalesce ($.Values.persistence).ownership "1999:1999" | quote }}
permissions: "u=rwX,g=rX,o="
flags: [ "recurse", "forced", "create", "changes" ]
targets: [ "$(DATA_DIR)", "$(INIT_DIR)" ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above re: init volume

Comment on lines 89 to 90
- name: &initVol "init"
mountPath: *initDir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above re: init volume

Comment on lines 109 to 110
- name: *initVol
mountPath: *initDir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above re: init volume

Comment on lines 115 to 117
- name: *initVol
emptyDir:
sizeLimit: 1Gi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above re: init volume

@ArkCase ArkCase deleted a comment from jenkins-armedia Feb 9, 2026
Comment on lines +20 to +25
<metadataFilter class="com.armedia.acm.tika.filter.ContentTypeNormalizationFilter">
<contentTypeFixes>audio/vnd.wave:audio/wav,audio/mpeg:audio/mp3,audio/x-flac:audio/flac,video/quicktime:video/mp4</contentTypeFixes>
<extensionOverrides>.mpga:.mp3,.qt:.mp4</extensionOverrides>
</metadataFilter>
<metadataFilter class="com.armedia.acm.tika.filter.CreatedDateNormalizationFilter"/>
<metadataFilter class="com.armedia.acm.tika.filter.GpsEnrichmentFilter"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still add the coordinates to the Dockerfile, commented out ... same with the RUN commands to fetch them. The point here is to start getting everything lined up.

Furthermore, it might actually behoove us to have those JARs be separate from ArkCase since they're not specifically dependent on ArkCase to begin with ... are they?

Comment on lines 115 to 118
httpGet:
scheme: HTTP
path: "/tika"
port: *tika
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=148639291#TikaServer-SSL(Beta)

This is already available since Tika 2.4.0. I'll test this and figure this out. Most likely the challenge had to do with access to the keystore(s).

@@ -0,0 +1,3 @@
{{- if not (include "arkcase.subsystem.external" $) -}}
{{- include "arkcase.subsystem.service" (dict "ctx" $ "dnsOnly" true) -}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants