Conversation
186364b to
9a9c5f5
Compare
9a9c5f5 to
f473b8c
Compare
doc/README.md
Outdated
| - [Pentaho](https://www.hitachivantara.com/en-us/products/dataops-software/data-integration-analytics.html) (for reporting services) | ||
| - [Alfresco](https://www.alfresco.com/), for content storage services | ||
| - [Minio](https://min.io/), for content storage services in S3-compatible mode | ||
| - [Tika](https://tika.apache.org/), for document metadata and text extraction services |
There was a problem hiding this comment.
Please fix the indentation to maintain consistency
There was a problem hiding this comment.
looks like Idea is messing this up, I will fix this trough the UI
There was a problem hiding this comment.
Non-executable files don't require permissions being specified. That extension was added to the scripts specifically for executable files.
| <metadataFilter class="com.armedia.acm.tika.filter.ContentTypeNormalizationFilter"> | ||
| <contentTypeFixes>audio/vnd.wave:audio/wav,audio/mpeg:audio/mp3,audio/x-flac:audio/flac,video/quicktime:video/mp4</contentTypeFixes> | ||
| <extensionOverrides>.mpga:.mp3,.qt:.mp4</extensionOverrides> | ||
| </metadataFilter> | ||
| <metadataFilter class="com.armedia.acm.tika.filter.CreatedDateNormalizationFilter"/> | ||
| <metadataFilter class="com.armedia.acm.tika.filter.GpsEnrichmentFilter"/> |
There was a problem hiding this comment.
I see com.armedia.acm.* ... do we have to add JARs of our own to extend the Tika container?
There was a problem hiding this comment.
Yes, this is how we extend Tika Server capabilities without modifying Tika’s source code.
We provide our custom parser implementations as a separate JAR and load them through the container’s classpath, and then activate/configure them via tika-config.xml.
This follows Tika’s recommended extension mechanism.
There was a problem hiding this comment.
Then we need to add those JAR's Nexus/Maven coordinates to the container's build process. Can you provide the list for them or create an MR for the container?
There was a problem hiding this comment.
It's already created, but I need the arkcase-tika merged, until then, the jar is not available on nexus.
ArkCase/ark_tika#1
There was a problem hiding this comment.
You can still add the coordinates to the Dockerfile, commented out ... same with the RUN commands to fetch them. The point here is to start getting everything lined up.
Furthermore, it might actually behoove us to have those JARs be separate from ArkCase since they're not specifically dependent on ArkCase to begin with ... are they?
There was a problem hiding this comment.
Yes, it’s a separate small module outside of ArkCase that contains only the custom parsing logic and filters.
I have tested this locally using the updated ark_tika Docker image (including the arkcase-tika.jar) along with the Helm changes and the new Tika pod. The pod is up and running, the configuration is being picked up correctly, and all custom parsers/filters are properly registered.
At the moment, I’m experiencing issues running the core pod with the latest develop branch. Once I get it running, I’ll be able to fully confirm that the ArkCase–Tika integration is working as expected.
| strategy: | ||
| type: {{ $.Values.updateStrategy }} | ||
| {{- if (eq $.Values.updateStrategy "RollingUpdate") }} | ||
| rollingUpdate: | ||
| maxSurge: 1 | ||
| maxUnavailable: 0 | ||
| {{- end }} |
There was a problem hiding this comment.
This can stay as-is for now, but we'll probably want to use a HorizontalPodAutoscaler for this Deployment
There was a problem hiding this comment.
On second thought, let's just replace this with:
strategy:
type: RecreateWe don't really need to do a rolling update since these are meant to be stateless.
| - name: JAVA_OPTS | ||
| value: >- | ||
| -Xmx1536m | ||
| -Xms512m | ||
| -XX:+UseG1GC |
There was a problem hiding this comment.
We should be able to use the percentage-based stuff so we can then control how much memory the pod is allocated via its resource allocations ... look for what's done for other pods like messaging or core
src/app/charts/tika/Chart.yaml
Outdated
| @@ -0,0 +1,10 @@ | |||
| apiVersion: v2 | |||
| name: tika | |||
| version: 0.10.0-2 | |||
There was a problem hiding this comment.
The version can remain as 0.10.0-0 since it's the first release of this subchart.
src/app/charts/tika/values.yaml
Outdated
| persistence: | ||
| enabled: true | ||
| volumeSize: | ||
| init: "1Gi" | ||
| data: "16Gi" | ||
| volumes: | ||
| data: | ||
| capacity: "16Gi" | ||
| accessModes: | ||
| - ReadWriteOnce |
There was a problem hiding this comment.
I don't think Tika requires persistence, so this can be removed or substituted with:
persistence:
enabled: false| httpGet: | ||
| scheme: HTTP | ||
| path: "/tika" | ||
| port: *tika |
There was a problem hiding this comment.
The scheme should be HTTPS, b/c we should be using SSL everywhere.
There was a problem hiding this comment.
Looks like Tika Server 3.x does not support native HTTPS. I tried configuring it via both CLI flags and tika-config.xml, but the server doesn't recognize any SSL/keystore parameters.
How do we handle services that only support HTTP? Can you help with this one ?
There was a problem hiding this comment.
I see they have just committed a change to improve the TLS docs: https://www.mail-archive.com/commits@tika.apache.org/msg20561.html
There was a problem hiding this comment.
Nice catch, but the fix is planned for version 4.0.0 😢
It’s estimated to be released in January, however it’s still not available and there haven’t been any updates so far.
For now, we’ll probably need to use a proxy as a workaround.
There was a problem hiding this comment.
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=148639291#TikaServer-SSL(Beta)
This is already available since Tika 2.4.0. I'll test this and figure this out. Most likely the challenge had to do with access to the keystore(s).
There was a problem hiding this comment.
@drivera-armedia did you had a chance to look at it?
There was a problem hiding this comment.
This second service declaration is unnecessary. The reason some pods have two service declarations is b/c they need to be aware of their cluster members. Tika is not a cluster per-se, and thus doesn't need to know how many other Tika pods are available.
Thus, the above service declaration is sufficient, and this one is unnecessary.
README.md
Outdated
| - [PostgreSQL](https://www.postgresql.org/)/[MariaDB](https://mariadb.org/), for database storage (only one instance is needed) | ||
| - [Pentaho](https://www.hitachivantara.com/en-us/products/dataops-software/data-integration-analytics.html) (for reporting services) | ||
| - [Alfresco](https://www.alfresco.com/), for content storage services | ||
| - [Tika](https://tika.apache.org/), for document metadata and text extraction services |
There was a problem hiding this comment.
As above, please fix the indentation to maintain consistency.
| <metadataFilter class="com.armedia.acm.tika.filter.ContentTypeNormalizationFilter"> | ||
| <contentTypeFixes>audio/vnd.wave:audio/wav,audio/mpeg:audio/mp3,audio/x-flac:audio/flac,video/quicktime:video/mp4</contentTypeFixes> | ||
| <extensionOverrides>.mpga:.mp3,.qt:.mp4</extensionOverrides> | ||
| </metadataFilter> | ||
| <metadataFilter class="com.armedia.acm.tika.filter.CreatedDateNormalizationFilter"/> | ||
| <metadataFilter class="com.armedia.acm.tika.filter.GpsEnrichmentFilter"/> |
There was a problem hiding this comment.
Then we need to add those JAR's Nexus/Maven coordinates to the container's build process. Can you provide the list for them or create an MR for the container?
There was a problem hiding this comment.
If there are no specific helpers needed, let's remove this file.
| strategy: | ||
| type: {{ $.Values.updateStrategy }} | ||
| {{- if (eq $.Values.updateStrategy "RollingUpdate") }} | ||
| rollingUpdate: | ||
| maxSurge: 1 | ||
| maxUnavailable: 0 | ||
| {{- end }} |
There was a problem hiding this comment.
On second thought, let's just replace this with:
strategy:
type: RecreateWe don't really need to do a rolling update since these are meant to be stateless.
| - name: INIT_DIR | ||
| value: &initDir "/app/init" |
There was a problem hiding this comment.
Why do we need an init volume? This is meant to be a stateless pod - what initialization does it need that can't be done as part of the container's boot sequence?
There was a problem hiding this comment.
Actually, we don't need it. It's removed, and DATA_DIR is replaced wiht TEMP_DIR
| - ownership: {{ coalesce ($.Values.persistence).ownership "1999:1999" | quote }} | ||
| permissions: "u=rwX,g=rX,o=" | ||
| flags: [ "recurse", "forced", "create", "changes" ] | ||
| targets: [ "$(DATA_DIR)", "$(INIT_DIR)" ] |
There was a problem hiding this comment.
See above re: init volume
| - name: &initVol "init" | ||
| mountPath: *initDir |
There was a problem hiding this comment.
See above re: init volume
| - name: *initVol | ||
| mountPath: *initDir |
There was a problem hiding this comment.
See above re: init volume
| - name: *initVol | ||
| emptyDir: | ||
| sizeLimit: 1Gi |
There was a problem hiding this comment.
See above re: init volume
| <metadataFilter class="com.armedia.acm.tika.filter.ContentTypeNormalizationFilter"> | ||
| <contentTypeFixes>audio/vnd.wave:audio/wav,audio/mpeg:audio/mp3,audio/x-flac:audio/flac,video/quicktime:video/mp4</contentTypeFixes> | ||
| <extensionOverrides>.mpga:.mp3,.qt:.mp4</extensionOverrides> | ||
| </metadataFilter> | ||
| <metadataFilter class="com.armedia.acm.tika.filter.CreatedDateNormalizationFilter"/> | ||
| <metadataFilter class="com.armedia.acm.tika.filter.GpsEnrichmentFilter"/> |
There was a problem hiding this comment.
You can still add the coordinates to the Dockerfile, commented out ... same with the RUN commands to fetch them. The point here is to start getting everything lined up.
Furthermore, it might actually behoove us to have those JARs be separate from ArkCase since they're not specifically dependent on ArkCase to begin with ... are they?
| httpGet: | ||
| scheme: HTTP | ||
| path: "/tika" | ||
| port: *tika |
There was a problem hiding this comment.
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=148639291#TikaServer-SSL(Beta)
This is already available since Tika 2.4.0. I'll test this and figure this out. Most likely the challenge had to do with access to the keystore(s).
| @@ -0,0 +1,3 @@ | |||
| {{- if not (include "arkcase.subsystem.external" $) -}} | |||
| {{- include "arkcase.subsystem.service" (dict "ctx" $ "dnsOnly" true) -}} | |||
There was a problem hiding this comment.
Please address this one
No description provided.