Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip Tika parsing with skip_tika new option #858

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -469,6 +469,9 @@ private void indexFile(FileAbstractModel fileAbstractModel, ScanStatistic stats,
} else if (fsSettings.getFs().isXmlSupport()) {
// https://github.com/dadoonet/fscrawler/issues/185 : Support Xml files
doc.setObject(XmlDocParser.generateMap(inputStream));
} else if (fsSettings.getFs().isSkipTika()) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of it today and adding some tests. I found out that we should move this logic in TikaDocParser#generate method.

Two reasons:

  • We can more easily unit test this
  • UploadAPI class (Rest API) must also see this setting. UploadAPI calls generate method. That's why we should call the logic from there.

Could you change that please?

// https://github.com/dadoonet/fscrawler/issues/846 : Skip Tika parser
doc.setContent(inputStreamToString(inputStream));
} else {
// Extracting content with Tika
TikaDocParser.generate(fsSettings, inputStream, filename, fullFilename, doc, messageDigest, filesize);
Expand Down Expand Up @@ -592,4 +595,26 @@ private void esDelete(String index, String id) {
}
}

/**
* Read the stream and get the raw string
*
* @param inputStream
* @return
*/
private String inputStreamToString(InputStream inputStream) {
InputStreamReader isReader = new InputStreamReader(inputStream);
BufferedReader reader = new BufferedReader(isReader);
StringBuilder sb = new StringBuilder();
String str;
try {
while ((str = reader.readLine()) != null) {
sb.append(str);
}
} catch (IOException e) {
logger.error("Failed to read InputStream: {}", e.getMessage());
logger.trace("Failed to read InputStream.", e);
}

return sb.toString();
}
}
1 change: 1 addition & 0 deletions docs/source/admin/fs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ The job file must comply to the following ``yaml`` specifications:
add_filesize: true
remove_deleted: true
add_as_inner_object: false
skip_tika: false
store_source: true
index_content: true
indexed_chars: "10000.0"
Expand Down
15 changes: 15 additions & 0 deletions docs/source/admin/fs/local-fs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ Here is a list of Local FS settings (under ``fs.`` prefix)`:
+----------------------------+-----------------------+---------------------------------+
| ``fs.follow_symlinks`` | ``false`` | `Follow Symlinks`_ |
+----------------------------+-----------------------+---------------------------------+
| ``fs.skip_tika: true`` | ``false`` | `Skip tika parser`_ |
+----------------------------+-----------------------+---------------------------------+

.. _root-directory:

Expand Down Expand Up @@ -747,3 +749,16 @@ If you want FSCrawler to follow the symbolic links, you need to be explicit abou
name: "test"
fs:
follow_symlink: true

Skip Tika
^^^^^^^^^^^^^^^

.. versionadded:: 2.7

If you want to skip Tika to parse your content you can. So contents will be indexed without getting parsed by tika.

.. code:: yaml

name: "test"
fs:
skip_tika: true
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ public class Fs {
private Ocr ocr = new Ocr();
private ByteSizeValue ignoreAbove = null;
private boolean followSymlinks = false;
private boolean skipTika = false;

public static Builder builder() {
return new Builder();
Expand Down Expand Up @@ -91,6 +92,7 @@ public static class Builder {
private Ocr ocr = new Ocr();
private ByteSizeValue ignoreAbove = null;
private boolean followSymlinks = false;
private boolean skipTika = false;

public Builder setUrl(String url) {
this.url = url;
Expand Down Expand Up @@ -246,10 +248,15 @@ public Builder setFollowSymlinks(boolean followSymlinks) {
return this;
}

public Builder setSkipTika(boolean skipTika) {
this.skipTika = skipTika;
return this;
}

public Fs build() {
return new Fs(url, updateRate, includes, excludes, filters, jsonSupport, filenameAsId, addFilesize,
removeDeleted, addAsInnerObject, storeSource, indexedChars, indexContent, attributesSupport, rawMetadata,
checksum, xmlSupport, indexFolders, langDetect, continueOnError, ocr, ignoreAbove, followSymlinks);
checksum, xmlSupport, indexFolders, langDetect, continueOnError, ocr, ignoreAbove, followSymlinks, skipTika);
}
}

Expand All @@ -260,7 +267,7 @@ public Fs( ) {
private Fs(String url, TimeValue updateRate, List<String> includes, List<String> excludes, List<String> filters, boolean jsonSupport,
boolean filenameAsId, boolean addFilesize, boolean removeDeleted, boolean addAsInnerObject, boolean storeSource,
Percentage indexedChars, boolean indexContent, boolean attributesSupport, boolean rawMetadata, String checksum, boolean xmlSupport,
boolean indexFolders, boolean langDetect, boolean continueOnError, Ocr ocr, ByteSizeValue ignoreAbove, boolean followSymlinks) {
boolean indexFolders, boolean langDetect, boolean continueOnError, Ocr ocr, ByteSizeValue ignoreAbove, boolean followSymlinks, boolean skipTika) {
this.url = url;
this.updateRate = updateRate;
this.includes = includes;
Expand All @@ -284,6 +291,7 @@ private Fs(String url, TimeValue updateRate, List<String> includes, List<String>
this.ocr = ocr;
this.ignoreAbove = ignoreAbove;
this.followSymlinks = followSymlinks;
this.skipTika = skipTika;
}

public String getUrl() {
Expand Down Expand Up @@ -478,6 +486,10 @@ public void setIgnoreAbove(ByteSizeValue ignoreAbove) {
this.ignoreAbove = ignoreAbove;
}

public boolean isSkipTika() {
return skipTika;
}

public boolean isFollowSymlinks() {
return followSymlinks;
}
Expand Down Expand Up @@ -548,6 +560,7 @@ public String toString() {
", ocr=" + ocr +
", ignoreAbove=" + ignoreAbove +
", followSymlinks=" + followSymlinks +
", skipTika=" + skipTika +
'}';
}
}