Creating your own Text extractor

You can create your own Text Extractor

Conditions

  • The new Text Extractor class must be declared under the package "com.openkm.extractor".
  • The new Text Extractor class must be annotated with "@PluginImplementation".
  • The new Text Extractor class must extend of "AbstractTextExtractor".

Text Extractor interface:

package com.openkm.extractor;

import net.xeoh.plugins.base.Plugin;

import java.io.IOException;
import java.io.InputStream;

public interface TextExtractor extends Plugin {
    
    String[] getContentTypes();

    String extractText(InputStream stream, String type, String encoding) throws IOException;
}

The new class must be loaded into the package com.openkm.extractor because application plugins system will try to load from there.

Do not miss the tag @PluginImplementation otherwise the application plugin system will not be able to retrieve the new class.

More information at Register a new plugin 

Methods description

MethodTypeDescription

getContentTypes()

String[]

Returns the MIME types supported by this extractor. The returned strings must be in lower case, and the returned array must not be empty.

extractText(InputStream stream, String type, String encoding)

String

Returns a reader for the text content of the given binary document. The content type and character encoding (if available and applicable) are given as arguments.

Example of the Text extractor implementation

package com.openkm.extractor;

import net.xeoh.plugins.base.annotations.PluginImplementation;
import org.apache.commons.io.IOUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;
import java.nio.charset.StandardCharsets;

/**
 * Text extractor for plain text.
 */
@PluginImplementation
public class PlainTextExtractor extends AbstractTextExtractor {

    private static final Logger log = LoggerFactory.getLogger(PlainTextExtractor.class);

    /**
     * Creates a new <code>PlainTextExtractor</code> instance.
     */
    public PlainTextExtractor() {
        super(new String[]{"text/plain"});
    }

    // -------------------------------------------------------< TextExtractor >
    /**
     * Wraps the given input stream to an {@link InputStreamReader} using the
     * given encoding, or the platform default encoding if the encoding is not
     * given or is unsupported. Closes the stream and returns an empty reader if
     * the given encoding is not supported.
     *
     * @param stream binary stream
     * @param type ignored
     * @param encoding character encoding, optional
     * @return reader for the plain text content
     * @throws IOException if the binary stream can not be closed in case of an
     * encoding issue
     */
    @Override
    public String extractText(InputStream stream, String type, String encoding) throws IOException {
        log.debug("extractText({}, {}, {})", stream, type, encoding);

        try {
            if (encoding != null) {
                return IOUtils.toString(stream, encoding);
            }
        } catch (UnsupportedEncodingException e) {
            log.warn("Unsupported encoding '{}', using default ({}) instead.", encoding, System.getProperty("file.encoding"));
        }

        return IOUtils.toString(stream, StandardCharsets.UTF_8);
    }
}