Creating your own Text extractor

To create your own Text extractot you must create a new class that implements TextExtractor interface:

package com.openkm.plugin.extractor;

import net.xeoh.plugins.base.Plugin;

import java.io.IOException;
import java.io.InputStream;

public interface TextExtractor extends Plugin {
	String[] getContentTypes();
	String extractText(InputStream stream, String type, String encoding) throws IOException;
}

The new class must be loaded into the package com.openkm.plugin.extractor because application plugins system will try to load from there. See the sample below:

Do not miss the tag @PluginImplementation otherwise the application plugin system will not be able to retrieve the new class.

More information at Register a new plugin.

Methods description

MethodTypeDescription

getContentTypes()

String[]

Returns the MIME types supported by this extractor. The returned strings must be in lower case, and the returned array must not be empty.

extractText(InputStream stream, String type, String encoding)

String

Returns a reader for the text content of the given binary document. The content type and character encoding (if available and applicable) are given as arguments.

Example of the Text extractor implementation

package com.openkm.plugin.extractor;

import net.xeoh.plugins.base.annotations.PluginImplementation;
import org.apache.commons.io.IOUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;

/**
 * Text extractor for plain text.
 */
@PluginImplementation
public class PlainTextExtractor extends AbstractTextExtractor {
	
	/**
	 * Logger instance.
	 */
	private static final Logger logger = LoggerFactory.getLogger(PlainTextExtractor.class);
	
	/**
	 * Creates a new <code>PlainTextExtractor</code> instance.
	 */
	public PlainTextExtractor() {
		super(new String[] { "text/plain" });
	}
	
	// -------------------------------------------------------< TextExtractor >
	
	/**
	 * Wraps the given input stream to an {@link InputStreamReader} using
	 * the given encoding, or the platform default encoding if the encoding
	 * is not given or is unsupported. Closes the stream and returns an empty
	 * reader if the given encoding is not supported.
	 * 
	 * @param stream binary stream
	 * @param type ignored
	 * @param encoding character encoding, optional
	 * @return reader for the plain text content
	 * @throws IOException if the binary stream can not be closed in case
	 *         of an encoding issue
	 */
	public String extractText(InputStream stream, String type, String encoding) throws IOException {
		try {
			if (encoding != null) {
				return IOUtils.toString(stream, encoding);
			}
		} catch (UnsupportedEncodingException e) {
			logger.warn("Unsupported encoding '{}', using default ({}) instead.",
					new Object[] { encoding, System.getProperty("file.encoding") });
		}
		
		return IOUtils.toString(stream);
	}
}