Creating your own Text extractor
You can create your own Text Extractor
Conditions
- The new Text Extractor class must be declared under the package "com.openkm.extractor".
- The new Text Extractor class must be annotated with "@PluginImplementation".
- The new Text Extractor class must extend of "AbstractTextExtractor".
Text Extractor interface:
package com.openkm.extractor;
import net.xeoh.plugins.base.Plugin;
import java.io.IOException;
import java.io.InputStream;
public interface TextExtractor extends Plugin {
String[] getContentTypes();
String extractText(InputStream stream, String type, String encoding) throws IOException;
}
The new class must be loaded into the package com.openkm.extractor because application plugins system will try to load from there.
Do not miss the tag @PluginImplementation otherwise the application plugin system will not be able to retrieve the new class.
More information at Register a new plugin
Methods description
Method | Type | Description |
---|---|---|
getContentTypes() |
String[] |
Returns the MIME types supported by this extractor. The returned strings must be in lower case, and the returned array must not be empty. |
extractText(InputStream stream, String type, String encoding) |
String |
Returns a reader for the text content of the given binary document. The content type and character encoding (if available and applicable) are given as arguments. |
Example of the Text extractor implementation
package com.openkm.extractor;
import net.xeoh.plugins.base.annotations.PluginImplementation;
import org.apache.commons.io.IOUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;
import java.nio.charset.StandardCharsets;
/**
* Text extractor for plain text.
*/
@PluginImplementation
public class PlainTextExtractor extends AbstractTextExtractor {
private static final Logger log = LoggerFactory.getLogger(PlainTextExtractor.class);
/**
* Creates a new <code>PlainTextExtractor</code> instance.
*/
public PlainTextExtractor() {
super(new String[]{"text/plain"});
}
// -------------------------------------------------------< TextExtractor >
/**
* Wraps the given input stream to an {@link InputStreamReader} using the
* given encoding, or the platform default encoding if the encoding is not
* given or is unsupported. Closes the stream and returns an empty reader if
* the given encoding is not supported.
*
* @param stream binary stream
* @param type ignored
* @param encoding character encoding, optional
* @return reader for the plain text content
* @throws IOException if the binary stream can not be closed in case of an
* encoding issue
*/
@Override
public String extractText(InputStream stream, String type, String encoding) throws IOException {
log.debug("extractText({}, {}, {})", stream, type, encoding);
try {
if (encoding != null) {
return IOUtils.toString(stream, encoding);
}
} catch (UnsupportedEncodingException e) {
log.warn("Unsupported encoding '{}', using default ({}) instead.", encoding, System.getProperty("file.encoding"));
}
return IOUtils.toString(stream, StandardCharsets.UTF_8);
}
}