Creating automatic key extraction training files

Manual creation

Creating training files is so easy you simply must create a couple of files that KEA will use for creating KEA model extractor.

The main file to be analyzed by kea must be a foo.txt file ( if you've got pdf, doc, RTF or another type of file, that must be converted to txt ). Each file foo.txt must have a foo.key file. The foo.key file contains the keys which you identify the document, that keys must be present into your thesaurus.

Example of foo.key

AMARANTHUS
PLANT PRODUCTION
GEOGRAPHICAL DISTRIBUTION
NUTRITIVE VALUE
SEEDS
MERCHANTS

Both files among other pair of couples must be under the $TOMCAT_HOME/kea/training directory. That directory path is what it'll be used by KEA to create the model. T

You need a significative couple of documents in order for making a good key extraction model. Upper 100 or more files ( depending on how large is your thesaurus, etc... ) it's a good size to start.

RESTFull

Method description for sending training file

Method

POST

URL

Take the URL below as a sample.

Body

TrainingDocument object.

public class TrainingDocument implements Serializable {
    private static final long serialVersionUID = 1L;

    private String body = "";
    private List<String> keywords;
    private boolean forceKeywordsToUpperCase = false;

    public String getBody() {
        return body;
    }

    public void setBody(String body) {
        this.body = body;
    }

    public List<String> getKeywords() {
		return keywords;
	}

	public void setKeywords(List<String> keywords) {
		this.keywords = keywords;
	}

	public boolean isForceKeywordsToUpperCase() {
		return forceKeywordsToUpperCase;
	}

	public void setForceKeywordsToUpperCase(boolean forceKeywordsToUpperCase) {
		this.forceKeywordsToUpperCase = forceKeywordsToUpperCase;
	}

	@Override
	public String toString() {
        StringBuilder sb = new StringBuilder();
        sb.append("{");
        sb.append("body=").append(body.length());
        sb.append(",keywords=").append(keywords);
        sb.append(",forceKeywordsToUpperCase=").append(forceKeywordsToUpperCase);
        sb.append("}");
        return sb.toString();
    }
}

You can download sample files:

JAVA sample

import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.StandardCharsets;
import java.util.List;

import org.apache.commons.io.FileUtils;
import org.apache.commons.io.IOUtils;

import com.google.gson.Gson;
import com.openkm.bean.TrainingDocument;
import com.openkm.config.Config;
import com.openkm.config.auth.CustomUser;
import com.openkm.util.PrincipalUtils;
import com.openkm.util.RestClient;

InputStream is = null;
try {
	is = new FileInputStream("/home/openkm/test/trainingFile.txt");
	String content = IOUtils.toString(is, StandardCharsets.UTF_8);
	TrainingDocument td = new TrainingDocument();
	td.setBody(content);
	File keywordsFile = new File("/home/openkm/test/trainingFile.key");
	List<String> keywords = FileUtils.readLines(keywordsFile, StandardCharsets.UTF_8);
	td.setKeywords(keywords);
	td.setForceKeywordsToUpperCase(true);
	RestClient rc = new RestClient();
	Gson gson = new Gson();
	String json = gson.toJson(td);
	String response = rc.post("http://localhost:8080/keas/rest/training/file", json, RestClient.FORMAT_JSON);
} finally {
	IOUtils.closeQuietly(is);
}

How to optimize the model

The KEA model is something alive. The idea behind is that users set manually a couple of keywords in OpenKM what later will be used for building the model. For doing it we suggest creation of some metadata ( property group ) to indicating that user has validated some documents key ( flag to indicate that are documents that can be used to creating a new model ).

You can daily or weekly updated your training files with RESTFul webservices from OpenKM with a crontab task and rebuild the model.

While your repository is growing your KEA model it'll become more efficient.