Creating automatic key extraction training files
Manual creation
Creating training files is so easy you simply must create a couple of files that KEA will use for creating KEA model extractor.
The main file to be analyzed by kea must be a foo.txt file ( if you've got pdf, doc, RTF or another type of file, that must be converted to txt ). Each file foo.txt must have a foo.key file. The foo.key file contains the keys which you identify the document, that keys must be present into your thesaurus.
Example of foo.key
AMARANTHUS
PLANT PRODUCTION
GEOGRAPHICAL DISTRIBUTION
NUTRITIVE VALUE
SEEDS
MERCHANTS
Both files among other pair of couples must be under the $TOMCAT_HOME/kea/training directory. That directory path is what it'll be used by KEA to create the model. T
You need a significative couple of documents in order for making a good key extraction model. Upper 100 or more files ( depending on how large is your thesaurus, etc... ) it's a good size to start.
RESTFull
Method description for sending training file | |
---|---|
Method |
POST |
URL |
Take the URL below as a sample. |
Body |
TrainingDocument object.
|
You can download sample files:
JAVA sample
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.StandardCharsets;
import java.util.List;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.IOUtils;
import com.google.gson.Gson;
import com.openkm.bean.TrainingDocument;
import com.openkm.config.Config;
import com.openkm.config.auth.CustomUser;
import com.openkm.util.PrincipalUtils;
import com.openkm.util.RestClient;
InputStream is = null;
try {
is = new FileInputStream("/home/openkm/test/trainingFile.txt");
String content = IOUtils.toString(is, StandardCharsets.UTF_8);
TrainingDocument td = new TrainingDocument();
td.setBody(content);
File keywordsFile = new File("/home/openkm/test/trainingFile.key");
List<String> keywords = FileUtils.readLines(keywordsFile, StandardCharsets.UTF_8);
td.setKeywords(keywords);
td.setForceKeywordsToUpperCase(true);
RestClient rc = new RestClient();
Gson gson = new Gson();
String json = gson.toJson(td);
String response = rc.post("http://localhost:8080/keas/rest/training/file", json, RestClient.FORMAT_JSON);
} finally {
IOUtils.closeQuietly(is);
}
How to optimize the model
The KEA model is something alive. The idea behind is that users set manually a couple of keywords in OpenKM what later will be used for building the model. For doing it we suggest creation of some metadata ( property group ) to indicating that user has validated some documents key ( flag to indicate that are documents that can be used to creating a new model ).
You can daily or weekly updated your training files with RESTFul webservices from OpenKM with a crontab task and rebuild the model.
While your repository is growing your KEA model it'll become more efficient.