Question

DocumentParser app service no longer available?

0

I'm trying to do a POC which extracts metadata from a FileDocument, and I came across this app service: DocumentParser https://appstore.home.mendix.com/link/app/DocumentParser. This would be very useful to showcase to our potential client. But I realized that the app service is down since it returns java.net.UnknownHostException. Any chance of reviving this app service?

asked 2018-01-29

Rionald Chancellor

1 answers

Erwin 't Hoen · Accepted Answer · 2018-01-30

If you want to get the meta data from a file the same sort of way the app service is performing this task the code below could help. The code is for a java action that has an input of a specialization of System.FileDocument called MyDoc in the module MyFirstModule and returns a string with the attributes and values separated with a colon (:) as a string.

package myfirstmodule.actions;

import java.io.InputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.Parser;
import com.mendix.core.Core;
import com.mendix.systemwideinterfaces.core.IContext;
import com.mendix.webui.CustomJavaAction;
import com.mendix.systemwideinterfaces.core.IMendixObject;

public class JA_GetMetaData extends CustomJavaAction<java.lang.String>
{
	private IMendixObject __Document;
	private myfirstmodule.proxies.MyDoc Document;

	public JA_GetMetaData(IContext context, IMendixObject Document)
	{
		super(context);
		this.__Document = Document;
	}

	@Override
	public java.lang.String executeAction() throws Exception
	{
		this.Document = __Document == null ? null : myfirstmodule.proxies.MyDoc.initialize(getContext(), __Document);

		// BEGIN USER CODE
		  Parser parser = new AutoDetectParser();
	      BodyContentHandler handler = new BodyContentHandler();
	      Metadata metadata = new Metadata();
	      InputStream inputstream = Core.getFileDocumentContent(getContext(), Document.getMendixObject());
	      ParseContext context = new ParseContext();
	      
	      parser.parse(inputstream, handler, metadata, context);
	      //the handler contains the text content of the file being processed

	      //getting the list of all meta data elements 
	      String[] metadataNames = metadata.names();
	      
	      StringBuilder sb = new StringBuilder();
	      for(String name : metadataNames) {		        

	         sb.append(name+":"+metadata.get(name)+"\n");
	      }
	      return sb.toString();
		// END USER CODE
	}

	/**
	 * Returns a string representation of this action
	 */
	@Override
	public java.lang.String toString()
	{
		return "JA_GetMetaData";
	}

	// BEGIN EXTRA CODE
	// END EXTRA CODE
}

For a sample pdf file I just show the meta data in a message box and this will get you something like below:

WIth a little extra effort in the java code you can create an action that will store the data in an entiry associated with the document entity that holds the file. In addition you can create a more generic version by using the type parameters in the java action in the modeler. And if you are interested in the content of the file, have a look at the handler.

Be aware that you'll need the apache tika library in your userlib folder, this can be downloaded from: https://tika.apache.org/download.html

Also be aware that the mehtod does not work for all files, I did a short test with pdf, docx, xlsx and png these all return you the metadata.

Hope this helps you further in your showcase.