PDF to image

Has anyone had success using OCR to read a .pdf file to scan a certain section of a page? We were able to get a PDF reader to provide us a string of a pdf, however we also need the address from the page and it is providing it to us in a mixed format. Any suggestions or solutions to this problem is appreciated.
3 answers

We recently used azure document intelligence to extract data from pdf files. You can model / design your own template using a visual tool provided by azure. Works like a charm!




Mendix also has a module called Amazon Textractas part of their close collaboration with AWS. Did you check that out? Next to text it can also extract other data from documents.


What we ended up doing is use the pdf reader, split the string with \n, and loop through each iterator until we reached a certain index. We were reading invoices so there were little difference in the structure of rows provided. As a non-cloud method this was the best solution we could come up with.


A cloud method that Ivo and Rudd provided are good options.
