Question

How to locate and extract the specific lines in a paragraph

0

Hi colleagues, This is my situation. I have retrieved a content of a PDF File using Document Reader module. It has so many paragraphs and lines and I just want to locate and extract a specific line. An example would be like this - on a large PDF contents, the only line I should get is the one that starts with "Record No." plus the number itself. So for example I just want to retrieve this one on a PDF file that I just extracted - "Record No. 54321" Thanks!

asked 2026-01-08

John Chris Valdez

2 answers

1

This sounds like a job for a regular expression.

So, if you just want to extract the String "Record No. 54321" from the extracted text, I would use the RegexReplaceAll action from the Community Commons module.

For the Haystack, use the extracted text from your PDF

For the Replacement use '$1'

For the Needle Regex use

'^(?s).*?(Record No\. \d+).*$'

Screenshot 2026-01-08 at 14.38.44.png

The string "Record No. 12345" or whatever the match is in your document will be returned in $vRecordNumber

I wrote a blog post on this technique a few years ago.

Extracting text in Mendix using RegexReplaceAll

I hope this helps.

answered 2026-01-08

Robert Price

Ahmet Kudu · Accepted Answer · 2026-01-08

First, use the Document Reader module to read the PDF file and store the extracted content in a string variable (for example $PdfText).

Next, normalize the line breaks in the text, because PDF files can contain different line-ending formats. Replace all \r\n with \n, and then replace any remaining \r with \n. Store the result in a new string varaible, such as $NormalizedText.

After normalization, split the text into individual lines using String split with \n as the separator. This will give you a list of strings where each item represents one line from the PDF.

Then, loop over this list of lines. Inside the loop, trim each line and check whether it starts with the expected prefix, for examle Record No., using the startsWith function.

When a matching line is found, store it in a variable and exit the loop. If you only need the number itself, remove the Record No. prefix using replace and apply trim to get the final value.