RWS Language Cloud: How to create custom segmentation rules?

Article Number:000014738 | Last Updated:1/27/2021 4:36 PM

Scope/Environment

RWS Language Cloud

Question

How do I create custom segmentation rules on Language Cloud?

Answer

If you wish to force segmentation after other punctuation marks beside the defaults (full stop, question mark, exclamation mark, etc.) you will need to add new Segmentation Rules to the Language Processing rule.

User-added image

To create a new Segmentation Rule, first, make sure to have Language Processing Rules created.

If there is none, create one from:

Resources > Language Processing Rules > New Language Processing Rule

Next, open the Language Processing Rule and Add Entry in the Customized Language Resources section.

User-added image

The Language selected must be the Source language.
Note: The new rule will apply ONLY on that source language when using this Language Processing Rule in your Project Template. Any other source languages or project templates are not affected by these settings.

Once you added the Source language, the new Language Processing Rule will open up.

Go to the Segmentation Rules section and click the New Entry to add a new rule.

In the Before Break and After Break sections, you will need to add your search pattern using Regular expression.

Example:

You have the following text and you wish to segment after the "·" character if before and after this character you have whitespace:

This is a test · Another test · Example ·No Segmentation

Your Before Break and After Break rules should look like this:

User-added image

Before Break: \s[·]+

After Break: \s

The final segmentation should look like this:

User-added image
Note: There is no segmentation between Example ·No Segmentation because there is no whitespace after the character.

Reference

Send Article Feedback