To exclude segments that only contain a measurement, such as "10A" or "100mm", you can use File Type Configurations and configure the Embedded Content settings to exclude this content.
This requires basic knowledge of
regular expressions.
For example, to modify the file type settings for Word (.docx) files, go to
Resources → File Type Configurations → select a configuration → select "Microsoft Word 2007-2019".
- Note this change will only affect new projects. For existing projects, go to Projects → select the project → Settings → File Type Configuration. You may need to cancel existing files and re-upload them to the project.
In the settings for the Word file type (or whichever file type you chose), expand the "
Embedded Content" section and do the following:
- Enable the "allow processing of embedded content" option.
- Set "Extract from" to "All paragraphs".
- Click the "Add new rule" button.
- In the "New Tag Definition Rule" panel on the right-hand side, set the following options:
- Tag Type: Placeholder
- Regular expression: ^\d+\s?(mm|cm|km|A)$
- Note: this identifies segments that begin with a number and end with one of the measurement symbols within the brackets, optionally separated by a space character. Each measurement symbol is separated with a pipe symbol '|'. Add each measurement you want to include in this rule.
- Segmentation Hint: Exclude
- Click Save.