Segmentation in translation and the SRX standard format

Learn how to standardize segmentation rules across CAT tools using the SRX format to improve translation memory efficiency, control budgets, and maintain quality.

IV Iván Vázquez 15 Nov 2024 5 min read

Language Technology

Behind good text segmentation lie many aspects that make the difference between good and poor project management. Among these, the most important are budget, translation memory management, and translation quality.

In this blog, we will address some of these issues, ranging from the most general aspects of segmentation to more specific and technical matters.

What is segmentation?

When we import a file for translation into a CAT tool like Trados Studio or memoQ, it processes the file by dividing the translatable text into segments. Each of these segments usually corresponds to a sentence, identified by the tool through punctuation marks, such as exclamation and question marks.

Once the text is segmented, the translator's task will be to provide a translation for each segment, also called a translation unit. This is essential for working with translation memories, allowing the identification of matches, that is, translation units already in the translation memory or repeated in the text. In this way, the translation of these segments can be automated.

As mentioned before, the basic criterion for defining how a text is segmented is punctuation. In reality, segmentation rules are more complex, and each tool can establish them differently. For example, SDL Trados Studio determines them based on the translation memory applied to the project, whereas memoQ applies them directly to the project. Furthermore, each tool offers its own possibilities for modifying these rules, which we will elaborate on later.

In general, segmentation rules determine two aspects: on one hand, the punctuation marks that indicate the end of a segment and, on the other, the exceptions to these rules.

To give a very typical example, if we establish that a segment break should occur after a full stop, we can specify a series of abbreviations followed by a full stop so that when they appear, the tool maintains the same segment until the next full stop.

Finally, it is worth noting that segmentation rules are a language resource. They have a series of common elements for all languages, such as the full stop at the end of each segment, but others are specific to each language and their modification must be done individually.

Standardising segmentation rules: the SRX format

Segmentation Rules eXchange (SRX) is an XML-based open standard that provides a common set of regular expressions to define and share segmentation rules. Like the TMX format, it was developed by the Localization Industry Standards Association (LISA), and has been maintained by the Global and Localization Association (GALA) since 2011. It was created in response to the problem that a CAT tool could segment texts differently from the memory, so the latter could not be applied effectively.

The SXR format is based on regular expressions, which are used to define segmentation rules. Regular expressions are patterns based on Unicode that determine and locate a series of characters within a text. Thus, for segmentation rules, regular expressions allow the program to locate lowercase and uppercase letters, brackets, closing quotes, numbers, and any other punctuation marks, using them as criteria to determine when to create a segment break.

However, some programs have simplified options to introduce these characters without resorting to complex regular expressions. These can always be used to set up more advanced segmentation rules.

Apart from the possibility of modifying segmentation rules, the implementation of the SRX standard in CAT tools allows for the export and import of files, to apply the same segmentation rules in another project or different tool. Next, we will take a closer look at the possibilities that the SRX format offers us in two of the leading tools: SDL Trados Studio and memoQ.

Implementation of the SRX format in Trados Studio

Trados Studio has not implemented the SRX standard; however, when opening a file for translation in SDL Trados Studio, the program creates segments based on the default segmentation rules.

To modify segmentation rules in SDL Trados Studio, right-click on the translation memory and choose Settings. Once there, go to Language Resources and the settings for each language will be displayed. Look for the Segmentation Rules column and access the editor in the language you want to modify them for.

Next, an option for paragraph-based segmentation will be shown, which uses the paragraph marks specific to each file type, and another for sentence-based segmentation, which is the one that can be modified. The default segmentation rules applied are the full stop, colon, and question and exclamation marks, with the exception of these being followed by a lowercase letter.

In this panel, you can remove or edit these rules by adding characters before and after the break, as well as defining exceptions by using regular expressions. You can also add new rules following the same procedure.

Therefore, segmentation rules in Trados Studio are associated with the translation memory and not with a file type, so it is not possible to import and export them in an SRX file.

Implementation of the SRX format in memoQ

Segmentation rules in memoQ are set by default and can be modified for each specific project. To do this, open the project and access Settings. Once there, click the Segmentation Rules icon (the scissors icon) and select the set of segmentation rules for the language you want. A menu will open where you can modify these rules. You will find a simple view, where you can add punctuation marks, proper names starting with a lowercase letter, and abbreviations followed by numbers. In the advanced view, there is the option to use regular expressions for a more complex set-up of segmentation rules.

In the same window, you can find the option of exporting and importing an SRX file to use the same segmentation rules in other projects and tools. It is important to note that when exporting an SRX file, information about exceptions to the segmentation rules may be lost, as these are more sophisticated in memoQ than those allowed in SRX.

Language Technology By the same author

Iván Vázquez

Graduate in Translation and Interpreting from the University of Granada, specializing in French and Chinese. He has worked on several literary translation and web translation projects in Spain and France. Currently, he is a project management assistant and content writer at AbroadLink.

Segmentation in translation and the SRX standard format

What is segmentation?

Standardising segmentation rules: the SRX format

Implementation of the SRX format in Trados Studio

Implementation of the SRX format in memoQ

Related articles

What is a translation memory?

Computer-Assisted Translation (CAT) software: what is it?

What is a TMS or translation management system?