Segmentation in Translation Memory

How we analyze your text for translation.

A segment is the smallest unit of text that can be understood on its own. It is the unit that’s used when working with translation memory.

Document Parts

Just like a wall is made of bricks, a document is made of a number of parts e.g. paragraphs, sentences, phrases, terms, words.

Our translators use editing software to count the number of words in your documents. This technology is called Translation Memory. Translation Memory Software works at the sentence and segment level. It breaks down a source document into a set of segments, a segment usually being a sentence, ending with an end-of-segment punctuation such as a full stop or a question mark. So, translators work by editing and translating segments one after the other.

The term segment is used because, in some cases, a chunk of text may not be a complete sentence.

In the case of headings, for example, smaller units of text such as individual words or phrases would be managed by terminology dictionaries.

Paragraph marks, page breaks, end of cell, tabulators et cetera will always end a segment.

What is segmentation?

Example

Bad Segmentation.

This is a sentence

that is not on a single line

but in multiple lines.

This should have been displayed as…

This is a sentence that is not on a single line but in multiple lines.

Machine labels typically might be designed like this:

  • POTENTIALLY
  • HAZARDOUS
  • VENT

This label is saying “Potentially Hazardous Vent“. This is written as three separate lines of text. So when presented to a translator it would look like three different words, not one sentence. It is better if this text is written as a single sentence and placed into a cell or table, so that it automatically resizes correctly into a square shape. This way the translator will see this as one sentence.

Repetitions – 100% Matches – Fuzzy Matches

As the translators work, each segment for translation is compared to previous translations stored in the TM, and matches are presented to the translators automatically as they translate, very much like predictive text on your phone. A segment in the TM that is identical to the segment for translation is considered a 100% match.

If there is no exact match, but there are segments in the TM that are similar to the one being translated, then these are presented as fuzzy matches. Each is ranked by a percentage ranging from 0% to 99%, where the higher percent matches is closer in content to the sentence being translated. A 99% match might differ only in a single letter or punctuation mark, where a 75% match might have several different words. Generally, matches below the 70% mark are not useful.

When a document contains several identical segments that are not currently in the TM, these segments are known as repetitions. Most translation memory tools can identify potential repetitions before translation begins. The advantage of repetitions is that after the first occurrence is translated, the rest will become 100% matches.

As the translator works, each newly translated sentence is added to the TM. Thus, that new sentence can become a 100% match or even a fuzzy match for other sentences in the document. Repetitions are those segments that have the potential to become 100% matches.