Translation Memory (TM) calculations and scoring involve multiple factors:
-
Storage Format:
- Full segment text and TU metadata are stored in a serialized format (binary for server-based TMs, XML for others) for scoring, not for retrieval.
- Hash values for source and target segments, based on a tokenized 'identity string,' and segment features are stored in an index.
-
Candidate Retrieval: Exact Match:
- Exact matches are checked first by retrieving segments with the same hash value as the query segment.
- Retrieved segments are tokenized and scored for similarity. Segments with a BaseScore of 100% are flagged as exact matches.
- Retrieval stops once any batch contains one or more exact matches.
-
Candidate Retrieval: Fuzzy Match:
- Fuzzy matching seeks up to maxResults TUs (user-defined or 20, whichever is larger) with similarity scores at or above a threshold, typically 70.
- The threshold is based on edit distance calculations.
- During candidate identification, full token information isn't used; instead, segment feature indices are employed.
-
Candidate Scoring:
- Phase One: Computes the cheapest edit path using coarse-grained token comparisons, ignoring character-level differences.
- Phase Two: Refines the cost with fine-grained token comparisons, considering character-level differences.
- Phase Three: Calculates the final 'simplified' match score by summing penalty costs for the tokens, dividing by the number of tokens, and giving less weight to punctuation, whitespace differences, and stop words compared to content words.
Formula:TM Leverage = [100 - ((new words + MT words) * 100 / word count)]