An OmegaT-based TMS for simple translation workflows

Powered by
Bash, OmegaT, Github, Nextcloud and Python

Surpass Delta

Client: Prometric

Project: localization of the UI for Surpass Delta product

Periodicity: roughly once a quarter

Task: to translate new strings of text for new features that are being released on Delta

Languages: English to 59 languages (always the same)

Format: Excel (only some sheets/columns)

Hints

  • Repetitive workflow
  • Cumbersome management
  • File format is constant
  • Translation scope:
    • not all files for all language versions
    • not all parts of the (Excel) file

Challenges

  • How (each) PMs organize things
  • Sending emails
    • Booking subcontractors

    • Assigning jobs

  • File management

    • Uploading files

    • Downloading files

    • Putting files in the right folder

  • Reviewing deliverables

    • Checking completion

    • Checking tags

What are the manual steps that take most of PM's time?

  • Sending emails
    • Booking subcontractors

    • Assigning jobs

  • File management

    • Uploading files

    • Downloading files

    • Putting files in the right folder

  • Reviewing deliverables

    • Checking completion

    • Checking tags

What are the manual steps that take most of PM's time?

/glossary

/tm

/source

/target

/

/omegat

working TM

master TM

reference TM(s)

terminology

original docs

translated docs

user input

extraction

text

skeleton

merge

concordances

leverage (matches)

saved

bilingual

PROJECT
├── dictionary
├── glossary
├── omegat
│   ├── filters.xml
│   └── filter@configuration.frpm
├── omegat.project
├── source
│   └── file.txt
├── target
│   └── file.txt
└── tm
    └── file.tmx

OmegaT project

PROJECT
├── dictionary
├── glossary
├── omegat
│   ├── filters.xml
│   └── filter@configuration.frpm
├── omegat.project
├── source
│   └── file.txt
├── target
│   └── file.txt
└── tm
    └── file.tmx
PROJECT
├── dictionary
├── glossary
├── omegat
│   ├── filters.xml
│   └── filter@configuration.frpm
├── omegat.project
├── source
│   └── file.txt
├── target
│   └── file.txt
└── tm
    └── file.tmx
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<omegat>
  <project version="1.0">
    <source_dir>__DEFAULT__</source_dir>
    <source_dir_excludes>
      <mask>**/.svn/**</mask>
      <mask>**/.git/**</mask>
      <mask>**/.hg/**</mask>
      <mask>**/.repositories/**</mask>
      <mask>**/Thumbs.db</mask>
      <mask>**/.DS_Store</mask>
      <mask>**/~$*</mask>
    </source_dir_excludes>
    <target_dir>__DEFAULT__</target_dir>
    <tm_dir>__DEFAULT__</tm_dir>
    <glossary_dir>__DEFAULT__</glossary_dir>
    <glossary_file>__DEFAULT__</glossary_file>
    <dictionary_dir>__DEFAULT__</dictionary_dir>
    <source_lang>en</source_lang>
    <target_lang>bg-BG</target_lang>
    <source_tok>org.omegat.tokenizer.LuceneEnglishTokenizer</source_tok>
    <target_tok>org.omegat.tokenizer.LuceneBulgarianTokenizer</target_tok>
    <sentence_seg>true</sentence_seg>
    <support_default_translations>true</support_default_translations>
    <remove_tags>false</remove_tags>
  </project>
</omegat>

Project structure

the working TM sits here (not displayed)

the reference TM(s) sit here.

the master TMs are generated inside the project folder (root)

these are the project settings

Folder structure in server

~/02_Clients/[CLIENT]/01_PROJECTS/[PROJECT]/01_Translation$ tree -L 1
.
├── 00_Admin
├── 10_History
├── 20_Automation
├── 30_Incoming
├── 40_Jobs
├── 50_Repos
├── 80_Deliverables
└── 90_Assets

~/02_Clients/[CLIENT]/01_PROJECTS/[PROJECT]/01_Translation$ tree -L 1
.
├── 00_Admin
├── 10_History
── 20_Automation
├── 30_Incoming
├── 40_Jobs
├── 50_Repos
├── 80_Deliverables
└── 90_Assets

Folder structure in server

~/02_Clients/[CLIENT]/01_PROJECTS/[PROJECT]/01_Translation$ tree -L 1
.
├── 00_Admin
├── 10_History
├── 20_Automation
├── 30_Incoming
├── 40_Jobs
├── 50_Repos
├── 80_Deliverables
└── 90_Assets

input

Folder structure in server

~/02_Clients/[CLIENT]/01_PROJECTS/[PROJECT]/01_Translation$ tree -L 1
.
├── 00_Admin
├── 10_History
├── 20_Automation
├── 30_Incoming
├── 40_Jobs
├── 50_Repos
├── 80_Deliverables
└── 90_Assets

Folder structure in server

~/02_Clients/[CLIENT]/01_PROJECTS/[PROJECT]/01_Translation$ tree -L 1
.
├── 00_Admin
├── 10_History
├── 20_Automation
├── 30_Incoming
├── 40_Jobs
├── 50_Repos
├── 80_Deliverables
└── 90_Assets

output

Folder structure in server

~/02_Clients/[CLIENT]/01_PROJECTS/[PROJECT]/01_Translation$ tree -L 1
.
├── 00_Admin
├── 10_History
├── 20_Automation
├── 30_Incoming
├── 40_Jobs
├── 50_Repos
├── 80_Deliverables
└── 90_Assets

output

input

Folder structure in server

Application modules

  1. Initiation (common for all versions)
  2. Create repositories for each version
  3. Harvesting translations

Application modules

  1. Initiation (common for all versions)
  2. Create repositories for each version
  3. Harvesting translations

1. Initiation

  • Precondition: File format is constant and predictable
  • The client drops a batch of files for translation (in file drop area, connected to 30_Incoming)
  • A job folder is created for the batch of files under the PM folder (e.g. 40_Jobs > 2022_AUG01 > 01_Source) and the original files are moved there
  • Pre-processing (convert Excel to JSON): extract translatable text and key columns
  • A job folder is created for the batch in the common files repository (e.g. 50_Repos > 01_Common > PROJ_common_files > files > 2022_AUG01) and source (JSON) files are saved there

Job/batch folder

2022_AUG02

 current year

current month

current job within the month

~/02_Clients/[CLIENT]/01_PROJECTS/[PROJECT]/01_Translation$ tree -L 1
.
├── 00_Admin
├── 10_History
├── 20_Automation
├── 30_Incoming
├── 40_Jobs
├── 50_Repos
├── 80_Deliverables
└── 90_Assets

file drop

Folder structure in server

.
├── 30_Incoming
├── 40_Jobs
   ├── 2022_AUG01
│   │   ├── 00_Admin
│   │   ├── 01_Source
│   │   │   ├── file2_en.xls
│   │   │   └── file2_en.xls
│   │   ├── 02_Target
│   │   └── 03_Review
│   │       ├── Clean_Files
│   │       └── Notes_Files
├── 50_Repos
├── 80_Deliverables
└── 90_Assets

Folder structure in server

.
├── 30_Incoming
├── 40_Jobs
├── 50_Repos
│   ├── 01_Common
│   │   └── PROJ_common_files
│   │       ├── files
│   │       │   ├── 2022_AUG01
│   │       │   │   ├── file1_en.xls.json
│   │       │   │   └── file2_en.xls.json
│   │       └── settings
│   ├── 02_Versions
│   ├── 03_Harvest
│   └── repo_urls.txt
├── 80_Deliverables
└── 90_Assets

Folder structure in server

> org="capstan-PROJ"
> common_repo="PROJ_common_files"
> team="translators"

> # --- 

> cd /path/to/PROJ_common_files

> git init
> git add . && git commit -m "initial commit"
> gh repo create $org/$common_repo --private --source=.
--remote=origin --team $team
> git push --set-upstream origin master        

Create common files repo

> org="capstan-PROJ"
> common_repo="PROJ_common_files"
> team="translators"
> job_dname="2022_AUG01" # for example

> # add pre-processed json files

> cd /path/to/PROJ_common_files

> git add .
> git commit -m "New files added for job $job_dname"
> git push

Push new batch

Application modules

  1. Initiation (common for all versions)
  2. Create repositories for each version
  3. Harvesting translations

2. Create version repos

  • Required: Version specifications
  • For each version:
    • Create Github repository (and clone it)
    • Initialize OmegaT project in the local clone
    • Add repository mappings
      • to source files
      • to settings
      • to TMs
    • Mask files that don't need to be translated
    • Push files to the repo
    • Write the repo's URL for the PM
50_Repos/
├── 01_Common
│   └── Delta_common_files
├── 02_Versions
│   ├── Delta_amh-ETH_OMT
│   ├── Delta_ara-ZZZ_OMT
│   ├── Delta_bul-BGR_OMT
│   └── _tech
└── repo_urls.txt

Folder structure in server

> org="capstan-PROJ"
> omtprj_dname="PROJ_VERSION_files"
> team="translators"

> cd /path/to/version/omegat_project_dir

> gh repo create $org/$omtprj_dname --private
--clone --team $team

> # add repository mappings, mask files out of scope

> git add .
> git commit -m "Initial commit -- creating omegat 
team project repo"
> git push --set-upstream origin master

Create each version's repo

PROJECT
├── dictionary
├── glossary
├── omegat
│   ├── filters.xml
│   └── filter@config.frpm
├── omegat.project
├── source
│   └── file.txt
├── target
└── tm
    └── file.tmx

common for all
language versions

repository mapping

common files repo

PROJECT
├── dictionary
├── glossary
├── omegat
│   ├── filters.xml
│   └── filter@config.frpm
├── omegat.project
├── source
│   └── file.txt
├── target
└── tm
    └── file.tmx

OmegaT project

PROJECT
├── dictionary
├── glossary
├── omegat
│   ├── filters.xml
│   └── filter@configuration.frpm
├── omegat.project
├── source
│   └── file.txt
├── target
└── tm
    └── file.tmx
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<omegat>
  <project version="1.0">
    <source_dir>__DEFAULT__</source_dir>
    <source_dir_excludes>
      <mask>**/.svn/**</mask>
      <mask>**/.git/**</mask>
      <mask>**/.hg/**</mask>
      <mask>**/.repositories/**</mask>
      <mask>**/Thumbs.db</mask>
      <mask>**/.DS_Store</mask>
      <mask>**/~$*</mask>
    </source_dir_excludes>
    <target_dir>__DEFAULT__</target_dir>
    <tm_dir>__DEFAULT__</tm_dir>
    <glossary_dir>__DEFAULT__</glossary_dir>
    <glossary_file>__DEFAULT__</glossary_file>
    <dictionary_dir>__DEFAULT__</dictionary_dir>
    <source_lang>en</source_lang>
    <target_lang>bg-BG</target_lang>
    <source_tok>org.omegat.tokenizer.LuceneEnglishTokenizer</source_tok>
    <target_tok>org.omegat.tokenizer.LuceneBulgarianTokenizer</target_tok>
    <sentence_seg>true</sentence_seg>
    <support_default_translations>true</support_default_translations>
    <remove_tags>false</remove_tags>
  </project>
</omegat>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<omegat>
  <project version="1.0">
    <source_dir>__DEFAULT__</source_dir>
    <source_dir_excludes>
      <mask>**/.svn/**</mask>
      <mask>**/.git/**</mask>
      <mask>**/.hg/**</mask>
      <mask>**/.repositories/**</mask>
      <mask>**/Thumbs.db</mask>
      <mask>**/.DS_Store</mask>
      <mask>**/~$*</mask>
    </source_dir_excludes>
    <target_dir>__DEFAULT__</target_dir>
    <tm_dir>__DEFAULT__</tm_dir>
    <glossary_dir>__DEFAULT__</glossary_dir>
    <glossary_file>__DEFAULT__</glossary_file>
    <dictionary_dir>__DEFAULT__</dictionary_dir>
    <source_lang>en</source_lang>
    <target_lang>bg-BG</target_lang>
    <source_tok>org.omegat.tokenizer.LuceneEnglishTokenizer</source_tok>
    <target_tok>org.omegat.tokenizer.LuceneBulgarianTokenizer</target_tok>
    <sentence_seg>true</sentence_seg>
    <support_default_translations>true</support_default_translations>
    <remove_tags>false</remove_tags>
    <repositories>
      <repository type="git" url="https://github.com/capstanlqc-delta/Delta_common_files.git">
        <mapping local="source" repository="files"/>
        <mapping local="omegat/okf_json@delta.fprm" repository="settings/okf_json@delta.fprm"/>
        <mapping local="omegat/filters.xml" repository="settings/filters.xml"/>
      </repository>
    </repositories>
  </project>
</omegat>

Repository mappings

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<omegat>
  <project version="1.0">
    <source_dir>__DEFAULT__</source_dir>
    <source_dir_excludes>
      <mask>**/.svn/**</mask>
      <mask>**/.git/**</mask>
      <mask>**/.hg/**</mask>
      <mask>**/.repositories/**</mask>
      <mask>**/Thumbs.db</mask>
      <mask>**/.DS_Store</mask>
      <mask>**/~$*</mask>
    </source_dir_excludes>
    <target_dir>__DEFAULT__</target_dir>
    <tm_dir>__DEFAULT__</tm_dir>
    <glossary_dir>__DEFAULT__</glossary_dir>
    <glossary_file>__DEFAULT__</glossary_file>
    <dictionary_dir>__DEFAULT__</dictionary_dir>
    <source_lang>en</source_lang>
    <target_lang>bg-BG</target_lang>
    <source_tok>org.omegat.tokenizer.LuceneEnglishTokenizer</source_tok>
    <target_tok>org.omegat.tokenizer.LuceneBulgarianTokenizer</target_tok>
    <sentence_seg>true</sentence_seg>
    <support_default_translations>true</support_default_translations>
    <remove_tags>false</remove_tags>
    <repositories>
      <repository type="git" url="https://github.com/capstanlqc-delta/Delta_bul-BGR_OMT.git">
        <mapping local="/" repository="/"/>
      </repository>
      <repository type="git" url="https://github.com/capstanlqc-delta/Delta_common_files.git">
        <mapping local="source" repository="files"/>
        <mapping local="omegat/okf_json@delta.fprm" repository="settings/okf_json@delta.fprm"/>
        <mapping local="omegat/filters.xml" repository="settings/filters.xml"/>
      </repository>
    </repositories>
  </project>
</omegat>

Repository mappings

Version specs

Config

Application modules

  1. Initiation (common for all versions)
  2. Create repositories for each version
  3. Harvesting translations

3. Harvest translations

  • For each version:
    • Clone repo or fetch files from repo
    • Check if target files have been committed, & if so:
    • Run OmegaT on the project to get latest version of target files and get word counts
    • Check if all segments have been translated, & if so:
    • Post-process target files: extract translation from target JSON and put it in original Excel format
    • Put JSON and Excel in the PM folder (e.g. 40_Jobs > 2022_AUG01 > 02_Target > [VERSION] > 20220819-151217)
    • Put JSON and Excel in the deliverables folder (e.g. 80_Deliverables > 2022_AUG01 > [VERSION])
    • Remove target files from repository
50_Repos/
├── 01_Common
│   └── Delta_common_files
├── 02_Versions
│   ├── Delta_amh-ETH_OMT
│   ├── Delta_ara-ZZZ_OMT
│   ├── Delta_bul-BGR_OMT
│   └── _tech
├── 03_Harvest
│   ├── Delta_amh-ETH_OMT
│   ├── Delta_ara-ZZZ_OMT
│   └── Delta_bul-BGR_OMT
└── repo_urls.txt

Folder structure in server

40_Jobs
├── 2022_AUG01
   ├── 00_Admin
│   ├── 01_Source
│   ├── 02_Target
│   └── 03_Review
50_Repos
├── 01_Common
├── 02_Versions
├── 03_Harvest
│   ├── Delta_amh-ETH_OMT
│   ├── Delta_ara-ZZZ_OMT
│   └── Delta_bul-BGR_OMT
└── 80_Deliverables

Folder structure in server

target
JSON files

done
XLS files

~/02_Clients/[CLIENT]/01_PROJECTS/[PROJECT]/01_Translation$ tree -L 1
.
├── 00_Admin
├── 10_History
├── 20_Automation
├── 30_Incoming
├── 40_Jobs
├── 50_Repos
├── 80_Deliverables
└── 90_Assets

output

input

Folder structure in server

> org="capstan-PROJ"
> omtprj_dname="PROJ_VERSION_files"
> team="translators"

> cd /path/to/harvest/folder

# if never cloned:
> gh repo clone $org/$omtprj_dname

# if already cloned:
> cd /path/to/harvest/folder/$omtprj_dname
> git fetch --all
> git reset --hard origin/master

Pull version's target files

@todo

Automated notifications:

  • To each translator, when a new batch is available
  • To the PM, when a translator commits files but not all segments are translated
  • ... what else??

Automated comment handling:

  • TBD with PM .....................................................................

Automated access management:

  • Revoke rights from translators before revision starts
  • Grant rights to revisers before revision starts

PM's manual actions

  • Prepare the config file and version specs file (once)
  • Share URL of the file drop area with client (once)
  • Share URL of the file retrieval area with client (once)
  • Book translators (one per version + backup?)
  • Send instructions and repo URLs to translators (once)
  • Review deliverables (per job, per version)
  • If necessary, ask linguists to make changes and redeliver
  • Notify client that they can fetch all deliverables (per job)
  • Optional: Delete job folder from 01_Common after delivery

 

In other words: no file uploads/downloads

In-house review

  • PM reviews files in 40_Jobs > etc.
  • PM asks for changes
  • Translator makes changes and commits target files again
  • Translations are harvested, files are post-processed again

Work in progress

Outsourced revision

Work in progress

Three options:

  • PM grants simultaneous access to translators and revisers to the same repository,
    • PM notifies revisers when they can start
    • PM trusts translators that they will refrain from making changes after they have committed their work
  • PM grants consecutive access to translators then to revisers, for each version
    • Safer but requires more manual work
  • Two repositories (one for translation, one for revision) accessible by different teams

Delivery to client

  • PM notifies the client that they can fetch all target files for all versions

* bash 5.1

* omegat 5.7

* github

* python 3.8

* nextcloud

Questions?

An OmegaT-based TMS

By cApStAn LQC

An OmegaT-based TMS

An OmegaT-based TMS

  • 164