Sentence Level Segmentation and OAuth2 

Pulkit Pushkarna

Introduction

  • In Cloudwords while translating an asset we extract the text blocks in the form of paragraphs and do the translation.
  • Now we have the option to split the paragraph into sentences and perform the translation.

Scope of current Implementation

  • For now we have introduced sentence level segmentation for Eloqua integration.
  • In the future we will introduce the segmentation for other integrations.

Tool we have used for segmentation

 

Language supported for Segmentation

  • Pragmatic Segmenter supports following languages for segmentation:
    • Amharic
    • Arabic
    • Armenian
    • Burmese
    • Chinese (Simplified)
    • English
    • Greek
    • Hindi
    • Japanese
    • Persian

 Pragmatic Segmenter Response to sibling languages

  • We have done some experimentation with the language which have similar script.
  • In case of French and German when we passed English as a language parameter to the tool. It was able to make the sentences perfectly.
  • For now with some customization we have added French and Germany  also.
  • Pragmatic Segmenter is working fine with these two languages with our customization.
  • In future we will identify more languages who share the same script and eventually we should be able to provide support for many more languages.
  • So cloudwords also supports French and German language segmentation apart from the languages supported by Pragmatic Segmenter.

Enabling Sentence Level Segmentation in Cloudwords

  • In order to enable sentence level segmentation we need to check and save following option from admin panel for customer.

 

Eloqua assets which can be tested for sentence Level Segmentation

  • Sunil_Dynamic_Email_19- Feb-2020

  • Sunil_Simple_Landing Page_MT

Segmentation problem in non clonable elements

  • A non-clonable element clearly represents a piece of content that must be translated as one piece, and cannot be segmented. e.g
    <source>This is a <g>sentence. It has</g> markup.</source>
  • For this reason we do not break the text blocks which have html tags.
  • So for now we perform segmentation on plain text block which does not contain any html tags.
  • For more information please refer to the following link:http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Struct_Segmentation

Example of non clonable Elements

  • <source><g>This is a sentence. It has markup.</g></source>
  • By Segmenting paragraph we will get:

    • <source><g>This is a sentence.</source>
    • <source>It has markup.</g></source>
  • Both of the lines are not valid xliff source content. In order to make them valid we need to insert missing tags
    • <source><g>This is a sentence.</g><source>
    • <source><g>It has markup.</g></source>
  • Now when we merge the segments to together we will get a different Block
    • <source>
      <g>This is a sentence</g>. <g>It has markup.</g>
      </source>
    • This result is different from the original source

 

We can further improve the current implementation of segmentation with the following low hanging fruits

  • Identify the languages which share the same script with the languages supported by Pragmatic Segmenter and add support for those languages for segmentation.
  • Identify and Segment Text units which contains HTML tags but are clonable.

Upcoming plans for Segmentation

  • Find a solution to segment non Clonable elements (POC).
  • Introduce Segmentation in other integration.
  • Introduce Segmentation for other local files.

OAuth 2 in Cloudwords

Pulkit Pushkarna

OAuth (Open Authorization) In Cloudwords

  • OAuth is a protocol which is used for Authorization and Authentication while dealing with Restful APIs.

 

  • In Cloudwords we only use to support OAuth 1 earlier. But now we support both OAuth 1.0 and OAuth 2.0.

 

  • OAuth2.0 is a complete rewrite of OAuth1.0

 

  • OAuth2.0 is not backward with OAuth1.0 and should be thought of as a completely new protocol

 

  • A high level goal of OAuth is allowing a Resource Owner to give access to a third party in a limited way, without giving away the password.

Advantages of OAuth 2.0

  • OAuth 2 protocol issues token for a specific amount of time after that interval token expires.
  • Once the token is expired we do not need to submit client credentials. we just need to use the refresh token to get the new access token.
  • OAuth 2 supports different grant types like client_credentials, password,  authorization_code.

Brightcove

Pardot

CW-APP

CW-API

Auth Server

Client

Get Token

Send Token

Send Token

Get Token

Send Token

CW-API

Auth Server

Client

Get Token

Send Token

Validate Token

Release Show and Tell

Pulkit Pushkarna

Major stories and tasks 

  • CW-5482 Eloqua Sentence  Level Segmentation for clonable elements (P176) https://localhost:8443/cust.htm#project/3331/review/9321/fr/20001

  • Non clonable HTML elements.

    • ​<div>This is line 1. This is line 2.</div>

      • ​<div>This is line 1.

      • This is line 2</div>.

  • Clonable HTML Elements

    • ​<b>This</b> is line 1. <br/><br/><b>This</b> is line 2.

      • <b>This</b> is line 1.

      • <br/><br/><b>This</b> is line 2  

 Non Demonstrable/ Support tickets

Ticket Client
Zen-7260 PAN - Sandbox - REST - Landing Page not generating ICR MakCust
CW-5608 email notification request revision logging UL
Zen-7292  Project ID 187213 - Review why System User approved language workspace (In Progress) F5
Zen-7255  REST - Project ID 188303 | Unexpected error occurs (In Progress) Google Primer
CW-5630 Mongo duplicate index error System Issue
CW-5634 Task reminder sent for a non-approved bid UL

Future tasks for segmentation

  • Introduce sentence level segmentation Marketo Rest and AEM.
  • Glossary (configurable, no longer dependent on department).
  • Translation Memory (configurable, no longer dependent on department).

Release Show and Tell Harrier

Pulkit Pushkarna

 

 

Blackbird release show and tell

Pulkit Pushkarna

Segmentation approach for integration (i.e Eloqua and Marketo Rest)

Source Html

Segmented source XLIFF

Segmented target XLIFF

Target HTML

Segmentation approach in case if source file is XLIFF

Source XLIFF

Segmented Source XLIFF

Segmented Target XLIFF

Target XLIFF

  • CW-5786 Sentence Level Segmentation for XLIFF file

https://localhost:8443/cust.htm#project/369/review/919/fr/2539

  • CW-5771 Sentence Level segmentation for AEM

https://localhost:8443/cust.htm#project/307/review/795/fr/2215

  • CW-5839 Seperation of concern for source and target xlf file for segmentation

https://localhost:8443/cust.htm#project/367/review/915/fr/2525

  • CW-5838 Segmenting complex source tags in XLIFF Edge case
  • CW-5841 API changes for AEM segmentation

https://localhost:8443/cust.htm#project/307/review/795/fr/2215

Sentence Level Segmentation Implementation so far

  • Integrations Support
    • Marketo Rest
    • Eloqua
    • XLIFF (local projects)
    • AEM
  • Language Supported for Segmentation
    • Amharic
    • Arabic
    • Armenian
    • Burmese
    • Chinese (Simplified)
    • English
    • Greek
    • Hindi
    • Japanese
    • Persian
  • We do not support Segmentation of non clonable html elements (Recommended)

Hexa Release Show and Tell

Pulkit Pushkarna


Thar and XUV Release Demoable items Show and Tell

Major Enhancement in Sentence Level Segmentation

  • Earlier we had a limitation in Sentence level segmentation that if both target and source language are among the following languages
    • Amharic
    • Arabic
    • Armenian
    •  Burmese
    • Chinese(Simplified),
    • English
    • Greek
    • Hindi
    • Japanese
    • Persian

then only we can introduce sentence level segmentation.

  • But now with the new enhancement Sentence Level segmentation is only dependent on source i.e if the source language is among the languages mentioned above than we will introduce sentence level segmentation for
    •  Eloqua,
    • MarketoRest,
    • AEM
    •  XLIFF file (Local Project)
  • We still have One limitation in SLS. We don't do segmentation for non clonable elements

Previous approach

Source Language TU 

Segmentation Microservice

Core Platform

Segmented TU 

Target Language TU 

Segmented TU 

Current Approach

Core Platform

Segmentation Microservice

Source Language TU 

Segmented TU 

Source Segmented TU is used  by

core platform to generate target

segmented TU and target TU

Pragmatic Segmenter

 

Ambharic, Arabic, Armenian, Burmese, Chinese,

English, Greek, Hindi, Japanese, Persian

Ambharic, Arabic, Armenian, Burmese, Chinese,

English, Greek, Hindi, Japanese, Persian

Pragmatic Segmenter

 

  • Zen-7962 - Mouser - Formatting of special symbols is being changed by our platform (CW-6104)

https://localhost:8443/cust.htm#project/863/review/2363/fr/6707

  • Sync Failed for PSD for PID 208745  (CW-6097)

https://localhost:8443/cust.htm#project/975/review/2599/ar/7607

  • Encoding issue for XLIFF Segmentation (CW-5921)

https://localhost:8443/cust.htm#project/999

https://localhost:8443/cust.htm#project/999/review/2723/cs/7963

  • Segmentation | MR | Asset sync error appearing on the while creating project with segmentation languages. (CW-5734)

https://localhost:8443/cust.htm#project/997/review/2715/hu/7885

  • Worked on Microsoft's WCAG Issues

REL-Marazzo and REL-scorpio Demoable items Show and Tell

Pulkit

      https://engage-sj.marketo.com/?munchkinId=529-IDT-654#/classic/PG36201D4

  • CW-6108 Problem with uploadeing revised files for - 12AUG_CC_Video_Business_Agility_EWF

    https://localhost:8443/lsp.htm#project/389/language/fr

  • CW-6047 Segmentation || Eloqua : Edit Translation / Provide Feedback option is not showing for Dynamic Content

https://localhost:8443/cust.htm#project/383/review/927/en-au/2343

  • CW-6011 Segmentation || Eloqua || Issue with Asset >ComplexLandingPageAsset

https://localhost:8443/cust.htm#project/341/review/795/fr/2069

  • CW-6008 Sentence Level Segmentation ||ELoqua > NumberFormatException

https://localhost:8443/cust.htm#project/397/review/963/es-mx/2439

  • CW-5845 Segmentation || MR || Asset Specific Issue : Getting ParseErrorException

https://localhost:8443/cust.htm#project/189/review/423/fr/1129

  • CW-5762 Segmentation || MR || Asset Specific Issue : Getting NoSuchElementException

https://localhost:8443/cust.htm#project/413/review/1023/es-es/2567

 

 

Show and Tell Alturas and Bolero

Pulkit Pushkarna

  • CW-5875 Segmentation || Eloqua || Edits are not reflecting correctly for some of the TUs

https://localhost:8443/cust.htm#project/415/review/1027/fr/2579

https://p03.eloquapreview.com/Preview.aspx?siteId=1774690490&userGuid=a0a17071-8eb2-49e2-9454-257dddc23dd2

  • CW-5935 Add Sibling source languages for sentence level segmentation

https://localhost:8443/cust.htm#project/1119/review/2479/fr-bf/6947

  • CW-6025 Campaign | Reviewers Tab | User not able to click using keyboard on headers for sorting

https://localhost:8443/cust.htm#workspace/1/reviewers

  • CW-6179 Segmentation || MR || Asset Specific Issue : Side by Side ICR is not showing for mentioned asset

https://localhost:8443/cust.htm#project/1121/review/2483/fr/6963

  • CW-6190 Translation Not persisting in Eloqua for Translated Asset in case of one tu for the asset mentioned in description

https://localhost:8443/cust.htm#project/1123/review/2487/fr/6979

https://p03.eloquapreview.com/Preview.aspx?siteId=1774690490&userGuid=56ca5a49-55bf-4d04-a3d3-7af5a80aa2cf

  • CW-6264 Microservice Inter communication for CW app
  • CW-6331 Zen-8275 targetLanguages missing in project search API

Show and Tell for demoable items of REL-punch

Pulkit Pushkarna

  • Zen-8283/CW-6341 Project ID 224665 - Project Status Syncing Content after over an hour. Blank token issue in marketoRest.

https://localhost:8443/cust.htm#project/1231/review/2727/el/7679

Sentence Level Segmentation in Cloudwords

Sentence Repeating in Paragraphs

  • I Love Indian food. It is the most flavourful food in the world.
  • I Love Indian food. It is very healthy and soul nourishing.
  • I Love Indian food. Most of the Indian traditional dishes are vegetarian.

Introduction

  • In Cloudwords while translating an asset we extract the text blocks in the form of paragraphs and do the translation.
  • With Sentence level segmentation we have the option to split the paragraph into sentences and perform the translation.

We support SLS for following Integration

  • Marketo Rest
  • Eloqua
  • XLIFF files
  • AEM
  • DOCX (Not Very Mature right now)

Tool we have used for segmentation

 

Language supported for Segmentation

  • Pragmatic Segmenter supports following languages for segmentation:
    • Amharic
    • Arabic
    • Armenian
    • Burmese
    • Chinese (Simplified)
    • English
    • Greek
    • Hindi
    • Japanese
    • Persian

Enabling Sentence Level Segmentation in Cloudwords

  • In order to enable sentence level segmentation we need to check and save following option from admin panel for customer.

 

Segmentation problem in non clonable elements

Example of non clonable Elements

  • <source><g>This is a sentence. It has markup.</g></source>
  • By Segmenting paragraph we will get:

    • <source><g>This is a sentence.</source>
    • <source>It has markup.</g></source>
  • Both of the lines are not valid xliff source content. In order to make them valid we need to insert missing tags
    • <source><g>This is a sentence.</g><source>
    • <source><g>It has markup.</g></source>
  • Now when we merge the segments to together we will get a different Block
    • <source>
      <g>This is a sentence</g>. <g>It has markup.</g>
      </source>
    • This result is different from the original source

 

Segmentation approach for integration (i.e Eloqua and Marketo Rest)

Source Html

Segmented source XLIFF

Segmented target XLIFF

Target HTML

Segmentation approach in case if source file is XLIFF

Source XLIFF

Segmented Source XLIFF

Segmented Target XLIFF

Target XLIFF

Architectural diagram for Sentence Level Segmenntation in Cloudwords

Segmentation Microservice

Core Platform

Segmented TU 

Source Language TU 

Pragmatic Segmenter

 

Ambharic, Arabic, Armenian, Burmese, Chinese,

English, Greek, Hindi, Japanese, Persian

Bolt and Everito Show and Tell 

Pulkit Pushkarna

 

Tiago and Tigor Release show and tell

Pulkit Pushkarna

Demoable Items

 

  • CW-6614 Restrict adding a PO number when editing/Creating the project

 

  • CW-5728 Add alternative ways to change user's email address

 

  • CW-6575 Upgrade OAuth JDBCTokens to JWT Tokens. Perform jmeter load testing to insure that it is working fine for concurrent requests

UI Architecture for Cloudwords Platform

password

POST /oauth2/token

Credentials

access token, refresh token

access token

React App

Resource Server

OAuth2 Server

BFF

React App

Node App

Spring Boot

SpringMVC

Front End

BFF

OAuth Server

CW API

(Resource server)

Indica and Curvv Show and Tell

Pulkit Pushkarna

  • CW-6806 Distortion in the 2 TU of ICR in case of docx file.

https://app-curvv.stage.cloudwords.com/cust.htm#project/573/language/de

 

  • CW-6871 Apostrophe's Issue || ICR || Edit Translations || Docx-Segmentation

https://app-curvv.stage.cloudwords.com/cust.htm#project/575/language/de

 

  • CW-6857 Docx || Segmentation ON || Target ICR - Paragraph Issue

https://app-curvv.stage.cloudwords.com/cust.htm#project/577/language/fr

  • CW-6640 Tigor - Old HTML Editor - Revision Request breaks up HTML

https://app-curvv.stage.cloudwords.com/cust.htm#project/585/language/fr

 

  • CW-6448 Sentence Level Segmentation || Source File Language Issue

https://app-curvv.stage.cloudwords.com/cust.htm#project/579/source

 

  • CW-6422 Docx Segmentation | Asset Sync error observed when project created post enabling the segmentation.

https://app-curvv.stage.cloudwords.com/cust.htm#project/581/language/fr

 

 

 

  • CW-6418 Sentence Level Segmentation Issue || Docx

https://app-curvv.stage.cloudwords.com/cust.htm#project/583/language/ar

 

  • Design and Implementation of new architecture for React application.

Sentence Level Segmentation and Oauth 2

By Pulkit Pushkarna

Sentence Level Segmentation and Oauth 2

  • 948