Sentence Level Segmentation and OAuth2
Pulkit Pushkarna
Introduction
- In Cloudwords while translating an asset we extract the text blocks in the form of paragraphs and do the translation.
- Now we have the option to split the paragraph into sentences and perform the translation.
Scope of current Implementation
- For now we have introduced sentence level segmentation for Eloqua integration.
- In the future we will introduce the segmentation for other integrations.
Tool we have used for segmentation
- We are using pragmatic segmenter for sentence level segmentation
- Following is the web link of the tool https://www.tm-town.com/natural-language-processing
- Git url: https://github.com/diasks2/pragmatic_segmenter
Language supported for Segmentation
- Pragmatic Segmenter supports following languages for segmentation:
- Amharic
- Arabic
- Armenian
- Burmese
- Chinese (Simplified)
- English
- Greek
- Hindi
- Japanese
- Persian
Pragmatic Segmenter Response to sibling languages
- We have done some experimentation with the language which have similar script.
- In case of French and German when we passed English as a language parameter to the tool. It was able to make the sentences perfectly.
- For now with some customization we have added French and Germany also.
- Pragmatic Segmenter is working fine with these two languages with our customization.
- In future we will identify more languages who share the same script and eventually we should be able to provide support for many more languages.
- So cloudwords also supports French and German language segmentation apart from the languages supported by Pragmatic Segmenter.
Enabling Sentence Level Segmentation in Cloudwords
- In order to enable sentence level segmentation we need to check and save following option from admin panel for customer.
Eloqua assets which can be tested for sentence Level Segmentation
-
Sunil_Dynamic_Email_19- Feb-2020
-
Sunil_Simple_Landing Page_MT
Segmentation problem in non clonable elements
-
A non-clonable element clearly represents a piece of content that must be translated as one piece, and cannot be segmented. e.g
<source>This is a <g>sentence. It has</g> markup.</source>
- For this reason we do not break the text blocks which have html tags.
- So for now we perform segmentation on plain text block which does not contain any html tags.
- For more information please refer to the following link:http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Struct_Segmentation
Example of non clonable Elements
-
<source><g>This is a sentence. It has markup.</g></source>
-
By Segmenting paragraph we will get:
-
<source><g>This is a sentence.</source>
-
<source>It has markup.</g></source>
-
-
Both of the lines are not valid xliff source content. In order to make them valid we need to insert missing tags
-
<source><g>This is a sentence.</g><source>
-
<source><g>It has markup.</g></source>
-
-
Now when we merge the segments to together we will get a different Block
-
<source>
<g>This is a sentence</g>. <g>It has markup.</g>
</source>
- This result is different from the original source
-
We can further improve the current implementation of segmentation with the following low hanging fruits
- Identify the languages which share the same script with the languages supported by Pragmatic Segmenter and add support for those languages for segmentation.
- Identify and Segment Text units which contains HTML tags but are clonable.
Upcoming plans for Segmentation
- Find a solution to segment non Clonable elements (POC).
- Introduce Segmentation in other integration.
- Introduce Segmentation for other local files.
OAuth 2 in Cloudwords
Pulkit Pushkarna
OAuth (Open Authorization) In Cloudwords
- OAuth is a protocol which is used for Authorization and Authentication while dealing with Restful APIs.
- In Cloudwords we only use to support OAuth 1 earlier. But now we support both OAuth 1.0 and OAuth 2.0.
- OAuth2.0 is a complete rewrite of OAuth1.0
- OAuth2.0 is not backward with OAuth1.0 and should be thought of as a completely new protocol
- A high level goal of OAuth is allowing a Resource Owner to give access to a third party in a limited way, without giving away the password.
Advantages of OAuth 2.0
- OAuth 2 protocol issues token for a specific amount of time after that interval token expires.
- Once the token is expired we do not need to submit client credentials. we just need to use the refresh token to get the new access token.
- OAuth 2 supports different grant types like client_credentials, password, authorization_code.
Brightcove
Pardot
CW-APP
CW-API
Auth Server
Client
Get Token
Send Token
Send Token
Get Token
Send Token
CW-API
Auth Server
Client
Get Token
Send Token
Validate Token
Release Show and Tell
Pulkit Pushkarna
Major stories and tasks
-
CW-5482 Eloqua Sentence Level Segmentation for clonable elements (P176) https://localhost:8443/cust.htm#project/3331/review/9321/fr/20001
-
Non clonable HTML elements.
-
<div>This is line 1. This is line 2.</div>
-
<div>This is line 1.
-
This is line 2</div>.
-
-
-
Clonable HTML Elements
-
<b>This</b> is line 1. <br/><br/><b>This</b> is line 2.
-
<b>This</b> is line 1.
-
<br/><br/><b>This</b> is line 2
-
-
CW-5574 Sentence Level Segmentation for Plain Text and clonable elements using seg-source.
CW-5593 Sentence Level Segmentation: Missing HTML blocks in case of space is not provided in the start of the sentence. (P178). https://localhost:8443/cust.htm#project/3335/review/9329/fr/20029
CW-5599 Sentence Level Segmentation: Match and Mark is not working for the Assets (P175) https://localhost:8443/cust.htm#project/3329/review/9317/ar/19989
CW-5601 Sentence Level Segmentation: Edit Translation/Provide Feedback option is not available for some HTML blocks P(175) .https://localhost:8443/cust.htm#project/3327/review/9313/de/19977
CW-5604 Powerpoint unable to sync back in review (P179) https://localhost:8443/cust.htm#project/3337/review/9337/fr/20049
Support Tickets: Zen-7260, CW-5608, Zen-7292, Zen-7289, CW-5630, CW-5634
Non Demonstrable/ Support tickets
Ticket | Client |
---|---|
Zen-7260 PAN - Sandbox - REST - Landing Page not generating ICR | MakCust |
CW-5608 email notification request revision logging | UL |
Zen-7292 Project ID 187213 - Review why System User approved language workspace (In Progress) | F5 |
Zen-7255 REST - Project ID 188303 | Unexpected error occurs (In Progress) | Google Primer |
CW-5630 Mongo duplicate index error | System Issue |
CW-5634 Task reminder sent for a non-approved bid | UL |
Future tasks for segmentation
- Introduce sentence level segmentation Marketo Rest and AEM.
- Glossary (configurable, no longer dependent on department).
- Translation Memory (configurable, no longer dependent on department).
Release Show and Tell Harrier
Pulkit Pushkarna
- CW-5473 Introduce Sentence Level Segmentation in Marketo Rest
- CW-5732 Segmentation | MR | Project are stuck in Asset level syncing.
- CW-5724 Sentence Level Segmentation: Static TUs are also highlighted in case of Dynamic project
- CW-5764 Google - Marketo REST - Bullet items
- CW-5690 Sentence Level Segmentation: Wrong reordering of the HTML Tags for the segments
Blackbird release show and tell
Pulkit Pushkarna
Segmentation approach for integration (i.e Eloqua and Marketo Rest)
Source Html
Segmented source XLIFF
Segmented target XLIFF
Target HTML
Segmentation approach in case if source file is XLIFF
Source XLIFF
Segmented Source XLIFF
Segmented Target XLIFF
Target XLIFF
- CW-5786 Sentence Level Segmentation for XLIFF file
https://localhost:8443/cust.htm#project/369/review/919/fr/2539
- CW-5771 Sentence Level segmentation for AEM
https://localhost:8443/cust.htm#project/307/review/795/fr/2215
- CW-5839 Seperation of concern for source and target xlf file for segmentation
https://localhost:8443/cust.htm#project/367/review/915/fr/2525
- CW-5838 Segmenting complex source tags in XLIFF Edge case
- CW-5841 API changes for AEM segmentation
https://localhost:8443/cust.htm#project/307/review/795/fr/2215
Sentence Level Segmentation Implementation so far
-
Integrations Support
- Marketo Rest
- Eloqua
- XLIFF (local projects)
- AEM
-
Language Supported for Segmentation
- Amharic
- Arabic
- Armenian
- Burmese
- Chinese (Simplified)
- English
- Greek
- Hindi
- Japanese
- Persian
- We do not support Segmentation of non clonable html elements (Recommended)
Hexa Release Show and Tell
Pulkit Pushkarna
CW-5865 tomcat 9 ars upgradation in jfrog
CW-5849 Segmentation || Eloqua || Asset Specific Issue : Getting ParseErrorException
- https://localhost:8443/cust.htm#project/693
CW-5855 ICR not opening for MR projects when segmentation is ON
https://localhost:8443/cust.htm#project/695/review/2035/fr/5825
https://localhost:8443/cust.htm#project/695/review/2035/fr/5843
CW-5845 Segmentation || MR || Asset Specific Issue : Getting ParseErrorException
https://localhost:8443/cust.htm#project/691/review/2019/nl/5769
CW-5822 IDML Sync Fail
-
Worked on WCAG issues
Thar and XUV Release Demoable items Show and Tell
Major Enhancement in Sentence Level Segmentation
-
Earlier we had a limitation in Sentence level segmentation that if both target and source language are among the following languages
- Amharic
- Arabic
- Armenian
- Burmese
- Chinese(Simplified),
- English
- Greek
- Hindi
- Japanese
- Persian
then only we can introduce sentence level segmentation.
-
But now with the new enhancement Sentence Level segmentation is only dependent on source i.e if the source language is among the languages mentioned above than we will introduce sentence level segmentation for
- Eloqua,
- MarketoRest,
- AEM
- XLIFF file (Local Project)
- We still have One limitation in SLS. We don't do segmentation for non clonable elements
Previous approach
Source Language TU
Segmentation Microservice
Core Platform
Segmented TU
Target Language TU
Segmented TU
Current Approach
Core Platform
Segmentation Microservice
Source Language TU
Segmented TU
Source Segmented TU is used by
core platform to generate target
segmented TU and target TU
Pragmatic Segmenter
Ambharic, Arabic, Armenian, Burmese, Chinese,
English, Greek, Hindi, Japanese, Persian
Ambharic, Arabic, Armenian, Burmese, Chinese,
English, Greek, Hindi, Japanese, Persian
Pragmatic Segmenter
Sentence Language Segmentation on the basis of source language (CW-5919) (Major Change)
- https://localhost:8443/cust.htm#project/989/review/2651/fr/7767
- https://localhost:8443/cust.htm#project/989/review/2655/de/7769
- https://localhost:8443/cust.htm#project/989/review/2659/el/7765
- https://localhost:8443/cust.htm#project/991/review/2675/bg/7795
- https://localhost:8443/cust.htm#project/991/review/2679/ja/7797
- https://localhost:8443/cust.htm#project/991/review/2683/ko/7793
- Zen-7962 - Mouser - Formatting of special symbols is being changed by our platform (CW-6104)
https://localhost:8443/cust.htm#project/863/review/2363/fr/6707
- Sync Failed for PSD for PID 208745 (CW-6097)
https://localhost:8443/cust.htm#project/975/review/2599/ar/7607
- Encoding issue for XLIFF Segmentation (CW-5921)
https://localhost:8443/cust.htm#project/999
https://localhost:8443/cust.htm#project/999/review/2723/cs/7963
- Segmentation | MR | Asset sync error appearing on the while creating project with segmentation languages. (CW-5734)
https://localhost:8443/cust.htm#project/997/review/2715/hu/7885
- Worked on Microsoft's WCAG Issues
REL-Marazzo and REL-scorpio Demoable items Show and Tell
Pulkit
- CW-6142 Zen-8032 Extra spaces added around links when editing a segment https://localhost:8443/cust.htm#project/391/review/943/fr/2401
- CW-6141 Zen-8023 After project creation, Project Status is stuck at Syncing Contenthttps://localhost:8443/cust.htm#project/139/review/327/it/911
- CW-6127 Zen-8017 ICR for an xlsx file doesn't work as expectedhttps://localhost:8443/cust.htm#project/387/review/935/fr/2375
- CW-6119 Tokens being cloned to global level
https://engage-sj.marketo.com/?munchkinId=529-IDT-654#/classic/PG36201D4
- CW-6108 Problem with uploadeing revised files for - 12AUG_CC_Video_Business_Agility_EWF
- CW-6047 Segmentation || Eloqua : Edit Translation / Provide Feedback option is not showing for Dynamic Content
https://localhost:8443/cust.htm#project/383/review/927/en-au/2343
- CW-6011 Segmentation || Eloqua || Issue with Asset >ComplexLandingPageAsset
https://localhost:8443/cust.htm#project/341/review/795/fr/2069
- CW-6008 Sentence Level Segmentation ||ELoqua > NumberFormatException
https://localhost:8443/cust.htm#project/397/review/963/es-mx/2439
- CW-5845 Segmentation || MR || Asset Specific Issue : Getting ParseErrorException
https://localhost:8443/cust.htm#project/189/review/423/fr/1129
- CW-5762 Segmentation || MR || Asset Specific Issue : Getting NoSuchElementException
https://localhost:8443/cust.htm#project/413/review/1023/es-es/2567
Show and Tell Alturas and Bolero
Pulkit Pushkarna
- CW-5875 Segmentation || Eloqua || Edits are not reflecting correctly for some of the TUs
https://localhost:8443/cust.htm#project/415/review/1027/fr/2579
- CW-5935 Add Sibling source languages for sentence level segmentation
https://localhost:8443/cust.htm#project/1119/review/2479/fr-bf/6947
- CW-6025 Campaign | Reviewers Tab | User not able to click using keyboard on headers for sorting
- CW-6179 Segmentation || MR || Asset Specific Issue : Side by Side ICR is not showing for mentioned asset
https://localhost:8443/cust.htm#project/1121/review/2483/fr/6963
- CW-6190 Translation Not persisting in Eloqua for Translated Asset in case of one tu for the asset mentioned in description
https://localhost:8443/cust.htm#project/1123/review/2487/fr/6979
- CW-6335 Zen-8247 Project ID 222723 - LSP is unable to upload Glossary Excel file to project when assigned to complete task
https://localhost:8443/lsp.htm#project/1139/task/8991
- CW-6264 Microservice Inter communication for CW app
- CW-6331 Zen-8275 targetLanguages missing in project search API
Show and Tell for demoable items of REL-punch
Pulkit Pushkarna
- Zen-8283/CW-6341 Project ID 224665 - Project Status Syncing Content after over an hour. Blank token issue in marketoRest.
https://localhost:8443/cust.htm#project/1231/review/2727/el/7679
- CW-6348 Duplicated requests
- CW-5511 Vendor and SME should see Campaign field in Project details
Sentence Level Segmentation in Cloudwords
Sentence Repeating in Paragraphs
- I Love Indian food. It is the most flavourful food in the world.
- I Love Indian food. It is very healthy and soul nourishing.
- I Love Indian food. Most of the Indian traditional dishes are vegetarian.
Introduction
- In Cloudwords while translating an asset we extract the text blocks in the form of paragraphs and do the translation.
- With Sentence level segmentation we have the option to split the paragraph into sentences and perform the translation.
We support SLS for following Integration
- Marketo Rest
- Eloqua
- XLIFF files
- AEM
- DOCX (Not Very Mature right now)
Tool we have used for segmentation
- We are using pragmatic segmenter for sentence level segmentation
- Following is the web link of the tool https://www.tm-town.com/natural-language-processing
- Git url: https://github.com/diasks2/pragmatic_segmenter
Language supported for Segmentation
- Pragmatic Segmenter supports following languages for segmentation:
- Amharic
- Arabic
- Armenian
- Burmese
- Chinese (Simplified)
- English
- Greek
- Hindi
- Japanese
- Persian
Enabling Sentence Level Segmentation in Cloudwords
- In order to enable sentence level segmentation we need to check and save following option from admin panel for customer.
Segmentation problem in non clonable elements
- A non-clonable element clearly represents a piece of content that must be translated as one piece, and cannot be segmented. e.g
<source>This is a <g>sentence. It has</g> markup.</source>
- For more information please refer to the following link:http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Struct_Segmentation
Example of non clonable Elements
-
<source><g>This is a sentence. It has markup.</g></source>
-
By Segmenting paragraph we will get:
-
<source><g>This is a sentence.</source>
-
<source>It has markup.</g></source>
-
-
Both of the lines are not valid xliff source content. In order to make them valid we need to insert missing tags
-
<source><g>This is a sentence.</g><source>
-
<source><g>It has markup.</g></source>
-
-
Now when we merge the segments to together we will get a different Block
-
<source>
<g>This is a sentence</g>. <g>It has markup.</g>
</source>
- This result is different from the original source
-
Segmentation approach for integration (i.e Eloqua and Marketo Rest)
Source Html
Segmented source XLIFF
Segmented target XLIFF
Target HTML
Segmentation approach in case if source file is XLIFF
Source XLIFF
Segmented Source XLIFF
Segmented Target XLIFF
Target XLIFF
Architectural diagram for Sentence Level Segmenntation in Cloudwords
Segmentation Microservice
Core Platform
Segmented TU
Source Language TU
Pragmatic Segmenter
Ambharic, Arabic, Armenian, Burmese, Chinese,
English, Greek, Hindi, Japanese, Persian
- https://localhost:8443/cust.htm#project/415/review/1027/fr/2579
- https://localhost:8443/cust.htm#project/413/review/1023/es-es/2567
- https://localhost:8443/cust.htm#project/383/review/927/en-au/2343
- https://localhost:8443/cust.htm#project/397/review/963/es-mx/2439
- https://localhost:8443/cust.htm#project/1377/review/3023/fr/8417
Bolt and Everito Show and Tell
Pulkit Pushkarna
- CW-6405 Mouser-Spaces are being removed from the segments' ends https://localhost:8443/cust.htm#project/1153/review/2551/es-es/7155
- CW-6343 Zen-8283/CW-6341 Project ID 224665 - Project Status Syncing Content after over an hour. Blank token issue in marketoRest. https://localhost:8443/cust.htm#project/1099/language/fr https://localhost:8443/cust.htm#project/1385/review/3043/sq/8431
- CW-6473 Handle Segmentation Text Unit in XLIFF validations
- CW-6201 Sentence Level Segmentation for docx file. https://localhost:8443/cust.htm#project/1387/review/3047/fr/8449
- CW-6313 Admin permission to enable SLS for DOCX
Tiago and Tigor Release show and tell
Pulkit Pushkarna
Demoable Items
- CW-6614 Restrict adding a PO number when editing/Creating the project
- CW-5728 Add alternative ways to change user's email address
- CW-6575 Upgrade OAuth JDBCTokens to JWT Tokens. Perform jmeter load testing to insure that it is working fine for concurrent requests
UI Architecture for Cloudwords Platform
password
POST /oauth2/token
Credentials
access token, refresh token
access token
React App
Resource Server
OAuth2 Server
BFF
React App
Node App
Spring Boot
SpringMVC
Front End
BFF
OAuth Server
CW API
(Resource server)
Deployed Component
- OAuth Server
- Resource Server
- Front End
- BFF
Indica and Curvv Show and Tell
Pulkit Pushkarna
- CW-6806 Distortion in the 2 TU of ICR in case of docx file.
https://app-curvv.stage.cloudwords.com/cust.htm#project/573/language/de
- CW-6871 Apostrophe's Issue || ICR || Edit Translations || Docx-Segmentation
https://app-curvv.stage.cloudwords.com/cust.htm#project/575/language/de
- CW-6857 Docx || Segmentation ON || Target ICR - Paragraph Issue
https://app-curvv.stage.cloudwords.com/cust.htm#project/577/language/fr
- CW-6640 Tigor - Old HTML Editor - Revision Request breaks up HTML
https://app-curvv.stage.cloudwords.com/cust.htm#project/585/language/fr
- CW-6448 Sentence Level Segmentation || Source File Language Issue
https://app-curvv.stage.cloudwords.com/cust.htm#project/579/source
- CW-6422 Docx Segmentation | Asset Sync error observed when project created post enabling the segmentation.
https://app-curvv.stage.cloudwords.com/cust.htm#project/581/language/fr
- CW-6418 Sentence Level Segmentation Issue || Docx
https://app-curvv.stage.cloudwords.com/cust.htm#project/583/language/ar
- Design and Implementation of new architecture for React application.
Sentence Level Segmentation and Oauth 2
By Pulkit Pushkarna
Sentence Level Segmentation and Oauth 2
- 948