Using PowerShell to translate a site using Microsoft Translator

Recently I was thinking about doing a translation of a static html site I once created. As I was thinking about doing translations myself, as I also do my own translations on this site, it popped up in my mind that I did an implementation of automatically translating SharePoint multi lingual pages for a client of mine. For the basis of that implementation I used the ideas from the article Automatically translate pages on multilingual SharePoint sites and extended on that a bit. My current challenge was much simpler, I wanted to crawl the site online and save a local copy with the translated content to a specified directory, so that was the challenge I gave myself.

Below the PowerShell script I created to automate the translations for me, a prerequisite is the module PowerHTML, and you need to setup Microsoft Translator in your Azure subscription. You can use the free pricing tier for translating up to 2 million characters a month, that was quite sufficient for my purpose. Add the key for using Microsoft Translator to a file called TranslatorKey.ps1, see comment #* for the expected contents:

# See: https://docs.microsoft.com/en-us/azure/cognitive-services/translator/reference/v3-0-reference

param([uri]$url, [string]$targetDir = ".\translated", $language = "en", [string]$defaultPage = "/index.html", [switch]$crawl)

Import-Module -ErrorAction Stop PowerHTML

# initialize translation variables
. .\TranslatorKey.ps1 #* Contains: $translatorKey = "[your key]"
$global:baseUri = "https://api.cognitive.microsofttranslator.com/translate?api-version=3.0"
$global:headers = @{
    'Ocp-Apim-Subscription-Key' = $translatorKey
    'Ocp-Apim-Subscription-Region' = 'northeurope' # added (change if needed)
    'Content-type' = 'application/json; charset=utf-8'
}
$global:language = $language

# store for crawling
$global:crawlPaths = @()
$global:crawlPosition = 0

# Simple page 'crawler' (url extractor)
function CrawlPage($url, $htmlDom) {
    $anchors = $htmlDom.SelectNodes("//a")
    $anchors | ForEach-Object {
        if ($_.Attributes["href"]) {
            $href = $_.Attributes["href"]
            $uri = $null
            if ([uri]::IsWellFormedUriString($href.Value, "Relative")) {
                $uri = [uri]::new($url, $href.Value)
            }
            else {
                if ([uri]::IsWellFormedUriString($href.Value, "Absolute")) {
                    $uri = [uri]$href.Value
                }
            }
            if ($uri -and $uri.Host -eq $url.Host) {
                $pagePath = if ($uri.AbsolutePath -eq "/") { $defaultPage } else { $uri.AbsolutePath }
                $pagePath = $pagePath.ToLower()
                if (-not $global:crawlPaths.Contains($pagePath)) {
                    $global:crawlPaths += $pagePath
                }
            }
        }
    }
}

function TranslateHtmlText($text, $pagePath, $retry = 0) {
    # Cleanup found at: https://www.powershellgallery.com/packages/SharePoint.Translate/0.6/Content/SharePoint.Translate.psm1
    $text = $text -replace "[^ -x7e]"," "
    # Create JSON array with 1 object for request body
    $textJson = @{
        "Text" = $text
    } | ConvertTo-Json
    $body = "[$textJson]"

    # Uri for the request includes language code and text type, which is always html
    $uri = "$($global:baseUri)&to=$($global:language)&textType=html"

    # Send request for translation and extract translated text
    try {
        $results = Invoke-RestMethod -Method Post -Uri $uri -Headers $global:headers -Body $body
        $translatedText = $results[0].translations[0].text
    }
    catch {
        if ($_.ToString().IndexOf("the client has exceeded request limits") -ge 0 -and $retry -lt 3) {
            $retry++
            Start-Sleep -Seconds ($retry * 3) # wait
            $translatedText = TranslateHtmlText $text $pagePath $retry
        }
        else {
            Write-Host "Warning, unable to translate: $pagePath"
            Write-Host $_
            $translatedText = $text
        }
    }
    return $translatedText
}

function TranslateAndSave($pagePath, $htmlDom, $targetDir) {
    $targetPath = "$targetDir$($pagePath.Replace("/", "\"))"
    # translation code here
    $text = $htmlDom.InnerHtml
    $translatedText = TranslateHtmlText $text $pagePath
    $path = $targetPath.Substring(0, $targetPath.LastIndexOf("\"))
    if(-not (Test-Path -PathType container $path))
    {
        New-Item -ItemType Directory -Path $path | Out-Null
    }
    Set-Content -Path $targetPath -Value $translatedText -Force # Take care this will overwrite existing files
}

Write-Host "Retrieving content"
Write-Host "Url: $url"
$response = Invoke-WebRequest -Uri $url -Method Get
$htmlDom = ConvertFrom-Html -Content $response.Content

$pagePath = if ($url.AbsolutePath -eq "/") { $defaultPage } else { $url.AbsolutePath }
TranslateAndSave $pagePath $htmlDom $targetDir
$global:crawlPaths += $pagePath
$global:crawlPosition++

if ($crawl) {
    CrawlPage $url $htmlDom
    while ($global:crawlPosition -lt $global:crawlPaths.Length) {
        $nextUrl = [uri]"$($url.Scheme)://$($url.Host):$($url.Port)$($global:crawlPaths[$global:crawlPosition])"
        Write-Host "Retrieving content"
        Write-Host "Url: $nextUrl"
        $response = Invoke-WebRequest -Uri $nextUrl -Method Get
        $htmlDom = ConvertFrom-Html -Content $response.Content
        CrawlPage $nextUrl $htmlDom
        TranslateAndSave $nextUrl.AbsolutePath $htmlDom $targetDir
        $global:crawlPosition++
    } 
}

I saved the script into a file called 'CrawlingTranslator.ps1'. Now I could just start the script as follows:

.\CrawlingTranslator.ps1 -url [url of site] -targetDir .\translated -language [two letter language code] -defaultPage index.html -crawl

The script uses the url of the site (this can be set to a specific page) to retrieve the first page. The targetDir is the directory where the extracted and translated pages should be stored. With language you can specify the target language using the two letter code, like nl for Dutch. Then you can set the defaultPage so the root page is set to the default page. Then there is a crawl switch, indicating you want to crawl the site to find more pages to translate.

So this script did the trick for me, please note that I haven't tested it that rigorously, but I thought it might be interesting for others as well. Maybe you are interested in the translation part only or only in the crawling part (I kept that quite simple). These are just some ideas on how this can be implemented. You might just want to test things first and/or find out how many pages are crawled (comment out the contents of the TranslateAndSave function). Have fun with it.