A simplistic approach to auto-detect the language of a blog post

written on 2019-05-07

HTML wants you to declare the language of your page.

<html lang="en">

would say: "this page is in English". Why is it useful? Once the browser knows the language of your page, it can apply automatic hyphenation for example.

I like writing in both English and German. So the language might vary between blog posts. In order to correctly declare the language in the HTML of each post, I could add a field, where I explicitly state either "en" or "de" for each blog post that I create.

But writing is about content, not about wrangling meta-data. So there's got to be a better way, right? Let's use some AI algorithm.

The 100 most common words

Take the most common 100 words from both languages. For each word, check if it occurs in the blog post. The language which has more word occurrences wins.

Luckily, Wikipedia has lists of the most common 100 words for both English and German.

This can even be implemented in the (slightly limited) Jinja templating language, which Lektor (the static site generator that this blog is based on) uses:

{%- macro auto_detect_lang(text) -%}
  {%- set de_words = ["der", "die", "und", "..."]  -%}
  {%- set en_words = ["the", "be", "to", "..."] -%}

  {%- set contained_de_words = [] -%}
  {%- set contained_en_words = []  -%}

  {%- for word in de_words -%}
    {%- if word in text -%}
      {%- do contained_de_words.append(word) -%}
    {%- endif %}
  {%- endfor %}

  {%- for word in en_words -%}
    {%- if word in text -%}
      {%- do contained_en_words.append(word) -%}
    {%- endif %}
  {%- endfor %}

  {%- if contained_de_words|length > contained_en_words|length -%}
  de
  {%- else -%}
  en
  {%- endif %}
{%- endmacro %}

You can find the full code in the blog's git repository.