Aharoni in Unicode

Why Did I Leave Quora, Why It Is Not Such a Big Deal, and Why Do I Nevertheless Hope to Come Back

Published 2018-10-13 Facebook , Internet , Quora , Twitter 3 Comments

This post is sad and angry. I don’t mention any names, but some people may read it and identify themselves here. Here’s my request to these people: I hope that this doesn’t hurt you personally. It’s not my intention to hurt anyone personally. All people have to do their jobs; sometimes they are happy about what they do despite some people’s complaints, sometimes they aren’t happy, but do it anyway because they need to pay bills, or because that it’s necessary for some kind of greater good. I totally get it. There’s a certain chance that I’ll meet some of you online or in real life. If this ever happens, I hope you don’t feel embarrassed or intimidated. I’ll be happy to meet you and I promise to be friendly. Thanks for understanding.

I used to be a prolific writer on the question and answer website Quora. I was even named a “Top Writer” four times. Sadly, in 2018 this once-fine website ruined itself.

The problematic signs were there even earlier, but the true catastrophe began with the “Links” feature. This feature adds links to articles on other websites to the Quora feed. Before this feature’s introduction the feed consisted mostly of questions and answers, as one would expect from, you know, a questions and answers website.

The articles and the websites shown as “Links” in the feed are selected automatically by Quora’s software. How does this software work is a mystery. There appears to be some intention to show things that are related to the topics that the user follows, but it also suggests unrelated topics. Sometimes they are labelled “Topic you might like”. Sometimes they aren’t labelled at all:

There’s no way to select a website to follow and see links from. There is a way to mute websites, but other sites will be shown instead.

There’s also no way to remove all the Links from the feed completely. By popular demand from Quora users, a volunteer made a browser extension called “Qure” that does it, but it only works on the web and not on the Quora mobile app.

The Link items in the feed look almost exactly like questions, which is severely distracting, and feels out of place. Quora staff people who work on this feature know it—”the links feel out of place” is a direct quote from a staff person. They know that many users dislike them, but they choose to show them anyway. “We’ll show links less to people who don’t like them” is also a direct quote from a staff person.

Let this sink in: They know that some people don’t like the links, and they show them to these people anyway. My logic—I won’t even bother calling it “ethics”—tells me that when you know that a person doesn’t like a thing, you don’t show that thing to that person at all unless you have a particularly good reason and you can explain it.

Another problematic feature that Quora introduced in 2018 is “Share”. This sounds like a sensible thing to have on any modern website, but on Quora it has a somewhat different meaning. “Sharing” on Quora means putting an item in your followers’ feed with a comment.

This is similar to retweeting with a comment on Twitter. It works fairly well on Twitter, but Quora is not Twitter. In Twitter everything is limited to 280 characters—the tweets and the comments on retweets. On Quora answers can and should be longer, but the comments are short, and this feels imbalanced.

What’s worse, even though Quora says that the comments on shared items “provide additional insight“, they are actually rather pointless. In fact, many of them are not even really written by people, but filled semi-automatically: “This is interesting“, “This is informative”, “Great summary”, “I recommend reading this“, etc. Those that are actually written by humans are not much better, for example: “H.R. has been a wonderful teacher and excellent writer. Since joining Quora last year I’ve latched on to his brilliance – he’s earned his place firmly”. This says nothing substantial that couldn’t be expressed by simply upvoting the writer’s answer.

Both links and answers can be shared. I’ve just explained why sharing answers is pointless. Sharing links is a weird thing: On one hand, seeing a link that was shared by a Quora user makes relatively more sense than seeing a link that was added to the feed by faceless software for some reason I don’t know. In practice, however, it doesn’t make the link any more sensible or useful. Shared links feel totally relevant on Facebook and Twitter, but Quora is neither Facebook nor Twitter. It’s a site for questions and answers, or at least it used to be one.

And then there are the items that are questions or answers, but that are shown to me on my feed for mysterious reasons: They are categorized under a topic I don’t follow, they are written by users that I don’t follow, and they weren’t even upvoted or shared by users that I do follow. They are just totally, completely unrelated to me.

Occasionally they are labelled as “Topic you might like” or “Author you might like”, but sometimes they don’t even carry this label.

It’s difficult to discuss this feature because unlike “Share” and “Links” it doesn’t even have a name. It’s just… random stuff that I didn’t ask to see, and that appears in my feed. In this blog post I’ll call it Nonsense. It’s not a nice name, but that’s what it is. (I really want to know this feature’s real name. It surely has one. If you are on Quora staff, please tell me what it is. I won’t reveal your identity.)

I would possibly understand showing this Nonsense to new users: Quora may want to suggest you stuff to follow to get you hooked. But I’ve had the account for seven years, I follow lots of people and topics, I visit the site several times a day, and I know very well what I want.

What’s worse, Nonsense items are shown to me while many items written by people I do follow are not. I followed people on Quora because their personality or knowledge genuinely interested me. To me, “Follow” means that I’m interested in seeing stuff written by these people. But Quora decided to disregard my specific request, and to show me Nonsense instead.

There’s no way to run away from Link items, from Share items, and from Nonsense items. Quora has a Mute feature, but for the most part it does more harm than good:

When you mute a Link item it mutes a particular link source, for example New York Times or Breitbart (yes, both are available), but when you mute one source, other sources are shown instead and there appears to be no end to it.
When you see an answer on a topic you don’t follow, you can mute that topic, but this (probably) means that if an answer is written in this topic by a user that you do follow, you won’t see it. This is often not what one wants. For example, “Entertainment” is a topic on which answers are often shown to me, even though I don’t follow it. I don’t want to see this random answers, but if a user I follow posts an answer in a question for which this is one of the topics, I’d be OK with seeing it.
When you see an “Author you might like”, and you don’t actually like that author, you can mute them. As above, this is not necessarily what I want: If that author happens to write an answer on a topic I follow, I’ll be OK with seeing it. I just don’t want to see that author’s answers when they are completely unrelated to me, but this is a feature, and there’s no way to get rid of it.

When I first saw the Links in February 2018, I was immediately appalled: What is this thing that is neither a question nor an answer?! When I saw that I cannot remove them from my feed, I pretty much immediately decided to stop using the site. It was clear to me that something is badly wrong.

Even thought I deleted my Facebook account in 2015, I created a new one some time after the links were introduced, just so that I could join the private Quora Top Writers Feedback group. For several months I tried talking to the Quora staff people in that group and understand: Why do the links even exist? Why are they so random and useless? Why are pointless items shown to me? I got almost zero substantial replies.

I intentionally came back to sincerely using Quora, thinking that the algorithms will learn my behavior, and show me more relevant links, or no links at all. This didn’t work, of course, and Quora became even worse when the awful Share and Nonsense features were added, so in June 2018 I stopped posting there almost completely.

After some more time, the Facebook group’s moderator didn’t like my questions about these unfortunate features, and removed me from the group, too. The explanation was that they were repetitive, which is understandable; what is less understandable is that instead of removing me from the group they could try answering the questions. They didn’t. They did suggest sending my complaints to a particular email address for Top Writers. I did it, and I received no reply.

So that’s it, I guess.

A legitimate question arises: Could I use Quora without the feed? Not really, because the best thing about Quora was that before the disastrous 2018 changes it showed me answers that interest me and questions that need answers on topics about which I know something. Without this, the site is not that useful. It moved to being oriented much more towards readers who are prone to click on clickbait and to writers who are local Quora “stars”. I don’t belong to either group.

(Before I go into the last conclusions, I should mention one unrelated and very positive thing that Quora did in 2018: Expansion of its internationalization efforts. For years, Quora used to be explicitly English-only. Later, Quora introduced sites in several new languages, among them Spanish, German, Hindi, Portuguese, Indonesian, and French. It also added an answer translation feature, which, while not yet implemented perfectly, is a step in a very good direction. I hope that it gets developed further and doesn’t get killed.)

I have a bit of a price to pay for publishing this blog post. I probably won’t be a top writer again (this came with pretty nice swag). I might be banned; not that it matters, because I plan to deactivate the account anyway. I may run into Quora staff people at professional conferences, and things may get awkward (see the top of this answer—I do hope to meet you, and I hope that it won’t get too awkward).

But at the same time… it’s not actually a big deal. Even though before 2018 Quora was a really nice place to ask my questions and to answer questions for which people need an answer, it is nowhere near being a truly essential site like Wikipedia. Stopping to read and write there every day allowed me to focus better on family and work, and also to revive some old neglected projects, such as translating Wikipedia articles or proofreading Gesenius’ Hebrew Grammar at Wikisource.

All that said, yeah, I’d probably be happy to come back. The web does need a good question and answer site, with relevant topics, with pleasant design, and with good moderation. Quora used to be such a site. It is no longer such a site, with or without me. It can easily go back to being one. However, this will only happen when it becomes possible to remove Links, Shared items, and Nonsense from the feed.

A couple of last conclusions:

On a website that has the characteristics of being a social network or a writers community, users need to be empowered somehow. It’s not easy, and it has costs, but when it’s done right, it’s worth it. Wikipedia empowers its users ridiculously: on no other site can the users edit the site’s CSS and JavaScript (not all users, but a lot of them). Reddit is not as transparent as Wikipedia, but it’s quite empowering as well: subreddit moderators can pressure the site’s management. The results of this pressure may be unpleasant and controversial, but it’s nevertheless good to have balances. Quora users are not empowered at all. It gives the company a lot of control, but is it actually good?
Some people enjoy random weird algorithmically-selected stuff, and some people don’t. I hate the Links, and the Nonsense items, and a lot of other users hate them, but some people are fine with them. And that’s OK. That’s what preferences are for.

Amir Aharoni’s Little Take on the Lodestar Affair

Published 2018-09-16 language , lexicography Leave a Comment
Tags: Merriam-Webster

In case you haven’t heard, an op-ed called I Am Part of the Resistance Inside the Trump Administration was published in the New York Times on September 5. It was allegedly written by an anonymous senior person in the White House, and it made a whole lot of noise in the news.

People immediately started guessing who this is. One of the popular guesses is that it’s vice president Mike Pence, because the article uses the word “lodestar”, which is relatively rare, but unusually common in Pence’s past speeches.

And here’s my tiny, tiny conspiracy theory about it: “lodestar” was Merriam-Webster’s word of the day on August 28. Being a dictionary lover, I listen to Merriam-Webster’s Word of the Day podcast every day using Podcast Addict, a simple RSS-based podcast player. I didn’t hear this episode. If you try to download this episode using Podcast Addict, you’ll see that the title is “lodestar”, but in fact it’s the episode for “rubric“, the previous day’s episode.

It’s kind of weird, but maybe it’s a total coincidence. Maybe the person who wrote the op-ed just follows the word of the day not through the podcast, but elsewhere on the web. And maybe it has nothing to do with Merriam-Webster, and they are just an educated person who knows words like “lodestar”.

But hey, feel free to spread the rumor that Merriam-Webster is trying to subvert the government, or make up whatever other nonsense you want.

There’s Nothing Particularly Good About Long Wikipedia Articles. Let’s Make Them Shorter

Published 2018-05-10 design , Wikipedia 2 Comments

Wikipedia used to have a warning about articles of a certain size. If I recall correctly, it was 64KB. As far as I understand, the reason for this was more engineering-oriented than user-experience-oriented: Loading a larger page was slower, because networks were slower, or at least so some people thought.

Wikipedia no longer has this warning. It’s not unusual to have a page of 250KB or more. I don’t participate in discussions about performance, but the discussions that I do see are about the time that it takes to parse the templates server-side, to load JavaScript modules, and to render the CSS; they are not so much about the kilobyte size of the pages themselves.

I suspect, however, that there is a problem with page length. Not one of performance engineering, but of user experience. Do people actually read whole encyclopedic articles in Wikipedia? In case you haven’t guessed it already, my hypothesis is that most people don’t.

This is my hypothesis because of the famous debunking of a designer myth: people usually don’t read texts.

It should be clarified right away that the notion that people don’t read whole Wikipedia article is not, by itself, a problem. It may be a bit sad for people who invest hours (or years!) in writing the brilliant prose of each excellent article, but the point of Wikipedia is not supposed to be getting millions of people to read very long articles. Rather, it’s making information that they need accessible, and making it as easy as possible for everybody to edit this information.

Do long articles make finding information easy? Probably not. Experienced Wikipedia editors are familiar with article structure, with tricks like Find in Page, and so on, but a lot of readers are not.

So here’s my call: Let’s bring back article length warning in some form. The importance of a topic doesn’t necessarily justify having a very long article about it. The purpose is not to have a long page, but to make information easy to find. If splitting an article to several pages makes the information easier to find, then the readers will of course be happy, and the editors who invest their effort in writing a lot about a topic should be happy, too, because their writing is more likely to be actually read.

Wikimedia Strategy Phase 1: What Does It Mean for Me and (Maybe) for Language Diversity in Wikipedia

Published 2018-02-25 Free Software , language , Wikipedia Leave a Comment
Tags: wmcon

The Wikimedia Foundation is leading a process to write a strategy for the Wikimedia movement. This process takes over a year. A few months ago, the conclusion of Phase 1 of this process was published: The strategic direction.

Some central concepts in this document are “knowledge as a service” and “knowledge equity”. Some people said that it’s too vague and high-level, and that it can be interpreted in a lot of ways. This is true, especially in a movement that is as culturally and linguistically diverse as Wikimedia. Perhaps this is intentional, so that people will be able to interpret this in any way that feels right for them.

Recently I was filling a registration form for Wikimedia Conference 2018. This form was very long, and it asked what do the concepts that appear in the strategic direction document mean to me. My answers were longish, and since there’s nothing secret about them, and they may (or may not) interest some people, I copied them from the form to this blog post. I edited them slightly for publishing here so that the context will be clearer, but the essence is the same as what I submitted.

Knowledge as a service

The knowledge that Wikimedia projects already contain is available through all common channels of communication: in addition to being available on the website, it must be findable on all search engines in all languages and countries, browsable on devices of all operating systems whether open or not, browsable as much as possible through social networks and chat applications, embeddable in other apps, etc.

It must be easy for all people, whether they are knowledgeable about computers or not, to contribute their knowledge to Wikimedia sites, and humanity in general should know that Wikimedia sites is the place where they contribute their knowledge and not only learn it.

Knowledge equity

What it means to me is:

That all people, of all ages and all kinds of identities, of all countries, who speak all languages, must be able to read and write in their language.

That we will fight whenever it’s reasonable against censorship and against all kinds of chilling effects that deter potential contributors or threaten their well-being.

That we remain independent of commercial and political entities by strictly refusing to carry political and commercial advertising and to accept unreasonable limited grants.

That all the software that is useful for reading and writing on our sites must be easily usable in all languages, whether it’s core software, extensions, templates, or gadgets.

That we don’t depend on any non-Free or otherwise unethical software, even if it appears to make consuming and contributing knowledge easier.

That we set a goal of having good coverage for core content in all languages and actively pursue it and not leave it only to the community’s “invisible hand”.

That we set a goal that the most popular Wikimedia projects in each country are in that country’s most spoken languages and not in a foreign language.

What kind of conditions do you need to realize these activities?

Describe what you think would be good conditions for you to move forward in this direction. Think of conditions in the broadest sense; e.g., capacity, skills, partnerships, clarification, structures and processes, room for development or experimentation, financial resources, people, access to other means of support etc.

We need to partner with academic institutions that work on topics that are not currently covered by our projects because of systemic bias.

We need to partner more with organizations that have expertise in developing minorized and under-resourced languages, working on the ground in the countries where these languages are spoken.

We need easy access to data about the social and political situations in poorer countries, and if such data doesn’t exist at all, we need to lead research that creates such data ourselves.

We need a new attitude to developing software for our sites: we need to understand what do our communities actually do on the sites with gadgets and templates rather than just developing new extensions that may be shiny, but are hard to integrate into the sites, each of which is heavily customized.

What I wrote in that form is a good description of my current attitude to what the priorities of Wikimedia movement should be, at least in terms of ideology and values. You can clearly see my interests: remembering that language support is important and that most people don’t speak English; remembering that we are not supposed to be an American non-profit organization, but an international movement that happens to have an office in the U.S.; remembering that we are also a part of the Free Software movement; remembering that good software engineering are important, even if engineering alone can’t solve all the problems.

For people who have doubts: This post represents my own opinions, and doesn’t express the opinion of the Wikimedia Foundation or any of its employees or managers.

How Gboard Could Be Better for Hebrew

Published 2017-11-28 Android , Google , Hebrew , keyboard Leave a Comment

Oh (edit): Most of these suggestions are implemented as of February 7 2018. The only significant change that still does not seem to be implemented is the Oleh character. Thank you, Google, for your continued improvements of Gboard.

I mostly use the Gboard app for writing on my phone. The Samsung keyboard is generally not bad, but it doesn’t include Hebrew vowels, and I need them.

There are, however, several characters that are needed for Hebrew, and that aren’t included in Gboard, and some unnecessary characters could be removed.

These can be removed:

Long-pressing the minus (-) in the punctuation keyboard shows interpunct (·) and the em dash (—). They are unnecessary for Hebrew. The en dash (–), must not be removed, but see below.
The low line (_) appears twice in the punctuation keyboard: as its own key to the left of &, and as an option when long-pressing the minus (-). One of them can be removed. I’ll further argue that the en dash (–) is more useful for Hebrew than the low line (_), and the standalone low line can be replaced with the en dash. The low line is not used much anywhere except programming, while the en dash is useful for typing ranges correctly in Hebrew. I’ll readily admit that not a lot of Hebrew speakers know about the en dash’s correct semantics, but not many more people use the low line anyway.

And these should be added:

Maqaf (־, U+05be): It’s the Hebrew hyphen. It has different appearance and different direction semantics. It should be available when you long-press the minus in the main keyboard, and can also appear when you long-press the minus in the punctuation keyboard (for example, instead of the unnecessary em dash).
Geresh (׳, U+05f3) and Gershayim (״, U+05f4): These punctuation marks are similar in appearance to quotation marks, but they have different semantics. Apple went as far as replacing quotation marks on Hebrew keyboards on its devices with Geresh and Gershayim, which is an exaggeration. The usual quotation marks (‘, “) are used by most people, even though they are not perfect, and they must stay on Gboard where they are. The elegant Hebrew quotation marks (‚’„”) also appear on Gboard and must not be removed. Geresh and Gershayim can be added on the additional punctuation
Rafe (U+05bf): It’s a diacritic that looks like a line above a letter, and the opposite of dagesh, which is already available. It can appear when you long-press the letter resh (ר).
Oleh (U+05ab): It’s a diacritic that looks like a left-pointing arrow above a letter, and in modern Hebrew it signifies stress. It can appear when you long-press the letter ayin (ע).

The five character that I suggest to add are already part of the standard Hebrew keyboard (SII 1452), which is implemented in Windows 8. They must also be available in Android.

I hope that Google developers see this and make the necessary changes.

Twitter Must Make it Easy to Mass-Report Spam Bots

Published 2017-09-21 diversity , Internet , Russia , Russian , spam , Twitter 3 Comments

I found a network of Russian female bots. Twitter spam bots.

They are not actually female. They just have Russian female names and female photos.

Most of those that I found were created in September 2016, although some were created at other times.

They all have similar taglines:

“In my opinion, everything is wonderful. I wonder what else” (“По-моему всё прекрасно. Интересно что ещё”)
“Right now absolutely everything is excellent. I wonder how else” (“Сейчас вообще всё отлично. Интересно как там ещё”)
“It looks like absolutely everything is wonderful. I’ll see what will happen next” (“Вроде вообще всё прекрасно. Посмотрю что будет дальше”)

… And so forth, with minor variations, which are very easy to detect for a human who knows Russian, although I’m less sure about software. (This reminds me of how I was interviewed for several natural language processing positions around 2011. All of them were about optimizing site text for Google ads, and all of them specifically targeted only English. When you only target English, other languages are used to spam you.)

Their usernames are all almost random and end with two digits: flowoghub90, viotrondo86, chirowsga88 (although “90” seem to be the most frequent digits). As location, they all indicate one of the large cities of Russia: Moscow, Krasnoyarsk, Perm, Saint-Petersburg, Rostov-on-Don, etc.

All of them post nothing but retweets of other accounts popular in Russia:

Russian government: PutinRF, MedvedevRussia
Russian government supporters: VRSoloviev, NickValuev
Mainstream news: ForbesRussia, vesti_news
Entertainment: achekhova, Nyusha_Nyusha
Major internet businesses yandex, vkontakte, GoogleRussia
And even opposition politicians: navalny

Curiously, all their names are only typical to ethnic Russians. Names of real women from Russia would be much more varied—there would be a lot of typical Armenian, Ukrainian, Jewish, Georgian, and Tatar names that reflect Russia’s diversity: Melikyan, Petrenko, Rivkind, Gamkrelidze, Khamitova. But these spam bot accounts only have names such as Kuznetsova, Romanova, Ershova, Medvedeva, Kiseleva. If you aren’t familiar with the Russian culture, let me make a comparison to the U.S.: It’s like having a lot of people named Smith, Harris, Anderson, and Roberts, and nobody named Gonzalez, Khan, O’Connor, Rosenberg, or Kim. Maybe the spammers wanted to be more mainstream than mainstream, and maybe it is just overt racism.

I found them when I noticed that a lot of unfamiliar accounts with Russian female names were retweeting something by Pavel Durov in which I was mentioned. Durov is the founder of VK and Telegram, and I guess that he can be classified under “major internet businesses” in the list above. I noticed the similar taglines of the “women”, and immediately understood they are all spam bots.

These accounts are active. Some of them retweeted stuff while I was writing this post. I also keep getting retweet notifications, more than two weeks after Durov’s original tweet was posted.

When I am looking at any of these accounts, Twitter suggests me similar ones, and they are all in the same network: Russian female names, similar “everything is wonderful” taglines, similar content. So Twitter’s software understands that they are similar, but doesn’t understand that they are spam bots that should be utterly banned. I also noticed that some of them are still suggested to me after I blocked them, which goes against the whole point of blocking.

I don’t know how many there are of them in this network. Likely thousands. I reported thirty or so, and I wonder whether it’s efficient for anything.

I also don’t know what is their purpose. Boost the popularity of other Russian accounts? But those that they retweet are popular already. Waste the time of people who try to use Twitter productively? Maybe; at least it’s the effect in my case. Function as bot followers in “pay to follow” networks? Possibly, but they have existed for a year, and they don’t follow so many people.

I’m probably not discovering anything very new in this post. But especially if I don’t, it all the more makes me wonder why isn’t this problem already addressed somehow. At the very least it should be possible to report them more efficiently with one click or tap. And Twitter should also provide a form for mass-reporting; currently, Twitter’s guides about spam only suggest this: “The most effective way to report spam is to go directly to the offending account profile, click the drop-down menu in the upper right corner, and select “report account as spam” from the list.” It’s OK for one account, but it requires five clicks, and it doesn’t scale for something as systematic as what I am describing in this post.

I do hope that somebody from Twitter will read this and do something about it. This is obvious systematic abuse, and I have no better way to report it.

The Curious Problem of Belarusian and Igbo in Twitter and Bing Translation

Published 2017-08-29 Belarusian , Free Software , Igbo , Microsoft , Nigeria , Russian , search , translation , Twitter , Ukraine , Wikipedia 3 Comments

Twitter sometimes offers machine translation for tweets that are not written in the language that I chose in my preferences. Usually I have Hebrew chosen, but for writing this post I temporarily switched to English.

Here’s an example where it works pretty well. I see a tweet written in French, and a little “Translate from French” link:

The translation is not perfect English, but it’s good enough; I never expect machine translation to have perfect grammar, vocabulary, and word order.

Now, out of curiosity I happen to follow a lot of people and organizations who tweet in the Belarusian language. It’s the official language of the country of Belarus, and it’s very closely related to Russian and Ukrainian. All three languages have similar grammar and share a lot of basic vocabulary, and all are written in the Cyrillic alphabet. However, the actual spelling rules are very different in each of them, and they use slightly different variants of Cyrillic: only Russian uses the letter ⟨ъ⟩; only Belarusian uses ⟨ў⟩; only Ukrainian uses ⟨є⟩.

Despite this, Bing gets totally confused when it sees tweets in the Belarusian language. Here’s an example form the Euroradio account:

Both tweets are written in Belarusian. Both of them have the letter ⟨ў⟩, which is used only in Belarusian, and never in Ukrainian and Russian. The letter ⟨ў⟩ is also used in Uzbek, but Uzbek never uses the letter ⟨і⟩. If a text uses both ⟨ў⟩ and ⟨і⟩, you can be certain that it’s written in Belarusian.

And yet, Twitter’s machine translation suggests to translate the top tweet from Ukrainian, and the bottom one from Russian!

An even stranger thing happens when you actually try to translate it:

Notice two weird things here:

After clicking, “Ukrainian” turned into “Russian”!
Since the text is actually written in Belarusian, trying to translate it as if it was Russian is futile. The actual output is mostly a transliteration of the Belarusian text, and it’s completely useless. You can notice how the letter ⟨ў⟩ cannot be transliterated.

Something similar happens with the Igbo language, spoken by more than 20 million people in Nigeria and other places in Western Africa:

This is written in Igbo by Blossom Ozurumba, a Nigerian Wikipedia editor, whom I have the pleasure of knowing in real life. Twitter identifies this as Vietnamese—a language of South-East Asia.

The reason for this might be that both Vietnamese and Igbo happen to be written in the Latin alphabet with addition of diacritical marks, one of the most common of which is the dot below, such as in the words ibụọla in this Igbo tweet, and the word chọn lọc in Vietnamese. However, other than this incidental and superficial similarity, the languages are completely unrelated. Identifying that a text is written in a certain language only by this feature is really not great.

If I paste the text of the tweet, “Nwoke ọma, ibụọla chi?”, into translate.bing.com, it is auto-identified as Italian, probably because it includes the word chi, and a word that is written identically happens to be very common in Italian. Of course, Bing fails to translate everything else in the Tweet, but this does show a curious thing: Even though the same translation engine is used on both sites, the language of the same text is identified differently.

How could this be resolved?

Neither Belarusian nor Igbo languages are supported by Bing. If Bing is the only machine translation engine that Twitter can use, it would be better to just skip it completely and not to offer any translation, than to offer this strange and meaningless thing. Of course, Bing could start supporting Belarusian; it has a smaller online presence than Russian and Ukrainian, but their grammar is so similar, that it shouldn’t be that hard. But what to do until that happens?

In Wikipedia’s Content Translation, we don’t give exclusivity to any machine translation backend, and we provide whatever we can, legally and technically. At the moment we have Apertium, Yandex, and YouDao, in languages that support them, and we may connect to more machine translation services in the future. In theory, Twitter could do the same and use another machine translation service that does support the Belarusian language, such as Yandex, Google, or Apertium, which started supporting Belarusian recently. This may be more a matter of legal and business decisions than a matter of engineering.

Another thing for Twitter to try is to let users specify in which languages do they write. Currently, Twitter’s preferences only allow selecting one language, and that is the language in which Twitter’s own user interface will appear. It could also let the user say explicitly in which languages do they write. This would make language identification easier for machine translation engines. It would also make some business sense, because it would be useful for researchers and marketers. Of course, it must not be mandatory, because people may want to avoid providing too much identifying information.

If Twitter or Bing Translation were free software projects with a public bug tracking system, I’d post this as a bug report. Given that they aren’t, I can only hope that somebody from Twitter or Microsoft will read it and fix these issues some day. Machine translation can be useful, and in fact Bing often surprises me with the quality of its translation, but it has silly bugs, too.

Aharoni in Unicode

Why Did I Leave Quora, Why It Is Not Such a Big Deal, and Why Do I Nevertheless Hope to Come Back

Amir Aharoni’s Little Take on the Lodestar Affair

There’s Nothing Particularly Good About Long Wikipedia Articles. Let’s Make Them Shorter

Wikimedia Strategy Phase 1: What Does It Mean for Me and (Maybe) for Language Diversity in Wikipedia

Knowledge as a service

Knowledge equity

What kind of conditions do you need to realize these activities?

How Gboard Could Be Better for Hebrew

Twitter Must Make it Easy to Mass-Report Spam Bots

The Curious Problem of Belarusian and Igbo in Twitter and Bing Translation

# me on social networks

# my other blogs

Good to know

Russian links

Archives

Meta