Understanding Arabic URL & URI Structure Encoding for Arabic Sites
If you have spent any amount of time dealing with newer websites in Arabic, you will surely have encountered Arabic URLs or URIs. It’s great now that Arabic speakers can now see the address of a website in their own language and understand it. Despite this, on the back-end of a website, the user sees a URL structure in their browser address bar that is coded in a certain fashion so that modern browsers like Chrome, Firefox, and Internet Explorer can interpret the code and output Arabic characters. Modern search engines that are compatible with Arabic, like Google Arabic, can also interpret this code so you see Arabic characters in the display URLs on search result pages. In this article, we will examine Arabic URL and URI encoding to understand better the encoding that creates the Arabic website address you see in your browser.
Arabic IDNs
To start with, let’s discuss what is referred to as IDNs or Internationalized Domain Names. Arabic IDNs include domains written in the Arabic script like “.مصر” or “.امرات”. Since Arabic is not part of the Domain Name System (DNS) for these domain names to work properly, they must be converted to a Latin script-based encoding called ASCII to be read properly by the DNS. Unlike the system used for converting Arabic website directories or folders to something your browser can understand, there isn’t a 1 to 1 correlation between how a letter is written in Arabic and how it appears in ASCII for IDNs. There are however several converters online that use what is called Punnycode to convert Arabic script domains into the ASCII domain. Understanding and analyzing URL directories or what are call URIs is much easier and we will go into that next.
Arabic URI structure analyzed
In order for the Arabic characters to be displayed in URLs in your browser, the characters are encoded into a Latin-based encoding called UTF-8 which typically is a 4-character hexadecimal string. An example would be the Arabic letter و WAW, which is converted to D988. However, with URIs, for some reason, the four characters in the UTF-8 code are separated in the middle by a percent sign % . So in the above example و would be written D9%88. Below we break down an example from Wikipedia.
In this example below, we look at a Wikipedia article on Naguib Mahfouz or نجيب محفوظ . We see that the page name is written within the URL or the URI in Arabic. However, if you have ever tried to copy and paste one of these URLs, you find that what is pasted instead of the Arabic in the URL is a long string of percent signs, letters, and numbers like the example below. As mentioned above, your browser interprets this long coded string from UTF-8 into Arabic. Let’s dissect this code from left to right.
http://ar.wikipedia.org/wiki/نجيب_محفوظ – How the URL appears in your browser
http://ar.wikipedia.org/wiki/%D9%86%D8%AC%D9%8A%D8%A8_%D9%85%D8%AD%D9%81%D9%88%D8%B8 – The actual UTF-8 encoding Arabic URL.
%D9%86 = ن
%D8%AC = ج
%D9%8A = ي
%D8%A8 = ب
%D9%85 = م
%D8%AD = ح
%D9%81 = ف
%D9%88 = و
%D8%B8 = ظ
Yes, the words are written backward
As you can see here, despite the Arabic letters appearing from right to left in the decoded version of the URL, the coded version is actually written from left to right. This may have some interesting implications related to how these URLs are interpreted by search engines and their impact on SEO, but we will have that discussion for another article.
Another Example
To illustrate further, here is another example from an Arabic website about cosmetic surgery. The website contains a page about different types of plastic surgery procedures which has a URL that appears in your browser as follows:
https://tajmeeli.com/عمليات-التجميل/
That said, when you copy the URL from your browser and paste it somewhere other than the browser bar, you get the following string of characters.
%D8%B9 = ع
%D9%85 = م
%D9%84 = ل
%D9%8A = ي
%D8%A7 = ا
%D8%AA = ت
%D8%A7 = ا
%D9%84 = ل
%D8%AA = ت
%D8%AC = ج
%D9%85 = م
%D9%8A = ي
%D9%84 = ل
Below you can find a full list of all UTF-8 codes for Arabic characters so you can understand any encoded URI.
UTF-8 | Arabic Character |
---|---|
D9AA | ٪ |
D9AD | ٭ |
D88C | ، |
D9A0 | ٠ |
D9A1 | ١ |
D9A2 | ٢ |
D9A3 | ٣ |
D9A4 | ٤ |
D9A5 | ٥ |
D9A6 | ٦ |
D9A7 | ٧ |
D9A8 | ٨ |
D9A9 | ٩ |
D89B | ؛ |
D89F | ؟ |
D8A1 | ء |
D8A2 | آ |
D8A3 | أ |
D8A4 | ؤ |
D8A5 | إ |
D8A6 | ئ |
D8A7 | ا |
D8A8 | ب |
D8A9 | ة |
D8AA | ت |
D8AB | ث |
D8AC | ج |
D8AD | ح |
D8AE | خ |
D8AF | د |
D8B0 | ذ |
D8B1 | ر |
D8B2 | ز |
D8B3 | س |
D8B4 | ش |
D8B5 | ص |
D8B6 | ض |
D8B7 | ط |
D8B8 | ظ |
D8B9 | ع |
D8BA | غ |
D980 | ـ |
D981 | ف |
D982 | ق |
D983 | ك |
D984 | ل |
D985 | م |
D986 | ن |
D987 | ه |
D988 | و |
D989 | ى |
D98A | ي |
Why Work With an Arabic Agency to Fix Your Arabic Issues
Other useful resources for those working with Arabic online
Complete List of Arabic Speaking Countries
A Comprehensive Guide to Arabic Domain Names (ADNs): The rise, fall, and the future
English to Arabic Translation Services
Doing Business in the Middle East
Dropshipping and Fulfillment Options for Ecommerce Sites Targeting the Middle East