Understanding Arabic URL & URI Structure Encoding for Arabic Sites

If you have spent any amount of time dealing with newer websites in Arabic you are sure to have come across Arabic URLs or URIs. It’s great now that Arabic speakers can now see the address of a website in their own language and understand it. Despite this on the back-end of a website what the user sees in their browser address bar is a URL structure that is actually coded in a certain fashion so that modern browsers like Chrome, Firefox, and Internet Explorer can interpret the code and output Arabic characters. Modern search engines that are compatible with Arabic, like Google Arabic, can also interpret this code so you see Arabic characters in the display URLs on search result pages.  In this article we will examine Arabic URL and URI encoding to get a better understanding of the encoding that creates the Arabic website address you see in your browser.

 

arabic domain names

Arabic IDNs

To start with let’s discuss what is referred to as IDNs or Internationalized Domain Names. Arabic IDNs include domains written in the Arabic script like “.مصر” or “.امرات”. Since Arabic is not part of the Domain Name System (DNS) for these domain names to work properly they have to be converted to a Latin script based encoding called ASCII in order to be read properly by the DNS. Unlike the system used for converting Arabic website directories or folders to something your browser can understand there isn’t a 1 to 1 correlation between how a letter is written in Arabic and how it appears in ASCII for IDNs. There are however several converters online that use what is called Punnycode to convert Arabic script domains into the ASCII domain. Understanding and analyzing URL directories or what are call URIs is much easier and we will go into that next.

 

Arabic URI structure analyzed

In order for the Arabic characters to be displayed in URLs in your browser the characters are encoded into a Latin based encoding called UTF-8 which typically are a 4 character hexadecimal string. An example would be the Arabic letter و WAW which is converted to D988. However, with URIs for some reason the four characters in the UTF-8 code are separated in the middle by a percent sign % .  So in the above example و would be written D9%88. Below we break down an example from Wikipedia.

In this example below we look at a Wikipedia article on Naguib Mahfouz or نجيب محفوظ . We see that the page name is written within the URL or the URI in Arabic. However, if you have ever tried to copy and paste one of these URLs you find that what is pasted instead of the Arabic in the URL is a long string of percent signs, letters, and number like the example below. As mentioned above you browser interprets this long coded string from UTF-8 into Arabic.  Let’s dissect this code from left to right.

http://ar.wikipedia.org/wiki/ نجيب_محفوظ/   – How the URL appears in your browser

http://ar.wikipedia.org/wiki/%D9%86%D8%AC%D9%8A%D8%A8_%D9%85%D8%AD%D9%81%D9%88%D8%B8  – The actual UTF-8 encoding Arabic URL.

%D9%86 = ن

%D8%AC = ج

%D9%8A = ي

%D8%A8 = ب

%D9%85 = م

%D8%AD = ح

%D9%81 = ف

%D9%88 = و

%D8%B8 = ظ

 

Yes, the words are written backwards

As you can see here that despite the fact that the Arabic letters show up from right to left in the decoded version of the URL, the coded version is actually written from left to right. This may have some interesting implications related to how these URLs are interpreted by search engines and their impact on SEO but we will have that discussion for another article.

 

Another Example

To illustrate further here is another example from an Arabic website about cosmetic surgery. The website contains a page about different types of plastic surgery procedures which has a URL that appears in your browser as follows:

https://tajmeeli.com/عمليات-التجميل/

That said, when you copy the URL from your browser and paste it somewhere other than the browser bar you get the following string of characters.

https://tajmeeli.com/%D8%B9%D9%85%D9%84%D9%8A%D8%A7%D8%AA-%D8%A7%D9%84%D8%AA%D8%AC%D9%85%D9%8A%D9%84/

 

%D8%B9 = ع

%D9%85 = م

%D9%84 = ل

%D9%8A = ي

%D8%A7 = ا

%D8%AA = ت

%D8%A7 = ا

%D9%84 = ل

%D8%AA = ت

%D8%AC = ج

%D9%85 = م

%D9%8A = ي

%D9%84 = ل

 

Below you can find a full list of all UTF-8 codes for Arabic characters so you can understand any encoded URI.

 

UTF-8 Arabic Character
D9AA ٪
D9AD ٭
D88C ،
D9A0 ٠
D9A1 ١
D9A2 ٢
D9A3 ٣
D9A4 ٤
D9A5 ٥
D9A6 ٦
D9A7 ٧
D9A8 ٨
D9A9 ٩
D89B ؛
D89F ؟
D8A1 ء
D8A2 آ
D8A3 أ
D8A4 ؤ
D8A5 إ
D8A6 ئ
D8A7 ا
D8A8 ب
D8A9 ة
D8AA ت
D8AB ث
D8AC ج
D8AD ح
D8AE خ
D8AF د
D8B0 ذ
D8B1 ر
D8B2 ز
D8B3 س
D8B4 ش
D8B5 ص
D8B6 ض
D8B7 ط
D8B8 ظ
D8B9 ع
D8BA غ
D980 ـ
D981 ف
D982 ق
D983 ك
D984 ل
D985 م
D986 ن
D987 ه
D988 و
D989 ى
D98A ي

 

We hope this article was helpful and please don’t hesitate to leave feedback if you have any additional insight.