When translating a document name to a url path segment the following rules are used:
- Control characters (ASCII / Unicode / ISO/IEC 8859-1 codes below 0x1f and between 0x80 and 0x9f such as tabs, carriage return, line feeds, end-of-text, bell, etcetera) are removed;
- A non brakable space is converted into a regular space;
- Multiple spaces are merged into a single space;
- Spaces are converted to hyphens;
- All common printable characters (ASCII / Unicode / ISO/IEC 8859-1 codes in the range 0x21 and 0x7f) are are used as-is, unless there is a replacement as specified in table 1.
- All letters from the Unicode Latin-1 Supplement block are converted to their lowercase base letter, e.g. 0x00c0 Latin Capital Letter A with grave, is converted to "a".
- All letters from the Unicode Latin Extended-A block are converted to their lowercase base letter, e.g. 0x0100 Latin Capital Letter A with macron, is converted to "a".
- Other special characters (ASCII / Unicode / ISO/IEC 8859-1 codes not in the range 0x21 and 0x7f) that are not listed in table 2 or table 3 are converted to lowercase;
- Leading spaces and ending spaces are removed.
All rules and translation tables are applied, not just the first matching rule. So if a rule indicates that a character is converted into a space, and another rule specifies that spaces are converted to hyphens, than the character is converted into a hyphen
The following translation tables for printable characters are used:
table 1: special handling of some regular printable characters ! removed " removed # removed $ usd % removed & removed ' removed ( removed ) removed * converted into space + converted into space , removed - converted into hyphen . removed at end; otherwise not changed / converted into hyphen : converted into space ; converted into space < removed = converted into hyphen > removed ? removed @ -at- { removed | converted into hyphen } removed ~ converted into hyphen
table 2: ISO 8859-1 special characters ¡ removed ¢ ct £ gbp ¤ removed ¥ yen ¦ - § removed ¨ removed © removed ª removed « removed ¬ removed - ® removed ¯ - ° removed ± - ² removed ³ removed ´ removed µ removed ¶ removed · removed ¸ removed ¹ removed º removed » removed ¼ removed ½ removed ¾ removed Ð d Ø o Ù u Ú u Û u Ü u Ý y Þ y ß ss à a á a â a ã a ä a å a æ ae ç c è e é e ê e ë e ì i í i î i ï i ð d ñ n ò o ó o ô o õ o ö o ÷ removed ø o ù u ú u û u ü u ý u þ y ÿ y
table 3: translation of Unicode characters above 0xc200 c2a1 removed c2a2 ct c2a3 gbp c2a4 removed c2a5 yen c2a6 removed c2a7 removed c2a8 removed c2a9 removed c2aa removed c2ab removed c2ac removed c2ad - c2ae removed c2af - c2b0 removed c2b1 removed c2b2 removed c2b3 removed c2b4 removed c2b5 removed c2b6 removed c2b7 removed c2b8 removed c2b9 removed c2ba removed c2bb removed c2bc removed c2bd removed c2be removed c2bf removed c380 a c381 a c382 a c383 a c384 a c385 a c386 ae c387 c c388 e c389 e c38a e c38b e c38c i c38d i c38e i c38f i c390 d c391 n c392 o c393 o c394 o c395 o c396 o c397 x c398 o c399 u c39a u c39b u c39c u c39d y c39e y c39f ss c3a0 a c3a1 a c3a2 a c3a3 a c3a4 a c3a5 a c3a6 ae c3a7 c c3a8 e c3a9 e c3aa e c3ab e c3ac i c3ad i c3ae i c3af i c3b0 d c3b1 n c3b2 o c3b3 o c3b4 o c3b5 o c3b6 o c3b7 removed c3b8 o c3b9 u c3ba u c3bb u c3bc u c3bd y c3be y c3bf y