Text encoding

A text encoding is a method of representing a piece of text as a sequence of codes (from a character encoding) for the purpose of computer storage or electronic communication of that text. While character encodings like ASCII represent individual characters of a language, a text encoding has to represent much larger things like articles and books, and must represent not only the characters they contain but the structure and organization}} of the text, and perhaps [[information about the text or its appearance. Common examples are HTML and RTF which represent texts in natural languages, and XML, which can represent many kinds of text not necessarily intended to be human-readable (the contents of a database, for example).

In general there are two basic forms of text encoding that are widely used. One is to use a markup language which adds markers to the text itself. Markup has the advantage of being easy to represent, but has the disadvantage of being hard to view without an "aware" reader application. HTML is generally unreadable if opened in a text editor for instance, at least to those unfamiliar with the format. Another method is to use "pointers" into the text, which is left in the original format. This has the advantage of allowing the content to be easily readable in any editor, although you lose the "styling". On the downside, editing such a document in a non-aware application typically leaves the pointers pointing to the wrong data. Today the majority of text encoding systems appear to use markup, although whether by choice of simple because "everyone else" does is open to question.

Though character encodings like ASCII and Unicode are not, strictly speaking, text encodings in their own right, they may serve as very simple text encodings if one wishes only to preserve the English content of a document and not necessarily its formatting. By far the most common text encoding now in use is what might informally be called "Plain ASCII", which involves simply encoding a text as a stream of ASCII characters. The specifics of how this is done vary greatly: for example, the end of a text line might be encoded as ASCII code 10 ("line feed" or "new line") as is common practice on Unix machines, or as ASCII code 13 ("carriage return") as is common on Apple machines, or as both (the sequence <13, 10> is used to end lines on MS-DOS based machines and many others, while the rather rare sequence <10, 13> was used by some Acorn machines). Some texts also use this line-end sequence inside paragraphs (with a blank line between paragraphs) while some do not. Also, various texts in this form interpret code 9 ("tab") and other control characters differently. None of these methods specify how to identify text structure like headings and tables, or special text forms like italics. Text in this format is basically readable by any computer though some work might be needed to accommodate local variations, and all information besides the actual words of the text will be lost.