Encoding & Character Sets

Introduction

Computers don’t read letters – they speak in terms of “on” and ”off,” which is represented in binary code as 1s and 0s. For humans to utilize computers effectively we need to use encoding and decoding to translate between human and computer languages. Encoding is the process of taking something human-readable – such as the English alphabet - and transforming it to something computer-readable, like binary code. Decoding is just the opposite – taking that computer code and translating it back into a human language.

Number Systems

Decimal

The human numbering system is decimal, referred to by computer scientists as “base10” due to the number of possible values represented by the characters in the system - ten values from 0 through 9. Decimal is a big endian system, which means that the least significant digit is on the right and the most significant is on the left. Each digit added to a number represents a multiple of the previous digit value by 10, 9 to 10, 99 to 100, 999 to 1,000. Therefore, the first digit (starting on the right) is the 1s value, the second is the 10s, the third is the 100s, et cetera.

Example: In the number 1,234, the 4 is the least significant digit because it represents a value of 4. 1 is the most significant because it represents a value of 1,000.

Binary

Instead of the human decimal system, computers use binary - a base2 system. Each digit has two potential values - on or off - represented by human numbers 1 and 0. Like decimal, binary is generally big endian, meaning that the significance of the digit increases from right to left.

For binary, the value of each digit doubles. Therefore, the first digit represents 1, the second 2, then 4, 8, 16, 32, 64, 128….

00000001 = 1
00000010 = 2
00000011 = 3
00000100 = 4
00000101 = 5
00000110 = 6
00000111 = 7
00001000 = 8

Each 1 or 0 in binary is called a “bit,” or binary digit. A collection of eight digits together like the examples above are called a “byte.” Each byte can be used to represent a number from 0 to 255.

Hexadecimal

Hexadecimal is a base16 number system used by programmers to simplify the differences between binary and decimal. Like decimal and binary, hex uses the left digit as the most significant. Each digit can have one of 16 values from 0 to F: 0-9, A-F.

For identification purposes, “0x” precedes hex values to indicate that they are hex. Hex uses two characters to represent one byte:

0x00 = 0
0x01 = 1
0x09 = 9
0x0A = 10
0x0F = 15
0x10 = 16
0x17 = 23

Character Sets

Character sets are referred to as 7-bit, 8-bit or 16-bit-per-character. This describes the largest number that can be utilized to create each character in the set and limits the number of unique characters available in the set’s “alphabet.” Consequently, it also represents the amount of data which must be used to define each character. For instance, sending something as a 32-bit character set will use significantly more data than using an 8-bit character set. The tradeoff depends on what you are trying to accomplish.

ASCII: the most basic English character set.
GSM7: a 7-bit character set loosely based on ASCII and created for use with text messages.
Latin1: an enhanced version of ASCII which includes some additional characters.
Latin9: an updated ASCII enhancement which includes some foreign characters in place of some of the characters Latin1 had included.
Unicode: a two-byte-per-character set that allows a much wider variety of characters.

Note: Aerialink will pass along whatever character set you indicate to us in your API call (via the dcs value). We will not alter the encoding on our end.

Bit Packing

All character sets with the exception of Unicode are technically 8-bits-per-character. That is to say, when they’re first encoded, they’re encoded as 8 bits, and then can be bit packed into ultimately costing only seven bits per character. Bit packing is the act of removing the first bit - the value of which is always “0” - in order to fit more data in a single message. This allows for 160 characters in 140 bytes of data.

ASCII

ASCII is the most basic English character set. Standardized at 7-bits-per character when bit packed, it has a base maximum number of 128 characters.

Basic ASCII

	0	1	2	3	4	5	6	7
0	NUL	DLE	spc	0	@	P	`	p
1	SOH	DC1 XON	!	1	A	Q	a	q
2	STX	DC2	“	2	B	R	b	r
3	ETX	DC3 XOFF	#	3	C	S	c	s
4	EOT	DC4	$	4	D	T	d	t
5	ENQ	NAK	%	5	E	U	e	u
6	ACK	SYN	&	6	F	V	f	v
7	BEL	ETB	‘	7	G	W	g	w
8	BS	CAN	(	8	H	X	h	x
9	HT	EM	)	9	I	Y	i	y
A	LF	SUB	*	:	J	Z	j	z
B	VT	ESC	+	;	K	[	k	{
C	FF	FS	,	<	L	\	l	\|
D	CR	GS	-	=	M	]	m	}
E	SO	RS	.	>	N	^	n	~
F	SI	US	/	?	O	_	o	del

There is an extension set for ASCII to add some additional characters, but it is not typically used for SMS.

GSM

GSM was developed as an ASCII alternative to use a 7-bit character set with a foreign language and utilize characters which ASCII has as “control” for visible printed characters. As you can see below, most control characters have been replaced with viewable letters and figures.

Basic GSM

	0x00	0x010	0x20	0x30	0x40	0x50	0x60	0x70
0x00	@	Δ	SP	0	¡	P	¿	p
0x01	£	_	!	1	A	Q	a	q
0x02	$	Φ	“	2	B	R	b	r
0x03	¥	Γ	#	3	C	S	c	s
0x04	è	Λ	¤	4	D	T	d	t
0x05	é	Ω	%	5	E	U	e	u
0x06	ù	Π	&	6	F	V	f	v
0x07	ì	Ψ	‘	7	G	W	g	w
0x08	ò	Σ	(	8	H	X	h	x
0x09	Ç	Θ	)	9	I	Y	i	y
0x0A	LF	Ξ	*	:	J	Z	j	z
0x0B	Ø	ESC	+	;	K	Ä	k	ä
0x0C	ø	Æ	,	<	L	Ö	l	ö
0x0D	CR	æ	-	=	M	Ñ	m	ñ
0x0E	Å	ß	.	>	N	Ü	n	ü
0x0F	å	É	/	?	O	§	o	à

Basic GSM Extension

GSM can be extended with a small set of characters which are accessed by first sending the escape character (0xB1), then the character in the table below. For instance, 0xB140 is |.

	0x00	0x010	0x20	0x30	0x40	0x60
0x00					\|
0x01
0x02
0x03
0x04		^
0x05						€
0x06
0x07
0x08			{
0x09			}
0x0A	FF
0x0B		SS2
0x0C				[
0x0D	CR2			~
0x0E				]
0x0F			\

Unicode

UTF-16 Unicode is a 16-bit character set which is far too large to show in a simple chart, comprised of 65,536 total possible characters. It works by using two bytes per single character, which greatly increases the potential number of characters per set, from simple characters such as A (0x0041) to complex modern characters like ☺ (0x263A).

Data Coding Schemes

The SMPP standard allows 140 bytes for the data portion of a message. This is the limit of message content for any SMS. To translate this data from binary to text, we must know how the data is encoded (that is to say, which character set is intended to be used). To indicate this, an optional parameter can be included in an API call. This is the “dcs” value, which signifies the character set which should be used to translate the data.

The table below indicates the DCS values and the character sets they correspond to based on sender ID type.

DCS Value	Sender ID	Character Set	Bits per Character	Characters per Single Message	Max per Concat. Part
0 (Default)	Short Code Long Code	ASCII	7	160*	153
0 (Default)	8XX (Toll-Free)	GSM	7	160*	153
1	Short Code Long Code	ASCII	8**	140**	136**
1	8XX (Toll-Free)	Latin-9	8	140	134
3	All	Latin-1	8	140	134
8	All	UTF-16	16	70	67

* U.S. Cellular imposes a unique limit of 150 characters per message.
** Please note that while ASCII can be bit-packed into 7-bit-per-character messages and that is done for DCS0, it is not always done for DCS1, so we advise our customers to expect it to cost 8 bits per character when sending DCS1, universally. For more information, read about Bit Packing above.

Message Concatenation

A single SMS message has a maximum length of 140 bytes. This translates to 160, 140 or 70 characters per message depending on the bit size of each character. Note that the inclusion of “special characters” not found in GSM may alter the way your message is encoded, thereby exceeding and/or reducing the allotted length. Furthermore, GSM - while it is a 7-bit language - does include the following 8-bit characters which may also cause your messages to exceed the 160-character GSM limit: |, ^, {, }, €, [, ~, ] and \

Long messages exceeding 140 bytes are cut into multiple messages (segments) and, when supported by the carrier, stitched back together into a single long message on the destination handset. These individual messages (segments) have a reduced character length per message segment to accommodate the UDH (User Data Header) which is used to reunite the message segments at-destination. This header encompasses six bytes (48 bits) of data, meaning that the number of allotted characters removed for the UDH will vary depending on the bit-size of that character.

For more information about individual character sets, check out the preceding articles.

Quick Facts About Concatenation

Billing: Although a concatenated message is composed as one thought and sent in one request, it is charged as separate messages.
UDH Header Length: The “character” length of the UDH header changes depending on the bit-size of the characters in the character set used.
Automatic Segmentation: Long messages sent to the Aerialink HTTP API v4 will be segmented automatically and sent to the carrier SMSC with a UDH indicating that the messages are concatenated segments of a whole which are intended to be knit back together.
Your Application: Your application will need to specify the segmentation in the message.
Message Order: Aerialink sends message segments to the carrier in the order we receive them, but messages sent to carriers who don’t support concatenation may not always arrive to the destination handset in order.
Industry Abilities: Some providers offer “250 characters per message,” but this is a marketing spin - they are using concatenation to achieve this.
Carrier Support: Not all carrier networks support concatenation, and some of them support it only in some regions or for some handsets.

Example Concatenation Scenario

Your customer submits the following message content via API:

How now brown cow. See the quick brown fox jump over the lazy dog. Now is the time for all men to come to the aid of their country. How much wood would a wood chuck chuck, if a wood chuck could chuck wood?

The message is 208 characters long. A single SMS cannot be exceed a total of 140 bytes. Messages in excess of 140 bytes are split into as many segments as it takes to send the message in parts no longer than 140 bytes. In this case, assuming we’re sticking to simple ASCII characters, we would have sent this as two segments.

Segment 1: How now brown cow. See the quick brown fox jump over the lazy dog. Now is the time for all men to come to the aid of their country. How much wood would a
Segment 2: wood chuck chuck, if a wood chuck could chuck wood?

The user would receive the whole message as one long string on their phone, assuming their device and carrier support concatenation. Users with handsets and/or carriers who do not support concatenation would receive each segment as an individual message. So, from your system you see just one submit, but for our system to process this message we have to split it up into two segments - two messages. We charge per segment because the carriers charge per segment.

Note that the DCS (Data Coding Scheme) selection used to submit the message will determine how it is segmented. The example above depicts how the default DCS value of 0 will package and segment the message with 7-bit encoding. Using another DCS value such as 1 or 8 would segment the message differently. Please refer to the above chart in the Data Coding Scheme article for more information about the characters per segment depending on DCS value.

Concatenated Message Billing

Although sent in a single request, concatenated messages are composed of multiple message transactions, each with their own transaction GUID (Global Unique Identifier). Each message segment counts and is charged as a separate message. In other words, the cost of a concatenated message will be the sum of the cost its message segments.

Tips & Best Practices

Limit the character count of each message to 160 whenever possible to ensure that the message will not be broken into multiple segments that risk arriving out of order or incomplete.
If possible, pass UDH header details in the message via an SMPP bind.

User Data Header

A UDH can be provided in SMS strings sent via GSM to dictate a message’s format and processing and is applicable to message concatenation, SMS containing enriched text content as well as MMS (Multimedia Messaging Service).

Octet	Designates	Possible Value	Details
1st	Overall UDH Length	05	Excludes first octet.
2nd	Information Element Identifier	00	CSMS 8-bit reference number.
3rd	UDH Content Length	03	Excludes first three octets
4th	CSMS Group Reference Number	00-FF	One of 255 possibile values. Must be the same for all segments of CSMS.
5th	Total Number of Segments	00-FF	Must remain constant for all segments of message. If the value is zero, UDH will be ignored.
6th	Segment Number	00-FF	The current segment’s number in sequential order of total segments (“part 1 of 4,” in which 04 is the value of octet 5 and 01 is the value of octet 6). The value should begin with 01 as the first segment in the series. If the value of octet 6 is zero or is greater than the value of octet 5, UDH will be ignored.

Message Length

While the UDH does not contain any of a message’s text content itself, it does take space from an SMS’s already limited character length, which is why concatenated segments (as listed above) are shortened from 160 characters to 153 characters. The longer your UDH (and the more robust its information), the shorter the available text content within its SMS becomes.

Deliverability

Although Aerialink can fully process UDH content, its deliverability is subject to the processing capabilities of all carriers the message passes through. If a carrier cannot process the UDH correctly, your message’s UDH content may not display. This can result in concatenated messages arriving split, out of order, or displaying some of your usually invisible UDH within the message text.

Other Message Formats

Binary Message Format

Binary-encoded SMS messages can contain a high data content and are therefore used to send more complex data, such as rich-content Over-the-Air configuration messages for Mobile Device Management. They can contain a maximum of 140 bytes per SMS.

Restrictions

Carriers who support 7-bit ASCII characters–all US carriers and many more–limit usage of binary. Furthermore, binary SMS messages have different formats and meanings between the GSM and CDMA network technology worlds, which is another reason why it is not supported by US Carriers. It is required to encode binary data to a base 64 (or something similar), which unfortunately doubles the size of any message, thereby making it inefficient due to already strict size limitations inherent to SMS.

There are also some handsets which do not read binary messages properly. The iPhone, for example, is incapable of displaying binary messages. Aerialink is configured to pass your user data header intact to the carriers. The carriage of that UDH to the mobile end-user is carrier-dependent and will vary.

Flash Message Format

GSM 03.38
Unicode

Flash SMS push directly to the main screen of a handset’s display rather than arriving in or being stored within the inbox. They require no user interaction, and are therefore useful in emergencies (e.g., fire alarm, tornado siren) as well as cases in which confidentiality is important (e.g., a one-time password).

Presently, flash messaging is only supported by GSM networks.

Search Results