diff options
author | Roland Pesch <pesch@cygnus> | 1991-01-17 15:34:55 +0000 |
---|---|---|
committer | Roland Pesch <pesch@cygnus> | 1991-01-17 15:34:55 +0000 |
commit | 93b4551441109edfc4c5348e97c11abb62b5a5a6 (patch) | |
tree | f8edc021507f8a31f5d56250f92d818ee9e6cabc /gas/doc | |
parent | bca4316904e02748c96d9c0e5d5a6180a13287cf (diff) | |
download | gdb-93b4551441109edfc4c5348e97c11abb62b5a5a6.zip gdb-93b4551441109edfc4c5348e97c11abb62b5a5a6.tar.gz gdb-93b4551441109edfc4c5348e97c11abb62b5a5a6.tar.bz2 |
Initial revision
Diffstat (limited to 'gas/doc')
-rw-r--r-- | gas/doc/as.texinfo | 3227 |
1 files changed, 3227 insertions, 0 deletions
diff --git a/gas/doc/as.texinfo b/gas/doc/as.texinfo new file mode 100644 index 0000000..ee3c3d2 --- /dev/null +++ b/gas/doc/as.texinfo @@ -0,0 +1,3227 @@ +\input texinfo @c -*-texinfo-*- +@tex +\special{twoside} +@end tex +@setfilename as +@settitle as +@titlepage +@center @titlefont{as} +@sp 1 +@center The GNU Assembler +@sp 2 +@center Dean Elsner, Jay Fenlason & friends +@sp 13 +The Free Software Foundation Inc. thanks The Nice Computer +Company of Australia for loaning Dean Elsner to write the +first (Vax) version of @code{as} for Project GNU. +The proprietors, management and staff of TNCCA thank FSF for +distracting the boss while they got some work +done. +@sp 3 + +Copyright @copyright{} 1986,1987 Free Software Foundation, Inc. + +Permission is granted to make and distribute verbatim copies of +this manual provided the copyright notice and this permission notice +are preserved on all copies. + +@ignore +Permission is granted to process this file through Tex and print the +results, provided the printed document carries copying permission +notice identical to this one except for the removal of this paragraph +(this paragraph not being relevant to the printed manual). + +@end ignore +Permission is granted to copy and distribute modified versions of this +manual under the conditions for verbatim copying, provided that the entire +resulting derived work is distributed under the terms of a permission +notice identical to this one. + +Permission is granted to copy and distribute translations of this manual +into another language, under the same conditions as for modified versions. + +@end titlepage +@node top, Syntax, top, top +@chapter Overview, Usage +@menu +* Syntax:: The (machine independent) syntax that assembly language + files must follow. The machine dependent syntax + can be found in the machine dependent section of + the manual for the machine that you are using. +* Segments:: How to use segments and subsegments, and how the + assembler and linker will relocate things. +* Symbols:: How to set up and manipulate symbols. +* Expressions:: And how the assembler deals with them. +* PseudoOps:: The assorted machine directives that tell the + assembler exactly what to do with its input. +* MachineDependent:: Information specific to each machine. +* Maintenance:: Keeping the assembler running. +* Retargeting:: Teaching the assembler about new machines. +@end menu + +This document describes the GNU assembler @code{as}. This document +does @emph{not} describe what an assembler does, or how it works. +This document also does @emph{not} describe the opcodes, registers +or addressing modes that @code{as} uses on any paticular computer +that @code{as} runs on. Consult a good book on assemblers or the +machine's architecture if you need that information. + +This document describes the directives that @code{as} understands, +and their syntax. This document also describes some of the +machine-dependent features of various flavors of the assembler. +This document also describes how the assembler works internally, and +provides some information that may be useful to people attempting to +port the assembler to another machine. + + +Throughout this document, we assume that you are running @dfn{GNU}, +the portable operating system from the @dfn{Free Software +Foundation, Inc.}. This restricts our attention to certain kinds of +computer (in paticular, the kinds of computers that GNU can run on); +once this assumption is granted examples and definitions need less +qualification. + +Readers should already comprehend: +@itemize @bullet +@item +Central processing unit +@item +registers +@item +memory address +@item +contents of memory address +@item +bit +@item +8-bit byte +@item +2's complement arithmetic +@end itemize + +@code{as} is part of a team of programs that turn a high-level +human-readable series of instructions into a low-level +computer-readable series of instructions. Different versions of +@code{as} are used for different kinds of computer. In paticular, +at the moment, @code{as} only works for the DEC Vax, the Motorola +680x0, the Intel 80386, the Sparc, and the National Semiconductor +32032/32532. + +@section Notation +GNU and @code{as} assume the computer that will run the programs it +assembles will obey these rules. + +A (memory) @dfn{address} is 32 bits. The lowest address is zero. + +The @dfn{contents} of any memory address is one @dfn{byte} of +exactly 8 bits. + +A @dfn{word} is 16 bits stored in two bytes of memory. The addresses +of the bytes differ by exactly 1. Notice that the interpretation of +the bits in a word and of how to address a word depends on which +particular computer you are assembling for. + +A @dfn{long word}, or @dfn{long}, is 32 bits composed of four bytes. +It is stored in 4 bytes of memory; these bytes have contiguous +addresses. Again the interpretation and addressing of those bits is +machine dependent. National Semiconductor 32x32 computers say +@i{double word} where we say @i{long}. + +Numeric quantities are usually @i{unsigned} or @i{2's complement}. +Bytes, words and longs may store numbers. @code{as} manipulates +integer expressions as 32-bit numbers in 2's complement format. +When asked to store an integer in a byte or word, the lowest order +bits are stored. The order of bytes in a word or long in memory is +determined by what kind of computer will run the assembled program. +We won't mention this important @i{caveat} again. + +The meaning of these terms has changed over time. Although @i{byte} +used to mean any length of contiguous bits, @i{byte} now pervasively +means exactly 8 contiguous bits. A @i{word} of 16 bits made sense +for 16-bit computers. Even on 32-bit computers, a @i{word} still +means 16 bits (to machine language programmers). To many other +programmers of GNU a @i{word} means 32 bits, so beware. Similarly +@i{long} means 32 bits: from ``long word''. National Semiconductor +32x32 machine language calls a 32-bit number a ``double word''. + +@example + + Names for integers of different sizes: some conventions + + +length as vax 32x32 680x0 GNU C +(bits) + + 8 byte byte byte byte char + 16 word word word word short (int) + 32 long long(-word) double-word long(-word) long (int) + 64 quad quad(-word) +128 octa octa-word + +@end example + +@section as, the GNU Assembler +@dfn{As} is an assembler; it is one of the team of programs that +`compile' your programs into the binary numbers that a computer uses +to `run' your program. Often @code{as} reads a @i{source} program +written by a compiler and writes an @dfn{object} program for the +linker (sometimes referred to as a @dfn{loader}) @code{ld} to read. + +The source program consists of @dfn{statements} and comments. Each +statement might @dfn{assemble} to one (and only one) machine +language instruction or to one very simple datum. + +Mostly you don't have to think about the assembler because the +compiler invokes it as needed; in that sense the assembler is just +another part of the compiler. If you write your own assembly +language program, then you must run the assembler yourself to get an +object file suitable for linking. You can read below how to do this. + +@code{as} is only intended to assemble the output of the C compiler +@code{cc} for use by the linker @code{ld}. @code{as} tries to +assemble correctly everything that the standard assembler would +assemble, with a few exceptions (described in the machine-dependent +chapters.) Note that this doesn't mean @code{as} will use the same +syntax as the standard assembler. For example, we know of several +incompatable syntaxes for the 680x0. + +Each version of the assembler knows about just one kind of machine +language, but much is common between the versions, including object +file formats, (most) assembler directives (often called +@dfn{pseudo-ops)} and assembler syntax. + +Unlike older assemblers, @code{as} tries to assemble a source program +in one pass of the source file. This subtly changes the meaning of +the @kbd{.org} directive (@xref{Org}.). + +If you want to write assembly language programs, you must tell +@code{as} what numbers should be in a computer's memory, and which +addresses should contain them, so that the program may be executed +by the computer. Using symbols will prevent many bookkeeping +mistakes that can occur if you use raw numbers. + +@section Command Line Synopsis +@example +as [ options @dots{} ] [ file1 @dots{} ] +@end example + +After the program name @code{as}, the command line may contain +options and file names. Options may be in any order, and may be +before, after, or between file names. The order of file names is +significant. + +@subsection Options + +Except for @samp{--} any command line argument that begins with a +hyphen (@samp{-}) is an option. Each option changes the behavior of +@code{as}. No option changes the way another option works. An +option is a @samp{-} followed by one ore more letters; the case of +the letter is important. No option (letter) should be used twice on +the same command line. (Nobody has decided what two copies of the +same option should mean.) All options are optional. + +Some options expect exactly one file name to follow them. The file +name may either immediately follow the option's letter (compatible +with older assemblers) or it may be the next command argument (GNU +standard). These two command lines are equivalent: + +@example +as -o my-object-file.o mumble +as -omy-object-file.o mumble +@end example + +Always, @file{--} (that's two hyphens, not one) by itself names the +standard input file. + +@section Input File(s) + +We use the words @dfn{source program}, abbreviated @dfn{source}, to +describe the program input to one run of @code{as}. The program may +be in one or more files; how the source is partitioned into files +doesn't change the meaning of the source. + +The source text is a catenation of the text in each file. + +Each time you run @code{as} it assembles exactly one source +program. A source program text is made of one or more files. +(The standard input is also a file.) + +You give @code{as} a command line that has zero or more input file +names. The input files are read (from left file name to right). A +command line argument (in any position) that has no special meaning +is taken to be an input file name. If @code{as} is given no file +names it attempts to read one input file from @code{as}'s standard +input. + +Use @file{--} if you need to explicitly name the standard input file +in your command line. + +It is OK to assemble an empty source. @code{as} will produce a +small, empty object file. + +If you try to assemble no files then @code{as} will try to read +standard input, which is normally your terminal. You may have to +type @key{ctl-D} to tell @code{as} there is no more program to +assemble. + +@subsection Input Filenames and Line-numbers +A line is text up to and including the next newline. The first line +of a file is numbered @b{1}, the next @b{2} and so on. + +There are two ways of locating a line in the input file(s) and both +are used in reporting error messages. One way refers to a line +number in a physical file; the other refers to a line number in a +logical file. + +@dfn{Physical files} are those files named in the command line given +to @code{as}. + +@dfn{Logical files} are ``pretend'' files which bear no relation to +physical files. Logical file names help error messages reflect the +proper source file. Often they are used when @code{as}' source is +itself synthesized from other files. + +@section Output (Object) File +Every time you run @code{as} it produces an output file, which is +your assembly language program translated into numbers. This file +is the object file; named @code{a.out} unless you tell @code{as} to +give it another name by using the @code{-o} option. Conventionally, +object file names end with @file{.o}. The default name of +@file{a.out} is used for historical reasons. Older assemblers were +capable of assembling self-contained programs directly into a +runnable program. This may still work, but hasn't been tested. + +The object file is for input to the linker @code{ld}. It contains +assembled program code, information to help @code{ld} to integrate +the assembled program into a runnable file and (optionally) symbolic +information for the debugger. The precise format of object files is +described elsewhere. + +@comment link above to some info file(s) like the description of a.out. +@comment don't forget to describe GNU info as well as Unix lossage. + +@section Error and Warning Messages + +@code{as} may write warnings and error messages to the standard +error file (usually your terminal). This should not happen when +@code{as} is run automatically by a compiler. Error messages are +useful for those (few) people who still write in assembly language. + +Warnings report an assumption made so that @code{as} could keep +assembling a flawed program. + +Errors report a grave problem that stops the assembly. + +Warning messages have the format +@example +file_name:line_number:Warning Message Text +@end example +If a logical file name has been given (@xref{File}.) it is used for +the filename, otherwise the name of the current input file is used. +If a logical line number was given (@xref{Line}.) then it is used to +calculate the number printed, otherwise the actual line in the +current source file is printed. The message text is intended to be +self explanatory (In the grand Unix tradition). + +Error messages have the format +@example +file_name:line_number:FATAL:Error Message Text +@end example +The file name and line number are derived the same as for warning +messages. The actual message text may be rather less explanatory +because many of them aren't supposed to happen. + +@section Options +@subsection -f Works Faster +@samp{-f} should only be used when assembling programs written by a +(trusted) compiler. @samp{-f} causes the assembler to not bother +pre-processing the input file(s) before assembling them. Needless +to say, if the files actually need to be pre-processed (if the +contain comments, for example), @code{as} will not work correctly if +@samp{-f} is used. + +@subsection -L Includes Local Labels +For historical reasons, labels beginning with @samp{L} (upper case +only) are called @dfn{local labels}. Normally you don't see such +labels because they are intended for the use of programs (like +compilers) that compose assembler programs, not for your notice. +Normally both @code{as} and @code{ld} discard such labels, so you +don't normally debug with them. + +This option tells @code{as} to retain those @samp{L@dots{}} symbols +in the object file. Usually if you do this you also tell the linker +@code{ld} to preserve symbols whose names begin with @samp{L}. + +@subsection -o Names the Object File +There is always one object file output when you run @code{as}. By +default it has the name @file{a.out}. You use this option (which +takes exactly one filename) to give the object file a different name. + +Whatever the object file is called, @code{as} will overwrite any +existing file of the same name. + +@subsection -R Folds Data Segment into Text Segment +@code{-R} tells @code{as} to write the object file as if all +data-segment data lives in the text segment. This is only done at +the very last moment: your binary data are the same, but data +segment parts are relocated differently. The data segment part of +your object file is zero bytes long because all it bytes are +appended to the text segment. (@xref{Segments}.) + +When you use @code{-R} it would be nice to generate shorter address +displacements (possible because we don't have to cross segments) +between text and data segment. We don't do this simply for +compatibility with older versions of @code{as}. @code{-R} may work +this way in future. + +@subsection -W Represses Warnings +@code{as} should never give a warning or error message when +assembling compiler output. But programs written by people often +cause @code{as} to give a warning that a particular assumption was +made. All such warnings are directed to the standard error file. +If you use this option, any warning is repressed. This option only +affects warning messages: it cannot change any detail of how +@code{as} assembles your file. Errors, which stop the assembly, are +still reported. + +@section Special Features to support Compilers + +In order to assemble compiler output into something that will work, +@code{as} will occasionlly do strange things to @samp{.word} +directives. In particular, when @code{gas} assembles a directive of +the form @samp{.word sym1-sym2}, and the difference between +@code{sym1} and @code{sym2} does not fit in 16 bits, @code{as} will +create a @dfn{secondary jump table}, immediately before the next +label. This @var{secondary jump table} will be preceeded by a +short-jump to the first byte after the table. The short-jump +prevents the flow-of-control from accidentally falling into the +table. Inside the table will be a long-jump to @code{sym2}. The +original @samp{.word} will contain @code{sym1} minus (the address of +the long-jump to sym2) If there were several @samp{.word sym1-sym2} +before the secondary jump table, all of them will be adjusted. If +ther was a @samp{.word sym3-sym4}, that also did not fit in sixteen +bits, a long-jump to @code{sym4} will be included in the secondary +jump table, and the @code{.word}(s), will be adjusted to contain +@code{sym3} minus (the address of the long-jump to sym4), etc. + +@emph{This feature may be disabled by compiling @code{as} with the +@samp{-DWORKING_DOT_WORD} option.} This feature is likely to confuse +assembly language programmers. + +@node Syntax, Segments, top, top +@chapter Syntax +This chapter informally defines the machine-independent syntax +allowed in a source file. @code{as} has ordinary syntax; it tries +to be upward compatible from BSD 4.2 assembler except @code{as} does +not assemble Vax bit-fields. + +@section The Pre-processor +The preprocess phase handles several aspects of the syntax. The +pre-processor will be disabled by the @samp{-f} option, or if the +first line of the source file is @code{#NO_APP}. The option to +disable the pre-processor was designed to make compiler output +assemble as fast as possible. + +The pre-processor adjusts and removes extra whitespace. It leaves +one space or tab before the keywords on a line, and turns any other +whitespace on the line into a single space. + +The pre-processor removes all comments, replacing them with a single +space (for /* @dots{} */ comments), or an appropriate number of +newlines. + +The pre-processor converts character constants into the appropriate +numeric values. + +This means that excess whitespace, comments, and character constants +cannot be used in the portions of the input text that are not +pre-processed. + +If the first line of an input file is @code{#NO_APP} or the +@samp{-f} option is given, the input file will not be +pre-processed. Within such an input file, parts of the file can be +pre-processed by putting a line that says @code{#APP} before the +text that should be pre-processed, and putting a line that says +@code{#NO_APP} after them. This feature is mainly intend to support +asm statements in compilers whose output normally does not need to +be pre-processed. + +@section Whitespace +@dfn{Whitespace} is one or more blanks or tabs, in any order. +Whitespace is used to separate symbols, and to make programs neater +for people to read. Unless within character constants +(@xref{Characters}.), any whitespace means the same as exactly one +space. + +@section Comments +There are two ways of rendering comments to @code{as}. In both +cases the comment is equivalent to one space. + +Anything from @samp{/*} through the next @samp{*/} is a comment. + +@example +/* + The only way to include a newline ('\n') in a comment + is to use this sort of comment. +*/ +/* This sort of comment does not nest. */ +@end example + +Anything from the @dfn{line comment} character to the next newline +considered a comment and is ignored. The line comment character is +@samp{#} on the Vax, and @samp{|} on the 680x0. +@xref{MachineDependent}. On some machines there are two different +line comment characters. One will only begin a comment if it is the +first non-whitespace character on a line, while the other will +always begin a comment. + +To be compatible with past assemblers a special interpretation is +given to lines that begin with @samp{#}. Following the @samp{#} an +absolute expression (@pxref{Expressions}) is expected: this will be +the logical line number of the @b{next} line. Then a string +(@xref{Strings}.) is allowed: if present it is a new logical file +name. The rest of the line, if any, should be whitespace. + +If the first non-whitespace characters on the line are not numeric, +the line is ignored. (Just like a comment.) +@example + # This is an ordinary comment. +# 42-6 "new_file_name" # New logical file name + # This is logical line # 36. +@end example +This feature is deprecated, and may disappear from future versions +of @code{as}. + +@section Symbols +A @dfn{symbol} is one or more characters chosen from the set of all +letters (both upper and lower case), digits and the three characters +@samp{_.$}. No symbol may begin with a digit. Case is +significant. There is no length limit: all characters are +significant. Symbols are delimited by characters not in that set, +or by begin/end-of-file. (@xref{Symbols}.) + +@section Statements +A @dfn{statement} ends at a newline character (@samp{\n}) or at a +semicolon (@samp{;}). The newline or semicolon is considered part +of the preceding statement. Newlines and semicolons within +character constants are an exception: they don't end statements. +It is an error to end any statement with end-of-file: the last +character of any input file should be a newline. + +You may write a statement on more than one line if you put a +backslash (@kbd{\}) immediately in front of any newlines within the +statement. When @code{as} reads a backslashed newline both +characters are ignored. You can even put backslashed newlines in +the middle of symbol names without changing the meaning of your +source program. + +An empty statement is OK, and may include whitespace. It is ignored. + +Statements begin with zero or more labels, followed by a @dfn{key +symbol} which determines what kind of statement it is. The key +symbol determines the syntax of the rest of the statement. If the +symbol begins with a dot (@t{.}) then the statement is an assembler +directive: typically valid for any computer. If the symbol begins +with a letter the statement is an assembly language +@dfn{instruction}: it will assemble into a machine language +instruction. Different versions of @code{as} for different +computers will recognize different instructions. In fact, the same +symbol may represent a different instruction in a different +computer's assembly language. + +A label is usually a symbol immediately followed by a colon +(@code{:}). Whitespace before a label or after a colon is OK. You +may not have whitespace between a label's symbol and its colon. +Labels are explained below. +@xref{Labels}. + +@example +label: .directive followed by something +another$label: # This is an empty statement. + instruction operand_1, operand_2, @dots{} +@end example + +@section Constants +A constant is a number, written so that its value is known by +inspection, without knowing any context. Like this: +@example +.byte 74, 0112, 092, 0x4A, 0X4a, 'J, '\J # All the same value. +.ascii "Ring the bell\7" # A string constant. +.octa 0x123456789abcdef0123456789ABCDEF0 # A bignum. +.float 0f-314159265358979323846264338327\ +95028841971.693993751E-40 # - pi, a flonum. +@end example + +@node Characters, Strings, , Syntax +@subsection Character Constants +There are two kinds of character constants. @dfn{Characters} stand +for one character in one byte and their values may be used in +numeric expressions. String constants (properly called string +@i{literals}) are potentially many bytes and their values may not be +used in arithmetic expressions. + +@node Strings, , Characters, Syntax +@subsubsection Strings +A @dfn{string} is written between double-quotes. It may contain +double-quotes or null characters. The way to get weird characters +into a string is to @dfn{escape} these characters: precede them with +a backslash (@code{\}) character. For example @samp{\\} represents +one backslash: the first @code{\} is an escape which tells +@code{as} to interpret the second character literally as a backslash +(which prevents @code{as} from recognizing the second @code{\} as an +escape character). The complete list of escapes follows. + +@table @kbd +@item \EOF +A @kbd{\} followed by end-of-file erroneous. It is treated just +like an end-of-file without a preceding backslash. +@c @item \a +@c Mnemonic for ACKnowledge; for ASCII this is octal code 007. +@item \b +Mnemonic for backspace; for ASCII this is octal code 010. +@c @item \e +@c Mnemonic for EOText; for ASCII this is octal code 004. +@item \f +Mnemonic for FormFeed; for ASCII this is octal code 014. +@item \n +Mnemonic for newline; for ASCII this is octal code 012. +@c @item \p +@c Mnemonic for prefix; for ASCII this is octal code 033, usually known as @code{escape}. +@item \r +Mnemonic for carriage-Return; for ASCII this is octal code 015. +@c @item \s +@c Mnemonic for space; for ASCII this is octal code 040. Included for compliance with +@c other assemblers. +@item \t +Mnemonic for horizontal Tab; for ASCII this is octal code 011. +@c @item \v +@c Mnemonic for Vertical tab; for ASCII this is octal code 013. +@c @item \x @var{digit} @var{digit} @var{digit} +@c A hexadecimal character code. The numeric code is 3 hexadecimal digits. +@item \ @var{digit} @var{digit} @var{digit} +An octal character code. The numeric code is 3 octal digits. +For compatibility with other Unix systems, 8 and 9 are legal digits +with values 010 and 011 respectively. +@item \\ +Represents one @samp{\} character. +@c @item \' +@c Represents one @samp{'} (accent acute) character. +@c This is needed in single character literals +@c (@xref{Characters}.) to represent +@c a @samp{'}. +@item \" +Represents one @samp{"} character. Needed in strings to represent +this character, because an unescaped @samp{"} would end the string. +@item \ @var{anything-else} +Any other character when escaped by @kbd{\} will give a warning, but +assemble as if the @samp{\} was not present. The idea is that if +you used an escape sequence you clearly didn't want the literal +interpretation of the following character. However @code{as} has no +other interpretation, so @code{as} knows it is giving you the wrong +code and warns you of the fact. +@end table + +Which characters are escapable, and what those escapes represent, +varies widely among assemblers. The current set is what we think +BSD 4.2 @code{as} recognizes, and is a subset of what most C +compilers recognize. If you are in doubt, don't use an escape +sequence. + +@subsubsection Characters +A single character may be written as a single quote immediately +followed by that character. The same escapes apply to characters as +to strings. So if you want to write the character backslash, you +must write @kbd{'\\} where the first @code{\} escapes the second +@code{\}. As you can see, the quote is an accent acute, not an +accent grave. A newline (or semicolon (@samp{;})) immediately +following an accent acute is taken as a literal character and does +not count as the end of a statement. The value of a character +constant in a numeric expression is the machine's byte-wide code for +that character. @code{as} assumes your character code is ASCII: @kbd{'A} +means 65, @kbd{'B} means 66, and so on. + +@subsection Number Constants +@code{as} distinguishes 3 flavors of numbers according to how they +are stored in the target machine. @i{Integers} are numbers that +would fit into an @code{int} in the C language. @i{Bignums} are +integers, but they are stored in a more than 32 bits. @i{Flonums} +are floating point numbers, described below. + +@subsubsection Integers +An octal integer is @samp{0} followed by zero or more of the octal +digits (@samp{01234567}). + +A decimal integer starts with a non-zero digit followed by zero or +more digits (@samp{0123456789}). + +A hexadecimal integer is @samp{0x} or @samp{0X} followed by one or +more hexadecimal digits chosen from @samp{0123456789abcdefABCDEF}. + +Integers have the obvious values. To denote a negative integer, use +the unary operator @samp{-} discussed under expressions +(@xref{Unops}.). + +@subsubsection Bignums +A @dfn{bignum} has the same syntax and semantics as an integer +except that the number (or its negative) takes more than 32 bits to +represent in binary. The distinction is made because in some places +integers are permitted while bignums are not. + +@subsubsection Flonums +A @dfn{flonum} represents a floating point number. The translation +is complex: a decimal floating point number from the text is +converted by @code{as} to a generic binary floating point number of +more than sufficient precision. This generic floating point number +is converted to the particular computer's floating point format(s) +by a portion of @code{as} specialized to that computer. + +A flonum is written by writing (in order) +@itemize @bullet +@item +The digit @samp{0}. +@item +A letter, to tell @code{as} the rest of the number is a flonum. +@kbd{e} +is recommended. Case is not important. +(Any otherwise illegal letter will work here, +but that might be changed. Vax BSD 4.2 assembler +seems to allow any of @samp{defghDEFGH}.) +@item +An optional sign: either @samp{+} or @samp{-}. +@item +An optional integer part: zero or more decimal digits. +@item +An optional fraction part: @samp{.} followed by zero +or more decimal digits. +@item +An optional exponent, consisting of: +@itemize @bullet +@item +A letter; the exact significance varies according to +the computer that executes the program. @code{as} +accepts any letter for now. Case is not important. +@item +Optional sign: either @samp{+} or @samp{-}. +@item +One or more decimal digits. +@end itemize +@end itemize + +At least one of @var{integer part} or @var{fraction part} must be +present. The floating point number has the obvious value. + +The computer running @code{as} needs no floating point hardware. +@code{as} does all processing using integers. + +@node Segments, Symbols, Syntax, top +@chapter (Sub)Segments & Relocation +Roughly, a @dfn{segment} is a range of addresses, with no gaps, with +all data ``in'' those addresses being treated the same. For example +there may be a ``read only'' segment. + +The linker @code{ld} reads many object files (partial programs) and +combines their contents to form a runnable program. When @code{as} +emits an object file, the partial program is assumed to start at +address 0. @code{ld} will assign the final addresses the partial +program occupies, so that different partial programs don't overlap. +That explanation is too simple, but it will suffice to explain how +@code{as} works. + +@code{ld} moves blocks of bytes of your program to their run-time +addresses. These blocks slide to their run-time addresses as rigid +units; their length does not change and neither does the order of +bytes within them. Such a rigid unit is called a @i{segment}. +Assigning run-time addresses to segments is called +@dfn{relocation}. It includes the task of adjusting mentions of +object-file addresses so they refer to the proper run-time addresses. + +An object file written by @code{as} has three segments, any of which +may be empty. These are named @i{text}, @i{data} and @i{bss} +segments. Within the object file, the text segment starts at +address 0, the data segment follows, and the bss segment follows the +data segment. + +To let @code{ld} know which data will change when the segments are +relocated, and how to change that data, @code{as} also writes to the +object file details of the relocation needed. To perform relocation +@code{ld} must know for each mention of an address in the object +file: +@itemize @bullet +@item +At what address in the object file does this mention of +an address begin? +@item +How long (in bytes) is this mention? +@item +Which segment does the address refer to? +What is the numeric value of (@var{address} @t{-} +@var{start-address of segment})? +@item +Is the mention of an address ``Program counter relative''? +@end itemize + +In fact, every address @code{as} ever thinks about is expressed as +(@var{segment} @t{+} @var{offset into segment}). Further, every +expression @code{as} computes is of this segmented nature. So +@dfn{absolute expression} means an expression with segment +``absolute'' (@xref{LdSegs}.). A @dfn{pass1 expression} means an +expression with segment ``pass1'' (@xref{MythSegs}.). In this +document ``(segment, offset)'' will be written as @{ segment-name +(offset into segment) @}. + +Apart from text, data and bss segments you need to know about the +@dfn{absolute} segment. When @code{ld} mixes partial programs, +addresses in the absolute segment remain unchanged. That is, +address @{absolute 0@} is ``relocated'' to run-time address 0 by +@code{ld}. Although two partial programs' data segments will not +overlap addresses after linking, @b{by definition} their absolute +segments will overlap. Address @{absolute 239@} in one partial +program will always be the same address when the program is running +as address @{absolute 239@} in any other partial program. + +The idea of segments is extended to the @dfn{undefined} segment. +Any address whose segment is unknown at assembly time is by +definition rendered @{undefined (something, unknown yet)@}. Since +numbers are always defined, the only way to generate an undefined +address is to mention an undefined symbol. A reference to a named +common block would be such a symbol: its value is unknown at assembly +time so it has segment @i{undefined}. + +By analogy the word @i{segment} is to describe groups of segments in +the linked program. @code{ld} puts all partial program's text +segments in contiguous addresses in the linked program. It is +customary to refer to the @i{text segment} of a program, meaning all +the addresses of all partial program's text segments. Likewise for +data and bss segments. + +@section Segments +Some segments are manipulated by @code{ld}; others are invented for +use of @code{as} and have no meaning except during assembly. + +@node LdSegs, , , +@subsection ld segments +@code{ld} deals with just 5 kinds of segments, summarized below. +@table @b +@item text segment +@itemx data segment +These segments hold your program bytes. @code{as} and @code{ld} +treat them as separate but equal segments. Anything you can say of +one segment is true of the other. When the program is running +however it is customary for the text segment to be unalterable: it +will contain instructions, constants and the like. The data segment +of a running program is usually alterable: for example, C variables +would be stored in the data segment. +@item bss segment +This segment contains zeroed bytes when your program begins +running. It is used to hold unitialized variables or common +storage. The length of each partial program's bss segment is +important, but because it starts out containing zeroed bytes there +is no need to store explicit zero bytes in the object file. The Bss +segment was invented to eliminate those explicit zeros from object +files. +@item absolute segment +Address 0 of this segment is always ``relocated'' to runtime address +0. This is useful if you want to refer to an address that @code{ld} +must not change when relocating. In this sense we speak of absolute +addresses being ``unrelocatable'': they don't change during +relocation. +@item undefined segment +This ``segment'' is a catch-all for address references to objects +not in the preceding segments. See the description of @file{a.out} +for details. +@end table +An idealized example of the 3 relocatable segments follows. Memory +addresses are on the horizontal axis. + +@example + +-----+----+--+ +partial program # 1: |ttttt|dddd|00| + +-----+----+--+ + + text data bss + seg. seg. seg. + + +---+---+---+ +partial program # 2: |TTT|DDD|000| + +---+---+---+ + + +--+---+-----+--+----+---+-----+~~ +linked program: | |TTT|ttttt| |dddd|DDD|00000| + +--+---+-----+--+----+---+-----+~~ + + addresses: 0 @dots{} +@end example + +@node MythSegs, , , +@subsection Mythical Segments +These segments are invented for the internal use of @code{as}. They +have no meaning at run-time. You don't need to know about these +segments except that they might be mentioned in @code{as}' warning +messages. These segments are invented to permit the value of every +expression in your assembly language program to be a segmented +address. + +@table @b +@item absent segment +An expression was expected and none was found. +@item goof segment +An internal assembler logic error has been found. This means there +is a bug in the assembler. +@item grand segment +A @dfn{grand number} is a bignum or a flonum, but not an integer. +If a number can't be written as a C @code{int} constant, it is a +grand number. @code{as} has to remember that a flonum or a bignum +does not fit into 32 bits, and cannot be a primary (@xref{Primary}.) +in an expression: this is done by making a flonum or bignum be of +type ``grand''. This is purely for internal @code{as} convenience; +grand segment behaves similarly to absolute segment. +@item pass1 segment +The expression was impossible to evaluate in the first pass. The +assembler will attempt a second pass (second reading of the source) +to evaluate the expression. Your expression mentioned an undefined +symbol in a way that defies the one-pass (segment + offset in +segment) assembly process. No compiler need emit such an expression. +@item difference segment +As an assist to the C compiler, expressions of the forms +@itemize @bullet +@item +(undefined symbol) @t{-} (expression) +@item +(something) @t{-} (undefined symbol) +@item +(undefined symbol) @t{-} (undefined symbol) +@end itemize +are permitted to belong to the ``difference'' segment. @code{as} +re-evaluates such expressions after the source file has been read +and the symbol table built. If by that time there are no undefined +symbols in the expression then the expression assumes a new segment. +The intention is to permit statements like @samp{.word label - +base_of_table} to be assembled in one pass where both @code{label} +and @code{base_of_table} are undefined. This is useful for +compiling C and Algol switch statements, Pascal case statements, +FORTRAN computed goto statements and the like. +@end table + +@section Sub-Segments +Assembled bytes fall into two segments: text and data. Because you +may have groups of text or data that you want to end up near to each +other in the object file, @code{as}, allows you to use +@dfn{subsegments}. Within each segment, there can be numbered +subsegments with values from 0 to 8192. Objects assembled into the +same subsegment will be grouped with other objects in the same +subsegment when they are all put into the object file. For example, +a compiler might want to store constants in the text segment, but +might not want to have them intersperced with the program being +assembled. In this case, the compiler could issue a @code{text 0} +before each section of code being output, and a @code{text 1} before +each group of constants being output. + +Subsegments are optional. If you don't used subsegments, everything +will be stored in subsegment number zero. + +Each subsegment is zero-padded up to a multiple of four bytes. +(Subsegments may be padded a different amount on different flavors +of @code{as}.) Subsegments appear in your object file in numeric +order, lowest numbered to highest. (All this to be compatible with +other people's assemblers.) The object file, @code{ld} @i{etc.} +have no concept of subsegments. They just see all your text +subsegments as a text segment, and all your data subsegments as a +data segment. + +To specify which subsegment you want subsequent statements assembled +into, use a @samp{.text @var{expression}} or a @samp{.data +@var{expression}} statement. @var{Expression} should be an absolute +expression. (@xref{Expressions}.) If you just say @samp{.text} +then @samp{.text 0} is assumed. Likewise @samp{.data} means +@samp{.data 0}. Assembly begins in @code{text 0}. +For instance: +@example +.text 0 # The default subsegment is text 0 anyway. +.ascii "This lives in the first text subsegment. *" +.text 1 +.ascii "But this lives in the second text subsegment." +.data 0 +.ascii "This lives in the data segment," +.ascii "in the first data subsegment." +.text 0 +.ascii "This lives in the first text segment," +.ascii "immediately following the asterisk (*)." +@end example + +Each segment has a @dfn{location counter} incremented by one for +every byte assembled into that segment. Because subsegments are +merely a convenience restricted to @code{as} there is no concept of +a subsegment location counter. There is no way to directly +manipulate a location counter. The location counter of the segment +that statements are being assembled into is said to be the +@dfn{active} location counter. + +@section Bss Segment +The @code{bss} segment is used for local common variable storage. +You may allocate address space in the @code{bss} segment, but you may +not dictate data to load into it before your program executes. When +your program starts running, all the contents of the @code{bss} +segment are zeroed bytes. + +Addresses in the bss segment are allocated with a special statement; +you may not assemble anything directly into the bss segment. Hence +there are no bss subsegments. + +@node Symbols, Expressions, Segments, top +@chapter Symbols +Because the linker uses symbols to link, the debugger uses symbols +to debug and the programmer uses symbols to name things, symbols are +a central concept. Symbols do not appear in the object file in the +order they are declared. This may break some debuggers. + +@node Labels, , , Symbols +@section Labels +A @dfn{label} is written as a symbol immediately followed by a colon +(@samp{:}). The symbol then represents the current value of the +active location counter, and is, for example, a suitable instruction +operand. You are warned if you use the same symbol to represent two +different locations: the first definition overrides any other +definitions. + +@section Giving Symbols Other Values +A symbol can be given an arbitrary value by writing a symbol followed +by an equals sign (@samp{=}) followed by an expression +(@pxref{Expressions}). This is equivalent to using the @code{.set} +directive. (@xref{Set}.) + +@section Symbol Names +Symbol names begin with a letter or with one of @samp{$._}. That +character may be followed by any string of digits, letters, +underscores and dollar signs. Case of letters is significant: +@code{foo} is a different symbol name than @code{Foo}. + +Each symbol has exactly one name. Each name in an assembly program +refers to exactly one symbol. You may use that symbol name any +number of times in an assembly program. + +@subsection Local Symbol Names + +Local symbols help compilers and programmers use names temporarily. +There are ten @dfn{local} symbol names, which are re-used throughout +the program. Their names are @samp{0} @samp{1} @dots{} @samp{9}. +To define a local symbol, write a label of the form +@var{digit}@t{:}. To refer to the most recent previous definition +of that symbol write @var{digit}@t{b}, using the same digit as when +you defined the label. To refer to the next definition of a local +label, write @var{digit}@t{f} where @var{digit} gives you a choice +of 10 forward references. The @samp{b} stands for ``backwards'' and +the @samp{f} stands for ``forwards''. + +Local symbols are not used by the current C compiler. + +There is no restriction on how you can use these labels, but +remember that at any point in the assembly you can refer to at most +10 prior local labels and to at most 10 forward local labels. + +Local symbol names are only a notation device. They are immediately +transformed into more conventional symbol names before the assembler +thinks about them. The symbol names stored in the symbol table, +appearing in error messages and optionally emitted to the object +file have these parts: +@table @kbd +@item L +All local labels begin with @samp{L}. Normally both @code{as} and +@code{ld} forget symbols that start with @samp{L}. These labels are +used for symbols you are never intended to see. If you give the +@samp{-L} option then @code{as} will retain these symbols in the +object file. By instructing @code{ld} to also retain these symbols, +you may use them in debugging. +@item @i{a digit} +If the label is written @samp{0:} then the digit is @samp{0}. +If the label is written @samp{1:} then the digit is @samp{1}. +And so on up through @samp{9:}. +@item @i{control}-A +This unusual character is included so you don't accidentally invent +a symbol of the same name. The character has ASCII value +@samp{\001}. +@item @i{an ordinal number} +This is like a serial number to keep the labels distinct. The first +@samp{0:} gets the number @samp{1}; The 15th @samp{0:} gets the +number @samp{15}; @i{etc.}. Likewise for the other labels @samp{1:} +through @samp{9:}. +@end table +For instance, the +first @code{1:} is named @code{L1^A1}, the 44th @code{3:} is named @code{L3^A44}. + +@section The Special Dot Symbol + +The special symbol @code{.} refers to the current address that +@code{as} is assembling into. Thus, the expression @samp{melvin: +.long .} will cause @var{melvin} to contain its own address. +Assigning a value to @code{.} is treated the same as a @code{.org} +directive. Thus, the expression @samp{.=.+4} is the same as saying +@samp{.space 4}. + +@section Symbol Attributes +Every symbol has the attributes discussed below. The detailed +definitions are in <a.out.h>. + +If you use a symbol without defining it, @code{as} assumes zero for +all these attributes, and probably won't warn you. This makes the +symbol an externally defined symbol, which is generally what you +would want. + +@subsection Value +The value of a symbol is (usually) 32 bits, the size of one C +@code{int}. For a symbol which labels a location in the +@code{text}, @code{data}, @code{bss} or @code{Absolute} segments the +value is the number of addresses from the start of that segment to +the label. Naturally for @code{text} @code{data} and @code{bss} +segments the value of a symbol changes as @code{ld} changes segment +base addresses during linking. @code{absolute} symbols' values do +not change during linking: that is why they are called absolute. + +The value of an undefined symbol is treated in a special way. If it +is 0 then the symbol is not defined in this assembler source +program, and @code{ld} will try to determine its value from other +programs it is linked with. You make this kind of symbol simply by +mentioning a symbol name without defining it. A non-zero value +represents a @code{.comm} common declaration. The value is how much +common storage to reserve, in bytes (@i{i.e.} addresses). The +symbol refers to the first address of the allocated storage. + +@subsection Type +The type attribute of a symbol is 8 bits encoded in a devious way. +We kept this coding standard for compatibility with older operating +systems. + +@example + + 7 6 5 4 3 2 1 0 bit numbers + +-----+-----+-----+-----+-----+-----+-----+-----+ + | | | | + | N_STAB bits | N_TYPE bits |N_EXT| + | | | bit | + +-----+-----+-----+-----+-----+-----+-----+-----+ + + n_type byte +@end example + +@subsubsection N_EXT bit +This bit is set if @code{ld} might need to use the symbol's value +and type bits. If this bit is re-set then @code{ld} can ignore the +symbol while linking. It is set in two cases. If the symbol is +undefined, then @code{ld} is expected to find the symbol's value +elsewhere in another program module. Otherwise the symbol has the +value given, but this symbol name and value are revealed to any other +programs linked in the same executable program. This second use of +the @code{N_EXT} bit is most often done by a @code{.globl} statement. + +@subsubsection N_TYPE bits +These establish the symbol's ``type'', which is mainly a relocation +concept. Common values are detailed in the manual describing the +executable file format. + +@subsubsection N_STAB bits +Common values for these bits are described in the manual on the +executable file format. + +@subsection Desc(riptor) +This is an arbitrary 16-bit value. You may establish a symbol's +descriptor value by using a @code{.desc} statement (@xref{Desc}.). +A descriptor value means nothing to @code{as}. + +@subsection Other +This is an arbitrary 8-bit value. It means nothing to @code{as}. + +@node Expressions, PseudoOps, Symbols, top +@chapter Expressions +An @dfn{expression} specifies an address or numeric value. +Whitespace may precede and/or follow an expression. + +@section Empty Expressions +An empty expression has no operands: it is just whitespace or null. +Wherever an absolute expression is required, you may omit the +expression and @code{as} will assume a value of (absolute) 0. This +is compatible with other assemblers. + +@section Integer Expressions +An @dfn{integer expression} is one or more @i{primaries} delimited +by @i{operators}. + +@node Primary, Unops, , Expressions +@subsection Primaries +@dfn{Primaries} are symbols, numbers or subexpressions. Other +languages might call primaries ``arithmetic operands'' but we don't +want them confused with ``instruction operands'' of the machine +language so we give them a different name. + +Symbols are evaluated to yield @{@var{segment} @var{value}@} where +@var{segment} is one of @b{text}, @b{data}, @b{bss}, @b{absolute}, +or @b{undefined}. @var{value} is a signed 2's complement 32 bit +integer. + +Numbers are usually integers. + +A number can be a flonum or bignum. In this case, you are warned +that only the low order 32 bits are used, and @code{as} pretends +these 32 bits are an integer. You may write integer-manipulating +instructions that act on exotic constants, compatible with other +assemblers. + +Subexpressions are a left parenthesis (@t{(}) followed by an integer +expression followed by a right parenthesis (@t{)}), or a unary +operator followed by an primary. + +@subsection Operators +@dfn{Operators} are arithmetic marks, like @t{+} or @t{%}. Unary +operators are followed by an primary. Binary operators appear +between primaries. Operators may be preceded and/or followed by +whitespace. + +@subsection Unary Operators +@node Unops, , Primary, Expressions +@code{as} has the following @dfn{unary operators}. They each take +one primary, which must be absolute. +@table @t +@item - +Hyphen. @dfn{Negation}. Two's complement negation. +@item ~ +Tilde. @dfn{Complementation}. Bitwise not. +@end table + +@subsection Binary Operators +@dfn{Binary operators} are infix. Operators are prioritized, but +equal priority operators are performed left to right. Apart from +@samp{+} or @samp{-}, both primaries must be absolute, and the +result is absolute, else one primary can be either undefined or +pass1 and the result is pass1. +@enumerate +@item +Highest Priority +@table @code +@item * +@dfn{Multiplication}. +@item / +@dfn{Division}. Truncation is the same as the C operator @samp{/} +of the compiler that compiled @code{as}. +@item % +@dfn{Remainder}. +@item < +@itemx << +@dfn{Shift Left}. Same as the C operator @samp{<<} of +the compiler that compiled @code{as}. +@item > +@itemx >> +@dfn{Shift Right}. Same as the C operator @samp{>>} of +the compiler that compiled @code{as}. +@end table +@item +Intermediate priority +@table @t +@item | +@dfn{Bitwise Inclusive Or}. +@item & +@dfn{Bitwise And}. +@item ^ +@dfn{Bitwise Exclusive Or}. +@item ! +@dfn{Bitwise Or Not}. +@end table +@item +Lowest Priority +@table @t +@item + +@dfn{Addition}. If either primary is absolute, the result +has the segment of the other primary. +If either primary is pass1 or undefined, result is pass1. +Otherwise @t{+} is illegal. +@item - +@dfn{Subtraction}. If the right primary is absolute, the +result has the segment of the left primary. +If either primary is pass1 the result is pass1. +If either primary is undefined the result is difference segment. +If both primaries are in the same segment, the result is absolute; provided +that segment is one of text, data or bss. +Otherwise @t{-} is illegal. +@end table +@end enumerate + +The sense of the rules is that you can't add or subtract quantities +from two different segments. If both primaries are in one of these +segments, they must be in the same segment: @b{text}, @b{data} or +@b{bss}, and the operator must be @samp{-}. + +@node PseudoOps, MachineDependent, Expressions, top +@chapter Assembler Directives +@menu +* Abort:: The Abort directive causes as to abort +* Align:: Pad the location counter to a power of 2 +* Ascii:: Fill memory with bytes of ASCII characters +* Asciz:: Fill memory with bytes of ASCII characters followed + by a null. +* Byte:: Fill memory with 8-bit integers +* Comm:: Reserve public space in the BSS segment +* Data:: Change to the data segment +* Desc:: Set the n_desc of a symbol +* Double:: Fill memory with double-precision floating-point numbers +* File:: Set the logical file name +* Fill:: Fill memory with repeated values +* Float:: Fill memory with single-precision floating-point numbers +* Global:: Make a symbol visible to the linker +* Int:: Fill memory with 32-bit integers +* Lcomm:: Reserve private space in the BSS segment +* Line:: Set the logical line number +* Long:: Fill memory with 32-bit integers +* Lsym:: Create a local symbol +* Octa:: Fill memory with 128-bit integers +* Org:: Change the location counter +* Quad:: Fill memory with 64-bit integers +* Set:: Set the value of a symbol +* Short:: Fill memory with 16-bit integers +* Space:: Fill memory with a repeated value +* Stab:: Store debugging information +* Text:: Change to the text segment +* Word:: Fill memory with 16-bit integers +@end menu + +All assembler directives begin with a symbol that begins with a +period (@samp{.}). The rest of the symbol is letters: their case +does not matter. + +@node Abort, Align, PseudoOps, PseudoOps +@section .abort +This directive stops the assembly immediately. It is for +compatibility with other assemblers. The original idea was that the +assembler program would be piped into the assembler. If the source +of program wanted to quit, then this directive tells @code{as} to +quit also. One day @code{.abort} will not be supported. + +@node Align, Ascii, Abort, PseudoOps +@section .align @var{absolute-expression} , @var{absolute-expression} +Pad the location counter (in the current subsegment) to a word, +longword or whatever boundary. The first expression is the number +of low-order zero bits the location counter will have after +advancement. For example @samp{.align 3} will advance the location +counter until it a multiple of 8. If the location counter is +already a multiple of 8, no change is needed. + +The second expression gives the value to be stored in the padding +bytes. It (and the comma) may be omitted. If it is omitted, the +padding bytes are zeroed. + +@node Ascii, Asciz, Align, PseudoOps +@section .ascii @var{strings} +This expects zero or more string literals (@xref{Strings}.) +separated by commas. It assembles each string (with no automatic +trailing zero byte) into consecutive addresses. + +@node Asciz, Byte, Ascii, PseudoOps +@section .asciz @var{strings} +This is just like .ascii, but each string is followed by a zero byte. +The `z' in `.asciz' stands for `zero'. + +@node Byte, Comm, Asciz, PseudoOps +@section .byte @var{expressions} + +This expects zero or more expressions, separated by commas. +Each expression is assembled into the next byte. + +@node Comm, Data, Byte, PseudoOps +@section .comm @var{symbol} , @var{length} +This declares a named common area in the bss segment. Normally +@code{ld} reserves memory addresses for it during linking, so no +partial program defines the location of the symbol. Tell @code{ld} +that it must be at least @var{length} bytes long. @code{ld} will +allocate space that is at least as long as the longest @code{.comm} +request in any of the partial programs linked. @var{length} is an +absolute expression. + +@node Data, Desc, Comm, PseudoOps +@section .data @var{subsegment} +This tells @code{as} to assemble the following statements onto the +end of the data subsegment numbered @var{subsegment} (which is an +absolute expression). If @var{subsegment} is omitted, it defaults +to zero. + +@node Desc, Double, Data, PseudoOps +@section .desc @var{symbol}, @var{absolute-expression} +This sets @code{n_desc} of the symbol to the low 16 bits of +@var{absolute-expression}. + +@node Double, File, Desc, PseudoOps +@section .double @var{flonums} +This expects zero or more flonums, separated by commas. It assembles +floating point numbers. The exact kind of floating point numbers +emitted depends on what computer @code{as} is assembling for. See +the machine-specific part of the manual for the machine the +assembler is running on for more information. + +@node File, Fill, Double, PseudoOps +@section .file @var{string} +This tells @code{as} that we are about to start a new logical +file. @var{String} is the new file name. An empty file name +is OK, but you must still give the quotes: @code{""}. This +statement may go away in future: it is only recognized to +be compatible with old @code{as} programs. + +@node Fill, Float, File, PseudoOps +@section .fill @var{repeat} , @var{size} , @var{value} +@var{result}, @var{size} and @var{value} are absolute expressions. +This emits @var{repeat} copies of @var{size} bytes. @var{Repeat} +may be zero or more. @var{Size} may be zero or more, but if it is +more than 8, then it is deemed to have the value 8, compatible with +other people's assemblers. The contents of each @var{repeat} bytes +is taken from an 8-byte number. The highest order 4 bytes are +zero. The lowest order 4 bytes are @var{value} rendered in the +byte-order of an integer on the computer @code{as} is assembling for. +Each @var{size} bytes in a repetition is taken from the lowest order +@var{size} bytes of this number. Again, this bizarre behavior is +compatible with other people's assemblers. + +@var{Size} and @var{value} are optional. +If the second comma and @var{value} are absent, @var{value} is +assumed zero. If the first comma and following tokens are absent, +@var{size} is assumed to be 1. + +@node Float, Global, Fill, PseudoOps +@section .float @var{flonums} +This directive assembles zero or more flonums, separated by commas. +The exact kind of floating point numbers emitted depends on what +computer @code{as} is assembling for. See the machine-specific part +of the manual for the machine the assembler is running on for more +information. + +@node Global, Int, Float, PseudoOps +@section .global @var{symbol} +This makes the symbol visible to @code{ld}. If you define +@var{symbol} in your partial program, its value is made available to +other partial programs that are linked with it. Otherwise, +@var{symbol} will take its attributes from a symbol of the same name +from another partial program it is linked with. + +This is done by setting the @code{N_EXT} bit +of that symbol's @code{n_type} to 1. + +@node Int, Lcomm, Global, PseudoOps +@section .int @var{expressions} +Expect zero or more @var{expressions}, of any segment, separated by +commas. For each expression, emit a 32-bit number that will, at run +time, be the value of that expression. The byte order of the +expression depends on what kind of computer will run the program. + +@node Lcomm, Line, Int, PseudoOps +@section .lcomm @var{symbol} , @var{length} +Reserve @var{length} (an absolute expression) bytes for a local +common and denoted by @var{symbol}, whose segment and value are +those of the new local common. The addresses are allocated in the +@code{bss} segment, so at run-time the bytes will start off zeroed. +@var{Symbol} is not declared global (@xref{Global}.), so is normally +not visible to @code{ld}. + +@node Line, Long, Lcomm, PseudoOps +@section .line @var{logical line number} +This tells @code{as} to change the logical line number. +@var{logical line number} is an absolute expression. The next line +will have that logical line number. So any other statements on the +current line (after a @code{;}) will be reported as on logical line +number @var{logical line number} - 1. One day this directive will +be unsupported: it is used only for compatibility with existing +assembler programs. + +@node Long, Lsym, Line, PseudoOps +@section .long @var{expressions} +This is the same as @samp{.int}, @pxref{Int}. + +@node Lsym, Octa, Long, PseudoOps +@section .lsym @var{symbol}, @var{expression} +This creates a new symbol named @var{symbol}, but do not put it in +the hash table, ensuring it cannot be referenced by name during the +rest of the assembly. This sets the attributes of the symbol to be +the same as the expression value. @code{n_other} = @code{n_desc} = +0. @code{n_type} = (whatever segment the expression has); the +@code{N_EXT} bit of @code{n_type} is zero. @code{n_value} = +(expression's value). + +@node Octa, Org, Lsym, PseudoOps +@section .octa @var{bignums} +This expects zero or more bignums, separated by commas. For each +bignum, it emits an 16-byte (@b{octa}-word) integer. + +@node Org, Quad, Octa, PseudoOps +@section .org @var{new-lc} , @var{fill} +This will advance the location counter of the current segment to +@var{new-lc}. @var{new-lc} is either an absolute expression or an +expression with the same segment as the current subsegment. That +is, you can't use @code{.org} to cross segments. Because @code{as} +tries to assemble programs in one pass @var{new-lc} must be defined. +If you really detest this restriction we eagerly await a chance to +share your improved assembler. To be compatible with former +assemblers, if the segment of @var{new-lc} is absolute then we +pretend the segment of @var{new-lc} is the same as the current +subsegment. + +Beware that the origin is relative to the start of the segment, not +to the start of the subsegment. This is compatible with other +people's assemblers. + +If the location counter (of the current subsegment) is advanced, the +intervening bytes are filled with @var{fill} which should be an +absolute expression. If the comma and @var{fill} are omitted, +@var{fill} defaults to zero. + +@node Quad, Set, Org, PseudoOps +@section .quad @var{bignums} +This expects zero or more bignums, separated by commas. For each +bignum, it emits an 8-byte (@b{quad}-word) integer. If the bignum +won't fit in a quad-word, it prints a warning message; and just +takes the lowest order 8 bytes of the bignum. + +@node Set, Short, Quad, PseudoOps +@section .set @var{symbol}, @var{expression} + +This sets the value of @var{symbol} to expression. This will change +@code{n_value} and @code{n_type} to conform to the @var{expression}. +if @code{n_ext} is set, it remains set. + +It is OK to @code{.set} a symbol many times in the same assembly. +If the expression's segment is unknowable during pass 1, a second +pass over the source program will be forced. The second pass is +currently not implemented. @code{as} will abort with an error +message if one is required. + +If you @code{.set} a global symbol, the value stored in the object +file is the last value stored into it. + +@node Short, Space, Set, PseudoOps +@section .short @var{expressions} +Except on the Sparc this is the same as @samp{.word}. @xref{Word}. +On the sparc, this expects zero or more @var{expressions}, and emits +a 16 bit number for each. + +@node Space, Stab, Short, PseudoOps +@section .space @var{size} , @var{fill} +This emits @var{size} bytes, each of value @var{fill}. Both +@var{size} and @var{fill} are absolute expressions. If the comma +and @var{fill} are omitted, @var{fill} is assumed to be zero. + +@node Stab, Text, Space, PseudoOps +@section .stabd, .stabn, .stabs +There are three directives that begin @code{.stab@dots{}}. +All emit symbols, for use by symbolic debuggers. +The symbols are not entered in @code{as}' hash table: they +cannot be referenced elsewhere in the source file. +Up to five fields are required: +@table @var +@item string +This is the symbol's name. It may contain any character except @samp{\000}, +so is more general than ordinary symbol names. Some debuggers used to +code arbitrarily complex structures into symbol names using this technique. +@item type +An absolute expression. The symbol's @code{n_type} is set to the low 8 +bits of this expression. +Any bit pattern is permitted, but @code{ld} and debuggers will choke on +silly bit patterns. +@item other +An absolute expression. +The symbol's @code{n_other} is set to the low 8 bits of this expression. +@item desc +An absolute expression. +The symbol's @code{n_desc} is set to the low 16 bits of this expression. +@item value +An absolute expression which becomes the symbol's @code{n_value}. +@end table + +If a warning is detected while reading the @code{.stab@dots{}} +statement the symbol has probably already been created and you will +get a half-formed symbol in your object file. This is compatible +with earlier assemblers (!) + +.stabd @var{type} , @var{other} , @var{desc} + +The ``name'' of the symbol generated is not even an empty string. +It is a null pointer, for compatibility. Older assemblers used a +null pointer so they didn't waste space in object files with empty +strings. + +The symbol's @code{n_value} is set to the location counter, +relocatably. When your program is linked, the value of this symbol +will be where the location counter was when the @code{.stabd} was +assembled. + +.stabn @var{type} , @var{other} , @var{desc} , @var{value} + +The name of the symbol is set to the empty string @code{""}. + +.stabs @var{string} , @var{type} , @var{other} , @var{desc} , @var{value} + +@node Text, Word, Stab, PseudoOps +@section .text @var{subsegment} +Tells @code{as} to assemble the following statements onto the end of +the text subsegment numbered @var{subsegment}, which is an absolute +expression. If @var{subsegment} is omitted, subsegment number zero +is used. + +@node Word, , Text, PseudoOps +@section .word @var{expressions} +On the Sparc, this produces 32-bit numbers instead of 16-bit ones. +This expect zero or more @var{expressions}, of any segment, +separated by commas. For each expression, emit a 16-bit number that +will, at run time, be the value of that expression. The byte order +of the expression depends on what kind of computer will run the +program. + +@section Deprecated Directives +One day these directives won't work. +They are included for compatibility with older assemblers. +@table @t +@item .abort +@item .file +@item .line +@end table + +@node MachineDependent, Maintenance, PseudoOps, top +@chapter Machine Dependent Features +@section Vax +@subsection Options + +The Vax version of @code{as} accepts any of the following options, +gives a warning message that the option was ignored and proceeds. +These options are for compatibility with scripts designed for other +people's assemblers. + +@table @asis +@item @kbd{-D} (Debug) +@itemx @kbd{-S} (Symbol Table) +@itemx @kbd{-T} (Token Trace) +These are obsolete options used to debug old assemblers. + +@item @kbd{-d} (Displacement size for JUMPs) +This option expects a number following the @kbd{-d}. Like options +that expect filenames, the number may immediately follow the +@kbd{-d} (old standard) or constitute the whole of the command line +argument that follows @kbd{-d} (GNU standard). + +@item @kbd{-V} (Virtualize Interpass Temporary File) +Some other assemblers use a temporary file. This option +commanded them to keep the information in active memory rather +than in a disk file. @code{as} always does this, so this +option is redundant. + +@item @kbd{-J} (JUMPify Longer Branches) +Many 32-bit computers permit a variety of branch instructions +to do the same job. Some of these instructions are short (and +fast) but have a limited range; others are long (and slow) but +can branch anywhere in virtual memory. Often there are 3 +flavors of branch: short, medium and long. Some other +assemblers would emit short and medium branches, unless told by +this option to emit short and long branches. + +@item @kbd{-t} (Temporary File Directory) +Some other assemblers may use a temporary file, and this option +takes a filename being the directory to site the temporary +file. @code{as} does not use a temporary disk file, so this +option makes no difference. @kbd{-t} needs exactly one +filename. +@end table + +The Vax version of the assembler accepts two options when +compiled for VMS. They are @kbd{-h}, and @kbd{-+}. The +@kbd{-h} option prevents @code{as} from modifying the +symbol-table entries for symbols that contain lowercase +characters (I think). The @kbd{-+} option causes @code{as} to +print warning messages if the FILENAME part of the object file, +or any symbol name is larger than 31 characters. The @kbd{-+} +option also insertes some code following the @samp{_main} +symbol so that the object file will be compatable with Vax-11 +"C". + +@subsection Floating Point +Conversion of flonums to floating point is correct, and +compatible with previous assemblers. Rounding is +towards zero if the remainder is exactly half the least significant bit. + +@code{D}, @code{F}, @code{G} and @code{H} floating point formats +are understood. + +Immediate floating literals (@i{e.g.} @samp{S`$6.9}) +are rendered correctly. Again, rounding is towards zero in the +boundary case. + +The @code{.float} directive produces @code{f} format numbers. +The @code{.double} directive produces @code{d} format numbers. + +@subsection Machine Directives +The Vax version of the assembler supports four directives for +generating Vax floating point constants. They are described in the +table below. + +@table @code +@item .dfloat +This expects zero or more flonums, separated by commas, and +assembles Vax @code{d} format 64-bit floating point constants. + +@item .ffloat +This expects zero or more flonums, separated by commas, and +assembles Vax @code{f} format 32-bit floating point constants. + +@item .gfloat +This expects zero or more flonums, separated by commas, and +assembles Vax @code{g} format 64-bit floating point constants. + +@item .hfloat +This expects zero or more flonums, separated by commas, and +assembles Vax @code{h} format 128-bit floating point constants. + +@end table + +@subsection Opcodes +All DEC mnemonics are supported. Beware that @code{case@dots{}} +instructions have exactly 3 operands. The dispatch table that +follows the @code{case@dots{}} instruction should be made with +@code{.word} statements. This is compatible with all unix +assemblers we know of. + +@subsection Branch Improvement +Certain pseudo opcodes are permitted. They are for branch +instructions. They expand to the shortest branch instruction that +will reach the target. Generally these mnemonics are made by +substituting @samp{j} for @samp{b} at the start of a DEC mnemonic. +This feature is included both for compatibility and to help +compilers. If you don't need this feature, don't use these +opcodes. Here are the mnemonics, and the code they can expand into. + +@table @code +@item jbsb +@samp{Jsb} is already an instruction mnemonic, so we chose @samp{jbsb}. +@table @asis +@item (byte displacement) +@kbd{bsbb @dots{}} +@item (word displacement) +@kbd{bsbw @dots{}} +@item (long displacement) +@kbd{jsb @dots{}} +@end table +@item jbr +@itemx jr +Unconditional branch. +@table @asis +@item (byte displacement) +@kbd{brb @dots{}} +@item (word displacement) +@kbd{brw @dots{}} +@item (long displacement) +@kbd{jmp @dots{}} +@end table +@item j@var{COND} +@var{COND} may be any one of the conditional branches +@code{neq nequ eql eqlu gtr geq lss gtru lequ vc vs gequ cc lssu cs}. +@var{COND} may also be one of the bit tests +@code{bs bc bss bcs bsc bcc bssi bcci lbs lbc}. +@var{NOTCOND} is the opposite condition to @var{COND}. +@table @asis +@item (byte displacement) +@kbd{b@var{COND} @dots{}} +@item (word displacement) +@kbd{b@var{UNCOND} foo ; brw @dots{} ; foo:} +@item (long displacement) +@kbd{b@var{UNCOND} foo ; jmp @dots{} ; foo:} +@end table +@item jacb@var{X} +@var{X} may be one of @code{b d f g h l w}. +@table @asis +@item (word displacement) +@kbd{@var{OPCODE} @dots{}} +@item (long displacement) +@kbd{@var{OPCODE} @dots{}, foo ; brb bar ; foo: jmp @dots{} ; bar:} +@end table +@item jaob@var{YYY} +@var{YYY} may be one of @code{lss leq}. +@item jsob@var{ZZZ} +@var{ZZZ} may be one of @code{geq gtr}. +@table @asis +@item (byte displacement) +@kbd{@var{OPCODE} @dots{}} +@item (word displacement) +@kbd{@var{OPCODE} @dots{}, foo ; brb bar ; foo: brw @var{destination} ; bar:} +@item (long displacement) +@kbd{@var{OPCODE} @dots{}, foo ; brb bar ; foo: jmp @var{destination} ; bar: } +@end table +@item aobleq +@itemx aoblss +@itemx sobgeq +@itemx sobgtr +@table @asis +@item (byte displacement) +@kbd{@var{OPCODE} @dots{}} +@item (word displacement) +@kbd{@var{OPCODE} @dots{}, foo ; brb bar ; foo: brw @var{destination} ; bar:} +@item (long displacement) +@kbd{@var{OPCODE} @dots{}, foo ; brb bar ; foo: jmp @var{destination} ; bar:} +@end table +@end table + +@subsection operands +The immediate character is @samp{$} for Unix compatibility, not +@samp{#} as DEC writes it. + +The indirect character is @samp{*} for Unix compatibility, not +@samp{@@} as DEC writes it. + +The displacement sizing character is @samp{`} (an accent grave) for +Unix compatibility, not @samp{^} as DEC writes it. The letter +preceding @samp{`} may have either case. @samp{G} is not +understood, but all other letters (@code{b i l s w}) are understood. + +Register names understood are @code{r0 r1 r2 @dots{} r15 ap fp sp +pc}. Any case of letters will do. + +For instance +@example +tstb *w`$4(r5) +@end example + +Any expression is permitted in an operand. Operands are comma +separated. + +@c There is some bug to do with recognizing expressions +@c in operands, but I forget what it is. It is +@c a syntax clash because () is used as an address mode +@c and to encapsulate sub-expressions. +@subsection Not Supported +Vax bit fields can not be assembled with @code{as}. Someone +can add the required code if they really need it. + +@section 680x0 +@subsection Options +The 680x0 version of @code{as} has two machine dependent options. +One shortens undefined references from 32 to 16 bits, while the +other is used to tell @code{as} what kind of machine it is +assembling for. + +You can use the @kbd{-l} option to shorten the size of references to +undefined symbols. If the @kbd{-l} option is not given, references +to undefined symbols will be a full long (32 bits) wide. (Since +@code{as} cannot know where these symbols will end up being, +@code{as} can only allocate space for the linker to fill in later. +Since @code{as} doesn't know how far away these symbols will be, it +allocates as much space as it can.) If this option is given, the +references will only be one word wide (16 bits). This may be useful +if you want the object file to be as small as possible, and you know +that the relevant symbols will be less than 17 bits away. + +The 680x0 version of @code{as} is usually used to assemble programs +for the Motorola MC68020 microprocessor. Occasionally it is used to +assemble programs for the mostly-similar-but-slightly-different +MC68000 or MC68010 microprocessors. You can give @code{as} the +options @samp{-m68000}, @samp{-mc68000}, @samp{-m68010}, +@samp{-mc68010}, @samp{-m68020}, and @samp{-mc68020} to tell it what +processor it should be assembling for. Unfortunately, these options +are almost entirely unused and untried. They make work, but nobody +has tested them much. + +@subsection Syntax + +The 680x0 version of @code{as} uses syntax similar to the Sun +assembler. Size modifieres are appended directly to the end of the +opcode without an intervening period. Thus, @samp{move.l} is +written @samp{movl}, etc. + +@c This is no longer true +@c Explicit size modifiers for branch instructions are ignored; @code{as} +@c automatically picks the smallest size that will reach the +destination. + +If @code{as} is compiled with SUN_ASM_SYNTAX defined, it will also +allow Sun-style local labels of the form @samp{1$} through @samp{$9}. + +In the following table @dfn{apc} stands for any of the address +registers (@samp{a0} through @samp{a7}), nothing, (@samp{}), the +Program Counter (@samp{pc}), or the zero-address relative to the +program counter (@samp{zpc}). + +The following addressing modes are understood: +@table @dfn +@item Immediate +@samp{#@var{digits}} + +@item Data Register +@samp{d0} through @samp{d7} + +@item Address Register +@samp{a0} through @samp{a7} + +@item Address Register Indirect +@samp{a0@@} through @samp{a7@@} + +@item Address Register Postincrement +@samp{a0@@+} through @samp{a7@@+} + +@item Address Register Predecrement +@samp{a0@@-} through @samp{a7@@-} + +@item Indirect Plus Offset +@samp{@var{apc}@@(@var{digits})} + +@item Index +@samp{@var{apc}@@(@var{digits},@var{register}:@var{size}:@var{scale})} +or @samp{@var{apc}@@(@var{register}:@var{size}:@var{scale})} + +@item Postindex +@samp{@var{apc}@@(@var{digits})@@(@var{digits},@var{register}:@var{size}:@var{scale})} +or @samp{@var{apc}@@(@var{digits})@@(@var{register}:@var{size}:@var{scale})} + +@item Preindex +@samp{@var{apc}@@(@var{digits},@var{register}:@var{size}:@var{scale})@@(@var{digits})} +or @samp{@var{apc}@@(@var{register}:@var{size}:@var{scale})@@(@var{digits})} + +@item Memory Indirect +@samp{@var{apc}@@(@var{digits})@@(@var{digits})} + +@item Absolute +@samp{@var{symbol}}, or @samp{@var{digits}}, or either of the above followed +by @samp{:b}, @samp{:w}, or @samp{:l}. +@end table + +@subsection Floating Point +The floating point code is not too well tested, and may have +subtle bugs in it. + +Packed decimal (P) format floating literals are not supported. +Feel free to add the code yourself. + +The floating point formats generated by directives are these. +@table @code +@item .float +@code{Single} precision floating point constants. +@item .double +@code{Double} precision floating point constants. +@end table + +There is no directive to produce regions of memory holding +extended precision numbers, however they can be used as +immediate operands to floating-point instructions. Adding a +directive to create extended precision numbers would not be +hard. Nobody has felt any burning need to do it. + +@subsection Machine Directives +In order to be compatible with the Sun assembler the 680x0 assembler +understands the following directives. +@table @code +@item .data1 +This directive is identical to a @code{.data 1} directive. +@item .data2 +This directive is identical to a @code{.data 2} directive. +@item .even +This directive is identical to a @code{.align 1} directive. +@c Is this true? does it work??? +@item .skip +This directive is identical to a @code{.space} directive. +@end table + +@subsection Opcodes +Danger: Several bugs have been found in the opcode table (and +fixed). More bugs may exist. Be careful when using obscure +instructions. + +The assembler automatically chooses the proper size for branch +instructions. However, most attempts to force a short displacement +will be honored. Branches that are forced to use a short +displacement will not be adjusted if the target is out of range. +Let The User Beware. + +The immediate character is @samp{#} for Sun compatibility. The +line-comment character is @samp{|}. If a @samp{#} appears at the +beginning of a line, it is treated as a comment unless it looks like +@samp{# line file}, in which case it is treated normally. + +@section 32x32 +@subsection Options +The 32x32 version of @code{as} accepts a @kbd{-m32032} option to +specify thiat it is compiling for a 32032 processor, or a +@kbd{-m32532} to specify that it is compiling for a 32532 option. +The default (if neither is specified) is chosen when the assembler +is compiled. + +@subsection Syntax +I don't know anything about the 32x32 syntax assembled by +@code{as}. Someone who undersands the processor (I've never seen +one) and the possible syntaxes should write this section. + +@subsection Floating Point +The 32x32 uses IEEE floating point numbers, but @code{as} will only +create single or double precision values. I don't know if the 32x32 +understands extended precision numbers. + +@subsection Machine Directives +The 32x32 has no machine dependent directives. + +@section Sparc +@subsection Options +The sparc has no machine dependent options. + +@subsection syntax +I don't know anything about Sparc syntax. Someone who does +will have to write this section. + +@subsection Floating Point +The Sparc uses ieee floating-point numbers. + +@subsection Machine Directives +The Sparc version of @code{as} supports the following additional +machine directives: + +@table @code +@item .common +This must be followed by a symbol name, a positive number, and +@code{"bss"}. This behaves somewhat like @code{.comm}, but the +syntax is different. + +@item .global +This is functionally identical to @code{.globl}. + +@item .half +This is functionally identical to @code{.short}. + +@item .proc +This directive is ignored. Any text following it on the same +line is also ignored. + +@item .reserve +This must be followed by a symbol name, a positive number, and +@code{"bss"}. This behaves somewhat like @code{.lcomm}, but the +syntax is different. + +@item .seg +This must be followed by @code{"text"}, @code{"data"}, or +@code{"data1"}. It behaves like @code{.text}, @code{.data}, or +@code{.data 1}. + +@item .skip +This is functionally identical to the .space directive. + +@item .word +On the Sparc, the .word directive produces 32 bit values, +instead of the 16 bit values it produces on every other machine. + +@end table + +@section Intel 80386 +@subsection Options +The 80386 has no machine dependent options. + +@subsection AT&T Syntax versus Intel Syntax +In order to maintain compatibility with the output of @code{GCC}, +@code{as} supports AT&T System V/386 assembler syntax. This is quite +different from Intel syntax. We mention these differences because +almost all 80386 documents used only Intel syntax. Notable differences +between the two syntaxes are: +@itemize @bullet +@item +AT&T immediate operands are preceded by @samp{$}; Intel immediate +operands are undelimited (Intel @samp{push 4} is AT&T @samp{pushl $4}). +AT&T register operands are preceded by @samp{%}; Intel register operands +are undelimited. AT&T absolute (as opposed to PC relative) jump/call +operands are prefixed by @samp{*}; they are undelimited in Intel syntax. + +@item +AT&T and Intel syntax use the opposite order for source and destination +operands. Intel @samp{add eax, 4} is @samp{addl $4, %eax}. The +@samp{source, dest} convention is maintained for compatibility with +previous Unix assemblers. + +@item +In AT&T syntax the size of memory operands is determined from the last +character of the opcode name. Opcode suffixes of @samp{b}, @samp{w}, +and @samp{l} specify byte (8-bit), word (16-bit), and long (32-bit) +memory references. Intel syntax accomplishes this by prefixes memory +operands (@emph{not} the opcodes themselves) with @samp{byte ptr}, +@samp{word ptr}, and @samp{dword ptr}. Thus, Intel @samp{mov al, byte +ptr @var{foo}} is @samp{movb @var{foo}, %al} in AT&T syntax. + +@item +Immediate form long jumps and calls are +@samp{lcall/ljmp $@var{segment}, $@var{offset}} in AT&T syntax; the +Intel syntax is +@samp{call/jmp far @var{segment}:@var{offset}}. Also, the far return +instruction +is @samp{lret $@var{stack-adjust}} in AT&T syntax; Intel syntax is +@samp{ret far @var{stack-adjust}}. + +@item +The AT&T assembler does not provide support for multiple segment +programs. Unix style systems expect all programs to be single segments. +@end itemize + +@subsection Opcode Naming +Opcode names are suffixed with one character modifiers which specify the +size of operands. The letters @samp{b}, @samp{w}, and @samp{l} specify +byte, word, and long operands. If no suffix is specified by an +instruction and it contains no memory operands then @code{as} tries to +fill in the missing suffix based on the destination register operand +(the last one by convention). Thus, @samp{mov %ax, %bx} is equivalent +to @samp{movw %ax, %bx}; also, @samp{mov $1, %bx} is equivalent to +@samp{movw $1, %bx}. Note that this is incompatible with the AT&T Unix +assembler which assumes that a missing opcode suffix implies long +operand size. (This incompatibility does not affect compiler output +since compilers always explicitly specify the opcode suffix.) + +Almost all opcodes have the same names in AT&T and Intel format. There +are a few exceptions. The sign extend and zero extend instructions need +two sizes to specify them. They need a size to sign/zero extend +@emph{from} and a size to zero extend @emph{to}. This is accomplished +by using two opcode suffixes in AT&T syntax. Base names for sign extend +and zero extend are @samp{movs@dots{}} and @samp{movz@dots{}} in AT&T +syntax (@samp{movsx} and @samp{movzx} in Intel syntax). The opcode +suffixes are tacked on to this base name, the @emph{from} suffix before +the @emph{to} suffix. Thus, @samp{movsbl %al, %edx} is AT&T syntax for +``move sign extend @emph{from} %al @emph{to} %edx.'' Possible suffixes, +thus, are @samp{bl} (from byte to long), @samp{bw} (from byte to word), +and @samp{wl} (from word to long). + +The Intel syntax conversion instructions +@itemize @bullet +@item +@samp{cbw} --- sign-extend byte in @samp{%al} to word in @samp{%ax}, +@item +@samp{cwde} --- sign-extend word in @samp{%ax} to long in @samp{%eax}, +@item +@samp{cwd} --- sign-extend word in @samp{%ax} to long in @samp{%dx:%ax}, +@item +@samp{cdq} --- sign-extend dword in @samp{%eax} to quad in @samp{%edx:%eax}, +@end itemize +are called @samp{cbtw}, @samp{cwtl}, @samp{cwtd}, and @samp{cltd} in +AT&T naming. @code{as} accepts either naming for these instructions. + +Far call/jump instructions are @samp{lcall} and @samp{ljmp} in +AT&T syntax, but are @samp{call far} and @samp{jump far} in Intel +convention. + +@subsection Register Naming +Register operands are always prefixes with @samp{%}. The 80386 registers +consist of +@itemize @bullet +@item +the 8 32-bit registers @samp{%eax} (the accumulator), @samp{%ebx}, +@samp{%ecx}, @samp{%edx}, @samp{%edi}, @samp{%esi}, @samp{%ebp} (the +frame pointer), and @samp{%esp} (the stack pointer). + +@item +the 8 16-bit low-ends of these: @samp{%ax}, @samp{%bx}, @samp{%cx}, +@samp{%dx}, @samp{%di}, @samp{%si}, @samp{%bp}, and @samp{%sp}. + +@item +the 8 8-bit registers: @samp{%ah}, @samp{%al}, @samp{%bh}, +@samp{%bl}, @samp{%ch}, @samp{%cl}, @samp{%dh}, and @samp{%dl} (These +are the high-bytes and low-bytes of @samp{%ax}, @samp{%bx}, +@samp{%cx}, and @samp{%dx}) + +@item +the 6 segment registers @samp{%cs} (code segment), @samp{%ds} +(data segment), @samp{%ss} (stack segment), @samp{%es}, @samp{%fs}, +and @samp{%gs}. + +@item +the 3 processor control registers @samp{%cr0}, @samp{%cr2}, and +@samp{%cr3}. + +@item +the 6 debug registers @samp{%db0}, @samp{%db1}, @samp{%db2}, +@samp{%db3}, @samp{%db6}, and @samp{%db7}. + +@item +the 2 test registers @samp{%tr6} and @samp{%tr7}. + +@item +the 8 floating point register stack @samp{%st} or equivalently +@samp{%st(0)}, @samp{%st(1)}, @samp{%st(2)}, @samp{%st(3)}, +@samp{%st(4)}, @samp{%st(5)}, @samp{%st(6)}, and @samp{%st(7)}. +@end itemize + +@subsection Opcode Prefixes +Opcode prefixes are used to modify the following opcode. They are used +to repeat string instructions, to provide segment overrides, to perform +bus lock operations, and to give operand and address size (16-bit +operands are specified in an instruction by prefixing what would +normally be 32-bit operands with a ``operand size'' opcode prefix). +Opcode prefixes are usually given as single-line instructions with no +operands, and must directly precede the instruction they act upon. For +example, the @samp{scas} (scan string) instruction is repeated with: +@example + repne + scas +@end example + +Here is a list of opcode prefixes: +@itemize @bullet +@item +Segment override prefixes @samp{cs}, @samp{ds}, @samp{ss}, @samp{es}, +@samp{fs}, @samp{gs}. These are automatically added by specifying +using the @var{segment}:@var{memory-operand} form for memory references. + +@item +Operand/Address size prefixes @samp{data16} and @samp{addr16} +change 32-bit operands/addresses into 16-bit operands/addresses. Note +that 16-bit addressing modes (i.e. 8086 and 80286 addressing modes) +are not supported (yet). + +@item +The bus lock prefix @samp{lock} inhibits interrupts during +execution of the instruction it precedes. (This is only valid with +certain instructions; see a 80386 manual for details). + +@item +The wait for coprocessor prefix @samp{wait} waits for the +coprocessor to complete the current instruction. This should never be +needed for the 80386/80387 combination. + +@item +The @samp{rep}, @samp{repe}, and @samp{repne} prefixes are added +to string instructions to make them repeat @samp{%ecx} times. +@end itemize + +@subsection Memory References +An Intel syntax indirect memory reference of the form +@example +@var{segment}:[@var{base} + @var{index}*@var{scale} + @var{disp}] +@end example +is translated into the AT&T syntax +@example +@var{segment}:@var{disp}(@var{base}, @var{index}, @var{scale}) +@end example +where @var{base} and @var{index} are the optional 32-bit base and +index registers, @var{disp} is the optional displacement, and +@var{scale}, taking the values 1, 2, 4, and 8, multiplies @var{index} +to calculate the address of the operand. If no @var{scale} is +specified, @var{scale} is taken to be 1. @var{segment} specifies the +optional segment register for the memory operand, and may override the +default segment register (see a 80386 manual for segment register +defaults). Note that segment overrides in AT&T syntax @emph{must} have +be preceded by a @samp{%}. If you specify a segment override which +coincides with the default segment register, @code{as} will @emph{not} +output any segment register override prefixes to assemble the given +instruction. Thus, segment overrides can be specified to emphasize which +segment register is used for a given memory operand. + +Here are some examples of Intel and AT&T style memory references: +@table @asis + +@item AT&T: @samp{-4(%ebp)}, Intel: @samp{[ebp - 4]} +@var{base} is @samp{%ebp}; @var{disp} is @samp{-4}. @var{segment} is +missing, and the default segment is used (@samp{%ss} for addressing with +@samp{%ebp} as the base register). @var{index}, @var{scale} are both missing. + +@item AT&T: @samp{foo(,%eax,4)}, Intel: @samp{[foo + eax*4]} +@var{index} is @samp{%eax} (scaled by a @var{scale} 4); @var{disp} is +@samp{foo}. All other fields are missing. The segment register here +defaults to @samp{%ds}. + +@item AT&T: @samp{foo(,1)}; Intel @samp{[foo]} +This uses the value pointed to by @samp{foo} as a memory operand. +Note that @var{base} and @var{index} are both missing, but there is only +@emph{one} @samp{,}. This is a syntactic exception. + +@item AT&T: @samp{%gs:foo}; Intel @samp{gs:foo} +This selects the contents of the variable @samp{foo} with segment +register @var{segment} being @samp{%gs}. + +@end table + +Absolute (as opposed to PC relative) call and jump operands must be +prefixed with @samp{*}. If no @samp{*} is specified, @code{as} will +always choose PC relative addressing for jump/call labels. + +Any instruction that has a memory operand @emph{must} specify its size (byte, +word, or long) with an opcode suffix (@samp{b}, @samp{w}, or @samp{l}, +respectively). + +@subsection Handling of Jump Instructions +Jump instructions are always optimized to use the smallest possible +displacements. This is accomplished by using byte (8-bit) displacement +jumps whenever the target is sufficiently close. If a byte displacement +is insufficient a long (32-bit) displacement is used. We do not support +word (16-bit) displacement jumps (i.e. prefixing the jump instruction +with the @samp{addr16} opcode prefix), since the 80386 insists upon masking +@samp{%eip} to 16 bits after the word displacement is added. + +Note that the @samp{jcxz}, @samp{jecxz}, @samp{loop}, @samp{loopz}, +@samp{loope}, @samp{loopnz} and @samp{loopne} instructions only come in +byte displacements, so that it is possible that use of these +instructions (@code{GCC} does not use them) will cause the assembler to +print an error message (and generate incorrect code). The AT&T 80386 +assembler tries to get around this problem by expanding @samp{jcxz foo} to +@example + jcxz cx_zero + jmp cx_nonzero +cx_zero: jmp foo +cx_nonzero: +@end example + +@subsection Floating Point +All 80387 floating point types except packed BCD are supported. +(BCD support may be added without much difficulty). These data +types are 16-, 32-, and 64- bit integers, and single (32-bit), +double (64-bit), and extended (80-bit) precision floating point. +Each supported type has an opcode suffix and a constructor +associated with it. Opcode suffixes specify operand's data +types. Constructors build these data types into memory. + +@itemize @bullet +@item +Floating point constructors are @samp{.float} or @samp{.single}, +@samp{.double}, and @samp{.tfloat} for 32-, 64-, and 80-bit formats. +These correspond to opcode suffixes @samp{s}, @samp{l}, and @samp{t}. +@samp{t} stands for temporary real, and that the 80387 only supports +this format via the @samp{fldt} (load temporary real to stack top) and +@samp{fstpt} (store temporary real and pop stack) instructions. + +@item +Integer constructors are @samp{.word}, @samp{.long} or @samp{.int}, and +@samp{.quad} for the 16-, 32-, and 64-bit integer formats. The corresponding +opcode suffixes are @samp{s} (single), @samp{l} (long), and @samp{q} +(quad). As with the temporary real format the 64-bit @samp{q} format is +only present in the @samp{fildq} (load quad integer to stack top) and +@samp{fistpq} (store quad integer and pop stack) instructions. +@end itemize + +Register to register operations do not require opcode suffixes, +so that @samp{fst %st, %st(1)} is equivalent to @samp{fstl %st, %st(1)}. + +Since the 80387 automatically synchronizes with the 80386 @samp{fwait} +instructions are almost never needed (this is not the case for the +80286/80287 and 8086/8087 combinations). Therefore, @code{as} supresses +the @samp{fwait} instruction whenever it is implicitly selected by one +of the @samp{fn@dots{}} instructions. For example, @samp{fsave} and +@samp{fnsave} are treated identically. In general, all the @samp{fn@dots{}} +instructions are made equivalent to @samp{f@dots{}} instructions. If +@samp{fwait} is desired it must be explicitly coded. + +@subsection Notes +There is some trickery concerning the @samp{mul} and @samp{imul} +instructions that deserves mention. The 16-, 32-, and 64-bit expanding +multiplies (base opcode @samp{0xf6}; extension 4 for @samp{mul} and 5 +for @samp{imul}) can be output only in the one operand form. Thus, +@samp{imul %ebx, %eax} does @emph{not} select the expanding multiply; +the expanding multiply would clobber the @samp{%edx} register, and this +would confuse @code{GCC} output. Use @samp{imul %ebx} to get the +64-bit product in @samp{%edx:%eax}. + +We have added a two operand form of @samp{imul} when the first operand +is an immediate mode expression and the second operand is a register. +This is just a shorthand, so that, multiplying @samp{%eax} by 69, for +example, can be done with @samp{imul $69, %eax} rather than @samp{imul +$69, %eax, %eax}. + +@node Maintenance, Retargeting, MachineDependent, top +@chapter Maintaining the Assembler +[[this chapter is still being built]] + +@section Design +We had these goals, in descending priority: +@table @b +@item Accuracy. +For every program composed by a compiler, @code{as} should emit +``correct'' code. This leaves some latitude in choosing addressing +modes, order of @code{relocation_info} structures in the object +file, @i{etc}. + +@item Speed, for usual case. +By far the most common use of @code{as} will be assembling compiler +emissions. + +@item Upward compatibility for existing assembler code. +Well @dots{} we don't support Vax bit fields but everything else +seems to be upward compatible. + +@item Readability. +The code should be maintainable with few surprises. (JF: ha!) + +@end table + +We assumed that disk I/O was slow and expensive while memory was +fast and access to memory was cheap. We expect the in-memory data +structures to be less than 10 times the size of the emitted object +file. (Contrast this with the C compiler where in-memory structures +might be 100 times object file size!) +This suggests: +@itemize @bullet +@item +Try to read the source file from disk only one time. For other +reasons, we keep large chunks of the source file in memory during +assembly so this is not a problem. Also the assembly algorithm +should only scan the source text once if the compiler composed the +text according to a few simple rules. +@item +Emit the object code bytes only once. Don't store values and then +backpatch later. +@item +Build the object file in memory and do direct writes to disk of +large buffers. +@end itemize + +RMS suggested a one-pass algorithm which seems to work well. By not +parsing text during a second pass considerable time is saved on +large programs (@i{e.g.} the sort of C program @code{yacc} would +emit). + +It happened that the data structures needed to emit relocation +information to the object file were neatly subsumed into the data +structures that do backpatching of addresses after pass 1. + +Many of the functions began life as re-usable modules, loosely +connected. RMS changed this to gain speed. For example, input +parsing routines which used to work on pre-sanitized strings now +must parse raw data. Hence they have to import knowledge of the +assemblers' comment conventions @i{etc}. + +@section Deprecated Feature(?)s +We have stopped supporting some features: +@itemize @bullet +@item +@code{.org} statements must have @b{defined} expressions. +@item +Vax Bit fields (@kbd{:} operator) are entirely unsupported. +@end itemize + +It might be a good idea to not support these features in a future release: +@itemize @bullet +@item +@kbd{#} should begin a comment, even in column 1. +@item +Why support the logical line & file concept any more? +@item +Subsegments are a good candidate for flushing. +Depends on which compilers need them I guess. +@end itemize + +@section Bugs, Ideas, Further Work +Clearly the major improvement is DON'T USE A TEXT-READING +ASSEMBLER for the back end of a compiler. It is much faster to +interpret binary gobbledygook from a compiler's tables than to +ask the compiler to write out human-readable code just so the +assembler can parse it back to binary. + +Assuming you use @code{as} for human written programs: here are +some ideas: +@itemize @bullet +@item +Document (here) @code{APP}. +@item +Take advantage of knowing no spaces except after opcode +to speed up @code{as}. (Modify @code{app.c} to flush useless spaces: +only keep space/tabs at begin of line or between 2 +symbols.) +@item +Put pointers in this documentation to @file{a.out} documentation. +@item +Split the assembler into parts so it can gobble direct binary +from @i{e.g.} @code{cc}. It is silly for@code{cc} to compose text +just so @code{as} can parse it back to binary. +@item +Rewrite hash functions: I want a more modular, faster library. +@item +Clean up LOTS of code. +@item +Include all the non-@file{.c} files in the maintenance chapter. +@item +Document flonums. +@item +Implement flonum short literals. +@item +Change all talk of expression operands to expression quantities, +or perhaps to expression primaries. +@item +Implement pass 2. +@item +Whenever a @code{.text} or @code{.data} statement is seen, we close +of the current frag with an imaginary @code{.fill 0}. This is +because we only have one obstack for frags, and we can't grow new +frags for a new subsegment, then go back to the old subsegment and +append bytes to the old frag. All this nonsense goes away if we +give each subsegment its own obstack. It makes code simpler in +about 10 places, but nobody has bothered to do it because C compiler +output rarely changes subsegments (compared to ending frags with +relaxable addresses, which is common). +@end itemize + +@section Sources +@c The following files in the @file{as} directory +@c are symbolic links to other files, of +@c the same name, in a different directory. +@c @itemize @bullet +@c @item +@c @file{atof_generic.c} +@c @item +@c @file{atof_vax.c} +@c @item +@c @file{flonum_const.c} +@c @item +@c @file{flonum_copy.c} +@c @item +@c @file{flonum_get.c} +@c @item +@c @file{flonum_multip.c} +@c @item +@c @file{flonum_normal.c} +@c @item +@c @file{flonum_print.c} +@c @end itemize + +Here is a list of the source files in the @file{as} directory. + +@table @file +@item app.c +This contains the pre-processing phase, which deletes comments, +handles whitespace, etc. This was recently re-written, since app +used to be a separate program, but RMS wanted it to be inline. + +@item append.c +This is a subroutine to append a string to another string returning a +pointer just after the last @code{char} appended. (JF: All these +little routines should probably all be put in one file.) + +@item as.c +Here you will find the main program of the assembler @code{as}. + +@item expr.c +This is a branch office of @file{read.c}. This understands +expressions, primaries. Inside @code{as}, primaries are called +(expression) @i{operands}. This is confusing, because we also talk +(elsewhere) about instruction @i{operands}. Also, expression +operands are called @i{quantities} explicitly to avoid confusion +with instruction operands. What a mess. + +@item frags.c +This implements the @b{frag} concept. Without frags, finding the +right size for branch instructions would be a lot harder. + +@item hash.c +This contains the symbol table, opcode table @i{etc.} hashing +functions. + +@item hex_value.c +This is a table of values of digits, for use in atoi() type +functions. Could probably be flushed by using calls to strtol(), or +something similar. + +@item input-file.c +This contains Operating system dependent source file reading +routines. Since error messages often say where we are in reading +the source file, they live here too. Since @code{as} is intended to +run under GNU and Unix only, this might be worth flushing. Anyway, +almost all C compilers support stdio. + +@item input-scrub.c +This deals with calling the pre-processor (if needed) and feeding the +chunks back to the rest of the assembler the right way. + +@item messages.c +This contains operating system independent parts of fatal and +warning message reporting. See @file{append.c} above. + +@item output-file.c +This contains operating system dependent functions that write an +object file for @code{as}. See @file{input-file.c} above. + +@item read.c +This implements all the directives of @code{as}. This also deals +with passing input lines to the machine dependent part of the +assembler. + +@item strstr.c +This is a C library function that isn't in most C libraries yet. +See @file{append.c} above. + +@item subsegs.c +This implements subsegments. + +@item symbols.c +This implements symbols. + +@item write.c +This contains the code to perform relaxation, and to write out +the object file. It is mostly operating system independent, but +different OSes have different object file formats in any case. + +@item xmalloc.c +This implements @code{malloc()} or bust. See @file{append.c} above. + +@item xrealloc.c +This implements @code{realloc()} or bust. See @file{append.c} above. + +@item atof-generic.c +The following files were taken from a machine-independent subroutine +library for manipulating floating point numbers and very large +integers. + +@file{atof-generic.c} turns a string into a flonum internal format +floating-point number. + +@item flonum-const.c +This contains some potentially useful floating point numbers in +flonum format. + +@item flonum-copy.c +This copies a flonum. + +@item flonum-multip.c +This multiplies two flonums together. + +@item bignum-copy.c +This copies a bignum. + +@end table + +Here is a table of all the machine-specific files (this includes +both source and header files). Typically, there is a +@var{machine}.c file, a @var{machine}-opcode.h file, and an +atof-@var{machine}.c file. The @var{machine}-opcode.h file should +be identical to the one used by GDB (which uses it for disassembly.) + +@table @file + +@item atof-ieee.c +This contains code to turn a flonum into a ieee literal constant. +This is used by tye 680x0, 32x32, sparc, and i386 versions of @code{as}. + +@item i386-opcode.h +This is the opcode-table for the i386 version of the assembler. + +@item i386.c +This contains all the code for the i386 version of the assembler. + +@item i386.h +This defines constants and macros used by the i386 version of the assembler. + +@item m-generic.h +generic 68020 header file. To be linked to m68k.h on a +non-sun3, non-hpux system. + +@item m-sun2.h +68010 header file for Sun2 workstations. Not well tested. To be linked +to m68k.h on a sun2. (See also @samp{-DSUN_ASM_SYNTAX} in the +@file{Makefile}.) + +@item m-sun3.h +68020 header file for Sun3 workstations. To be linked to m68k.h before +compiling on a Sun3 system. (See also @samp{-DSUN_ASM_SYNTAX} in the +@file{Makefile}.) + +@item m-hpux.h +68020 header file for a HPUX (system 5?) box. Which box, which +version of HPUX, etc? I don't know. + +@item m68k.h +A hard- or symbolic- link to one of @file{m-generic.h}, +@file{m-hpux.h} or @file{m-sun3.h} depending on which kind of +680x0 you are assembling for. (See also @samp{-DSUN_ASM_SYNTAX} in the +@file{Makefile}.) + +@item m68k-opcode.h +Opcode table for 68020. This is now a link to the opcode table +in the @code{GDB} source directory. + +@item m68k.c +All the mc680x0 code, in one huge, slow-to-compile file. + +@item ns32k.c +This contains the code for the ns32032/ns32532 version of the +assembler. + +@item ns32k-opcode.h +This contains the opcode table for the ns32032/ns32532 version +of the assembler. + +@item vax-inst.h +Vax specific file for describing Vax operands and other Vax-ish things. + +@item vax-opcode.h +Vax opcode table. + +@item vax.c +Vax specific parts of @code{as}. Also includes the former files +@file{vax-ins-parse.c}, @file{vax-reg-parse.c} and @file{vip-op.c}. + +@item atof-vax.c +Turns a flonum into a Vax constant. + +@item vms.c +This file contains the special code needed to put out a VMS +style object file for the Vax. + +@end table + +Here is a list of the header files in the source directory. +(Warning: This section may not be very accurate. I didn't +write the header files; I just report them.) Also note that I +think many of these header files could be cleaned up or +eliminated. + +@table @file + +@item a.out.h +This describes the structures used to create the binary header data +inside the object file. Perhaps we should use the one in +@file{/usr/include}? + +@item as.h +This defines all the globally useful things, and pulls in <stdio.h> +and <assert.h>. + +@item bignum.h +This defines macros useful for dealing with bignums. + +@item expr.h +Structure and macros for dealing with expression() + +@item flonum.h +This defines the structure for dealing with floating point +numbers. It #includes @file{bignum.h}. + +@item frags.h +This contains macro for appending a byte to the current frag. + +@item hash.h +Structures and function definitions for the hashing functions. + +@item input-file.h +Function headers for the input-file.c functions. + +@item md.h +structures and function headers for things defined in the +machine dependent part of the assembler. + +@item obstack.h +This is the GNU systemwide include file for manipulating obstacks. +Since nobody is running under real GNU yet, we include this file. + +@item read.h +Macros and function headers for reading in source files. + +@item struct-symbol.h +Structure definition and macros for dealing with the gas +internal form of a symbol. + +@item subsegs.h +structure definition for dealing with the numbered subsegments +of the text and data segments. + +@item symbols.h +Macros and function headers for dealing with symbols. + +@item write.h +Structure for doing segment fixups. +@end table + +@comment ~subsection Test Directory +@comment (Note: The test directory seems to have disappeared somewhere +@comment along the line. If you want it, you'll probably have to find a +@comment REALLY OLD dump tape~dots{}) +@comment +@comment The ~file{test/} directory is used for regression testing. +@comment After you modify ~code{as}, you can get a quick go/nogo +@comment confidence test by running the new ~code{as} over the source +@comment files in this directory. You use a shell script ~file{test/do}. +@comment +@comment The tests in this suite are evolving. They are not comprehensive. +@comment They have, however, caught hundreds of bugs early in the debugging +@comment cycle of ~code{as}. Most test statements in this suite were naturally +@comment selected: they were used to demonstrate actual ~code{as} bugs rather +@comment than being written ~i{a prioi}. +@comment +@comment Another testing suggestion: over 30 bugs have been found simply by +@comment running examples from this manual through ~code{as}. +@comment Some examples in this manual are selected +@comment to distinguish boundary conditions; they are good for testing ~code{as}. +@comment +@comment ~subsubsection Regression Testing +@comment Each regression test involves assembling a file and comparing the +@comment actual output of ~code{as} to ``known good'' output files. Both +@comment the object file and the error/warning message file (stderr) are +@comment inspected. Optionally ~code{as}' exit status may be checked. +@comment Discrepencies are reported. Each discrepency means either that +@comment you broke some part of ~code{as} or that the ``known good'' files +@comment are now out of date and should be changed to reflect the new +@comment definition of ``good''. +@comment +@comment Each regression test lives in its own directory, in a tree +@comment rooted in the directory ~file{test/}. Each such directory +@comment has a name ending in ~file{.ret}, where `ret' stands for +@comment REgression Test. The ~file{.ret} ending allows ~code{find +@comment (1)} to find all regression tests in the tree, without +@comment needing to list them explicitly. +@comment +@comment Any ~file{.ret} directory must contain a file called +@comment ~file{input} which is the source file to assemble. During +@comment testing an object file ~file{output} is created, as well as +@comment a file ~file{stdouterr} which contains the output to both +@comment stderr and stderr. If there is a file ~file{output.good} in +@comment the directory, and if ~file{output} contains exactly the +@comment same data as ~file{output.good}, the file ~file{output} is +@comment deleted. Likewise ~file{stdouterr} is removed if it exactly +@comment matches a file ~file{stdouterr.good}. If file +@comment ~file{status.good} is present, containing a decimal number +@comment before a newline, the exit status of ~code{as} is compared +@comment to this number. If the status numbers are not equal, a file +@comment ~file{status} is written to the directory, containing the +@comment actual status as a decimal number followed by newline. +@comment +@comment Should any of the ~file{*.good} files fail to match their corresponding +@comment actual files, this is noted by a 1-line message on the screen during +@comment the regression test, and you can use ~code{find (1)} to find any +@comment files named ~file{status}, ~file {output} or ~file{stdouterr}. +@comment +@node Retargeting, , Maintenance, top +@chapter Teaching the Assembler about a New Machine + +This chapter describes the steps required in order to make the +assembler work with another machine's assembly language. This +chapter is not complete, and only describes the steps in the +broadest terms. You should look at the source for the +currently supported machine in order to discover some of the +details that aren't mentioned here. + +You should create a new file called @file{@var{machine}.c}, and +add the appropriate lines to the file @file{Makefile} so that +you can compile your new version of the assembler. This should +be straighforward; simply add lines similar to the ones there +for the four current versions of the assembler. + +If you want to be compatable with GDB, (and the current +machine-dependent versions of the assembler), you should create +a file called @file{@var{machine}-opcode.h} which should +contain all the information about the names of the machine +instructions, their opcodes, and what addressing modes they +support. If you do this right, the assembler and GDB can share +this file, and you'll only have to write it once. Note that +while you're writing @code{as}, you may want to use an +independent program (if you have access to one), to make sure +that @code{as} is emitting the correct bytes. Since @code{as} +and @code{GDB} share the opcode table, an incorrect opcode +table entry may make invalid bytes look OK when you disassemble +them with @code{GDB}. + +@section Functions You will Have to Write + +Your file @file{@var{machine}.c} should contain definitions for +the following functions and variables. It will need to include +some header files in order to use some of the structures +defined in the machine-independent part of the assembler. The +needed header files are mentioned in the descriptions of the +functions that will need them. + +@table @code + +@item long omagic; +This long integer holds the value to place at the beginning of +the @file{a.out} file. It is usually @samp{OMAGIC}, except on +machines that store additional information in the magic-number. + +@item char comment_chars[]; +This character array holds the values of the characters that +start a comment anywhere in a line. Comments are stripped off +automatically by the machine independent part of the +assembler. Note that the @samp{/*} will always start a +comment, and that only @samp{*/} will end a comment started by +@samp{*/}. + +@item char line_comment_chars[]; +This character array holds the values of the chars that start a +comment only if they are the first (non-whitespace) character +on a line. If the character @samp{#} does not appear in this +list, you may get unexpected results. (Various +machine-independent parts of the assembler treat the comments +@samp{#APP} and @samp{#NO_APP} specially, and assume that lines +that start with @samp{#} are comments.) + +@item char EXP_CHARS[]; +This character array holds the letters that can separate the +mantissa and the exponent of a floating point number. Typical +values are @samp{e} and @samp{E}. + +@item char FLT_CHARS[]; +This character array holds the letters that--when they appear +immediately after a leading zero--indicate that a number is a +floating-point number. (Sort of how 0x indicates that a +hexadecimal number follows.) + +@item pseudo_typeS md_pseudo_table[]; +(@var{pseudo_typeS} is defined in @file{md.h}) +This array contains a list of the machine_dependent directives +the assembler must support. It contains the name of each +pseudo op (Without the leading @samp{.}), a pointer to a +function to be called when that directive is encountered, and +an integer argument to be passed to that function. + +@item void md_begin(void) +This function is called as part of the assembler's +initialization. It should do any initialization required by +any of your other routines. + +@item int md_parse_option(char **optionPTR, int *argcPTR, char ***argvPTR) +This routine is called once for each option on the command line +that the machine-independent part of @code{as} does not +understand. This function should return non-zero if the option +pointed to by @var{optionPTR} is a valid option. If it is not +a valid option, this routine should return zero. The variables +@var{argcPTR} and @var{argvPTR} are provided in case the option +requires a filename or something similar as an argument. If +the option is multi-character, @var{optionPTR} should be +advanced past the end of the option, otherwise every letter in +the option will be treated as a separate single-character +option. + +@item void md_assemble(char *string) +This routine is called for every machine-dependent +non-directive line in the source file. It does all the real +work involved in reading the opcode, parsing the operands, +etc. @var{string} is a pointer to a null-terminated string, +that comprises the input line, with all excess whitespace and +comments removed. + +@item void md_number_to_chars(char *outputPTR,long value,int nbytes) +This routine is called to turn a C long int, short int, or char +into the series of bytes that represents that number on the +target machine. @var{outputPTR} points to an array where the +result should be stored; @var{value} is the value to store; and +@var{nbytes} is the number of bytes in 'value' that should be +stored. + +@item void md_number_to_imm(char *outputPTR,long value,int nbytes) +This routine is called to turn a C long int, short int, or char +into the series of bytes that represent an immediate value on +the target machine. It is identical to the function @code{md_number_to_chars}, +except on NS32K machines.@refill + +@item void md_number_to_disp(char *outputPTR,long value,int nbytes) +This routine is called to turn a C long int, short int, or char +into the series of bytes that represent an displacement value on +the target machine. It is identical to the function @code{md_number_to_chars}, +except on NS32K machines.@refill + +@item void md_number_to_field(char *outputPTR,long value,int nbytes) +This routine is identical to @code{md_number_to_chars}, +except on NS32K machines. + +@item void md_ri_to_chars(struct relocation_info *riPTR,ri) +(@code{struct relocation_info} is defined in @file{a.out.h}) +This routine emits the relocation info in @var{ri} +in the appropriate bit-pattern for the target machine. +The result should be stored in the location pointed +to by @var{riPTR}. This routine may be a no-op unless you are +attempting to do cross-assembly. + +@item char *md_atof(char type,char *outputPTR,int *sizePTR) +This routine turns a series of digits into the appropriate +internal representation for a floating-point number. +@var{type} is a character from @var{FLT_CHARS[]} that describes +what kind of floating point number is wanted; @var{outputPTR} +is a pointer to an array that the result should be stored in; +and @var{sizePTR} is a pointer to an integer where the size (in +bytes) of the result should be stored. This routine should +return an error message, or an empty string (not (char *)0) for +success. + +@item int md_short_jump_size; +This variable holds the (maximum) size in bytes of a short (16 +bit or so) jump created by @code{md_create_short_jump()}. This +variable is used as part of the broken-word feature, and isn't +needed if the assembler is compiled with +@samp{-DWORKING_DOT_WORD}. + +@item int md_long_jump_size; +This variable holds the (maximum) size in bytes of a long (32 +bit or so) jump created by @code{md_create_long_jump()}. This +variable is used as part of the broken-word feature, and isn't +needed if the assembler is compiled with +@samp{-DWORKING_DOT_WORD}. + +@item void md_create_short_jump(char *resultPTR,long from_addr, +@code{long to_addr,fragS *frag,symbolS *to_symbol)} +This function emits a jump from @var{from_addr} to @var{to_addr} in +the array of bytes pointed to by @var{resultPTR}. If this creates a +type of jump that must be relocated, this function should call +@code{fix_new()} with @var{frag} and @var{to_symbol}. The jump +emitted by this function may be smaller than @var{md_short_jump_size}, +but it must never create a larger one. +(If it creates a smaller jump, the extra bytes of memory will not be +used.) This function is used as part of the broken-word feature, +and isn't needed if the assembler is compiled with +@samp{-DWORKING_DOT_WORD}.@refill + +@item void md_create_long_jump(char *ptr,long from_addr, +@code{long to_addr,fragS *frag,symbolS *to_symbol)} +This function is similar to the previous function, +@code{md_create_short_jump()}, except that it creates a long +jump instead of a short one. This function is used as part of +the broken-word feature, and isn't needed if the assembler is +compiled with @samp{-DWORKING_DOT_WORD}. + +@item int md_estimate_size_before_relax(fragS *fragPTR,int segment_type) +This function does the initial setting up for relaxation. This +includes forcing references to still-undefined symbols to the +appropriate addressing modes. + +@item relax_typeS md_relax_table[]; +(relax_typeS is defined in md.h) +This array describes the various machine dependent states a +frag may be in before relaxation. You will need one group of +entries for each type of addressing mode you intend to relax. + +@item void md_convert_frag(fragS *fragPTR) +(@var{fragS} is defined in @file{as.h}) +This routine does the required cleanup after relaxation. +Relaxation has changed the type of the frag to a type that can +reach its destination. This function should adjust the opcode +of the frag to use the appropriate addressing mode. +@var{fragPTR} points to the frag to clean up. + +@item void md_end(void) +This function is called just before the assembler exits. It +need not free up memory unless the operating system doesn't do +it automatically on exit. (In which case you'll also have to +track down all the other places where the assembler allocates +space but never frees it.) + +@end table + +@section External Variables You will Need to Use + +You will need to refer to or change the following external variables +from within the machine-dependent part of the assembler. + +@table @code +@item extern char flagseen[]; +This array holds non-zero values in locations corresponding to +the options that were on the command line. Thus, if the +assembler was called with @samp{-W}, @var{flagseen['W']} would +be non-zero. + +@item extern fragS *frag_now; +This pointer points to the current frag--the frag that bytes +are currently being added to. If nothing else, you will need +to pass it as an argument to various machine-independent +functions. It is maintained automatically by the +frag-manipulating functions; you should never have to change it +yourself. + +@item extern LITTLENUM_TYPE generic_bignum[]; +(@var{LITTLENUM_TYPE} is defined in @file{bignum.h}. +This is where @dfn{bignums}--numbers larger than 32 bits--are +returned when they are encountered in an expression. You will +need to use this if you need to implement directives (or +anything else) that must deal with these large numbers. +@code{Bignums} are of @code{segT} @code{SEG_BIG} (defined in +@file{as.h}, and have a positive @code{X_add_number}. The +@code{X_add_number} of a @code{bignum} is the number of +@code{LITTLENUMS} in @var{generic_bignum} that the number takes +up. + +@item extern FLONUM_TYPE generic_floating_point_number; +(@var{FLONUM_TYPE} is defined in @file{flonum.h}. +The is where @dfn{flonums}--floating-point numbers within +expressions--are returned. @code{Flonums} are of @code{segT} +@code{SEG_BIG}, and have a negative @code{X_add_number}. +@code{Flonums} are returned in a generic format. You will have +to write a routine to turn this generic format into the +appropriate floating-point format for your machine. + +@item extern int need_pass_2; +If this variable is non-zero, the assembler has encountered an +expression that cannot be assembled in a single pass. Since +the second pass isn't implemented, this flag means that the +assembler is punting, and is only looking for additional syntax +errors. (Or something like that.) + +@item extern segT now_seg; +This variable holds the value of the segment the assembler is +currently assembling into. + +@end table + +@section External functions will you need + +You will find the following external functions useful (or +indispensable) when you're writing the machine-dependent part +of the assembler. + +@table @code + +@item char *frag_more(int bytes) +This function allocates @var{bytes} more bytes in the current +frag (or starts a new frag, if it can't expand the current frag +any more.) for you to store some object-file bytes in. It +returns a pointer to the bytes, ready for you to store data in. + +@item void fix_new(fragS *frag, int where, short size, symbolS *add_symbol, symbolS *sub_symbol, long offset, int pcrel) +This function stores a relocation fixup to be acted on later. +@var{frag} points to the frag the relocation belongs in; +@var{where} is the location within the frag where the relocation begins; +@var{size} is the size of the relocation, and is usually 1 (a single byte), + 2 (sixteen bits), or 4 (a longword). +The value @var{add_symbol} @minus{} @var{sub_symbol} + @var{offset}, is added to the byte(s) +at @var{frag->literal[where]}. If @var{pcrel} is non-zero, the address of the +location is subtracted from the result. A relocation entry is also added +to the @file{a.out} file. @var{add_symbol}, @var{sub_symbol}, and/or +@var{offset} may be NULL.@refill + +@item char *frag_var(relax_stateT type, int max_chars, int var, +@code{relax_substateT subtype, symbolS *symbol, char *opcode)} +This function creates a machine-dependent frag of type @var{type} +(usually @code{rs_machine_dependent}). +@var{max_chars} is the maximum size in bytes that the frag may grow by; +@var{var} is the current size of the variable end of the frag; +@var{subtype} is the sub-type of the frag. The sub-type is used to index into +@var{md_relax_table[]} during @code{relaxation}. +@var{symbol} is the symbol whose value should be used to when relax-ing this frag. +@var{opcode} points into a byte whose value may have to be modified if the +addressing mode used by this frag changes. It typically points into the +@var{fr_literal[]} of the previous frag, and is used to point to a location +that @code{md_convert_frag()}, may have to change.@refill + +@item void frag_wane(fragS *fragPTR) +This function is useful from within @code{md_convert_frag}. It +changes a frag to type rs_fill, and sets the variable-sized +piece of the frag to zero. The frag will never change in size +again. + +@item segT expression(expressionS *retval) +(@var{segT} is defined in @file{as.h}; @var{expressionS} is defined in @file{expr.h}) +This function parses the string pointed to by the external char +pointer @var{input_line_pointer}, and returns the segment-type +of the expression. It also stores the results in the +@var{expressionS} pointed to by @var{retval}. +@var{input_line_pointer} is advanced to point past the end of +the expression. (@var{input_line_pointer} is used by other +parts of the assembler. If you modify it, be sure to restore +it to its original value.) + +@item as_warn(char *message,@dots{}) +If warning messages are disabled, this function does nothing. +Otherwise, it prints out the current file name, and the current +line number, then uses @code{fprintf} to print the +@var{message} and any arguments it was passed. + +@item as_bad(char *message,@dots{}) +This function should be called when @code{as} encounters +conditions that are bad enough that @code{as} should not +produce an object file, but should continue reading input and +printing warning and bad error messages. + +@item as_fatal(char *message,@dots{}) +This function prints out the current file name and line number, +prints the word @samp{FATAL:}, then uses @code{fprintf} to +print the @var{message} and any arguments it was passed. Then +the assembler exits. This function should only be used for +serious, unrecoverable errors. + +@item void float_const(int float_type) +This function reads floating-point constants from the current +input line, and calls @code{md_atof} to assemble them. It is +useful as the function to call for the directives +@samp{.single}, @samp{.double}, @samp{.float}, etc. +@var{float_type} must be a character from @var{FLT_CHARS}. + +@item void demand_empty_rest_of_line(void); +This function can be used by machine-dependent directives to +make sure the rest of the input line is empty. It prints a +warning message if there are additional characters on the line. + +@item long int get_absolute_expression(void) +This function can be used by machine-dependent directives to +read an absolute number from the current input line. It +returns the result. If it isn't given an absolute expression, +it prints a warning message and returns zero. + +@end table + + +@section The concept of Frags + +This assembler works to optimize the size of certain addressing +modes. (e.g. branch instructions) This means the size of many +pieces of object code cannot be determined until after assembly +is finished. (This means that the addresses of symbols cannot be +determined until assembly is finished.) In order to do this, +@code{as} stores the output bytes as @dfn{frags}. + +Here is the definition of a frag (from @file{as.h}) +@example +struct frag +@{ + long int fr_fix; + long int fr_var; + relax_stateT fr_type; + relax_substateT fr_substate; + unsigned long fr_address; + long int fr_offset; + struct symbol *fr_symbol; + char *fr_opcode; + struct frag *fr_next; + char fr_literal[]; +@} +@end example + +@table @var +@item fr_fix +is the size of the fixed-size piece of the frag. + +@item fr_var +is the maximum (?) size of the variable-sized piece of the frag. + +@item fr_type +is the type of the frag. +Current types are: +rs_fill +rs_align +rs_org +rs_machine_dependent + +@item fr_substate +This stores the type of machine-dependent frag this is. (what +kind of addressing mode is being used, and what size is being +tried/will fit/etc. + +@item fr_address +@var{fr_address} is only valid after relaxation is finished. +Before relaxation, the only way to store an address is (pointer +to frag containing the address) plus (offset into the frag). + +@item fr_offset +This contains a number, whose meaning depends on the type of +the frag. +for machine_dependent frags, this contains the offset from +fr_symbol that the frag wants to go to. Thus, for branch +instructions it is usually zero. (unless the instruction was +@samp{jba foo+12} or something like that.) + +@item fr_symbol +for machine_dependent frags, this points to the symbol the frag +needs to reach. + +@item fr_opcode +This points to the location in the frag (or in a previous frag) +of the opcode for the instruction that caused this to be a frag. +@var{fr_opcode} is needed if the actual opcode must be changed +in order to use a different form of the addressing mode. +(For example, if a conditional branch only comes in size tiny, +a large-size branch could be implemented by reversing the sense +of the test, and turning it into a tiny branch over a large jump. +This would require changing the opcode.) + +@var{fr_literal} is a variable-size array that contains the +actual object bytes. A frag consists of a fixed size piece of +object data, (which may be zero bytes long), followed by a +piece of object data whose size may not have been determined +yet. Other information includes the type of the frag (which +controls how it is relaxed), + +@item fr_next +This is the next frag in the singly-linked list. This is +usually only needed by the machine-independent part of +@code{as}. + +@end table + +@c Is this really a good idea? +@iftex +@center [end of manual] +@end iftex +@summarycontents +@contents +@bye |