Commands Reference, Volume 3

lex Command

Purpose

Generates a C or C++ language program that matches patterns for simple lexical analysis of an input stream.

Syntax

lex [ -C ] [ -t ] [ -v| -n ] [ File... ]

The lex command reads File or standard input, generates a C language program, and writes it to a file named lex.yy.c. This file, lex.yy.c, is a compilable C language program. A C++ compiler also can compile the output of the lex command. The -C flag renames the output file to lex.yy.C for the C++ compiler.

The C++ program generated by the lex command can use either STDIO or IOSTREAMS. If the cpp define _CPP_IOSTREAMS is true during a C++ compilation, the program uses IOSTREAMS for all I/O. Otherwise, STDIO is used.

The lex command uses rules and actions contained in File to generate a program, lex.yy.c, which can be compiled with the cc command. The compiled lex.yy.c can then receive input, break the input into the logical pieces defined by the rules in File, and run program fragments contained in the actions in File.

The generated program is a C language function called yylex. The lex command stores the yylex function in a file named lex.yy.c. You can use the yylex function alone to recognize simple one-word input, or you can use it with other C language programs to perform more difficult input analysis functions. For example, you can use the lex command to generate a program that simplifies an input stream before sending it to a parser program generated by the yacc command.

The yylex function analyzes the input stream using a program structure called a finite state machine. This structure allows the program to exist in only one state (or condition) at a time. There is a finite number of states allowed. The rules in File determine how the program moves from one state to another.

If you do not specify a File, the lex command reads standard input. It treats multiple files as a single file.

Note: Since the lex command uses fixed names for intermediate and output files, you can have only one program generated by lex in a given directory.

lex Specification File

The input file can contain three sections: definitions, rules, and user subroutines. Each section must be separated from the others by a line containing only the delimiter, %% (double percent signs). The format is:
definitions %% rules %% user subroutines

The purpose and format of each are described in the following sections.

Definitions

If you want to use variables in your rules, you must define them in this section. The variables make up the left column, and their definitions make up the right column. For example, if you want to define D as a numerical digit, you would write the following:

D   [0-9]

You can use a defined variable in the rules section by enclosing the variable name in {} (braces), for example:

{D}

Lines in the definitions section beginning with a blank or enclosed in %{, %} delimiter lines are copied to the lex.yy.c file. You can use this construct to declare C language variables to be used in the lex actions or to include header files, for example:

%{
#include <math.h>
int count;
%}

Such lines can also appear at the beginning of the rules section, immediately after the first %% delimiter, but they should not be used anywhere else in the rules section. If the line is in the definitions section of File, the lex command copies it to the external declarations section of the lex.yy.c file. If the line appears in the rules section, before the first rule, the lex command copies it to the local declaration section of the yylex subroutine in lex.yy.c. Such lines should not occur after the first rule.

The type of the lex external, yytext, can be set to either a null-terminated character array (default) or a pointer to a null-terminated character string by specifying one of the following in the definitions section:

%array    (default)
%pointer

In the definitions section, you can set table sizes for the resulting finite state machine. The default sizes are large enough for small programs. You may want to set larger sizes for more complex programs.

%an	Number of transitions is n (default 5000)
%en	Number of parse tree nodes is n (default 2000)
%hn	Number of multibyte character output slots (default is 0)
%kn	Number of packed character classes (default 1000)
%mn	Number of multibyte "character class" character output slots (default is 0)
%nn	Number of states is n (default 2500)
%on	Number of output slots (default 5000, minimum 257)
%pn	Number of positions is n (default 5000)
%vp	Percentage of slots vacant in the hash tables controlled by %h and %m (default 20, range 0 <= P < 100)
%zn	Number of multibyte character class output slots (default 0)

If multibyte characters appear in extended regular expression strings, you may need to reset the output array size with the %o argument (possibly to array sizes in the range 10,000 to 20,000). This reset reflects the much larger number of characters relative to the number of single-byte characters.

If multibyte characters appear in extended regular expressions, you must set the multibyte hash table sizes with the %h and %m arguments to sizes greater than the total number of multibyte characters contained in the lex file.

If no multibyte characters appear in extended regular expressions but you want '.' to match multibyte characters, you must set %z greater than zero. Similarly, for inverse character classes (for example, [^abc]) to match multibyte characters, you must set both %h and %m greater than zero.

When using multibyte characters, the lex.yy.c file must be compiled with the -qmbcs compiler option.

Rules

Once you have defined your terms, you can write the rules section. It contains strings and expressions to be matched by the yylex subroutine, and C commands to execute when a match is made. This section is required, and it must be preceded by the delimiter %% (double percent signs), whether or not you have a definitions section. The lex command does not recognize your rules without this delimiter.

In this section, the left column contains the pattern in the form of an extended regular expression, which will be recognized in an input file to the yylex subroutine. The right column contains the C program fragment executed when that pattern is recognized, called an action.

When the lexical analyzer finds a match for the extended regular expression, the lexical analyzer executes the action associated with that extended regular expression.

Patterns can include extended characters. If multibyte locales are installed on your system, patterns can also include multibyte characters that are part of the installed code set.

The columns are separated by a tab or blanks. For example, if you want to search files for the keyword KEY, you can write the following:

(KEY) printf ("found KEY");

If you include this rule in File, the yylex lexical analyzer matches the pattern KEY and runs the printf subroutine.

Each pattern can have a corresponding action, that is, a C command to execute when the pattern is matched. Each statement must end with a ; (semicolon). If you use more than one statement in an action, you must enclose all of them in { } (braces). A second delimiter, %%, must follow the rules section if you have a user subroutine section. Without a specified action for a pattern match, the lexical analyzer copies the input pattern to the output without changing it.

When the yylex lexical analyzer matches a string in the input stream, it copies the matched string to an external character array (or a pointer to a character string), yytext, before it executes any commands in the rules section. Similarly, the external int, yyleng, is set to the length of the matched string in bytes (therefore, multibyte characters will have a size greater than 1).

For information on how to form extended regular expressions, see "Extended Regular Expressions in the lex Command" in AIX 5L Version 5.1 General Programming Concepts: Writing and Debugging Programs.

User Subroutines

The lex library defines the following subroutines as macros that you can use in the rules section of the lex specification file:

input	Reads a byte from yyin.
unput	Replaces a byte after it has been read.
output	Writes an output byte to yyout.
winput	Reads a multibyte character from yyin.
wunput	Replaces a multibyte character after it has been read.
woutput	Writes a multibyte output character to yyout.
yysetlocale	Calls the setlocale (LC_ALL, " " ); subroutine to determine the current locale.

The winput, wunput, and woutput macros are defined to use the yywinput, yywunput, and yywoutput subroutines coded in the lex.yy.c file. For compatibility, these yy subroutines subsequently use the input, unput, and output subroutines to read, replace, and write the necessary number of bytes in a complete multibyte character.

You can override these macros by writing your own code for these routines in the user subroutines section. But if you write your own, you must undefine these macros in the definition section as follows:

%{
#undef input
#undef unput
#undef output
#undef winput
#undef wunput
#undef woutput
#undef yysetlocale
%}

There is no main subroutine in lex.yy.c, because the lex library contains the main subroutine that calls the yylex lexical analyzer, as well as the yywrap subroutine called by yylex( ) at the end of File. Therefore, if you do not include main( ), yywrap( ), or both in the user subroutines section, when you compile lex.yy.c, you must enter cclex.yy.c-ll, where ll calls the lex library.

External names generated by the lex command all begin with the preface yy, as in yyin, yyout, yylex, and yytext.

Finite State Machine

The default skeleton for the finite state machine is defined in /usr/ccs/lib/lex/ncform. The user can use a personally configured finite state machine by setting an environment variable LEXER=PATH. The PATH variable designates the user-defined finite state machine path and file name. The lex command checks the environment for this variable and, if it is set, uses the supplied path.

Putting Blanks in an Expression

Normally, blanks or tabs end a rule and therefore, the expression that defines a rule. However, you can enclose the blanks or tab characters in " " (quotation marks) to include them in the expression. Use quotes around all blanks in expressions that are not already within sets of [ ] (brackets).

Other Special Characters

The lex program recognizes many of the normal C language special characters. These character sequences are:

Sequence	Meaning
\a	Alert
\b	Backspace
\f	Form Feed
\n	Newline (Do not use the actual new-line character in an expression.)
\r	Return
\t	Tab
\v	Vertical Tab
\\	Backslash
\digits	The character with encoding represented by the one-, two-, or three-digit octal integer specified by digits.
\xdigits	The character with encoding represented by the sequence of hexadecimal characters specified by digits.
\c	Where c is none of the characters listed above, represents the character c unchanged.

Note: Do not use \0 or \x0 in lex rules.

When using these special characters in an expression, you do not need to enclose them in quotes. Every character, except these special characters and the operator symbols described in "Extended Regular Expressions in the lex Command" in AIX 5L Version 5.1 General Programming Concepts: Writing and Debugging Programs, is always a text character.

Matching Rules

When more than one expression can match the current input, the lex command chooses the longest match first. When several rules match the same number of characters, the lex command chooses the rule that occurs first. For example, if the rules

integer    keyword action...;
[a-z]+       identifier action...;

are given in that order, and integers is the input word, lex matches the input as an identifier, because [a-z]+ matches eight characters while integer matches only seven. However, if the input is integer, both rules match seven characters. lex selects the keyword rule because it occurs first. A shorter input, such as int, does not match the expression integer, and so lex selects the identifier rule.

Matching a String Using Wildcard Characters

Because lex chooses the longest match first, do not use rules containing expressions like .*. For example:

'.*'

might seem like a good way to recognize a string in single quotes. However, the lexical analyzer reads far ahead, looking for a distant single quote to complete the long match. If a lexical analyzer with such a rule gets the following input:

'first' quoted string here, 'second' here

it matches:

'first' quoted string here, 'second'

To find the smaller strings, first and second, use the following rule:

'[^'\n]*'

This rule stops after 'first'.

Errors of this type are not far reaching, because the . (period) operator does not match a new-line character. Therefore, expressions like .* (period asterisk) stop on the current line. Do not try to defeat this with expressions like[.\n]+. The lexical analyzer tries to read the entire input file and an internal buffer overflow occurs.

Finding Strings within Strings

The lex program partitions the input stream and does not search for all possible matches of each expression. Each character is accounted for once and only once. For example, to count occurrences of both she andhe in an input text, try the following rules:

she         s++
he          h++
\n          |.           ;

where the last two rules ignore everything besides he and she. However, because she includes he, lex does not recognize the instances of he that are included in she.

To override this choice, use the action REJECT. This directive tells lex to go to the next rule. lex then adjusts the position of the input pointer to where it was before the first rule was executed and executes the second choice rule. For example, to count the included instances of he, use the following rules:

she                 {s++;REJECT;}
he                  {h++;REJECT;}
\n                  |.                   ;

After counting the occurrences of she, lex rejects the input stream and then counts the occurrences of he. Because in this case she includes he but not vice versa, you can omit the REJECT action on he. In other cases, it may be difficult to determine which input characters are in both classes.

In general, REJECT is useful whenever the purpose of lex is not to partition the input stream but to detect all examples of some items in the input, and the instances of these items may overlap or include each other.

Flags

-C	Produces the lex.yy.C file instead of lex.yy.c for use with a C++ compiler. To get the I/O Stream Library, use the macro, _CPP_IOSTREAMS, as well.
-n	Suppresses the statistics summary. When you set your own table sizes for the finite state machine, the lex command automatically produces this summary if you do not select this flag.
-t	Writes lex.yy.c to standard output instead of to a file.
-v	Provides a one-line summary of the generated finite-state-machine statistics.

Exit Status

This command returns the following exit values:

0	Successful completion.
>0	An error occurred.

Examples

To draw lex instructions from the file lexcommands and place the output in lex.yy.c, use the following command:
```
lex lexcommands
```
To create a lex program that converts uppercase to lowercase, removes blanks at the end of a line, and replaces multiple blanks by single blanks, including the following in a lex command file:
```
%%
[A-Z]   putchar(yytext[0]+ 'a'-'A');
[ ]+$ ;
[ ]+    putchar(' '); 
```

Files

/usr/ccs/lib/libl.a	Contains the run-time library.
/usr/ccs/lib/lex/ncform	Defines a finite state machine.

Related Information

The yacc command.

Creating an Input Language with the lex and yacc Commands in AIX 5L Version 5.1 General Programming Concepts: Writing and Debugging Programs.

Using the lex Program with the yacc Program in AIX 5L Version 5.1 General Programming Concepts: Writing and Debugging Programs.

Example Program for the lex and yacc Programs in AIX 5L Version 5.1 General Programming Concepts: Writing and Debugging Programs.

National Language Support Overview for Programming in AIX 5L Version 5.1 General Programming Concepts: Writing and Debugging Programs.