
See Our team
Wondering how we keep quality?
Got unsolved questions? Ask Questions
GATE
GMAT
CBSE
NCERT
Career
Interview
Railway
UPSC
NID
NIFT-UG
NIFT-PG
PHP
AJAX
JavaScript
Node Js
Shell Script
Research
SYSTEM SOFTWARE [10CS52] Unit-7
Unit 7 LEX AND YACC-1
Lex is a program generator designed for lexical processing of character input streams. It
accepts a high-level, problem oriented specification for character string matching, and
produces a program in a general purpose language which recognizes regular expressions.The
regular expressions are specified by the user in the source specifications given to Lex.
7.1 Lex and Yacc
The Lex written code recognizes these expressions in an input stream and partitions the
input stream into strings matching the expressions. At the boundaries between strings
program sections provided by the user are executed. The Lex source file associates the
regular expressions and the program fragments. As each expression appears in the input to
the program written by Lex, the corresponding fragment is executed.
Lex turns the user's expressions and actions (called source in this memo) into the host
general-purpose language; the generated program is named yylex. The yylex program will
recognize expressions in a stream (called input in this memo) and perform the specified
actions for each expression as it is detected. See Figure 1.
+-------+
Source -> | Lex | -> yylex
+-------+
+-------+
Input -> | yylex | -> Output
+-------+
7.2 A YACC Parser
The structure of a lex file is intentionally similar to that of a yacc file; files are divided up
into three sections, separated by lines that contain only two percent signs, as follows:
Definition section
%%
Rules section
%%
C code section
? The definition section is the place to define macros and to import header files written
in C. It is also possible to write any C code here, which will be copied verbatim into
the generated source file.
? The rules section is the most important section; it associates patterns with C
statements. Patterns are simply regular expressions. When the lexer sees some text in
the input matching a given pattern, it executes the associated C code. This is the basis
of how lex operates.
? The C code section contains C statements and functions that are copied verbatim to
the generated source file. These statements presumably contain code called by the
rules in the rules section. In large programs it is more convenient to place this code in
a separate file and link it in at compile time.
Example:
/*** Definition section ***/
%{
/* C code to be copied verbatim */
#include
%}
/* This tells lex to read only one input file */
%%
/*** Rules section ***/
/* [0-9]+ matches a string of one or more digits */
[0-9]+ {
/* yytext is a string containing the matched text. */
printf("Saw an integer: %s\n", yytext);
}
. { /* Ignore all other characters. */ }
%%
/*** C Code section ***/
int main(void)
{
/* Call the lexer, then quit. */
yylex();
return 0;
REGULAR EXPRESSIONS:
Regular expression specifies a set of strings to be matched. It contains text characters and
operator characters The letters of the alphabet and the digits are always text characters; thus
the regular expression integer matches the string integer wherever it appears and the
expression
a57D
looks for the string a57D.
Operators:
The operator characters are
" \ [ ] ^ - ? . * + | ( ) $ / { } % <>
and if they are to be used as text characters, an escape should be used. The quotation mark
operator (") indicates that whatever is contained between a pair of quotes is to be taken as
text characters.
Thus
xyz"++"
matches the string xyz++ when it appears.
? Note that a part of a string may be quoted. It is harmless but unnecessary to quote an
ordinary text character; the expression
"xyz++"
is the same as the one above. Thus by quoting every non-alphanumeric character being used
as a text character, the user can avoid remembering the list above of current operator
characters, and is safe should further extensions to Lex lengthen the list.
? An operator character may also be turned into a text character by preceding it with
\ as in
xyz\+\+
which is another, less readable, equivalent of the above expressions.
Another use of the quoting mechanism is to get a blank into an expression; blanks or tabs end
a rule. Any blank character not contained within []must be quoted.
? Several normal C escapes with \ are recognized: \n is newline, \t is tab, and \b is
backspace. To enter \ itself, use \\. Since newline is illegal in an expression, \n must
be used; it is not required to escape tab and backspace. Every character but blank,
tab, newline and the list above is always a text character.
? Character classes. Classes of characters can be specified using the operator pair [].
The construction [abc] matches a single character, which may be a, b, or c. Within
square brackets, most operator meanings are ignored. Only three characters are
special: these are \ - and ^. The - character indicates ranges.
For example:
[a-z0-9<>_]indicates the character class containing all the lower case letters, the digits, the
angle brackets, and underline. Ranges may be given in either order.
? Using - between any pair of characters which are not both upper case letters, both
lower case letters, or both digits is implementation dependent and will get a warning
message. If it is desired to include the character - in a character class, it should be first
or last; thus
[-+0-9]
matches all the digits and the two signs.
In character classes, the ^ operator must appear as the first character after the left bracket; it
indicates that the resulting string is to be complemented with respect to the computer
character set. Thus , [^abc] matches all characters except a, b, or c, including all special
or control characters
or [^a-zA-Z]
is any character which is not a letter. The \ character provides the usual escapes within
character class brackets.
? Optional expressions.: The operator ? indicates an optional element of an expression.
Thus ab?c
matches either ac or abc.
? Repeated expressions: Repetitions of classes are indicated by the operators * and +.
Ex: a*
is any number of consecutive a characters, including zero, while a+ is one or more instances
of a.
For example [a-z]+
is all strings of lower case letters.
Defining regular expressions in Lex :
Character Meaning
A-Z, 0-9, a-z Characters and numbers that form part of the pattern.
. Matches any character except \n.
- Used to denote range. Example: A-Z implies all characters from A to Z.
[ ] A character class. Matches any character in the brackets. If the first
character is ^ then it indicates a negation pattern. Example: [abC]
matches either of a, b, and C.
* Match zero or more occurrences of the preceding pattern.
+ Matches one or more occurrences of the preceding pattern.
? Matches zero or one occurrences of the preceding pattern.
$ Matches end of line as the last character of the pattern.
{ } Indicates how many times a pattern can be present. Example: A{1,3}
implies one or three occurrences of A may be present.
\ Used to escape meta characters. Also used to remove the special meaning
of characters as defined in this table.
^ Negation.
| Logical OR between expressions.
"" Literal meanings of characters. Meta characters hold.
/ Look ahead. Matches the preceding pattern only if followed by the
succeeding expression. Example: A0/1 matches A0 only if A01 is the
input.
( ) Groups a series of regular expressions.
Examples of regular expressions:
Regular expression Meaning
joke[rs] Matches either jokes or joker.
A{1,2}shis+ Matches AAshis, Ashis, AAshi, Ashi.
(A[b-e])+ Matches zero or one occurrences of A followed by any character from
b to e.
Tokens in Lex are declared like variable names in C. Every token has an associated expression.
(Examples of tokens and expression are given in the following table.) Using the examples in our
tables, we'll build a word-counting program. Our first task will be to show how tokens are declared.
Examples of token declarations
Token Associated expression Meaning
number ([0-9])+ 1 or more occurrences of a digit
chars [A-Za-z] Any character
blank " " A blank space
word (chars)+ 1 or more occurrences of chars
variable (chars)+(number)*(chars)*( number)*
7.3. USING LEX:
If lex.l is the file containing the lex specification, the C source for the lexical analyzer is
produced by running lex with the following command:
lex lex.l
lex produces a C file called lex.yy.c.
Options
There are several options available with the lex command. If you use one or more of them,
place them between the command name lex and the filename argument.
The -t option sends lex's output to the standard output rather than to the file lex.yy.c.
The -v option prints out a small set of statistics describing the so-called finite automata that
lex produces with the C program lex.yy.c.
WORD COUNTING PROGRAM
In this section we can add C variable declarations. We will declare an integer variable here
for our word-counting program that holds the number of words counted by the program.
We'll also perform token declarations of Lex.
Declarations for the word-counting program
%{
int wordCount = 0;
%}
chars [A-za-z\_\'\.\"]
numbers ([0-9])+
delim [" "\n\t]
whitespace {delim}+
words {chars}+
%%
The double percent sign implies the end of this section and the beginning of the second of the
three sections in Lex programming.
Lex rules for matching patterns
Let's look at the Lex rules for describing the token that we want to match. (We'll use C to
define what to do when a token is matched.) Continuing with our word-counting program,
here are the rules for matching tokens.
Lex rules for the word-counting program
{words} { wordCount++; /*
increase the word count by one*/ }
{whitespace} { /* do
nothing*/ }
{numbers} { /* one may
want to add some processing here*/ }
%%
C code
The third and final section of programming in Lex covers C function declarations (and
occasionally the main function) Note that this section has to include the yywrap() function.
Lex has a set of functions and variables that are available to the user. One of them is yywrap.
Typically, yywrap() is defined as shown in the example below.
C code section for the word-counting program
void main()
{
yylex(); /* start the analysis*/
printf(" No of words:
%d\n", wordCount);
}
int yywrap()
{
return 1;
}
LEXER
lexical analysis is the process of converting a sequence of characters into a sequence of
tokens. A program or function which performs lexical analysis is called a lexical analyzer,
lexer or scanner. A lexer often exists as a single function which is called by a parser or
another function
Token
A token is a string of characters, categorized according to the rules as a symbol (e.g.
IDENTIFIER, NUMBER, COMMA, etc.). The process of forming tokens from an input
stream of characters is called tokenization and the lexer categorizes them according to a
symbol type. A token can look like anything that is useful for processing an input text stream
or text file.
A lexical analyzer generally does nothing with combinations of tokens, a task left for a parser.
For example, a typical lexical analyzer recognizes parenthesis as tokens, but does nothing to
ensure that each '(' is matched with a ')'.
Consider this expression in the C programming language:
sum=3+2;
Tokenized in the following table:
lexeme token type
sum Identifier
= Assignment operator
3 Number
+ Addition operator
2 Number
; End of statement
Tokens are frequently defined by regular expressions, which are understood by a lexical
analyzer generator such as lex. The lexical analyzer (either generated automatically by a tool
like lex, or hand-crafted) reads in a stream of characters, identifies the lexemes in the stream,
and categorizes them into tokens. This is called "tokenizing." If the lexer finds an invalid
token, it will report an error.
Following tokenizing is parsing. From there, the interpreted data may be loaded into data
structures for general use, interpretation, or compiling.
Examples:
1. Write a Lex source program to copy an input file while adding 3 to every positive number
divisible by 7.
%%
int k;
[0-9]+ {
k = atoi(yytext);
if (k%7 == 0)
printf("%d", k+3);
else
printf("%d",k);
}
to do just that. The rule [0-9]+ recognizes strings of digits; atoi converts the digits to binary and
stores the result in k. The operator % (remainder) is used to check whether k is divisible by 7; if it is,
it is incremented by 3 as it is written out. It may be objected that this program will alter such input
items as 49.63 or X7. Furthermore, it increments the absolute value of all negative numbers divisible
by 7. To avoid this, just add a few more rules after the active one, as here:
%%
int k;
-?[0-9]+ {
k = atoi(yytext);
printf("%d",
k%7 == 0 ? k+3 : k);
}
-?[0-9.]+ ECHO;
[A-Za-z][A-Za-z0-9]+ ECHO;
Numerical strings containing a “.'' or preceded by a letter will be picked up by one of the last
two rules, and not changed. The if-else has been replaced by a C conditional expression to save
space; the form a?b:c means “if a then b else c''.
2. Write a Lex program that histograms the lengths of words, where a word is defined as a
string of letters.
int lengs[100];
%%
[a-z]+ lengs[yyleng]++;
. |
\n ;
%%
yywrap()
{
int i;
printf("Length No. words\n");
for(i=0; i<100; i++)
if (lengs[i] > 0)
printf("%5d%10d\n",i,lengs[i]);
return(1);
}
3.. Write a lex program to find the number of vowels and consonants.
%{
/* to find vowels and consonents*/
int vowels = 0;
int consonents = 0;
%}
%%
[ \t\n]+
[aeiouAEIOU] vowels++;
[bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ] consonents++;
.
%%
main()
{
yylex();
printf(" The number of vowels = %d\n", vowels);
printf(" number of consonents = %d \n", consonents);
return(0);
}
The same program can be executed by giving alternative grammar. It is as follows:
Here a file is opened which is given as a argument and reads to text and counts the number of
vowels and consonants.
%{
unsigned int vowelcount=0;
unsigned int consocount=0;
%}
vowel [aeiouAEIOU]
consonant [bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ]
eol \n
%%
{vowel} { vowelcount++;}
{consonant} { consocount++; }
%%
main(int argc,char *argv[])
{
if(argc > 1)
{
FILE *fp;
fp=fopen(argv[1],"r");
if(!fp)
{
fprintf(stderr,"could not open %s\n",argv[1]);
exit(1);
}
yyin=fp;
}
yylex();
printf(" vowelcount=%u consonantcount=%u\n ",vowelcount,consocount);
return(0);
}
4. Write a Lex program to count the number of words, characters, blanks and lines in a
given text.
%{
unsigned int charcount=0;
int wordcount=0;
int linecount=0;
int blankcount =0;
%}
word[^ \t\n]+
eol \n
%%
[ ] blankcount++;
{word} { wordcount++; charcount+=yyleng;}
{eol} {charcount++; linecount++;}
. { ECHO; charcount++;}
%%
main(argc, argv)
int argc;
char **argv;
{
if(argc > 1)
{
FILE *file;
file = fopen(argv[1],"r");
if(!file)
{
fprintf(stderr, "could not open %s\n", argv[1]);
exit(1);
}
yyin = file;
yylex();
printf("\nThe number of characters = %u\n", charcount);
printf("The number of wordcount = %u\n", wordcount);
printf("The number of linecount = %u\n", linecount);
printf("The number of blankcount = %u\n", blankcount);
return(0);
}
else
printf(" Enter the file name along with the program \n");
}
5. Write a lex program to find the number of positive integer, negative integer, positive
floating positive number and negative floating point number.
%{
int posnum = 0;
int negnum = 0;
int posflo = 0;
int negflo = 0;
%}
%%
[\n\t ];
([0-9]+) {posnum++;}
-?([0-9]+) {negnum++; }
([0-9]*\.[0-9]+) { posflo++; }
-?([0-9]*\.[0-9]+) { negflo++; }
. ECHO;
%%
main()
{
yylex();
printf("Number of positive numbers = %d\n", posnum);
printf("number of negative numbers = %d\n", negnum);
printf("number of floting positive number = %d\n", posflo);
printf("number of floating negative number = %d\n", negflo);
}
6. Write a lex program to find the given c program has right number of brackets. Count
the number of comments. Check for while loop.
%{
/* find main, comments, {, (, ), } */
int comments=0;
int opbr=0;
int clbr=0;
int opfl=0;
int clfl=0;
int j=0;
int k=0;
%}
%%
"main()" j=1;
"/*"[ \t].*[ \t]"*/" comments++;
"while("[0-9a-zA-Z]*")"[ \t]*\n"{"[ \t]*.*"}" k=1;
^[ \t]*"{"[ \t]*\n
^[ \t]*"}" k=1;
"(" opbr++;
")" clbr++;
"{" opfl++;
"}" clfl++;
[^ \t\n]+
. ECHO;
%%
main(argc, argv)
int argc;
char *argv[];
{
if (argc > 1)
{
FILE *file;
file = fopen(argv[1], "r");
if (!file)
{
printf("error opeing a file \n");
exit(1);
}
yyin = file;
}
yylex();
if(opbr != clbr)
printf("open brackets is not equal to close brackets\n");
if(opfl != clfl)
printf("open flower brackets is not equal to close flower brackets\n");
printf(" the number of comments = %d\n",comments);
if (!j)
printf("there is no main function \n");
if (k)
printf("there is loop\n");
else printf("there is no valid for loop\n");
return(0);
}
6. Write a lex program to replace scanf with READ and printf with WRITE statement
also find the number of scanf and printf.
%{
int pc=0,sc=0;
%}
%%
printf fprintf(yyout,"WRITE");pc++;
scanf fprintf(yyout,"READ");sc++;
. ECHO;
%%
main(int argc,char* argv[])
{
if(argc!=3)
{
printf("\nUsage: %s \n",argv[0]);
return;
}
yyin=fopen(argv[1],"r");
yyout=fopen(argv[2],"w");
yylex();
printf("\nno. of printfs:%d\nno. of scanfs:%d\n",pc,sc);
}
7. Write a lex program to find whether the given expression is valid.
%{
#include
int valid=0,ctr=0,oc = 0;
%}
NUM [0-9]+
OP [+*/-]
%%
{NUM}({OP}{NUM})+ {
valid = 1;
for(ctr = 0;yytext[ctr];ctr++)
{
switch(yytext[ctr])
{
case '+':
case '-':
case '*':
case '/': oc++;
}
}
}
{NUM}\n {printf("\nOnly a number.");}
\n { if(valid) printf("valid \n operatorcount = %d",oc);
else printf("Invalid");
valid = oc = 0;ctr=0;
}
%%
main()
{
yylex();
}
/* Another solution for the same problem */
%{
int oprc=0,digc=0,top=-1,flag=0;
char stack[20];
%}
digit [0-9]+
opr [+*/-]
%%
[ \n\t]+
['('] {stack[++top]='(';}
[')'] {flag=1;
if(stack[top]=='('&&(top!=-1))
top--;
else
{
printf("\n Invalid expression\n");
exit(0);
}
}
{digit} {digc++;}
{opr}/['('] { oprc++; printf("%s",yytext);}
{opr}/{digit} {oprc++; printf("%s",yytext);}
. {printf("Invalid "); exit(0);}
%%
main()
{
yylex();
if((digc==oprc+1||digc==oprc) && top==-1)
{
printf("VALID");
printf("\n oprc=%d\n digc=%d\n",oprc,digc);
}
else
printf("INVALID");
}
8.Write a lex program to find the given sentence is simple or compound.
%{
int flag=0;
%}
%%
(" "[aA][nN][dD]" ")|(" "[oO][rR]" ")|(" "[bB][uU][tT]" ") flag=1;
. ;
%%
main()
{yylex();
if (flag==1)
printf("COMPOUND SENTENCE \n");
else
printf("SIMPLE SENTENCE \n");
}
9. Write a lex program to find the number of valid identifiers.
%{
int count=0;
%}
%%
(" int ")|(" float ")|(" double ")|(" char ")
{
int ch; ch = input();
for(;;)
{
if (ch==',') {count++;}
else
156
if(ch==';') {break;}
ch = input();
}
count++;
}
%%
main(int argc,char *argv[])
{
yyin=fopen(argv[1],"r");
yylex();
printf("the no of identifiers used is %d\n",count);
}
RECOMMENDED QUESTIONS:
1. write the specification of lex with an example? (10)
2. what is regular expressions? With examples explain? (8)
3. write a lex program to count the no of words , lines , space, characters? (8)
4. write a lex program to count the no of vowels and consonants? (8)
5. what is lexer- parser communication? Explain? (5)
6. write a program to count no of words by the method of substitution? (7)