Example

How Pleasant Is The Weather?

See this Lexical Analysis example; Here, we can easily recognize that there are five words How Pleasant, The, Weather, Is. This is very natural for us as we can recognize the separators, blanks, and the punctuation symbol.

HowPl easantIs Th ewe ather?

Now, check this example, we can also read this. However, it will take some time because separators are put in the Odd Places. It is not something which comes to you immediately. In this tutorial, you will learn

Basic Terminologies:
Lexical Analyzer Architecture: How tokens are recognized
Roles of the Lexical analyzer
Lexical Errors
Error Recovery in Lexical Analyzer
Lexical Analyzer vs. Parser
Why separate Lexical and Parser?
Advantages of Lexical analysis
Disadvantage of Lexical analysis

Basic Terminologies

What’s a lexeme?

A lexeme is a sequence of characters that are included in the source program according to the matching pattern of a token. It is nothing but an instance of a token.

What’s a token?

Tokens in compiler design are the sequence of characters which represents a unit of information in the source program.

What is Pattern?

A pattern is a description which is used by the token. In the case of a keyword which uses as a token, the pattern is a sequence of characters.

Lexical Analyzer Architecture: How tokens are recognized

The main task of lexical analysis is to read input characters in the code and produce tokens. Lexical analyzer scans the entire source code of the program. It identifies each token one by one. Scanners are usually implemented to produce tokens only when requested by a parser. Here is how recognition of tokens in compiler design works-

“Get next token” is a command which is sent from the parser to the lexical analyzer. On receiving this command, the lexical analyzer scans the input until it finds the next token. It returns the token to Parser.

Lexical Analyzer skips whitespaces and comments while creating these tokens. If any error is present, then Lexical analyzer will correlate that error with the source file and line number.

Roles of the Lexical analyzer

Lexical analyzer performs below given tasks:

Helps to identify token into the symbol table Removes white spaces and comments from the source program Correlates error messages with the source program Helps you to expands the macros if it is found in the source program Read input characters from the source program

Example of Lexical Analysis, Tokens, Non-Tokens

Consider the following code that is fed to Lexical Analyzer

#include <stdio.h> int maximum(int x, int y) { // This will compare 2 numbers if (x > y) return x; else { return y; } }

Examples of Tokens created

Examples of Nontokens

Lexical Errors

A character sequence which is not possible to scan into any valid token is a lexical error. Important facts about the lexical error:

Lexical errors are not very common, but it should be managed by a scanner Misspelling of identifiers, operators, keyword are considered as lexical errors Generally, a lexical error is caused by the appearance of some illegal character, mostly at the beginning of a token.

Error Recovery in Lexical Analyzer

Here, are a few most common error recovery techniques:

Removes one character from the remaining input In the panic mode, the successive characters are always ignored until we reach a well-formed token By inserting the missing character into the remaining input Replace a character with another character Transpose two serial characters

Lexical Analyzer vs. Parser

Why separate Lexical and Parser?

The simplicity of design: It eases the process of lexical analysis and the syntax analysis by eliminating unwanted tokens To improve compiler efficiency: Helps you to improve compiler efficiency Specialization: specialized techniques can be applied to improves the lexical analysis process Portability: only the scanner requires to communicate with the outside world Higher portability: input-device-specific peculiarities restricted to the lexer

Advantages of Lexical analysis

Lexical analyzer method is used by programs like compilers which can use the parsed data from a programmer’s code to create a compiled binary executable code It is used by web browsers to format and display a web page with the help of parsed data from JavsScript, HTML, CSS A separate lexical analyzer helps you to construct a specialized and potentially more efficient processor for the task

Disadvantage of Lexical analysis

You need to spend significant time reading the source program and partitioning it in the form of tokens Some regular expressions are quite difficult to understand compared to PEG or EBNF rules More effort is needed to develop and debug the lexer and its token descriptions Additional runtime overhead is required to generate the lexer tables and construct the tokens

Summary

Lexical analysis is the very first phase in the compiler design Lexemes and Tokens are the sequence of characters that are included in the source program according to the matching pattern of a token Lexical analyzer is implemented to scan the entire source code of the program Lexical analyzer helps to identify token into the symbol table A character sequence which is not possible to scan into any valid token is a lexical error Removes one character from the remaining input is useful Error recovery method Lexical Analyser scan the input program while parser perform syntax analysis It eases the process of lexical analysis and the syntax analysis by eliminating unwanted tokens Lexical analyzer is used by web browsers to format and display a web page with the help of parsed data from JavsScript, HTML, CSS The biggest drawback of using Lexical analyzer is that it needs additional runtime overhead is required to generate the lexer tables and construct the tokens