This project implements a Lexical Analyzer (Lexer) written in the C programming language. The lexer reads C-like source code and breaks it into meaningful components called tokens such as keywords, identifiers, operators, literals, and delimiters.
A lexical analyzer is the first stage of a compiler pipeline, responsible for converting raw program text into a structured stream of tokens that can later be processed by a parser.
This project demonstrates how programming languages internally analyze source code and identify its syntactic elements.
GitHub Repository
https://github.com/Abineshabee/Lexer-in-c
Compilers do not directly understand source code. Instead, they process programs through several stages.
The lexer is responsible for scanning characters and grouping them into tokens.
Example source code:
int x = 10;The lexer converts it into tokens:
INT
IDENTIFIER(x)
ASSIGN
NUMBER(10)
SEMICOLON
This project demonstrates how such a system can be implemented from scratch using C.
The lexer is the first step in the compilation process.
Source Code
↓
Lexical Analysis (Lexer)
↓
Token Stream
↓
Syntax Analysis (Parser)
↓
Abstract Syntax Tree
↓
Semantic Analysis
↓
Code Generation
This project implements the Lexical Analysis stage.
The lexer supports a large number of tokens including:
• Identifiers • Numeric literals • String literals • Character literals • Arithmetic operators • Logical operators • Bitwise operators • Assignment operators • Control flow keywords • C/C++ keywords • Python-like built-in functions • String manipulation functions • Mathematical functions
| File | Description |
|---|---|
| lexer.c | Main lexer implementation |
| Token enum | Defines token types |
| Token struct | Stores token information |
| get_next_token() | Core token extraction logic |
| token_type_to_string() | Converts token enums to readable text |
The program defines an enumeration for token types.
enum TokenType {
TOKEN_IDENTIFIER,
TOKEN_NUMBER,
TOKEN_PLUS,
TOKEN_MINUS,
TOKEN_MULTIPLY,
TOKEN_DIVIDE
};This helps categorize different components of the source code.
Each token has two properties:
struct Token {
enum TokenType type
char value[MAX_TOKEN_LEN]
}
Example token:
Type : IDENTIFIER
Value: x
The lexer works using a character scanning algorithm.
Steps:
- Skip whitespace
- Detect identifiers or keywords
- Detect numbers
- Detect operators
- Detect string literals
- Detect character literals
- Return token
Identifiers represent variable or function names.
Example:
int count;Tokens:
INT
IDENTIFIER(count)
SEMICOLON
Numbers can be integers or floating point values.
Example:
5
3.14
Token output:
NUMBER(5)
NUMBER(3.14)
Strings are enclosed in double quotes.
Example:
"Hello World"Token:
STRING_LITERAL("Hello World")
Example:
'a'Token:
CHAR_LITERAL('a')
| Operator | Token |
|---|---|
| + | PLUS |
| - | MINUS |
| * | MULTIPLY |
| / | DIVIDE |
| % | MODULO |
Example:
x + yTokens:
IDENTIFIER(x)
PLUS
IDENTIFIER(y)
| Operator | Token |
|---|---|
| = | ASSIGN |
| += | PLUS_EQUAL |
| -= | MINUS_EQUAL |
| *= | MULTIPLY_EQUAL |
| /= | DIVIDE_EQUAL |
| %= | MODULO_EQUAL |
Example:
x += 5
Tokens:
IDENTIFIER(x)
PLUS_EQUAL
NUMBER(5)
| Operator | Token |
|---|---|
| == | EQUAL |
| != | NOT_EQUAL |
| < | LESS_THAN |
| > | GREATER_THAN |
| <= | LESS_EQUAL |
| >= | GREATER_EQUAL |
Example:
x > 10Tokens:
IDENTIFIER(x)
GREATER_THAN
NUMBER(10)
| Operator | Token |
|---|---|
| && | AND |
| || | OR |
| ! | NOT |
Example:
x > 0 && y < 10| Operator | Token |
|---|---|
| & | BIT_AND |
| | | BIT_OR |
| ^ | BIT_XOR |
| ~ | BIT_NOT |
| << | LEFT_SHIFT |
| >> | RIGHT_SHIFT |
| Symbol | Token |
|---|---|
| ( | LEFT_PAREN |
| ) | RIGHT_PAREN |
| { | LEFT_BRACE |
| } | RIGHT_BRACE |
| [ | LEFT_BRACKET |
| ] | RIGHT_BRACKET |
| ; | SEMICOLON |
| , | COMMA |
| Keyword |
|---|
| if |
| else |
| while |
| for |
| do |
| switch |
| case |
| break |
| continue |
| return |
Example:
if (x > 0)Tokens:
IF
LEFT_PAREN
IDENTIFIER(x)
GREATER_THAN
NUMBER(0)
RIGHT_PAREN
Supported data types:
| Type |
|---|
| int |
| float |
| double |
| char |
| short |
| long |
| void |
| bool |
Example:
float y = 3.14
Tokens:
FLOAT
IDENTIFIER(y)
ASSIGN
NUMBER(3.14)
The lexer also detects built-in string operations.
| Function |
|---|
| upper |
| lower |
| capitalize |
| count |
| find |
| replace |
| reverse |
| title |
Example:
upper(str)
Tokens:
UPPER
LEFT_PAREN
IDENTIFIER(str)
RIGHT_PAREN
Supported math functions:
| Function |
|---|
| abs |
| ceil |
| floor |
| round |
Example:
abs(x)
Token:
ABS
Input code:
int main() {
int x = 5;
float y = 3.14;
char* str = "Hello, World!";
if (x > 0 && y < 10.0) {
printf("%d %f\n", x, y);
}
return 0;
}Token: Type=INT Value=int
Token: Type=IDENTIFIER Value=main
Token: Type=LEFT_PAREN Value=(
Token: Type=RIGHT_PAREN Value=)
Token: Type=LEFT_BRACE Value={
Token: Type=INT Value=int
Token: Type=IDENTIFIER Value=x
Token: Type=ASSIGN Value==
Token: Type=NUMBER Value=5
Token: Type=SEMICOLON Value=;
Compile the lexer using GCC.
gcc lexer.c -o lexer./lexer
The program will analyze the input string and print tokens.
This project demonstrates key compiler concepts:
• Lexical Analysis • Tokenization • Programming language syntax • Compiler front-end design
It is useful for students learning:
• Compilers • Programming language design • Static analysis tools
Possible improvements include:
• Comment detection • Full C grammar parser • File input support • Error handling • AST generation • Syntax highlighting engine
Abinesh N
GitHub https://github.com/Abineshabee