This pure Lua module implements (1) a standard compliant C preprocessor with a couple useful extensions, and (2) a parser that provides a Lua friendly description of all global declarations and definitions in a C header or C program file.
The driver program lcpp invokes the preprocessor and outputs
preprocessed code. Although it can be used as a replacement for the
normal preprocessor, it is more useful as an extra preprocessing step
(see option -Zpass which is on by default.) The same capabilities
are offered by functions cparser.cpp and cparser.cppTokenIterator
provided by the module cparser.
The driver program lcdecl analyzes a C header file and a C program
file and outputs a short descriptions of the declarations and
definitions. This program is mostly useful to understand the
representations produced by the cparser function
cparser.declarationIterator.
This code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.
lcpp [options] inputfile.c [-o outputfile.c]Preprocess file inputfile.c and write the preprocessed code into
file outputfile.c or to the standard output.
The following options are recognized:
-
-Werror
Cause all warning to be treated as errors. Note that parsing cannot resume after an error. The parser simply throws a Lua error. -
-w
Do not print warning messages. -
-Dsym[=val]
Define preprocessor symbolsymto valueval. The default value ofvalis1. Note that it is possible to define function-like symbols with syntax-Dsym(args)=val. -
-Usym
Undefine preprocessor symbolsym. -
-Idir
Add directorydirto the search path for included files. Note that there is no default search path. When an include file is not found the include directive is simply ignored with a warning (but see also option-Zpass). Therefore all include directives are ignored unless one uses option-Ito specify the search path. -
-I-
Marks the beginning of the system include path. When an included file is given with angle brackets, (as in#include <stdio.h>), one only searches directories specified by the-Ioptions that follow-I-. Therefore all these include directives are ignored unless one uses option-I-followed by one or more option-I. -
-dM
Instead of producing the preprocessed file, dumps all macros defined at the end of the parse. -
-Zcppdef
Run the native preprocessor using commandcpp -dM < dev/nulland copy its predefined symbols. This is useful when usinglcppas a full replacement for the standard preprocessor. -
-Zpass
This option is enabled by default (use-Znopassto disable) and indicates that the output oflcppis going to be reprocessed by a C preprocessor and compiler. This option triggers the following behavior:- The preprocessor directives
#pragmaand#identare copied verbatim into the output file. - When the included file cannot be found in the provided
search path, the preprocessor directive
#includeis copied into the output file. - Preprocessor directives prefixed with a double
##are copied verbatim into the output file with a single#prefix. This feature is useful for#ifdirectives that depend on symbols defined by unresolved#includedirectives.
- The preprocessor directives
-
-std=(c|gnu)(89|99|11)
This option selects a C dialect. In the context of the preprocessor, this impacts the symbols predefined bylcppand potentially enables GCC extensions of the variadic macro definition syntax.- Symbol
__CPARSER__is always defined with value <1>. - Symbols
__STDC__and__STDC_VERSION__are either defined by option-Zcppdefor take values suitable for the target C dialect. - Symbols
__GNUC__and__GNUC_MINOR__are either defined by option-Zcppdefor are defined to values4and2if the target dialect starts with stringgnu.
This can be further adjusted using the
-Dor-Uoptions. The default dialect isgnu99. - Symbol
The lcpp preprocessor implements several useful nonstandard features.
The main feature are multiline macros. The other features are mostly
here because they make multiline macros more useful.
The C standard specifies that the expressions following #if
directives are constant expressions of integral type. However this
processor also handles strings. The only valid operations on strings
are the equality and ordering comparisons. This is quite useful to
make special cases for certain values of the parameters of a multiline
macro, as shown later.
Preprocessor directives #defmacro and #endmacro can be used to
define a function-like macro whose body spans several lines. The
#defmacro directive contains the macro name and a mandatory argument
list. The body of the macro is composed of all the following lines up
to the matching #endmacro. This offers several benefits:
-
The line numbers of the macro-expansion is preserved. This ensures that the compiler produces error messages with meaningful line numbers.
-
The multi-line macro can contain preprocessor directives. Conditional directives are very useful in this context. Note however that preprocessor definitions (with
#define,#defmacro, or#undef) nested inside multiline macros are only valid within the macro. -
The standard stringification
#and token concatenation##operators can be used freely in the body of multiline macros. Note that these operators only work with the parameters of the multiline macros and not with ordinary preprocessor definitions. This is consistent with the standard behavior of these operators in ordinary preprocessor macros.Example
#defmacro DEFINE_VDOT(TNAME, TYPE)
TYPE TNAME##Vector_dot(TYPE *a, TYPE *b, int n)
{
/* try cblas */
#if #TYPE == "float"
return cblas_sdot(n, a, 1, b, 1);
#elif #TYPE == "double"
return cblas_ddot(n, a, 1, b, 1);
#else
int i;
TYPE s = 0;
for(i=0;i<n;i++)
s += a[i] * b[i];
return s;
#endif
}
#endmacro
DEFINE_VDOT(Float,float);
DEFINE_VDOT(Double,double);
DEFINE_VDOT(Int,int);Details -- The values of the macro parameters are normally macro-expanded before substituting them into the text of the macro. However this macro-expansion does not happen when the substitution occurs in the context of a stringification or token concatenation operator. All this is consistent with the standard. The novelty is that this macro-expansion does not occur either when the parameter appears in a nested preprocessor directive or multiline macro.
More details -- The stringification operator only works when the next non-space token is a macro parameter. This provides a good way to distinguish a nested directive from a stringification operator appearing in the beginning of a line.
Even more details -- The standard mandates that the tokens generated by a macro-expansion can be combined with the following tokens to compose a new macro invocation. This is not allowed for multiline macros. An error is signaled if the expansion of a multiline macro generates an incomplete macro argument list.
Consider the following variadic macro
#define macro(msg, ...) printf(msg, __VA_ARGS__)The C standard says that it is an error to call this macro with only
one argument. Calling this macro with an empty second argument
--macro(msg,)-- leaves an annoying comma in the expansion
--printf(msg,)-- and causes a compiler syntax error.
This preprocessor accepts invocations of such a macro with a single
argument. The value of parameter __VA_ARGS__ is then a so-called
negative comma, meaning that the preceding comma is eliminated when
this parameter appears in the macro definition between a comma and a
closing parenthesis.
When a new invocation of the macro appears in the expansion of a
macro, the standard specifies that the preprocessor must rescan the
expansion but should not recursively expand the macro. Although this
restriction is both wise and useful, there are rare cases where one
would like to use recursive macros. As an experiment, this recursion
prevention feature is turned off when one defines a multiline macro
with #defrecursivemacro instead of #defmacro. Note that this might
prevent the preprocessor from terminating unless the macro eventually
takes a conditional branch that does not recursively invoke the macro.
ldecl [options] inputfile.c [-o outputfile.txt]Preprocess and parse file inputfile.c.
The output of a parser is a sequence of Lua data structures
representing each C definition or declaration encountered in the code.
Program ldecl prints each of them in two forms. The first form
directly represent the Lua tables composing the data structure. The
second form reconstructs a piece of C code representing the definition
or declaration of interest.
This program is mostly useful to people working with the Lua functions
offered by the cparser module because it provides a quick way to inspect
the resulting data structures.
Program lcdecl accepts all the preprocessing options
documented for program lcpp. It also accepts an additional
option -Ttypename and also adds to the meaning of
options -Zpass and -std=dialect.
-
-Ttypename
Similar tolcpp, programlcdeclonly reads the include files that are found along the path specified by the-Ioptions. It is generally not desirable to read all include files because they often contain declarations that are not directly useful. This also means that the C parser may not be aware of type definitions found in ignored include files. Although the C syntax is sufficiently unambiguous to allow the parser to guess that an identifier is a type name rather than a variable name, this can lead to confusing error messages. Option-Ttypename can then be used to inform the parser than symboltypenamerepresents a type and not a constant, a variable, or a function. -
-ZpassUnlikelcpp, programlcdeclprocesses the input file with option-Zpassoff by default. Turning it on will just eliminate potentially useful warning messages. -
-ZtagThis option causeslcdeclto treat all structs, unions, and enums as tagged types, possibly using synthetic tags of the form__anon_XXXXX. It is assumed that such names are not used anywhere in the parsed program. This is useful for certain code transformation applications. -
-std=(c|gnu)(89|99|11)
The dialect selection options also control whether the parser recognizes keywords introduced by later version of the C standard (e.g.,restrict,_Bool,_Complex,_Atomic,_Pragma,inline) or by the GCC compiler (e.g.,asm). Many of these keywords have a double-underline-delimited variant that is recognized in all cases (e.g,__restrict__).
Example.
Running ldecl on the following program
const int size = (3+2)*2;
float arr[size];
typedef struct symtable_s { const char *name; SymVal value; } symtable_t;
void printSymbols(symtable_t *p, int n) { do_something(p,n); }produces the following output (with very long lines).
+--------------------------
| Definition{where="test.c:2",intval=10,type=Qualified{t=Type{n="int"},const=true},name="size",init={..}}
| const int size = 10
+--------------------------
| Definition{where="test.c:3",type=Array{t=Type{n="float"},size=10},name="arr"}
| float arr[10]
+--------------------------
| TypeDef{sclass="[typetag]",where="test.c:4",type=Struct{Pair{Pointer{t=Qualified{t=Type{n="char"},const=true}},"name"},Pair{Type{n="SymVal"},"value"},n="symtable_s"},name="struct symtable_s"}
| [typetag] struct symtable_s{const char*name;SymVal value;}
+--------------------------
| TypeDef{sclass="typedef",where="test.c:4",type=Type{_def={..},n="struct symtable_s"},name="symtable_t"}
| typedef struct symtable_s symtable_t
+--------------------------
| Definition{where="test.c:5",type=Function{Pair{Pointer{t=Type{_def={..},n="symtable_t"}},"p"},Pair{Type{n="int"},"n"},t=Type{n="void"}},name="printSymbols",init={..}}
| void printSymbols(symtable_t*p,int n){..}
+--------------------------
Module cparser exports the following functions:
Program lcpp is implemented by function cparser.cpp.
Calling this function preprocesses file filename and writes the
preprocessed code to the specified output. The optional argument
outputfile can be a file name or a Lua file descriptor. When this
argument is nil, the preprocessed code is written to the standard
output. The optional argument options is an array of option
strings. All the options documented with program lcpp are
supported.
Calling this function produces two results:
- A token iterator function.
- A macro definition table.
Argument options is an array of options.
All the options documented for program lcpp are supported.
Argument lines is an iterator that returns input lines.
Lua provides many such iterators, including io.lines(filename) to
return the lines of the file named filename and filedesc:lines()
to return lines from the file descriptor filedesc. You can also use
string.gmatch(somestring,'[^\n]+') to return lines from string
somestring.
Each successive call of the token iterator function describes a token
of the preprocessed code by returning two strings. The first string
represent the token text. The second string follows format
"filename:lineno" and indicates on which line the token was
found. The filename either is the argument prefix or is the actual
name of an included file. When all the tokens have been produced, the
token iterator function returns nil.
Each named entry of the macro definition table contains the definition
of the corresponding preprocessor macros. Function
cparser.macroToString can be used to reconstruct the macro
definition from this information.
Example:
ti,macros = cparser.cppTokenIterator(nil, io.lines('test/testmacro.c'), 'testmacro.c')
for token,location in ti do
print(token, location)
end
for symbol,_ in pairs(macros) do
local s = cparser.macroToString(macros,symbol)
if s then print(s) end
endThis function returns a string representing the definition
of the preprocessor macro name found in macro definition table macros.
It returns nil if no such macro is defined.
Note that the macro definition table contains named entries that
are not macro definitions but functions implementing
magic symbols such as __FILE__ or __LINE__.
Program lcdecl is implemented by function cparser.parse.
Calling this function preprocesses and parses file filename, writing
a trace into the specified file. The optional argument outputfile
can be a file name or a Lua file descriptor. When this argument is
nil, the preprocessed code is written to the standard output. The
optional argument options is an array of option strings. All the
options documented with program lcdecl are supported.
Calling this function produces three results:
- A declaration iterator function.
- A symbol table.
- A macro definition table.
Argument options is an array of options.
All the options documented for program lcdecl are supported.
Argument lines is an iterator that returns input lines.
Lua provides many such iterators, including io.lines(filename) to
return the lines of the file named filename and filedesc:lines()
to return lines from the file descriptor filedesc. You can also use
string.gmatch(somestring,'[^\n]+') to return lines from string
somestring.
Each successive call of the declaration iterator function returns a Lua
data structure that represents a declaration, a definition, or certain
preprocessor events. The format of these data structures is described
under function cparser.declToString.
The symbol table contains the definition of all the C language
identifiers defined or declared by the parsed files. Type names are
represented by the Type{} data structure documented under function
cparser.typeToString. Constants, variables, and functions are
represented by Definition{} or Declaration{} data structures
similar to those returned by the declaration iterator.
The macro definition table contains
the definition of the preprocessor macros.
See the documentation of function macroToString for details.
Example
di = cparser.declarationIterator(nil, io.lines('tests/testmacro.c'), 'testmacro.c')
for decl in di do print(decl) print(">>", cparser.declToString(decl)) endThis function produces a string suitable for
declaring a variable nam with type ty in a C program.
Argument ty is a type data structure.
Argument nam should be a string representing a legal identifier.
However it defaults to %s in order to compute a format string
suitable for the standard Lua function string.format.
Module cparser represents each type with a tree whose nodes are Lua
tables tagged by their tag field. These tables are equipped with a
convenient metatable method that prints them compactly by first
displaying the tag then the remaining fields using the standard Lua
construct.
For instance, the type const int is printed as
Qualified{t=Type{n="int"},const=true}and corresponds to
{
tag="Qualified",
const=true,
t= {
tag="Type",
n = "int"
}
}The following tags are used to represent types.
-
Type{n=name}is used to represent a named typename. There is only one instance of each named type. Names can be made of multiple keywords, such asintorunsigned long int, they can also be typedef identifiers, such assize_t, or composed names, such asstruct fooorenum bar. This construct can also contain a field_defthat points to the definition of the named type if such a definition is known. -
Qualified{t=basetype,...}is used to represent a qualified variant ofbasetype. Fields namedconst,volatile, orrestrictare set to true to represent the applicable type qualifiers. When the type appears in function parameters and the base type is a pointer, a field namedstaticmay contain the guaranteed array size. -
Pointer{t=basetype}is used to represent a pointer to an object of typebasetype. This construct may also contains a fieldblock=trueto indicate that the pointer refers to a code block (a C extension found in Apple compilers) or a fieldref=trueto indicate a reference type (a C extension inspired by C++.) -
Array{t=basetype,size=s}is used to represent an array of object of typebasetype. The optional fieldsizecontains the array size when an array size is specified. The size is usually an integer. However there are situations in which the parser is unable to evaluate the size, for instance because it relies on the C keywordsizeof(x). In such cases, fieldsizeis a string containing a C expression for the size. -
Struct{}andUnion{}are used to represent the corresponding C types. The optional fieldncontains the structure tag. Each entry is represented by aPair{type,name}construct located at successive integer indices. This means that the type of the third entry of structure typetycan be accessed asty[3][1]and the corresponding name isty[3][2]. In the case ofStruct{}tables, the pairs optionally contain a fieldbitfieldto indicate a bitfield size for the structure entry. Fieldbitfieldusually contains a small integer but can also contain a string representing a C expression (just like fieldsizein theArray{}construct.) -
Enum{}is used to represent an enumerated type. The optional fieldnmay contain the enumeration tag name. The enumeration constants are reprsented asPair{name,value}located at successive integer indices. The second pair element is only given when the C code contains an explicit value. It can be an integer or an expression strint (just like fieldsizeinArray{}). -
Function{t=returntype}is used to represent functions returning an object of typereturntype. FieldwithoutProtois set totruewhen the function does not provide a prototype. Otherwise the arguments are described byPair{type,name}located as integer indices. The prototype of variadic functions end with aPair{ellipsis=true}to represent the...argument.
The Qualified{}, Function{}, Struct{}, Union{}, and Enum{}
tables may additionally have a field attr whose contents represents
attribute information, such as C11 attributes [[...]], MSVC-style
attributes __declspec(...) or GNU attributes __attribute__(...).
This is representing by an array containing all the attribute tokens
(on odd indices) and their locations (on even indices).
Parses string s as an abstract type name or a variable declaration
and returns the type object and possibly the variable name. This
function returns nil when the string cannot be interpreted as a type
or a declaration, or when the declaration specifies a storage class.
Example
> return cparser.stringToType("int(*)(const char*)")
Pointer{t=Function{Pair{Pointer{t=Qualified{const=true,t=Type{n="char"}}}},t=Type{n="int"}}} nilThis function produces a string that describes the data structures
returned by the declaration iterator. There are in fact three kinds
of data structures. All these structures have very similar fields.
In particular, field where always contains the location of the
definition or declaration.
-
TypeDef{name=n,sclass=s,type=ty}represents a type definition. This construct is produced in two different situations. When the C program contains atypedefkeyword, fieldsclasscontains the string"typedef", fieldnamecontains the new type name, and fieldtypecontains the type description. When the C program defines a taggedstruct,union, orenumtype, fieldsclasscontains the string"[typetag]", fieldnamecontains the tagged type name (e.g,"struct foo"), and fieldtypecontains the type definition (e.g.,Struct{...}). -
Declaration{name=n,sclass=s,type=ty,...}represents the declaration of a variable or function that is defined elsewhere. Fieldnamegives the variable or function name. Fieldtypegives its type. Fieldsclasscan be empty,"extern", or"static". -
Definition{name=n,sclass=s,type=ty...}represents the definition of a constant, a variable, or a function. Fieldnameagain gives the name, fieldtypegives its type, fieldsclassgives its storage class, and fieldinitmay contain an array of tokens and token locations representing a variable initializer or a function body. Constant definitions may also have a fieldintvalcontaning the value of an integer constant. This field works like the size of an array: it often contains a small integer but can also contains a string representing the C expression that the parser was unable to evaluate for one reason or another. Enumeration constants are reported with storage class"[enum]"and with a constant integer type containing an additional field_enumthat points to the corresponding enumerated type. -
CppEvent{directive=dir,...}describes certain preprocessor events that are potentially relevant to a C API. In particular, the definition of an object-like macroswith an integer valuevis reported asCppEvent{directive="define",name="s",intval=v}and its deletion asCppEvent{directive="undef",name="s"}. Finally,CppEvent{directive="include",name="fspec"}indicates that an include directive was not resolved.